{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T22:29:35Z","timestamp":1774477775608,"version":"3.50.1"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T00:00:00Z","timestamp":1747958400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T00:00:00Z","timestamp":1747958400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"FernUniversit\u00e4t in Hagen"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>To provide good results and decisions in data-driven systems, data quality must be ensured as a primary consideration. An important aspect of this is data cleaning. Although many different algorithms and tools already exist for data cleaning, an end-to-end data quality solution is still needed. In this paper, we present FONDUE, our vision of a well-founded end-to-end data quality optimizer. In contrast to many studies that consider data cleaning in the context of machine learning, our approach focuses on various scenarios, such as when preprocessing and downstream analysis are separated. As an adaptive and easily extendable framework, FONDUE operates similarly to proven methods of database query optimization. Analogously, it consists of the following parts: Rule-based optimization, where the appropriate data cleaning algorithms are selected based on use case constraints, optimizer hints in the form of best practices, and cost-based optimization, where the costs are measured in terms of data quality. Accordingly, the result is an optimized data cleaning pipeline. The choice of different optimization goals enables further flexibility, e.g. for environments with limited resources. As a first building block of FONDUE, we present CheDDaR, which is used to detect errors and measure data quality. Both are important tasks for improving data quality with FONDUE.<\/jats:p>","DOI":"10.1186\/s40537-025-01158-x","type":"journal-article","created":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T03:20:35Z","timestamp":1747970435000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["FONDUE\u2014Fine-Tuned Optimization: Nurturing Data Usability &amp; Efficiency"],"prefix":"10.1186","volume":"12","author":[{"given":"Valerie","family":"Restat","sequence":"first","affiliation":[]},{"given":"Indra","family":"Diestelk\u00e4mper","sequence":"additional","affiliation":[]},{"given":"Meike","family":"Klettke","sequence":"additional","affiliation":[]},{"given":"Uta","family":"St\u00f6rl","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,23]]},"reference":[{"issue":"12","key":"1158_CR1","doi-asserted-by":"publisher","first-page":"993","DOI":"10.14778\/2994509.2994518","volume":"9","author":"Z Abedjan","year":"2016","unstructured":"Abedjan Z, et al. Detecting data errors: where are we and what needs to be done? Proc VLDB Endow. 2016;9(12):993\u20131004.","journal-title":"Proc VLDB Endow"},{"issue":"1","key":"1158_CR2","first-page":"24","volume":"44","author":"F Neutatz","year":"2021","unstructured":"Neutatz F, et al. From Cleaning before ML to Cleaning for ML. IEEE Data Eng Bull. 2021;44(1):24\u201341.","journal-title":"IEEE Data Eng Bull"},{"key":"1158_CR3","unstructured":"Boehm M, et al. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. , Amsterdam (2020)."},{"key":"1158_CR4","unstructured":"Krishnan S, Wu E. AlphaClean: Automatic Generation of Data Cleaning Pipelines. CoRR abs\/1904.11827 (2019)."},{"key":"1158_CR5","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939511","volume-title":"Towards reliable interactive data cleaning: a user survey and recommendations","author":"S Krishnan","year":"2016","unstructured":"Krishnan S, et al. Towards reliable interactive data cleaning: a user survey and recommendations. New York, NY: ACM; 2016."},{"key":"1158_CR6","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW61823.2024.00039","volume-title":"Towards an End-to-End Data Quality Optimizer","author":"V Restat","year":"2024","unstructured":"Restat V, Klettke M, St\u00f6rl U. Towards an End-to-End Data Quality Optimizer. New York, NY: IEEE; 2024."},{"key":"1158_CR7","doi-asserted-by":"publisher","first-page":"123738","DOI":"10.1155\/2013\/123738","volume":"2013","author":"H Yun","year":"2013","unstructured":"Yun H, Jeong S, Kim K. Advanced harmony search with ant colony optimization for solving the traveling salesman problem. J Appl Math. 2013;2013:123738\u201311237388.","journal-title":"J Appl Math"},{"issue":"1","key":"1158_CR8","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/s00521-011-0794-0","volume":"23","author":"MS Kiran","year":"2013","unstructured":"Kiran MS, Iscan H, G\u00fcnd\u00fcz M. The analysis of discrete artificial bee colony algorithm with neighborhood operator on traveling salesman problem. Neural Comput Appl. 2013;23(1):9\u201321.","journal-title":"Neural Comput Appl"},{"key":"1158_CR9","doi-asserted-by":"publisher","first-page":"94","DOI":"10.1016\/j.ins.2013.09.034","volume":"258","author":"MB Dowlatshahi","year":"2014","unstructured":"Dowlatshahi MB, Nezamabadi-pour H, Mashinchi M. A discrete gravitational search algorithm for solving combinatorial optimization problems. Inf Sci. 2014;258:94\u2013107.","journal-title":"Inf Sci"},{"key":"1158_CR10","doi-asserted-by":"publisher","DOI":"10.1145\/2908812.2908935","volume-title":"A New Discrete Particle Swarm Optimization Algorithm","author":"S Strasser","year":"2016","unstructured":"Strasser S, et al. A New Discrete Particle Swarm Optimization Algorithm. NY: ACML New York; 2016."},{"key":"1158_CR11","doi-asserted-by":"publisher","DOI":"10.1145\/3597465.3605229","volume-title":"Interactive data cleaning for real-time streaming applications","author":"T R\u00e4th","year":"2023","unstructured":"R\u00e4th T, Onah N, Sattler K. Interactive data cleaning for real-time streaming applications. New York, NY: ACM; 2023."},{"key":"1158_CR12","volume-title":"Data Cleaning of Data Streams","author":"V Restat","year":"2025","unstructured":"Restat V, et al. Data Cleaning of Data Streams. Heidelberg: Springer; 2025."},{"issue":"12","key":"1158_CR13","doi-asserted-by":"publisher","first-page":"4377","DOI":"10.14778\/3685800.3685879","volume":"17","author":"X Ding","year":"2024","unstructured":"Ding X, et al. Clean4TSDB: a data cleaning tool for time series databases. Proc VLDB Endow. 2024;17(12):4377\u201380.","journal-title":"Proc VLDB Endow"},{"key":"1158_CR14","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.1993.344012","volume-title":"Data quality requirements analysis and modeling","author":"RY Wang","year":"1993","unstructured":"Wang RY, Kon HB, Madnick SE. Data quality requirements analysis and modeling. New York, NY: IEEE Computer Society; 1993."},{"issue":"4","key":"1158_CR15","doi-asserted-by":"publisher","first-page":"623","DOI":"10.1109\/69.404034","volume":"7","author":"RY Wang","year":"1995","unstructured":"Wang RY, Storey VC, Firth CP. A framework for analysis of data quality research. IEEE Trans Knowl Data Eng. 1995;7(4):623\u201340.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"1","key":"1158_CR16","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1080\/12460125.2015.1080494","volume":"25","author":"B Heinrich","year":"2016","unstructured":"Heinrich B, Hristova D. A quantitative approach for modelling the influence of currency of information on decision-making under uncertainty. J Decis Syst. 2016;25(1):16\u201341.","journal-title":"J Decis Syst"},{"issue":"2","key":"1158_CR17","first-page":"8","volume":"2","author":"RH Blake","year":"2011","unstructured":"Blake RH, Mangiameli P. The effects and interactions of data quality and problem complexity on classification. ACM J Data Inf Qual. 2011;2(2):8\u20131828.","journal-title":"ACM J Data Inf Qual"},{"issue":"12","key":"1158_CR18","doi-asserted-by":"publisher","first-page":"1781","DOI":"10.14778\/3229863.3229867","volume":"11","author":"S Schelter","year":"2018","unstructured":"Schelter S, et al. Automating large-scale data quality verification. Proc VLDB Endow. 2018;11(12):1781\u201394.","journal-title":"Proc VLDB Endow"},{"issue":"4","key":"1158_CR19","doi-asserted-by":"publisher","first-page":"153","DOI":"10.3390\/bdcc6040153","volume":"6","author":"W Elouataoui","year":"2022","unstructured":"Elouataoui W, et al. An advanced big data quality framework based on weighted metrics. Big Data Cogn Comput. 2022;6(4):153.","journal-title":"Big Data Cogn Comput"},{"key":"1158_CR20","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956","volume-title":"Raha: a configuration-free error detection system","author":"M Mahdavi","year":"2019","unstructured":"Mahdavi M, et al. Raha: a configuration-free error detection system. New York, NY: ACM; 2019."},{"key":"1158_CR21","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888","volume-title":"HoloDetect: few-shot learning for error detection","author":"A Heidari","year":"2019","unstructured":"Heidari A, et al. HoloDetect: few-shot learning for error detection. New York, NY: ACM; 2019."},{"issue":"1","key":"1158_CR22","first-page":"3","volume":"10","author":"C Bors","year":"2018","unstructured":"Bors C, et al. Visual interactive creation, customization, and analysis of data quality metrics. ACM J Data Inf Qual. 2018;10(1):3\u20131326.","journal-title":"ACM J Data Inf Qual"},{"issue":"2","key":"1158_CR23","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1145\/3589280","volume":"1","author":"R Shrestha","year":"2023","unstructured":"Shrestha R, et al. Exploratory training: when annonators learn about data. Proc ACM Manag Data. 2023;1(2):135\u2013113525.","journal-title":"Proc. ACM Manag Data"},{"key":"1158_CR24","unstructured":"Restat V, Klettke M, St\u00f6rl U. FAIR is not enough - A Metrics Framework to ensure Data Quality through Data Preparation. In: BTW. LNI, vol. P-331, pp. 917\u2013929. Gesellschaft f\u00fcr Informatik e.V., Bonn (2023)."},{"key":"1158_CR25","doi-asserted-by":"publisher","DOI":"10.1145\/3310205","volume-title":"Data Cleaning","author":"IF Ilyas","year":"2019","unstructured":"Ilyas IF, Chu X. Data Cleaning. New York, NY: ACM; 2019."},{"key":"1158_CR26","unstructured":"Mahdavi M, Abedjan Z. Semi-Supervised Data Cleaning with Raha and Baran. In: CIDR. www.cidrdb.org, online (2021)."},{"issue":"3","key":"1158_CR27","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1145\/3444831.3444835","volume":"49","author":"M Hameed","year":"2020","unstructured":"Hameed M, Naumann F. Data preparation: a survey of commercial tools. SIGMOD Rec. 2020;49(3):18\u201329.","journal-title":"SIGMOD Rec"},{"key":"1158_CR28","doi-asserted-by":"crossref","unstructured":"Klettke M, St\u00f6rl U. Four Generations in Data Engineering for Data Science: The Past, Presence and Future of a Field of Science. Datenbank-Spektrum, 59\u201366 (2021).","DOI":"10.1007\/s13222-021-00399-3"},{"key":"1158_CR29","volume-title":"CleanML: a study for evaluating the impact of data cleaning on ml classification tasks","author":"P Li","year":"2021","unstructured":"Li P, et al. CleanML: a study for evaluating the impact of data cleaning on ml classification tasks. New York, NY: IEEE; 2021."},{"key":"1158_CR30","volume-title":"A rule-based view of query optimization","author":"JC Freytag","year":"1987","unstructured":"Freytag JC. A rule-based view of query optimization. New York, NY: ACM Press; 1987."},{"key":"1158_CR31","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-8265-9_293","volume-title":"Query Optimization (in Relational Databases)","author":"T Neumann","year":"2018","unstructured":"Neumann T. Query Optimization (in Relational Databases). Heidelberg: Springer; 2018."},{"key":"1158_CR32","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01865-7","volume-title":"Data Profiling","author":"Z Abedjan","year":"2019","unstructured":"Abedjan Z. Data Profiling. Heidelberg: Springer; 2019."},{"key":"1158_CR33","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.126585","volume":"554","author":"F Clemente","year":"2023","unstructured":"Clemente F, et al. ydata-profiling: Accelerating data-centric AI with high-quality data. Neurocomputing. 2023;554: 126585.","journal-title":"Neurocomputing"},{"key":"1158_CR34","unstructured":"Restat V, St\u00f6rl U. ALPINE: Abstract Language for Pipeline Integration and Execution. In: BTW. LNI. Gesellschaft f\u00fcr Informatik e.V., Bonn (2025)."},{"issue":"11","key":"1158_CR35","doi-asserted-by":"publisher","first-page":"3310","DOI":"10.14778\/3611479.3611528","volume":"16","author":"L Woltmann","year":"2023","unstructured":"Woltmann L, et al. FASTgres: making learned query optimizer hinting effective. Proc VLDB Endow. 2023;16(11):3310\u201322.","journal-title":"Proc VLDB Endow"},{"issue":"4","key":"1158_CR36","doi-asserted-by":"publisher","first-page":"1047","DOI":"10.1007\/s10115-022-01661-0","volume":"64","author":"C Tsai","year":"2022","unstructured":"Tsai C, Hu Y. Empirical comparison of supervised learning techniques for missing value imputation. Knowl Inf Syst. 2022;64(4):1047\u201375.","journal-title":"Knowl Inf Syst"},{"key":"1158_CR37","doi-asserted-by":"publisher","DOI":"10.1145\/3533028.3533311","volume-title":"GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines","author":"V Restat","year":"2022","unstructured":"Restat V, et al. GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines. New York, NY: ACM; 2022."},{"key":"1158_CR38","unstructured":"Giovanelli J, Pisano G. Towards Human-centric AutoML via Logic and Argumentation. In: EDBT\/ICDT Workshops. CEUR Workshop Proceedings, vol. 3135. CEUR-WS.org, Aachen (2022)."},{"key":"1158_CR39","volume":"27","author":"K Hasan","year":"2021","unstructured":"Hasan K, et al. Missing value imputation affects the performance of machine learning: a review and analysis of the literature (2010\u20132021). Inf Med Unloc. 2021;27: 100799.","journal-title":"Inf Med Unloc"},{"key":"1158_CR40","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872803","volume-title":"Estimating compilation time of a query optimizer","author":"IF Ilyas","year":"2003","unstructured":"Ilyas IF, et al. Estimating compilation time of a query optimizer. New York, NY: ACM; 2003."},{"issue":"4","key":"1158_CR41","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1007\/s00778-006-0004-3","volume":"16","author":"NN Dalvi","year":"2007","unstructured":"Dalvi NN, Suciu D. Efficient query evaluation on probabilistic databases. VLDB J. 2007;16(4):523\u201344.","journal-title":"VLDB J"},{"key":"1158_CR42","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-8265-9_80692","volume-title":"Monte carlo methods for uncertain data","author":"PJ Haas","year":"2018","unstructured":"Haas PJ. Monte carlo methods for uncertain data. Heidelberg: Springer; 2018."},{"key":"1158_CR43","doi-asserted-by":"publisher","DOI":"10.1016\/j.rico.2023.100315","volume":"13","author":"PK Mandal","year":"2023","unstructured":"Mandal PK. A review of classical methods and Nature-Inspired Algorithms (NIAs) for optimization problems. Results Control Optimiz. 2023;13: 100315.","journal-title":"Results Control Optimiz"},{"issue":"12","key":"1158_CR44","doi-asserted-by":"publisher","first-page":"1482","DOI":"10.4249\/scholarpedia.1482","volume":"7","author":"JH Holland","year":"2012","unstructured":"Holland JH. Genetic algorithms. Scholarpedia. 2012;7(12):1482.","journal-title":"Scholarpedia"},{"key":"1158_CR45","volume-title":"Simulated Annealing","author":"KA Dowsland","year":"2012","unstructured":"Dowsland KA, Thompson JM. Simulated Annealing. Cham: Springer; 2012."},{"issue":"1","key":"1158_CR46","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1147\/sj.421.0098","volume":"42","author":"V Markl","year":"2003","unstructured":"Markl V, Lohman GM, Raman V. LEO: An autonomic query optimizer for DB2. IBM Syst J. 2003;42(1):98\u2013106.","journal-title":"IBM Syst J"},{"key":"1158_CR47","volume-title":"MLINSPECT: a data distribution debugger for machine learning pipelines","author":"S Grafberger","year":"2021","unstructured":"Grafberger S, et al. MLINSPECT: a data distribution debugger for machine learning pipelines. New York, NY: ACM; 2021."},{"issue":"6","key":"1158_CR48","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1145\/362384.362685","volume":"13","author":"EF Codd","year":"1970","unstructured":"Codd EF. A relational model of data for large shared data banks. Commun ACM. 1970;13(6):377\u201387.","journal-title":"Commun ACM"},{"issue":"3","key":"1158_CR49","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1145\/3381028","volume":"53","author":"A Boukerche","year":"2021","unstructured":"Boukerche A, Zheng L, Alfandi O. Outlier detection: methods, models, and classification. ACM Comput Surv. 2021;53(3):55\u201315537.","journal-title":"ACM Comput Surv"},{"key":"1158_CR50","doi-asserted-by":"publisher","first-page":"107964","DOI":"10.1109\/ACCESS.2019.2932769","volume":"7","author":"H Wang","year":"2019","unstructured":"Wang H, Bah MJ, Hammad M. Progress in outlier detection techniques: a survey. IEEE Access. 2019;7:107964\u20138000.","journal-title":"IEEE Access"},{"key":"1158_CR51","doi-asserted-by":"crossref","unstructured":"Bantilan N. pandera: Statistical Data Validation of Pandas Dataframes. In: SciPy, pp. 116\u2013124. scipy.org, online (2020).","DOI":"10.25080\/Majora-342d178e-010"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01158-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01158-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01158-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T04:02:50Z","timestamp":1747972970000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01158-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,23]]},"references-count":51,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1158"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01158-x","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-5182530\/v1","asserted-by":"object"}]},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,23]]},"assertion":[{"value":"30 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 April 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no Conflict of interest.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"131"}}