{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T11:39:12Z","timestamp":1775734752746,"version":"3.50.1"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2022,5,13]],"date-time":"2022-05-13T00:00:00Z","timestamp":1652400000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,5,13]],"date-time":"2022-05-13T00:00:00Z","timestamp":1652400000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100006764","name":"Technische Universit\u00e4t Berlin","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006764","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Datenbank Spektrum"],"published-print":{"date-parts":[[2022,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Data cleaning is widely acknowledged as an important yet tedious task when dealing with large amounts of data. Thus, there is always a\u00a0cost-benefit trade-off to consider. In particular, it is important to assess this trade-off when not every data point and data error is equally important for a\u00a0task. This is often the case when statistical analysis or machine learning (ML) models derive knowledge about data. If we only care about maximizing the utility score of the applications, such as accuracy or F1 scores, many tasks can afford some degree of data quality problems. Recent studies analyzed the impact of various data error types on vanilla ML tasks, showing that missing values and outliers significantly impact the outcome of such models. In this paper, we expand the setting to one where data cleaning is not considered in isolation but as an equal parameter among many other hyper-parameters that influence feature selection, regularization, and model selection. In particular, we use state-of-the-art AutoML frameworks to automatically learn the parameters that benefit a\u00a0particular ML binary classification task. In our study, we see that specific cleaning routines still play a\u00a0significant role but can also be entirely avoided if the choice of a\u00a0specific model or the filtering of specific features diminishes the overall impact.<\/jats:p>","DOI":"10.1007\/s13222-022-00413-2","type":"journal-article","created":{"date-parts":[[2022,5,13]],"date-time":"2022-05-13T11:08:00Z","timestamp":1652440080000},"page":"121-130","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":24,"title":["Data Cleaning and AutoML: Would an Optimizer Choose to Clean?"],"prefix":"10.1007","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8698-8010","authenticated-orcid":false,"given":"Felix","family":"Neutatz","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Binger","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yazan","family":"Alkhatib","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingwen","family":"Ye","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ziawasch","family":"Abedjan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,5,13]]},"reference":[{"key":"413_CR1","volume-title":"NeurIPS","author":"D Alistarh","year":"2017","unstructured":"Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In: NeurIPS"},{"key":"413_CR2","volume-title":"NeurIPS","author":"M Belkin","year":"2018","unstructured":"Belkin M, Hsu DJ, Mitra P (2018) Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In: NeurIPS"},{"key":"413_CR3","doi-asserted-by":"crossref","unstructured":"Breiman L (2001) Random forests. Mach Learn 45(1):5\u201332","DOI":"10.1023\/A:1010933404324"},{"key":"413_CR4","volume-title":"SIGMOD","author":"MM Breunig","year":"2000","unstructured":"Breunig MM, Kriegel H, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: SIGMOD"},{"key":"413_CR5","series-title":"CoRR, abs\/2003.06505","volume-title":"Autogluon-tabular: Robust and accurate automl for structured data","author":"N Erickson","year":"2020","unstructured":"Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, Smola AJ (2020) Autogluon-tabular: Robust and accurate automl for structured data. CoRR, abs\/2003.06505"},{"key":"413_CR6","volume-title":"NeurIPS","author":"M Feurer","year":"2015","unstructured":"Feurer M, Klein A, Eggensperger K, Springenberg JT, Blum M, Hutter F (2015) Efficient and robust automated machine learning. In: NeurIPS"},{"key":"413_CR7","doi-asserted-by":"crossref","unstructured":"Fr\u00e9nay B, Verleysen M (2013) Classification in the presence of label noise: a\u00a0survey. IEEE Trans Neural Netw Learning Syst 25(5):845\u2013869","DOI":"10.1109\/TNNLS.2013.2292894"},{"key":"413_CR8","volume-title":"UAI","author":"MA Gelbart","year":"2014","unstructured":"Gelbart MA, Snoek J, Adams RP (2014) Bayesian optimization with unknown constraints. In: UAI"},{"key":"413_CR9","doi-asserted-by":"crossref","unstructured":"Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3\u201342","DOI":"10.1007\/s10994-006-6226-1"},{"key":"413_CR10","first-page":"1","volume-title":"IJCNN","author":"I Guyon","year":"2015","unstructured":"Guyon I, Bennett KP, Cawley GC, Escalante HJ, Escalera S, Ho TK, Maci\u00e0 N, Ray B, Saeed M, Statnikov AR, Viegas E (2015) Design of the 2015 chalearn automl challenge. In: IJCNN, pp 1\u20138"},{"key":"413_CR11","doi-asserted-by":"crossref","unstructured":"Karlas B, Li P, Wu R, G\u00fcrel NM, Chu X, Wu W, Zhang C (2021) Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. PVLDB 14:255\u2013267","DOI":"10.14778\/3430915.3430917"},{"key":"413_CR12","doi-asserted-by":"crossref","unstructured":"Khoshgoftaar TM, Hulse JV, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern Part\u00a0A 41(3):552\u2013568","DOI":"10.1109\/TSMCA.2010.2084081"},{"key":"413_CR13","series-title":"CoRR, abs\/1711.01299","volume-title":"BoostClean: automated error detection and repair for machine learning","author":"S Krishnan","year":"2017","unstructured":"Krishnan S, Franklin MJ, Goldberg K, Wu E (2017) BoostClean: automated error detection and repair for machine learning. CoRR, abs\/1711.01299"},{"key":"413_CR14","doi-asserted-by":"crossref","unstructured":"Krishnan S, Wang J, Wu E, Franklin MJ, Goldberg K (2016) ActiveClean: interactive data cleaning for statistical modeling. PVLDB 9(12):948\u2013959","DOI":"10.14778\/2994509.2994514"},{"key":"413_CR15","volume-title":"ICDE","author":"P Li","year":"2021","unstructured":"Li P, Rao X, Blase J, Zhang Y, Chu X, Zhang C (2021) Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In: ICDE"},{"key":"413_CR16","volume-title":"ICDM","author":"FT Liu","year":"2008","unstructured":"Liu FT, Ting KM, Zhou Z (2008) Isolation forest. In: ICDM"},{"key":"413_CR17","volume-title":"ICLR","author":"A Madry","year":"2018","unstructured":"Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: ICLR"},{"key":"413_CR18","doi-asserted-by":"crossref","unstructured":"Mahdavi M, Abedjan Z (2020) Baran: Effective error correction via a\u00a0unified context representation and transfer learning. PVLDB 13(11):1948\u20131961","DOI":"10.14778\/3407790.3407801"},{"key":"413_CR19","unstructured":"Neutatz F (2022) Data-cleaning-specific AutoSklearn. https:\/\/github.com\/LUH-DBS\/AutoClean. Accessed 06\u00a0May 2022"},{"key":"413_CR20","unstructured":"Neutatz F, Chen B, Abedjan Z, Wu E (2021) From cleaning before ML to cleaning for ML. IEEE Data Eng Bull 44:24\u201341"},{"key":"413_CR21","volume-title":"CIKM","author":"F Neutatz","year":"2019","unstructured":"Neutatz F, Mahdavi M, Abedjan Z (2019) ED2: A case for active learning in error detection. In: CIKM"},{"key":"413_CR22","unstructured":"Ng A (2021) The batch: regulating ai, where the mles are, real-time voice replacement, robot painter. https:\/\/read.deeplearning.ai\/the-batch\/issue-114\/. Accessed 06\u00a0May 2022"},{"key":"413_CR23","doi-asserted-by":"crossref","unstructured":"Northcutt CG, Jiang L, Chuang IL (2021) Confident learning: Estimating uncertainty in dataset labels. JAIR 70:1373\u20131411","DOI":"10.1613\/jair.1.12125"},{"key":"413_CR24","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. JMLR 12:2825\u20132830"},{"key":"413_CR25","doi-asserted-by":"crossref","unstructured":"Quinlan JR (1999) Simplifying decision trees. Int J Hum Comput Stud 51(2):497\u2013510","DOI":"10.1006\/ijhc.1987.0321"},{"key":"413_CR26","volume-title":"ICLR","author":"A Raghunathan","year":"2018","unstructured":"Raghunathan A, Steinhardt J, Liang P (2018) Certified defenses against adversarial examples. In: ICLR"},{"key":"413_CR27","unstructured":"Rahimi A, Recht B (2007) Random features for large-scale kernel machines. NeurIPS 20:1177\u20131184"},{"key":"413_CR28","doi-asserted-by":"crossref","unstructured":"Rousseeuw PJ, van Driessen K (1999) A\u00a0fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212\u2013223","DOI":"10.1080\/00401706.1999.10485670"},{"key":"413_CR29","volume-title":"NeurIPS","author":"B Sch\u00f6lkopf","year":"1999","unstructured":"Sch\u00f6lkopf B, Williamson RC, Smola AJ, Shawe-Taylor J, Platt JC (1999) Support vector method for novelty detection. In: NeurIPS"},{"key":"413_CR30","volume-title":"Robust learning: Information theory and algorithms","author":"J Steinhardt","year":"2018","unstructured":"Steinhardt J (2018) Robust learning: Information theory and algorithms. Stanford University"},{"key":"413_CR31","volume-title":"PRICAI","author":"C Teng","year":"2000","unstructured":"Teng C (2000) Evaluating noise correction. In: PRICAI, vol 1886"},{"key":"413_CR32","doi-asserted-by":"crossref","unstructured":"Troyanskaya OG, Cantor MN, Sherlock G, Brown PO, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinform 17(6):520\u2013525","DOI":"10.1093\/bioinformatics\/17.6.520"},{"key":"413_CR33","doi-asserted-by":"crossref","unstructured":"Van Buuren S, Groothuis-Oudshoorn K (2011) mice: Multivariate imputation by chained equations in r. J\u00a0Stat Soft 45:1\u201367","DOI":"10.18637\/jss.v045.i03"},{"key":"413_CR34","volume-title":"Introduction to robust estimation and hypothesis testing","author":"RR Wilcox","year":"2011","unstructured":"Wilcox RR (2011) Introduction to robust estimation and hypothesis testing. Academic Press"},{"key":"413_CR35","volume-title":"SIGMOD","author":"W Wu","year":"2020","unstructured":"Wu W, Flokas L, Wu E, Wang J (2020) Complaint-driven training data debugging for query 2.0. In: SIGMOD"},{"key":"413_CR36","volume-title":"ICML","author":"H Zhang","year":"2017","unstructured":"Zhang H, Li J, Kara K, Alistarh D, Liu J, Zhang C (2017) Zipml: Training linear models with end-to-end low precision, and a\u00a0little bit of deep learning. In: ICML, vol 70"}],"container-title":["Datenbank-Spektrum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13222-022-00413-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s13222-022-00413-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13222-022-00413-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,7,29]],"date-time":"2022-07-29T11:25:29Z","timestamp":1659093929000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s13222-022-00413-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,13]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,7]]}},"alternative-id":["413"],"URL":"https:\/\/doi.org\/10.1007\/s13222-022-00413-2","relation":{},"ISSN":["1618-2162","1610-1995"],"issn-type":[{"value":"1618-2162","type":"print"},{"value":"1610-1995","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,13]]},"assertion":[{"value":"31 January 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 April 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 May 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}