{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T02:23:33Z","timestamp":1773887013213,"version":"3.50.1"},"reference-count":109,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,11,13]],"date-time":"2023-11-13T00:00:00Z","timestamp":1699833600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,11,13]]},"abstract":"<jats:p>In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.<\/jats:p>","DOI":"10.1145\/3617338","type":"journal-article","created":{"date-parts":[[2023,11,13]],"date-time":"2023-11-13T22:28:39Z","timestamp":1699914519000},"page":"1-26","source":"Crossref","is-referenced-by-count":13,"title":["SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0031-9911","authenticated-orcid":false,"given":"Shafaq","family":"Siddiqi","sequence":"first","affiliation":[{"name":"Graz University of Technology, Graz, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0202-6100","authenticated-orcid":false,"given":"Roman","family":"Kern","sequence":"additional","affiliation":[{"name":"Graz University of Technology, Graz, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1344-3663","authenticated-orcid":false,"given":"Matthias","family":"Boehm","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t Berlin, Berlin, Germany"}]}],"member":"320","published-online":{"date-parts":[[2023,11,13]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"crossref","unstructured":"Ziawasch Abedjan Lukasz Golab Felix Naumann and Thorsten Papenbrock. 2018. Data Profiling. In Synthesis Lectures on Data Management. http:\/\/sites.computer.org\/debull\/A18june\/p3.pdf","DOI":"10.1007\/978-3-031-01865-7"},{"key":"e_1_2_2_2_1","volume-title":"QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR","author":"Alexiou Giorgos","year":"2022","unstructured":"Giorgos Alexiou, George Papastefanatos, Vassilis Stamatopoulos, Georgia Koutrika, and Nectarios Koziris. 2022. QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR, Vol. abs\/2202.01546 (2022). showeprint[arXiv]2202.01546 https:\/\/arxiv.org\/abs\/2202.01546"},{"key":"e_1_2_2_3_1","unstructured":"ASQ\/ANSI\/ISO. 2015. 9001:2015: Quality management systems - Requirements."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","unstructured":"Denis Baylor Eric Breck Heng-Tze Cheng Noah Fiedel Chuan Yu Foo Zakaria Haque Salem Haykal Mustafa Ispir Vihan Jain Levent Koc Chiu Yuen Koo Lukasz Lew Clemens Mewald Akshay Naresh Modi Neoklis Polyzotis Sukriti Ramesh Sudip Roy Steven Euijong Whang Martin Wicke Jarek Wilkiewicz Xin Zhang and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD. 1387--1395. https:\/\/doi.org\/10.1145\/3097983.3098021","DOI":"10.1145\/3097983.3098021"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/2503308.2188395"},{"key":"e_1_2_2_6_1","first-page":"695","article-title":"Generic Schema Matching","volume":"4","author":"Bernstein Philip A.","year":"2011","unstructured":"Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB, Vol. 4, 11 (2011), 695--701. http:\/\/www.vldb.org\/pvldb\/vol4\/p695-bernstein_madhavan_rahm.pdf","journal-title":"Ten Years Later. PVLDB"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1247480.1247482"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313602"},{"key":"e_1_2_2_9_1","volume-title":"JMLR","volume":"20","author":"Bie\u00dfmann Felix","year":"2019","unstructured":"Felix Bie\u00dfmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. JMLR, Vol. 20 (2019). http:\/\/jmlr.org\/papers\/v20\/18--753.html"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209889.3209891"},{"key":"e_1_2_2_11_1","volume-title":"Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede.","author":"Boehm Matthias","year":"2020","unstructured":"Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginth\u00f6 r, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http:\/\/cidrdb.org\/cidr2020\/papers\/p22-boehm-cidr20.pdf"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732286.2732292"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","unstructured":"Christoph B\u00f6 hm Gerard de Melo Felix Naumann and Gerhard Weikum. 2012. LINDA: distributed web-of-data-scale entity matching. In CIKM. 2104--2108. https:\/\/doi.org\/10.1145\/2396761.2398582","DOI":"10.1145\/2396761.2398582"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1516360.1516494"},{"key":"e_1_2_2_15_1","unstructured":"Eric Breck Neoklis Polyzotis Sudip Roy Steven Whang and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys. https:\/\/proceedings.mlsys.org\/book\/267.pdf"},{"key":"e_1_2_2_16_1","volume-title":"A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR","author":"Brochu Eric","year":"2010","unstructured":"Eric Brochu, Vlad M. Cora, and Nando de Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR (2010). http:\/\/arxiv.org\/abs\/1012.2599"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000024"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2018.09.001"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137641"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","unstructured":"Emily Caveness Paul Suganthan G. C. Zhuo Peng Neoklis Polyzotis Sudip Roy and Martin Zinkevich. 2020. TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines. In SIGMOD. 2793--2796. https:\/\/doi.org\/10.1145\/3318464.3384707","DOI":"10.1145\/3318464.3384707"},{"key":"e_1_2_2_21_1","unstructured":"Austin Animal Center. 2022. Shelter Animal Outcomes competition dataset from Kaggle. https:\/\/www.kaggle.com\/competitions\/shelter-animal-outcomes\/data"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.953"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","unstructured":"Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD. 785--794. https:\/\/doi.org\/10.1145\/2939672.2939785","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/2983200.2983203"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824109"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2019.2916074"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00020"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","unstructured":"Can Cui Wei Wang Meihui Zhang Gang Chen Zhaojing Luo and Beng Chin Ooi. 2021. AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment. In SIGMOD. 2208--2216. https:\/\/doi.org\/10.1145\/3448016.3457324","DOI":"10.1145\/3448016.3457324"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","unstructured":"Michele Dallachiesa Amr Ebaid Ahmed Eldawy Ahmed K. Elmagarmid Ihab F. Ilyas Mourad Ouzzani and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD. 541--552. https:\/\/doi.org\/10.1145\/2463676.2465327","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_2_30_1","unstructured":"Data.Nashville.gov. 2020. Nashville Traffic Accidents Dataset. https:\/\/data.nashville.gov\/Police\/Traffic-Accidents\/6v6w-hpcw"},{"key":"e_1_2_2_31_1","unstructured":"Delve Datasets. 2022. Puma Dataset. https:\/\/www.cs.toronto.edu\/ delve\/data\/datasets.html"},{"key":"e_1_2_2_32_1","unstructured":"data.world. 2016. OLS Regression Challenge - Cancer. https:\/\/data.world\/nrippner\/ols-regression-challenge"},{"key":"e_1_2_2_33_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017","unstructured":"Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. (2017). http:\/\/cidrdb.org\/cidr2017\/papers\/p44-deng-cidr17.pdf"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","unstructured":"Mike Dreves Gene Huang Zhuo Peng Neoklis Polyzotis Evan Rosen and Paul Suganthan G. C. 2020. From Data to Models and Back. In DEEM@SIGMOD Workshop. https:\/\/doi.org\/10.1145\/3399579.3399868","DOI":"10.1145\/3399579.3399868"},{"key":"e_1_2_2_35_1","volume-title":"BOHB: Robust and Efficient Hyperparameter Optimization at Scale. In ICML. 1436--1445","author":"Falkner Stefan","year":"2018","unstructured":"Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. In ICML. 1436--1445. http:\/\/proceedings.mlr.press\/v80\/falkner18a.html"},{"key":"e_1_2_2_36_1","volume-title":"The Next Generation. CoRR","author":"Feurer Matthias","year":"2020","unstructured":"Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: The Next Generation. CoRR (2020). https:\/\/arxiv.org\/abs\/2007.04074"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3-030-05318--5"},{"key":"e_1_2_2_38_1","unstructured":"Nicol\u00f3 Fusi Rishit Sheth and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In NeurIPS. 3352--3361. https:\/\/proceedings.neurips.cc\/paper\/2018\/file\/b59a51a3c0bf9c5228fde841714f523a-Paper.pdf"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452747"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457287"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2005.140"},{"key":"e_1_2_2_42_1","unstructured":"Chicago health services. 2022. Chicago Food Inspection Dataset. https:\/\/data.cityofchicago.org\/Health-Human-Services\/Food-Inspections\/4ijn-s7e5\/data"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","unstructured":"Yuval Heffetz Roman Vainshtein Gilad Katz and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD. 2103--2113. https:\/\/doi.org\/10.1145\/3394486.3403261","DOI":"10.1145\/3394486.3403261"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","unstructured":"Christoph Hube Besnik Fetahu and Ujwal Gadiraju. 2019. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. In CHI. ACM 407. https:\/\/doi.org\/10.1145\/3290605.3300637","DOI":"10.1145\/3290605.3300637"},{"key":"e_1_2_2_46_1","volume-title":"Non-stochastic Best Arm Identification and Hyperparameter Optimization. In AISTATS (JMLR Workshop and Conference Proceedings","volume":"248","author":"Kevin","unstructured":"Kevin G. Jamieson and Ameet Talwalkar. 2016. Non-stochastic Best Arm Identification and Hyperparameter Optimization. In AISTATS (JMLR Workshop and Conference Proceedings, Vol. 51). 240--248. http:\/\/proceedings.mlr.press\/v51\/jamieson16.html"},{"key":"e_1_2_2_47_1","unstructured":"Kaggle. 2022a. House Prices - Advanced Regression Techniques. https:\/\/www.kaggle.com\/competitions\/house-prices-advanced-regression-techniques\/data"},{"key":"e_1_2_2_48_1","unstructured":"Kaggle. 2022b. Titanic Dataset. https:\/\/www.kaggle.com\/competitions\/titanic\/data"},{"key":"e_1_2_2_49_1","volume-title":"Barnab\u00e1 s P\u00f3 czos, and Eric P. Xing","author":"Kandasamy Kirthevasan","year":"2018","unstructured":"Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab\u00e1 s P\u00f3 czos, and Eric P. Xing. 2018. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In NeurIPS. 2020--2029. https:\/\/proceedings.neurips.cc\/paper\/2018\/hash\/f33ba15effa5c10e873bf3842afb46a6-Abstract.html"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","unstructured":"Sean Kandel Andreas Paepcke Joseph M. Hellerstein and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI. 3363--3372. https:\/\/doi.org\/10.1145\/1978942.1979444","DOI":"10.1145\/1978942.1979444"},{"key":"e_1_2_2_51_1","unstructured":"Daniel Kang Deepti Raghavan Peter Bailis and Matei Zaharia. 2020. Model Assertions for Monitoring and Improving ML Models. In MLSys. https:\/\/proceedings.mlsys.org\/book\/319.pdf"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403290"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3430915.3442426"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","unstructured":"Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quian\u00e9 -Ruiz Nan Tang and Si Yin. 2015. BigDansing: A System for Big Data Cleansing. In SIGMOD. 1215--1230. https:\/\/doi.org\/10.1145\/2723372.2747646","DOI":"10.1145\/2723372.2747646"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007314"},{"key":"e_1_2_2_56_1","article-title":"Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA","volume":"18","author":"Kotthoff Lars","year":"2017","unstructured":"Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res., Vol. 18 (2017), 25:1--25:5. http:\/\/jmlr.org\/papers\/v18\/16--261.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_2_57_1","volume-title":"BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR, Vol. abs\/1711.01299 (2017). http:\/\/arxiv.org\/abs\/1711.01299"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_2_59_1","unstructured":"Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. (2019). http:\/\/arxiv.org\/abs\/1904.11827"},{"key":"e_1_2_2_60_1","article-title":"Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization","volume":"18","author":"Li Lisha","year":"2017","unstructured":"Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http:\/\/jmlr.org\/papers\/v18\/16--558.html","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_2_61_1","doi-asserted-by":"publisher","unstructured":"Peng Li Xi Rao Jennifer Blase Yue Zhang Xu Chu and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. 13--24. https:\/\/doi.org\/10.1109\/ICDE51399.2021.00009","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3187009.3177737"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","unstructured":"Chris Mayfield Jennifer Neville and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD. 75--86. https:\/\/doi.org\/10.1145\/1807167.1807178","DOI":"10.1145\/1807167.1807178"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329486.3329496"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.14778\/3447689.3447691"},{"key":"e_1_2_2_68_1","first-page":"24","article-title":"From Cleaning before ML to Cleaning for ML","volume":"44","author":"Neutatz Felix","year":"2021","unstructured":"Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull., Vol. 44, 1 (2021), 24--41. http:\/\/sites.computer.org\/debull\/A21mar\/p24.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13222-022-00413--2"},{"key":"e_1_2_2_70_1","doi-asserted-by":"publisher","unstructured":"Uchechukwu Njoku Besim Bilalli Alberto Abell\u00f3 and Gianluca Bontempi. 2023. Wrapper Methods for Multi-Objective Feature Selection. In EDBT. 697--709. https:\/\/doi.org\/10.48786\/edbt.2023.58","DOI":"10.48786\/edbt.2023.58"},{"key":"e_1_2_2_71_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3-030-05318--5_8"},{"key":"e_1_2_2_72_1","doi-asserted-by":"publisher","unstructured":"Yongjoo Park Jingyi Qing Xiaoyang Shen and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In SIGMOD. 1135--1152. https:\/\/doi.org\/10.1145\/3299869.3300077","DOI":"10.1145\/3299869.3300077"},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","unstructured":"Eliana Pastor Elena Baralis and Luca de Alfaro. 2023. A Hierarchical Approach to Anomalous Subgroup Discovery. In ICDE. 2647--2659. https:\/\/doi.org\/10.1109\/ICDE55515.2023.00203","DOI":"10.1109\/ICDE55515.2023.00203"},{"key":"e_1_2_2_74_1","doi-asserted-by":"publisher","DOI":"10.1007\/978--3-030--58793--2_3"},{"key":"e_1_2_2_75_1","unstructured":"Hieu Pham Melody Y. Guan Barret Zoph Quoc V. Le and Jeff Dean. 2018. Efficient Neural Architecture Search via Parameter Sharing. In ICML. 4092--4101. http:\/\/proceedings.mlr.press\/v80\/pham18a.html"},{"key":"e_1_2_2_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452788"},{"key":"e_1_2_2_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3054782"},{"key":"e_1_2_2_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220109"},{"key":"e_1_2_2_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220109"},{"key":"e_1_2_2_81_1","doi-asserted-by":"publisher","DOI":"10.14778\/3352063.3352068"},{"key":"e_1_2_2_82_1","volume-title":"Hellerstein","author":"Raman Vijayshankar","year":"2001","unstructured":"Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter's Wheel: An Interactive Data Cleaning System. In VLDB. 381--390. http:\/\/www.vldb.org\/conf\/2001\/P381.pdf"},{"key":"e_1_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_2_84_1","unstructured":"C\u00e9 dric Renggli Bojan Karlas Bolin Ding Feng Liu Kevin Schawinski Wentao Wu and Ce Zhang. 2019. Continuous Integration of Machine Learning Models with ease.ml\/ci: Towards a Rigorous Yet Practical Treatment. In MLSys. https:\/\/proceedings.mlsys.org\/book\/266.pdf"},{"key":"e_1_2_2_85_1","unstructured":"UCI Repository. 2013. EEG Eye State Dataset. https:\/\/archive.ics.uci.edu\/ml\/datasets\/EEGEyeState"},{"key":"e_1_2_2_86_1","doi-asserted-by":"publisher","unstructured":"Svetlana Sagadeeva and Matthias Boehm. 2021. SliceLine: Fast Linear-Algebra-based Slice Finding for ML Model Debugging. In SIGMOD. 2290--2299. https:\/\/doi.org\/10.1145\/3448016.3457323","DOI":"10.1145\/3448016.3457323"},{"key":"e_1_2_2_87_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3463474"},{"key":"e_1_2_2_88_1","doi-asserted-by":"publisher","unstructured":"Sebastian Schelter Felix Bie\u00dfmann Dustin Lange Tammo Rukat Philipp Schmidt Stephan Seufert Pierre Brunelle and Andrey Taptunov. 2019. Unit Testing Data with Deequ. In SIGMOD. 1993--1996. https:\/\/doi.org\/10.1145\/3299869.3320210","DOI":"10.1145\/3299869.3320210"},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_2_90_1","doi-asserted-by":"publisher","unstructured":"Sebastian Schelter Tammo Rukat and Felix Bie\u00dfmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In SIGMOD. 1289--1299. https:\/\/doi.org\/10.1145\/3318464.3380604","DOI":"10.1145\/3318464.3380604"},{"key":"e_1_2_2_91_1","doi-asserted-by":"publisher","unstructured":"Sebastian Schelter Tammo Rukat and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. In EDBT. 529--534. https:\/\/doi.org\/10.5441\/002\/edbt.2021.63","DOI":"10.5441\/002"},{"key":"e_1_2_2_92_1","doi-asserted-by":"publisher","DOI":"10.1145\/3068335"},{"key":"e_1_2_2_93_1","doi-asserted-by":"publisher","unstructured":"Zeyuan Shang Emanuel Zgraggen Benedetto Buratti Ferdinand Kossmann Philipp Eichmann Yeounoh Chung Carsten Binnig Eli Upfal and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188. https:\/\/doi.org\/10.1145\/3299869.3319863","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465247"},{"key":"e_1_2_2_95_1","doi-asserted-by":"publisher","unstructured":"Alkis Simitsis Kevin Wilkinson Mal\u00fa Castellanos and Umeshwar Dayal. 2009. QoX-driven ETL design: reducing the cost of ETL consulting engagements. In SIGMOD. 953--960. https:\/\/doi.org\/10.1145\/1559845.1559954","DOI":"10.1145\/1559845.1559954"},{"key":"e_1_2_2_96_1","doi-asserted-by":"crossref","unstructured":"Alkis Simitsis Kevin Wilkinson and Petar Jovanovic. 2013. xPAD: a platform for analytic data flows. In SIGMOD. 1109--1112.","DOI":"10.1145\/2463676.2465247"},{"key":"e_1_2_2_97_1","doi-asserted-by":"publisher","unstructured":"Evan R. Sparks Ameet Talwalkar Daniel Haas Michael J. Franklin Michael I. Jordan and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In SoCC. 368--380. https:\/\/doi.org\/10.1145\/2806777.2806945","DOI":"10.1145\/2806777.2806945"},{"key":"e_1_2_2_98_1","volume-title":"CIDR","author":"Stonebraker Michael","unstructured":"Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Curation at Scale: The Data Tamer System. In CIDR. http:\/\/cidrdb.org\/cidr2013\/Papers\/CIDR13_Paper28.pdf"},{"key":"e_1_2_2_99_1","first-page":"3","article-title":"Data Integration: The Current Status and the Way Forward","volume":"41","author":"Stonebraker Michael","year":"2018","unstructured":"Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull., Vol. 41 (2018), 3--9.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_2_100_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452792"},{"key":"e_1_2_2_101_1","doi-asserted-by":"publisher","unstructured":"Chris Thornton Frank Hutter Holger H. Hoos and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In KDD. 847--855. https:\/\/doi.org\/10.1145\/2487575.2487629","DOI":"10.1145\/2487575.2487629"},{"key":"e_1_2_2_102_1","first-page":"1","article-title":"mice: Multivariate Imputation by Chained Equations in R","volume":"45","author":"van Buuren Stef","year":"2011","unstructured":"Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Vol. 45, 3 (2011), 1--67. https:\/\/www.jstatsoft.org\/index.php\/jss\/article\/view\/v045i03","journal-title":"Journal of Statistical Software"},{"key":"e_1_2_2_103_1","doi-asserted-by":"publisher","DOI":"10.14778\/3297753.3297763"},{"key":"e_1_2_2_104_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_2_105_1","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_2_2_106_1","volume-title":"An Architecture for and Fast and General Data Processing on Large Clusters. Ph.,D. Dissertation","author":"Zaharia Matei A.","unstructured":"Matei A. Zaharia. 2013. An Architecture for and Fast and General Data Processing on Large Clusters. Ph.,D. Dissertation. University of California, Berkeley, USA."},{"key":"e_1_2_2_107_1","doi-asserted-by":"publisher","unstructured":"Amrapali Zaveri and Anisa Rula. 2019. Data Quality and Data Cleansing of Semantic Data. (2019). https:\/\/doi.org\/10.1007\/978--3--319--63962--8_289--1","DOI":"10.1007\/978--3--319--63962--8_289--1"},{"key":"e_1_2_2_108_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115410"},{"key":"e_1_2_2_109_1","volume-title":"How Good Are Machine Learning Clouds for Binary Classification with Good Features? CoRR","author":"Zhang Hantian","year":"2017","unstructured":"Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2017b. How Good Are Machine Learning Clouds for Binary Classification with Good Features? CoRR, Vol. abs\/1707.09562 (2017). http:\/\/arxiv.org\/abs\/1707.09562"},{"key":"e_1_2_2_110_1","volume-title":"Le","author":"Zoph Barret","year":"2017","unstructured":"Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. In ICLR. https:\/\/openreview.net\/forum?id=r1Ue8Hcxg"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617338","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3617338","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:45:53Z","timestamp":1750178753000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3617338"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,13]]},"references-count":109,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,11,13]]}},"alternative-id":["10.1145\/3617338"],"URL":"https:\/\/doi.org\/10.1145\/3617338","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,13]]}}}