{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T15:59:41Z","timestamp":1776441581134,"version":"3.51.2"},"reference-count":277,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2025,1,24]],"date-time":"2025-01-24T00:00:00Z","timestamp":1737676800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSF","award":["IIS-2224843, ITE-2429680"],"award-info":[{"award-number":["IIS-2224843, ITE-2429680"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,5,31]]},"abstract":"<jats:p>\n            Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of\n            <jats:italic>data-centric AI<\/jats:italic>\n            . The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/daochenzha\/data-centric-AI\">https:\/\/github.com\/daochenzha\/data-centric-AI<\/jats:ext-link>\n            .\n          <\/jats:p>\n          <jats:p\/>","DOI":"10.1145\/3711118","type":"journal-article","created":{"date-parts":[[2025,1,6]],"date-time":"2025-01-06T11:29:56Z","timestamp":1736162996000},"page":"1-42","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":145,"title":["Data-centric Artificial Intelligence: A Survey"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6677-7504","authenticated-orcid":false,"given":"Daochen","family":"Zha","sequence":"first","affiliation":[{"name":"Computer Science, Rice University, Houston, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8331-3154","authenticated-orcid":false,"given":"Zaid Pervaiz","family":"Bhat","sequence":"additional","affiliation":[{"name":"Computer Science, Texas A&amp;M University, College Station, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8933-7117","authenticated-orcid":false,"given":"Kwei-Herng","family":"Lai","sequence":"additional","affiliation":[{"name":"Computer Science, Rice University, Houston, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3442-754X","authenticated-orcid":false,"given":"Fan","family":"Yang","sequence":"additional","affiliation":[{"name":"Computer Science, Rice University, Houston, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6933-3952","authenticated-orcid":false,"given":"Zhimeng","family":"Jiang","sequence":"additional","affiliation":[{"name":"Computer Science, Texas A&amp;M University, College Station, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-7289-3667","authenticated-orcid":false,"given":"Shaochen","family":"Zhong","sequence":"additional","affiliation":[{"name":"Computer Science, Rice University, Houston, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2234-3226","authenticated-orcid":false,"given":"Xia","family":"Hu","sequence":"additional","affiliation":[{"name":"Computer Science, Rice University, Houston, United States"}]}],"member":"320","published-online":{"date-parts":[[2025,1,24]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Mohamed Abdelaal Christian Hammacher and Harald Schoening. 2023. REIN: A comprehensive benchmark framework for data cleaning methods in ML pipelines. Retrieved from https:\/\/arXiv:2302.04702"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1002\/wics.101"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3328526.3329589"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.3390\/technologies9030052"},{"issue":"1","key":"e_1_3_1_6_2","first-page":"1","article-title":"Data normalization and standardization: A technical report","volume":"1","author":"Ali Peshawa Jamal Muhammad","year":"2014","unstructured":"Peshawa Jamal Muhammad Ali, Rezhna H. Faraj, Erbil Koya, Peshawa J. Muhammad Ali, and Rezhna H. Faraj. 2014. Data normalization and standardization: A technical report. Mach. Learn. Tech. Rep. 1, 1 (2014), 1\u20136.","journal-title":"Mach. Learn. Tech. Rep."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/4236.612229"},{"issue":"2","key":"e_1_3_1_8_2","first-page":"47","article-title":"Benchmarking data curation systems.","volume":"39","author":"Arocena Patricia C.","year":"2016","unstructured":"Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Ren\u00e9e J. Miller, Paolo Papotti, and Donatello Santoro. 2016. Benchmarking data curation systems. IEEE Data Eng. Bull. 39, 2 (2016), 47\u201362.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3517337"},{"issue":"2","key":"e_1_3_1_10_2","first-page":"18","article-title":"Feature selection based on information gain","volume":"2","author":"Azhagusundari B.","year":"2013","unstructured":"B. Azhagusundari, Antony Selvadoss Thanamani et\u00a0al. 2013. Feature selection based on information gain. Int. J. Innov.e Technol. Explor. Eng. 2, 2 (2013), 18\u201321.","journal-title":"Int. J. Innov.e Technol. Explor. Eng."},{"key":"e_1_3_1_11_2","article-title":"Regularized learning for domain adaptation under label shifts","author":"Azizzadenesheli Kamyar","year":"2019","unstructured":"Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. 2019. Regularized learning for domain adaptation under label shifts. In Proceedings of the ICLR (2019).","journal-title":"Proceedings of the ICLR"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00041"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-24628-9_16"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.softx.2020.100456"},{"key":"e_1_3_1_15_2","unstructured":"Matias Barenstein. 2019. Propublica\u2019s compas data revisited. Retrieved from https:\/\/arXiv:1906.04711"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/HICSS.1995.375612"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541883"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098021"},{"issue":"2","key":"e_1_3_1_19_2","first-page":"90","article-title":"A step towards global counterfactual explanations: Approximating the feature space through hierarchical division and graph search.","volume":"1","author":"Becker Maximilian","year":"2021","unstructured":"Maximilian Becker, Nadia Burkart, Pascal Birnstill, and J\u00fcrgen Beyerer. 2021. A step towards global counterfactual explanations: Approximating the feature space through hierarchical division and graph search. Adv. Artif. Intell. Mach. Learn. 1, 2 (2021), 90\u2013110.","journal-title":"Adv. Artif. Intell. Mach. Learn."},{"key":"e_1_3_1_20_2","unstructured":"Eyal Betzalel Coby Penso Aviv Navon and Ethan Fetaya. 2022. A study on the evaluation of generative models. Retrieved from https:\/\/arXiv:2206.10935"},{"key":"e_1_3_1_21_2","volume-title":"Proceedings of the CIDR","author":"Bhardwaj Anant","year":"2015","unstructured":"Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. 2015. Datahub: Collaborative data science & dataset version management at scale. In Proceedings of the CIDR."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40994-3_25"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4842-4470-8_18"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.3390\/app10113933"},{"key":"e_1_3_1_25_2","unstructured":"Pierre Blanchart. 2021. An exact counterfactual-example-based approach to tree-ensemble models interpretability. Retrieved from https:\/\/arXiv:2105.14820"},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the ICLR","author":"Boecking Benedikt","year":"2021","unstructured":"Benedikt Boecking, Willie Neiswanger, Eric Xing, and Artur Dubrawski. 2021. Interactive weak supervision: Learning useful heuristics for data labeling. In Proceedings of the ICLR."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00067"},{"key":"e_1_3_1_28_2","article-title":"Conditional functional dependencies for data cleaning","author":"Bohannon Philip","year":"2006","unstructured":"Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2006. Conditional functional dependencies for data cleaning. Proceedings of the CDE.","journal-title":"Proceedings of the CDE"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376746"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbab354"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btl242"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2013.234"},{"key":"e_1_3_1_33_2","article-title":"Language models are few-shot learners","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. In Proceedings of the NeurIPS.","journal-title":"Proceedings of the NeurIPS"},{"key":"e_1_3_1_34_2","first-page":"147592172211111","article-title":"A feature extraction & selection benchmark for structural health monitoring","author":"Buckley Tadhg","year":"2022","unstructured":"Tadhg Buckley, Bidisha Ghosh, and Vikram Pakrashi. 2022. A feature extraction & selection benchmark for structural health monitoring. Struct. Health Monitor. (2022), 14759217221111141.","journal-title":"Struct. Health Monitor."},{"key":"e_1_3_1_35_2","volume-title":"Proceedings of the FAccT","author":"Buolamwini Joy","year":"2018","unstructured":"Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the FAccT."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-7485-2_17"},{"key":"e_1_3_1_37_2","volume-title":"Proceedings of the AAAI","author":"Carreira-Perpin\u00e1n Miguel A.","year":"2021","unstructured":"Miguel A. Carreira-Perpin\u00e1n and Suryabhan Singh Hada. 2021. Counterfactual explanations for oblique decision trees: Exact, efficient algorithms. In Proceedings of the AAAI."},{"key":"e_1_3_1_38_2","volume-title":"Proceedings of the VLDB","author":"Chaudhuri Surajit","year":"1997","unstructured":"Surajit Chaudhuri and Vivek R. Narasayya. 1997. An efficient, cost-driven index selection tool for Microsoft SQL server. In Proceedings of the VLDB."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767949"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.5555\/1622407.1622416"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.194"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3128572.3140448"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58135-0_3"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1111\/1754-9485.13261"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.12.130"},{"key":"e_1_3_1_46_2","first-page":"603","article-title":"Natural language processing","author":"Chowdhary KR1442","year":"2020","unstructured":"KR1442 Chowdhary and K. R. Chowdhary. 2020. Natural language processing. Fund. Artific. Intell. (2020), 603\u2013649.","journal-title":"Fund. Artific. Intell."},{"key":"e_1_3_1_47_2","volume-title":"Proceedings of the NeurIPS","author":"Christiano Paul F.","year":"2017","unstructured":"Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00139"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1093\/database\/bax061"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.5555\/1622737.1622744"},{"key":"e_1_3_1_52_2","volume-title":"Proceedings of the CVPR","author":"Cubuk Ekin D.","year":"2019","unstructured":"Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2019. Autoaugment: Learning augmentation policies from data. In Proceedings of the CVPR."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58112-1_31"},{"key":"e_1_3_1_54_2","volume-title":"Proceedings of the COLT","author":"Dekel Ofer","year":"2009","unstructured":"Ofer Dekel and Ohad Shamir. 2009. Vox Populi: Collecting high-quality labels from a crowd. In Proceedings of the COLT."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_56_2","unstructured":"Meghana Deodhar Xiao Ma Yixin Cai Alex Koes Alex Beutel and Jilin Chen. 2022. A Human-ML collaboration framework for improving video content reviews. Retrieved from https:\/\/arXiv:2210.09500"},{"issue":"2","key":"e_1_3_1_57_2","first-page":"119","article-title":"Toward a taxonomy of visuals in science communication","volume":"58","author":"Desnoyers Luc","year":"2011","unstructured":"Luc Desnoyers. 2011. Toward a taxonomy of visuals in science communication. Techn. Commun. 58, 2 (2011), 119\u2013134.","journal-title":"Techn. Commun."},{"key":"e_1_3_1_58_2","unstructured":"Amit Dhurandhar Tejaswini Pedapati Avinash Balakrishnan Pin-Yu Chen Karthikeyan Shanmugam and Ruchir Puri. 2019. Model agnostic contrastive explanations for structured data. Retrieved from https:\/\/arXiv:1906.00117"},{"key":"e_1_3_1_59_2","volume-title":"Proceedings of the NeurIPS","author":"Ding Frances","year":"2021","unstructured":"Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. Retiring adult: New datasets for fair machine learning. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575637.3575646"},{"key":"e_1_3_1_61_2","unstructured":"Sirui Ding Ruixiang Tang Daochen Zha Na Zou Kai Zhang Xiaoqian Jiang and Xia Hu. 2023. Fairly predicting graft failure in liver transplant for organ assigning. Retrieved from https:\/\/arXiv:2302.09400"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539597.3570368"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00040"},{"key":"e_1_3_1_64_2","unstructured":"Iddo Drori Yamuna Krishnamurthy Remi Rampin Raoni de Paula Lourenco Jorge Piazentin Ono Kyunghyun Cho Claudio Silva and Juliana Freire. 2021. AlphaD3M: Machine learning pipeline synthesis. Retrieved from https:\/\/arXiv:2111.02508"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687767"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2019.2944182"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00175"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETC.2014.2330519"},{"key":"e_1_3_1_69_2","first-page":"877","article-title":"A brief review of domain adaptation","author":"Farahani Abolfazl","year":"2021","unstructured":"Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R. Arabnia. 2021. A brief review of domain adaptation. In Proceedings from ICDATA and IKE. 877\u2013894.","journal-title":"Proceedings from ICDATA and IKE"},{"key":"e_1_3_1_70_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.84"},{"key":"e_1_3_1_71_2","volume-title":"Proceedings of the ICDE","author":"Fernandez Raul Castro","year":"2018","unstructured":"Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In Proceedings of the ICDE."},{"key":"e_1_3_1_72_2","volume-title":"Proceedings of the NeurIPS","author":"Feurer Matthias","year":"2015","unstructured":"Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_73_2","unstructured":"Apache Software Foundation. 2023. Hadoop. Retrieved from https:\/\/hadoop.apache.org"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1177\/15291006211051956"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISBI.2018.8363576"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457334"},{"key":"e_1_3_1_77_2","volume-title":"Proceedings of the CIKM Workshop","author":"Gamboa Edwin","year":"2022","unstructured":"Edwin Gamboa, Alejandro Libreros, Matthias Hirth, and Dan Dubiner. 2022. Human-AI collaboration for improving the identification of cars for autonomous driving. In Proceedings of the CIKM Workshop."},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.295"},{"key":"e_1_3_1_79_2","volume-title":"Proceedings of the ICML","author":"Ghorbani Amirata","year":"2020","unstructured":"Amirata Ghorbani, Michael Kim, and James Zou. 2020. A distributional framework for data valuation. In Proceedings of the ICML."},{"key":"e_1_3_1_80_2","volume-title":"Proceedings of the ICML","author":"Ghorbani Amirata","year":"2019","unstructured":"Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the ICML."},{"key":"e_1_3_1_81_2","unstructured":"Pieter Gijsbers Marcos L. P. Bueno Stefan Coors Erin LeDell S\u00e9bastien Poirier Janek Thomas Bernd Bischl and Joaquin Vanschoren. 2022. Amlb: An automl benchmark. Retrieved from https:\/\/arXiv:2207.12560"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"issue":"4","key":"e_1_3_1_83_2","first-page":"5","article-title":"Covariate shift by kernel mean matching","volume":"3","author":"Gretton Arthur","year":"2009","unstructured":"Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch\u00f6lkopf. 2009. Covariate shift by kernel mean matching. Dataset Shift Mach. Learn. 3, 4 (2009), 5.","journal-title":"Dataset Shift Mach. Learn."},{"key":"e_1_3_1_84_2","unstructured":"Georges G. Grinstein Patrick Hoffman Ronald M. Pickett and Sharon J. Laskowski. 2002. Benchmark Development for the Evaluation of Visualization for Data Mining. Morgan Kaufmann Publishers Inc. San Francisco CA USA 129\u2013176."},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24844-6_87"},{"key":"e_1_3_1_86_2","unstructured":"Keren Gu Brandon Yang Jiquan Ngiam Quoc Le and Jonathon Shlens. 2019. Using videos to evaluate image model robustness. Retrieved from https:\/\/arXiv:1904.10076"},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBME.2021.3117407"},{"key":"e_1_3_1_88_2","doi-asserted-by":"publisher","DOI":"10.2307\/j.ctv14jx6sm"},{"key":"e_1_3_1_89_2","volume-title":"Proceedings of the ICML","author":"Han Xiaotian","year":"2022","unstructured":"Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. 2022. G-Mixup: Graph data augmentation for graph classification. In Proceedings of the ICML."},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.eacl-main.316"},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00059"},{"key":"e_1_3_1_92_2","volume-title":"Proceedings of the WCCI","author":"He Haibo","year":"2008","unstructured":"Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the WCCI."},{"key":"e_1_3_1_93_2","doi-asserted-by":"publisher","DOI":"10.1145\/2983323.2983835"},{"key":"e_1_3_1_94_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403261"},{"key":"e_1_3_1_95_2","doi-asserted-by":"publisher","DOI":"10.1145\/2661829.2661885"},{"key":"e_1_3_1_96_2","article-title":"Benchmarking neural network robustness to common corruptions and perturbations","author":"Hendrycks Dan","year":"2019","unstructured":"Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the ICLR.","journal-title":"Proceedings of the ICLR"},{"key":"e_1_3_1_97_2","volume-title":"Proceedings of the CIDR","author":"Herodotou Herodotos","year":"2011","unstructured":"Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the CIDR."},{"key":"e_1_3_1_98_2","volume-title":"Proceedings of the NeurIPS","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the NeurIPS."},{"issue":"47","key":"e_1_3_1_99_2","first-page":"1","article-title":"Cascaded diffusion models for high fidelity image generation.","volume":"23","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. 2022. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 47 (2022), 1\u201333.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_1_100_2","volume-title":"Proceedings of the ICLR","author":"Hooper Sarah","year":"2021","unstructured":"Sarah Hooper, Michael Wornow, Ying Hang Seah, Peter Kellman, Hui Xue, Frederic Sala, Curtis Langlotz, and Christopher Re. 2021. Cut out the annotator, keep the cutout: Better segmentation with weak supervision. In Proceedings of the ICLR."},{"key":"e_1_3_1_101_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268911"},{"key":"e_1_3_1_102_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0254841"},{"key":"e_1_3_1_103_2","doi-asserted-by":"publisher","DOI":"10.3389\/fdata.2021.693674"},{"key":"e_1_3_1_104_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3406477"},{"key":"e_1_3_1_105_2","unstructured":"Johannes Jakubik Michael V\u00f6ssing Niklas K\u00fchl Jannis Walk and Gerhard Satzger. 2022. Data-centric artificial intelligence. Retrieved from https:\/\/arXiv:2212.11854"},{"key":"e_1_3_1_106_2","unstructured":"Mohammad Hossein Jarrahi Ali Memariani and Shion Guha. 2022. The principles of data-centric AI (DCAI). Retrieved from https:\/\/arXiv:2211.14611"},{"key":"e_1_3_1_107_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00814"},{"key":"e_1_3_1_108_2","unstructured":"Minqi Jiang Chaochuan Hou Ao Zheng Xiyang Hu Songqiao Han Hailiang Huang Xiangnan He Philip S. Yu and Yue Zhao. 2023. Weakly supervised anomaly detection: A survey. Retrieved from https:\/\/arXiv:2302.04549"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00324"},{"key":"e_1_3_1_110_2","volume-title":"Proceedings of the ICLR","author":"Jiang Zhimeng","year":"2022","unstructured":"Zhimeng Jiang, Kaixiong Zhou, Zirui Liu, Li Li, Rui Chen, Soo-Hyun Choi, and Xia Hu. 2022. An information fusion approach to learning with instance-dependent label noise. In Proceedings of the ICLR."},{"key":"e_1_3_1_111_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-021-03819-2"},{"key":"e_1_3_1_112_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2020\/395"},{"key":"e_1_3_1_113_2","article-title":"Chart-to-text: A large-scale benchmark for chart summarization","author":"Kanthara Shankar","year":"2022","unstructured":"Shankar Kanthara, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. 2022. Chart-to-text: A large-scale benchmark for chart summarization. In Proceedings of the ACL.","journal-title":"Proceedings of the ACL"},{"key":"e_1_3_1_114_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445899"},{"key":"e_1_3_1_115_2","volume-title":"Proceedings of the NAACL","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL."},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11678"},{"key":"e_1_3_1_117_2","doi-asserted-by":"publisher","DOI":"10.1145\/3306618.3314287"},{"key":"e_1_3_1_118_2","volume-title":"Proceedings of the NeurIPS","author":"Kingma Diederik","year":"2021","unstructured":"Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. 2021. Variational diffusion models. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_119_2","volume-title":"Proceedings of the ICML","author":"Koh Pang Wei","year":"2021","unstructured":"Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao et\u00a0al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the ICML."},{"key":"e_1_3_1_120_2","unstructured":"Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of data cleaning pipelines. Retrieved from https:\/\/arXiv:1904.11827"},{"key":"e_1_3_1_121_2","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_1_122_2","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882952"},{"key":"e_1_3_1_123_2","doi-asserted-by":"publisher","DOI":"10.1201\/9781351251389-8"},{"key":"e_1_3_1_124_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.12012"},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i18.18012"},{"key":"e_1_3_1_126_2","volume-title":"Proceedings of the NeurIPS","author":"Lai Kwei-Herng","year":"2021","unstructured":"Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. 2021. Revisiting time series outlier detection: Definitions and benchmarks. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_127_2","volume-title":"Proceedings of the KDD","author":"Lakshminarayan Kamakshi","year":"1996","unstructured":"Kamakshi Lakshminarayan, Steven A. Harp, Robert P. Goldman, Tariq Samad et\u00a0al. 1996. Imputation of missing data using machine learning techniques. In Proceedings of the KDD."},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-91473-2_9"},{"key":"e_1_3_1_129_2","doi-asserted-by":"publisher","DOI":"10.1145\/543613.543644"},{"key":"e_1_3_1_130_2","doi-asserted-by":"publisher","DOI":"10.1145\/3136625"},{"key":"e_1_3_1_131_2","unstructured":"Peng Li Xi Rao Jennifer Blase Yue Zhang Xu Chu and Ce Zhang. 2019. Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. Retrieved from https:\/\/arXiv:1904.09483"},{"key":"e_1_3_1_132_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-09342-5_13"},{"key":"e_1_3_1_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2021.3105636"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00210"},{"key":"e_1_3_1_135_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366424.3383530"},{"key":"e_1_3_1_136_2","volume-title":"Proceedings of the ICML","author":"Lipton Zachary","year":"2018","unstructured":"Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In Proceedings of the ICML."},{"key":"e_1_3_1_137_2","doi-asserted-by":"publisher","DOI":"10.1145\/3560815"},{"key":"e_1_3_1_138_2","volume-title":"Proceedings of the NeurIPS","author":"Liu Zhining","year":"2020","unstructured":"Zhining Liu, Pengfei Wei, Jing Jiang, Wei Cao, Jiang Bian, and Yi Chang. 2020. MESA: Boost ensemble imbalanced learning with meta-sampler. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_139_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i5.20468"},{"key":"e_1_3_1_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2018.00019"},{"key":"e_1_3_1_141_2","article-title":"Towards deep learning models resistant to adversarial attacks","author":"Madry Aleksander","year":"2018","unstructured":"Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the ICLR.","journal-title":"Proceedings of the ICLR"},{"key":"e_1_3_1_142_2","unstructured":"Cloudera Performance Management. 2023. ClouderaYarnTuning. Retrieved from https:\/\/docs.cloudera.com\/documentation\/enterprise\/latest\/topics\/cdh_ig_yarn_tuning.html"},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421425"},{"key":"e_1_3_1_144_2","unstructured":"Diego Martinex Daochen Zha Qiaoyu Tan and Xia Hu. 2023. Towards personalized preprocessing pipeline search. Retrieved from https:\/\/arXiv:2302.14329"},{"key":"e_1_3_1_145_2","article-title":"Dataperf: Benchmarks for data-centric ai development","author":"Mazumder Mark","year":"2024","unstructured":"Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karla\u0161, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Douwe Kiela, David Jurado et\u00a0al. 2024. Dataperf: Benchmarks for data-centric ai development. In Proceedings of the NeurIPS.","journal-title":"Proceedings of the NeurIPS"},{"key":"e_1_3_1_146_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380597"},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1145\/3457607"},{"key":"e_1_3_1_148_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-022-11012-2"},{"key":"e_1_3_1_149_2","volume-title":"Proceedings of the ICML Workshop","author":"Milutinovic Mitar","year":"2020","unstructured":"Mitar Milutinovic, Brandon Schoenfeld, Diego Martinez-Garcia, Saswati Ray, Sujen Shah, and David Yan. 2020. On evaluation of automl systems. In Proceedings of the ICML Workshop."},{"key":"e_1_3_1_150_2","doi-asserted-by":"publisher","DOI":"10.3115\/1690219.1690287"},{"key":"e_1_3_1_151_2","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbx044"},{"key":"e_1_3_1_152_2","unstructured":"Lester James Miranda. 2021. Towards data-centric machine learning: A short review. Retrieved from https:\/\/ljvmiranda921.github.io"},{"key":"e_1_3_1_153_2","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkw1081"},{"key":"e_1_3_1_154_2","unstructured":"Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. Retrieved from https:\/\/arXiv:1312.5602"},{"key":"e_1_3_1_155_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.282"},{"key":"e_1_3_1_156_2","doi-asserted-by":"publisher","DOI":"10.3390\/jimaging7120254"},{"key":"e_1_3_1_157_2","doi-asserted-by":"publisher","DOI":"10.14778\/3192965.3192973"},{"key":"e_1_3_1_158_2","article-title":"Data-centric AI resource hub","author":"Ng Andrew","year":"2021","unstructured":"Andrew Ng. 2021. Data-centric AI resource hub. Snorkel AI. Retrieved February 8, 2023 from https:\/\/snorkel.ai\/","journal-title":"Snorkel AI"},{"key":"e_1_3_1_159_2","article-title":"Landing AI","author":"Ng Andrew","year":"2023","unstructured":"Andrew Ng. 2023. Landing AI. Landing AI. Retrieved February 8, 2023 from https:\/\/landing.ai\/","journal-title":"Landing AI"},{"key":"e_1_3_1_160_2","article-title":"Data-Centric AI competition","author":"Ng Andrew","year":"2021","unstructured":"Andrew Ng, Dillon Laird, and Lynn He. 2021. Data-Centric AI competition. DeepLearning AI. Retrieved December 8, 2021 from https:\/\/https-deeplearning-ai.github.io\/data-centric-comp\/","journal-title":"DeepLearning AI"},{"key":"e_1_3_1_161_2","volume-title":"Proceedings of the CoMeSySo","author":"Obukhov Artem","year":"2020","unstructured":"Artem Obukhov and Mikhail Krasnyanskiy. 2020. Quality assessment method for GAN based on modified metrics inception score and Fr\u00e9chet inception distance. In Proceedings of the CoMeSySo."},{"key":"e_1_3_1_162_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report."},{"key":"e_1_3_1_163_2","volume-title":"Proceedings of the MLHC","author":"Otles Erkin","year":"2021","unstructured":"Erkin Otles, Jeeheh Oh, Benjamin Li, Michelle Bochinski, Hyeon Joo, Justin Ortwine, Erica Shenoy, Laraine Washer, Vincent B. Young, Krishna Rao et\u00a0al. 2021. Mind the performance gap: Examining dataset shift during prospective validation. In Proceedings of the MLHC."},{"key":"e_1_3_1_164_2","volume-title":"Proceedings of the NeurIPS","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray et\u00a0al. 2022. Training language models to follow instructions with human feedback. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_165_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2020.106384"},{"key":"e_1_3_1_166_2","doi-asserted-by":"publisher","DOI":"10.1145\/3439950"},{"key":"e_1_3_1_167_2","doi-asserted-by":"publisher","DOI":"10.1145\/3052973.3053009"},{"key":"e_1_3_1_168_2","article-title":"Carla: A python library to benchmark algorithmic recourse and counterfactual explanation algorithms","author":"Pawelczyk Martin","year":"2021","unstructured":"Martin Pawelczyk, Sascha Bielawski, Johannes van den Heuvel, Tobias Richter, and Gjergji Kasneci. 2021. Carla: A python library to benchmark algorithmic recourse and counterfactual explanation algorithms. In Proceedings of the NeurIPS.","journal-title":"Proceedings of the NeurIPS"},{"key":"e_1_3_1_169_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-92639-1_60"},{"key":"e_1_3_1_170_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-18818-8_2"},{"key":"e_1_3_1_171_2","doi-asserted-by":"publisher","DOI":"10.1145\/505248.506010"},{"key":"e_1_3_1_172_2","doi-asserted-by":"publisher","DOI":"10.14778\/2733004.2733009"},{"key":"e_1_3_1_173_2","unstructured":"Neoklis Polyzotis and Matei Zaharia. 2021. What can data-centric AI learn from data and ML engineering? Retrieved from https:\/\/arXiv:2112.06439"},{"key":"e_1_3_1_174_2","doi-asserted-by":"publisher","DOI":"10.1145\/3375627.3375850"},{"key":"e_1_3_1_175_2","unstructured":"Gil Press. 2022. Cleaning Big Data: Most Time-consuming Least Enjoyable Data Science Task Survey Says. Retrieved from https:\/\/www.forbes.com\/sites\/gilpress\/2016\/03\/23\/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says\/?sh=4e0b70766f63"},{"key":"e_1_3_1_176_2","doi-asserted-by":"publisher","DOI":"10.1109\/IRI.2015.39"},{"key":"e_1_3_1_177_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467224"},{"key":"e_1_3_1_178_2","article-title":"Improving language understanding by generative pre-training","author":"Radford Alec","year":"2018","unstructured":"Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever et\u00a0al. 2018. Improving language understanding by generative pre-training. OpenAI (2018).","journal-title":"OpenAI"},{"key":"e_1_3_1_179_2","article-title":"Language models are unsupervised multitask learners","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI (2019).","journal-title":"OpenAI"},{"key":"e_1_3_1_180_2","article-title":"Scale AI","author":"Ratner Alexander","year":"2023","unstructured":"Alexander Ratner. 2023. Scale AI. Snorkel AI. Retrieved February 8, 2023 from https:\/\/snorkel.ai\/","journal-title":"Snorkel AI"},{"key":"e_1_3_1_181_2","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157797"},{"key":"e_1_3_1_182_2","article-title":"Data programming: Creating large training sets, quickly","author":"Ratner Alexander J.","year":"2016","unstructured":"Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and Christopher R\u00e9. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the NeurIPS.","journal-title":"Proceedings of the NeurIPS"},{"key":"e_1_3_1_183_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472291"},{"key":"e_1_3_1_184_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0031-3203(02)00119-X"},{"key":"e_1_3_1_185_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_1_186_2","doi-asserted-by":"publisher","DOI":"10.1145\/3186549.3186559"},{"key":"e_1_3_1_187_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW49219.2020.00035"},{"key":"e_1_3_1_188_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15561-1_16"},{"key":"e_1_3_1_189_2","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457323"},{"key":"e_1_3_1_190_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSC45622.2019.8938371"},{"key":"e_1_3_1_191_2","volume-title":"Proceedings of the CHI","author":"Sambasivan Nithya","year":"2021","unstructured":"Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. 2021. \u201cEveryone wants to do the model work, not the data work\u201d: Data Cascades in High-Stakes AI. In Proceedings of the CHI."},{"key":"e_1_3_1_192_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.723"},{"key":"e_1_3_1_193_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2013.6606695"},{"key":"e_1_3_1_194_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-9473(01)00072-X"},{"key":"e_1_3_1_195_2","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_3_1_196_2","unstructured":"Timo Schick and Hinrich Sch\u00fctze. 2020. Few-shot text generation with pattern-exploiting training. Retrieved from https:\/\/arXiv:2012.11926"},{"key":"e_1_3_1_197_2","article-title":"Exploiting cloze questions for few shot text classification and natural language inference","author":"Schick Timo","year":"2021","unstructured":"Timo Schick and Hinrich Sch\u00fctze. 2021. Exploiting cloze questions for few shot text classification and natural language inference. In Proceedings of the EACL.","journal-title":"Proceedings of the EACL"},{"key":"e_1_3_1_198_2","article-title":"It\u2019s not just size that matters: Small language models are also few-shot learners","author":"Schick Timo","year":"2021","unstructured":"Timo Schick and Hinrich Sch\u00fctze. 2021. It\u2019s not just size that matters: Small language models are also few-shot learners. In Proceedings of the NAACL.","journal-title":"Proceedings of the NAACL"},{"key":"e_1_3_1_199_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i11.17144"},{"key":"e_1_3_1_200_2","unstructured":"Nabeel Seedat Fergus Imrie and Mihaela van der Schaar. 2022. DC-Check: A data-centric AI checklist to guide the development of reliable machine learning systems. Retrieved from https:\/\/arXiv:2211.05764"},{"key":"e_1_3_1_201_2","volume-title":"Proceedings of the NeurIPS","author":"Shafahi Ali","year":"2018","unstructured":"Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. 2018. Poison frogs! Targeted clean-label poisoning attacks on neural networks. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_202_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00952"},{"key":"e_1_3_1_203_2","doi-asserted-by":"crossref","unstructured":"Shubham Sharma Jette Henderson and Joydeep Ghosh. 2019. Certifai: Counterfactual explanations for robustness transparency interpretability and fairness of artificial intelligence models. Retrieved from https:\/\/arXiv:1905.07857","DOI":"10.1145\/3375627.3375812"},{"key":"e_1_3_1_204_2","unstructured":"Zheyan Shen Jiashuo Liu Yue He Xingxuan Zhang Renzhe Xu Han Yu and Peng Cui. 2021. Towards out-of-distribution generalization: A survey. Retrieved from https:\/\/arXiv:2108.13624"},{"key":"e_1_3_1_205_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0197-0"},{"key":"e_1_3_1_206_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-021-00492-0"},{"key":"e_1_3_1_207_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jnca.2016.04.008"},{"key":"e_1_3_1_208_2","doi-asserted-by":"crossref","unstructured":"Prerna Singh. 2023. Systematic review of data-centric approaches in artificial intelligence and machine learning. Data Science and Management 6 3 (2023) 144\u2013157.","DOI":"10.1016\/j.dsm.2023.06.001"},{"key":"e_1_3_1_209_2","volume-title":"Proceedings of the NeurIPS","author":"Sohoni Nimit","year":"2020","unstructured":"Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Albert Gu, and Christopher R\u00e9. 2020. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_210_2","doi-asserted-by":"publisher","DOI":"10.3390\/su11041077"},{"key":"e_1_3_1_211_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472749.3474792"},{"key":"e_1_3_1_212_2","article-title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models","author":"Srivastava Aarohi","year":"2023","unstructured":"Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri\u00e0 Garriga-Alonso et\u00a0al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. In Proceedings of the TMLR.","journal-title":"Proceedings of the TMLR"},{"key":"e_1_3_1_213_2","volume-title":"Proceedings of the CIDR","author":"Stonebraker Michael","year":"2013","unstructured":"Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data curation at scale: The data tamer system. In Proceedings of the CIDR."},{"issue":"2","key":"e_1_3_1_214_2","first-page":"3","article-title":"Data integration: The current status and the way forward.","volume":"41","author":"Stonebraker Michael","year":"2018","unstructured":"Michael Stonebraker, Ihab F. Ilyas et\u00a0al. 2018. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 41, 2 (2018), 3\u20139.","journal-title":"IEEE Data Eng. Bull."},{"issue":"5","key":"e_1_3_1_215_2","article-title":"Covariate shift adaptation by importance weighted cross validation.","volume":"8","author":"Sugiyama Masashi","year":"2007","unstructured":"Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert M\u00fcller. 2007. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 5 (2007).","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_1_216_2","doi-asserted-by":"publisher","DOI":"10.14778\/3368289.3368296"},{"key":"e_1_3_1_217_2","article-title":"Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction","volume":"1","author":"Sutton Oliver","year":"2012","unstructured":"Oliver Sutton. 2012. Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction. University lectures, University of Leicester 1 (2012).","journal-title":"University lectures, University of Leicester"},{"key":"e_1_3_1_218_2","volume-title":"Proceedings of the SIGIR Workshop","author":"Tang Wei","year":"2011","unstructured":"Wei Tang and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. In Proceedings of the SIGIR Workshop."},{"key":"e_1_3_1_219_2","unstructured":"Yuchao Tao Ryan McKenna Michael Hay Ashwin Machanavajjhala and Gerome Miklau. 2021. Benchmarking differentially private synthetic data generation algorithms. Retrieved from https:\/\/arXiv:2112.09238"},{"key":"e_1_3_1_220_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jksuci.2015.12.004"},{"key":"e_1_3_1_221_2","volume-title":"Proceedings of the EDBT","author":"Thirumuruganathan Saravanan","year":"2020","unstructured":"Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, and AnHai Doan. 2020. Data curation with deep learning. In Proceedings of the EDBT."},{"key":"e_1_3_1_222_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2000.839397"},{"key":"e_1_3_1_223_2","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3064029"},{"key":"e_1_3_1_224_2","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbac315"},{"key":"e_1_3_1_225_2","doi-asserted-by":"publisher","DOI":"10.1155\/2018\/7068349"},{"key":"e_1_3_1_226_2","first-page":"841","article-title":"Counterfactual explanations without opening the black box: Automated decisions and the GDPR","volume":"31","author":"Wachter Sandra","year":"2017","unstructured":"Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL Tech. 31 (2017), 841.","journal-title":"Harv. JL Tech."},{"issue":"1","key":"e_1_3_1_227_2","first-page":"1033","article-title":"A comparison of radial and linear charts for visualizing daily patterns","volume":"26","author":"Waldner Manuela","year":"2019","unstructured":"Manuela Waldner, Alexandra Diehl, Denis Gra\u010danin, Rainer Splechtna, Claudio Delrieux, and Kre\u0161imir Matkovi\u0107. 2019. A comparison of radial and linear charts for visualizing daily patterns. IEEE Trans. Visual. Comput. Graph. 26, 1 (2019), 1033\u20131042.","journal-title":"IEEE Trans. Visual. Comput. Graph."},{"key":"e_1_3_1_228_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1221"},{"key":"e_1_3_1_229_2","doi-asserted-by":"crossref","unstructured":"Mingyang Wan Daochen Zha Ninghao Liu and Na Zou. 2022. In-processing modeling techniques for machine learning fairness: A survey. Transactions on Knowledge Discovery from Data 17 3 (2023) 1\u201327.","DOI":"10.1145\/3551390"},{"key":"e_1_3_1_230_2","article-title":"Scale AI","author":"Wang Alexandr","year":"2023","unstructured":"Alexandr Wang. 2023. Scale AI. Scale AI. Retrieved February 8, 2023 from https:\/\/scale.com\/","journal-title":"Scale AI"},{"key":"e_1_3_1_231_2","doi-asserted-by":"publisher","DOI":"10.1145\/3511808.3557168"},{"key":"e_1_3_1_232_2","doi-asserted-by":"publisher","DOI":"10.14778\/2350229.2350263"},{"key":"e_1_3_1_233_2","doi-asserted-by":"publisher","DOI":"10.1145\/3440207"},{"key":"e_1_3_1_234_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9211"},{"key":"e_1_3_1_235_2","volume-title":"Proceedings of the NeurIPS","author":"Wang Yidong","year":"2022","unstructured":"Yidong Wang, Hao Chen, Yue Fan, Sun Wang, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo et\u00a0al. 2022. Usb: A unified semi-supervised learning benchmark for classification. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_236_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pbio.3001464"},{"key":"e_1_3_1_237_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2017.7966039"},{"key":"e_1_3_1_238_2","doi-asserted-by":"publisher","DOI":"10.1038\/d41586-018-02174-z"},{"key":"e_1_3_1_239_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/631"},{"key":"e_1_3_1_240_2","volume-title":"Hadoop: The Definitive Guide","author":"White Tom","year":"2012","unstructured":"Tom White. 2012. Hadoop: The Definitive Guide. O\u2019Reilly Media, Inc."},{"key":"e_1_3_1_241_2","volume-title":"Artificial Intelligence","author":"Winston Patrick Henry","year":"1984","unstructured":"Patrick Henry Winston. 1984. Artificial Intelligence. Addison-Wesley Longman Publishing."},{"key":"e_1_3_1_242_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2015.2467191"},{"key":"e_1_3_1_243_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4419-9878-1_4"},{"key":"e_1_3_1_244_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482106"},{"key":"e_1_3_1_245_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2022.3150080"},{"key":"e_1_3_1_246_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.snb.2015.02.025"},{"key":"e_1_3_1_247_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2018.06.004"},{"key":"e_1_3_1_248_2","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/1168\/2\/022022"},{"key":"e_1_3_1_249_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00840"},{"key":"e_1_3_1_250_2","unstructured":"Jin Yong Yoo John X. Morris Eli Lifland and Yanjun Qi. 2020. Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples. Retrieved from https:\/\/arXiv:2009.06368"},{"key":"e_1_3_1_251_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-emnlp.192"},{"key":"e_1_3_1_252_2","volume-title":"Proceedings of the NeurIPS","author":"Yuan Weizhe","year":"2021","unstructured":"Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. In Proceedings of the NeurIPS."},{"key":"e_1_3_1_253_2","doi-asserted-by":"publisher","DOI":"10.1109\/PASSAT\/SocialCom.2011.203"},{"key":"e_1_3_1_254_2","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"},{"key":"e_1_3_1_255_2","doi-asserted-by":"publisher","DOI":"10.1190\/1.1444352"},{"key":"e_1_3_1_256_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2021.3114814"},{"key":"e_1_3_1_257_2","unstructured":"Daochen Zha Zaid Pervaiz Bhat Kwei-Herng Lai Fan Yang and Xia Hu. 2023. Data-centric AI: Perspectives and challenges. Retrieved from https:\/\/arXiv:2301.04819"},{"key":"e_1_3_1_258_2","doi-asserted-by":"publisher","DOI":"10.1145\/3511808.3557474"},{"key":"e_1_3_1_259_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM50108.2020.00086"},{"key":"e_1_3_1_260_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599553"},{"key":"e_1_3_1_261_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611977172.23"},{"key":"e_1_3_1_262_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-018-1280-0"},{"key":"e_1_3_1_263_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/862"},{"key":"e_1_3_1_264_2","volume-title":"Proceedings of the ICML","author":"Zha Daochen","year":"2021","unstructured":"Daochen Zha, Jingru Xie, Wenye Ma, Sheng Zhang, Xiangru Lian, Xia Hu, and Ji Liu. 2021. Douzero: Mastering doudizhu with self-play deep reinforcement learning. In Proceedings of the ICML."},{"key":"e_1_3_1_265_2","volume-title":"Proceedings of the ICLR","author":"Zhang Hongyi","year":"2018","unstructured":"Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In Proceedings of the ICLR."},{"key":"e_1_3_1_266_2","volume-title":"Proceedings of the IICML","author":"Zhang Han","year":"2019","unstructured":"Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In Proceedings of the IICML."},{"key":"e_1_3_1_267_2","unstructured":"Jieyu Zhang Cheng-Yu Hsieh Yue Yu Chao Zhang and Alexander Ratner. 2022. A survey on programmatic weak supervision. Retrieved from https:\/\/arXiv:2202.05433"},{"key":"e_1_3_1_268_2","doi-asserted-by":"publisher","DOI":"10.1145\/3158369"},{"key":"e_1_3_1_269_2","doi-asserted-by":"publisher","DOI":"10.14778\/3538598.3538604"},{"key":"e_1_3_1_270_2","doi-asserted-by":"publisher","DOI":"10.1109\/TFUZZ.2019.2959995"},{"key":"e_1_3_1_271_2","volume-title":"Proceedings of the NeurIPS","author":"Zhang Xiang","year":"2015","unstructured":"Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the NeurIPS."},{"issue":"1","key":"e_1_3_1_272_2","article-title":"Missing data imputation: Focusing on single imputation","volume":"4","author":"Zhang Zhongheng","year":"2016","unstructured":"Zhongheng Zhang. 2016. Missing data imputation: Focusing on single imputation. Ann. Translat. Med. 4, 1 (2016).","journal-title":"Ann. Translat. Med."},{"key":"e_1_3_1_273_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.aiopen.2021.01.001"},{"key":"e_1_3_1_274_2","doi-asserted-by":"publisher","DOI":"10.14778\/3476311.3476334"},{"key":"e_1_3_1_275_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICTAI.2004.48"},{"key":"e_1_3_1_276_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.11"},{"key":"e_1_3_1_277_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.11854"},{"key":"e_1_3_1_278_2","volume-title":"Proceedings of the NeurIPS","author":"Zoph Barret","year":"2020","unstructured":"Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le. 2020. Rethinking pre-training and self-training. In Proceedings of the NeurIPS."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711118","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711118","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:26Z","timestamp":1750295426000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711118"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,24]]},"references-count":277,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,5,31]]}},"alternative-id":["10.1145\/3711118"],"URL":"https:\/\/doi.org\/10.1145\/3711118","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,24]]},"assertion":[{"value":"2023-03-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-14","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}