{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T04:42:10Z","timestamp":1773895330630,"version":"3.50.1"},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","issue":"10","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:p>Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we compare and summarize these algorithms with a driven information-based taxonomy. We systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms on 12 datasets under the settings of various data error rates, error types, and 4 downstream analysis tasks, assessing their error reduction performance with a<jats:italic>novel but practical<\/jats:italic>metric. We develop an effective and unified repair optimization strategy that substantially benefits the state of the arts. We conclude that, it is always worthy of data repair. The clean data does not determine the upper bound of data analysis performance. We provide valuable guidelines, challenges, and promising directions in the data repair domain. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice.<\/jats:p>","DOI":"10.14778\/3675034.3675051","type":"journal-article","created":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T22:19:11Z","timestamp":1722982751000},"page":"2617-2630","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Automatic Data Repair: Are We Ready to Deploy?"],"prefix":"10.14778","volume":"17","author":[{"given":"Wei","family":"Ni","sequence":"first","affiliation":[{"name":"Center for Data Science, Zhejiang University, Hangzhou, China and School of Data Science, City University of Hong Kong, Hong Kong, China"}]},{"given":"Xiaoye","family":"Miao","sequence":"additional","affiliation":[{"name":"Center for Data Science, Zhejiang University, Hangzhou, China and The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China"}]},{"given":"Xiangyu","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Data Science, City University of Hong Kong, Hong Kong, China"}]},{"given":"Yangyang","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Software Technology, Zhejiang University, Ningbo, China"}]},{"given":"Shuwei","family":"Liang","sequence":"additional","affiliation":[{"name":"Center for Data Science, Zhejiang University, Hangzhou, China and School of Software Technology, Zhejiang University, Ningbo, China"}]},{"given":"Jianwei","family":"Yin","sequence":"additional","affiliation":[{"name":"Center for Data Science, Zhejiang University, Hangzhou, China and School of Software Technology, Zhejiang University, Ningbo, China and College of Computer Science, Zhejiang University, Hangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2024,8,6]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"REIN: A comprehensive benchmark framework for data cleaning methods in ML pipelines. In EDBT. 499--511.","author":"Abdelaal Mohamed","year":"2023","unstructured":"Mohamed Abdelaal, Christian Hammacher, and Harald Sch\u00f6ning. 2023. REIN: A comprehensive benchmark framework for data cleaning methods in ML pipelines. In EDBT. 499--511."},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","first-page":"993","DOI":"10.14778\/2994509.2994518","article-title":"Detecting data errors: Where are we and what needs to be done","volume":"9","author":"Abedjan Ziawasch","year":"2016","unstructured":"Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9, 12 (2016), 993--1004.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_3_1","volume-title":"Foundations of Databases: The Logical Level","author":"Abiteboul Serge","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level. Addison-Wesley Longman Publishing Cooperation, Inc."},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","first-page":"36","DOI":"10.14778\/2850578.2850579","article-title":"Messing up with BART: Error generation for evaluating data-cleaning algorithms","volume":"9","author":"Arocena Patricia C.","year":"2015","unstructured":"Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Ren\u00e9e J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error generation for evaluating data-cleaning algorithms. Proceedings of the VLDB Endowment 9, 2 (2015), 36--47.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1007\/s00778-008-0098-x","article-title":"Swoosh: A generic approach to entity resolution","volume":"18","author":"Benjelloun Omar","year":"2009","unstructured":"Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. The VLDB Journal 18, 1 (2009), 255--276.","journal-title":"The VLDB Journal"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1145\/361002.361007","article-title":"Multidimensional binary search trees used for associative searching","volume":"18","author":"Bentley Jon Louis","year":"1975","unstructured":"Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509--517.","journal-title":"Commun. ACM"},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"Laure Berti-Equille Tamraparni Dasu and Divesh Srivastava. 2011. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE. 733--744.","DOI":"10.1109\/ICDE.2011.5767864"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"George Beskales Ihab F. Ilyas Lukasz Golab and Artur Galiullin. 2013. On the relative trust between inconsistent data and inaccurate constraints. In ICDE. 541--552.","DOI":"10.1109\/ICDE.2013.6544854"},{"key":"e_1_2_1_9_1","unstructured":"BigDaMa. 2023. Error generator. https:\/\/github.com\/BigDaMa\/error-generator. Accessed: 2024-06-14."},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","first-page":"311","DOI":"10.14778\/3157794.3157800","article-title":"Efficient denial constraint discovery with hydra","volume":"11","author":"Bleifu\u00df Tobias","year":"2017","unstructured":"Tobias Bleifu\u00df, Sebastian Kruse, and Felix Naumann. 2017. Efficient denial constraint discovery with hydra. Proceedings of the VLDB Endowment 11, 3 (2017), 311--323.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_11_1","volume-title":"Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and DALL-E 2. ArXiv Preprint ArXiv:2210.00586","author":"Borji Ali","year":"2023","unstructured":"Ali Borji. 2023. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and DALL-E 2. ArXiv Preprint ArXiv:2210.00586 (2023)."},{"key":"e_1_2_1_12_1","volume-title":"Article 55","author":"Boukerche Azzedine","year":"2020","unstructured":"Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2020. Outlier detection: Methods, models, and classification. Comput. Surveys 53, 3, Article 55 (2020), 37 pages."},{"key":"e_1_2_1_13_1","unstructured":"Andrei Z Broder. 1997. On the resemblance and containment of documents. In SEQUENCES. 21--29."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In KDD. 785--794.","DOI":"10.1145\/2939672.2939785"},{"key":"e_1_2_1_15_1","volume-title":"SEED: Simple, efficient, and effective data management via large language models. ArXiv Preprint ArXiv:2310.00749","author":"Hen Zui","year":"2023","unstructured":"Zui CHen, Lei Cao, Sam Madden, Ju Fan, Nan Tang, Zihui Gu, Zeyuan Shang, Chunwei Liu, Michael Cafarella, and Tim Kraska. 2023. SEED: Simple, efficient, and effective data management via large language models. ArXiv Preprint ArXiv:2310.00749 (2023)."},{"key":"e_1_2_1_16_1","volume-title":"Miller","author":"Chiang Fei","year":"2011","unstructured":"Fei Chiang and Ren\u00e9e J. Miller. 2011. A unified model for data and constraint repair. In ICDE. 446--457."},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Xu Chu Ihab F. Ilyas Sanjay Krishnan and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD. 2201--2206.","DOI":"10.1145\/2882903.2912574"},{"key":"e_1_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Xu Chu I. F. Ilyas and P. Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. 458--469.","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_19_1","volume-title":"KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD. 1247--1261.","author":"Chu Xu","year":"2015","unstructured":"Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD. 1247--1261."},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1007\/s10115-010-0375-z","article-title":"Cluster-based instance selection for machine classification","volume":"30","author":"Czarnowski Ireneusz","year":"2012","unstructured":"Ireneusz Czarnowski. 2012. Cluster-based instance selection for machine classification. Knowledge and Information Systems 30, 5 (2012), 113--133.","journal-title":"Knowledge and Information Systems"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Xin Luna Dong and Theodoros Rekatsinas. 2018. Data integration and machine learning: A natural synergy. In SIGMOD. 1645--1650.","DOI":"10.1145\/3183713.3197387"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","first-page":"1218","DOI":"10.14778\/2536274.2536280","article-title":"NADEEF: A generalized data cleaning system","volume":"6","author":"Ebaid Amr","year":"2013","unstructured":"Amr Ebaid, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiane-Ruiz, Nan Tang, and Si Yin. 2013. NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment 6, 12 (2013), 1218--1221.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","first-page":"1454","DOI":"10.14778\/3236187.3236198","article-title":"Distributed representations of tuples for entity resolution","volume":"11","author":"Ebraheem Muhammad","year":"2018","unstructured":"Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454--1467.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_24_1","article-title":"Conditional functional dependencies for capturing data inconsistencies","volume":"33","author":"Fan Wenfei","year":"2008","unstructured":"Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems 33, 2, Article 6 (2008), 48 pages.","journal-title":"ACM Transactions on Database Systems"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","first-page":"407","DOI":"10.14778\/1687627.1687674","article-title":"Reasoning about record matching rules","volume":"2","author":"Fan Wenfei","year":"2009","unstructured":"Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment 2, 1 (2009), 407--418.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/s00778-011-0253-7","article-title":"Towards certain fixes with editing rules and master data","volume":"21","author":"Fan Wenfei","year":"2012","unstructured":"Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. The VLDB Journal 21, 2 (2012), 213--238.","journal-title":"The VLDB Journal"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","article-title":"A theory for record linkage","volume":"64","author":"Fellegi Ivan P.","year":"1969","unstructured":"Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210.","journal-title":"J. Amer. Statist. Assoc."},{"key":"e_1_2_1_28_1","volume-title":"Clustering by passing messages between data points. Science 315, 5814","author":"Frey Brendan J","year":"2007","unstructured":"Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972--976."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the VLDB Endowment, 371--380","author":"Galhardas Helena","year":"2001","unstructured":"Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language, model, and algorithms. Proceedings of the VLDB Endowment, 371--380."},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","first-page":"2048","DOI":"10.1109\/TKDE.2020.3012472","article-title":"A hybrid data cleaning framework using markov logic networks","volume":"34","author":"Ge Congcong","year":"2022","unstructured":"Congcong Ge, Yunjun Gao, Xiaoye Miao, Bin Yao, and Haobo Wang. 2022. A hybrid data cleaning framework using markov logic networks. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2022), 2048--2062.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","first-page":"625","DOI":"10.14778\/2536360.2536363","article-title":"The LLUNATIC data-cleaning framework","volume":"6","author":"Geerts Floris","year":"2013","unstructured":"Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. Proceedings of the VLDB Endowment 6, 9 (2013), 625--636.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Stella Giannakopoulou Manos Karpathiotakis and Anastasia Ailamaki. 2020. Cleaning denial constraint violations through relaxation. In SIGMOD. 805--815.","DOI":"10.1145\/3318464.3389775"},{"key":"e_1_2_1_33_1","volume-title":"Julia Stoyanovich, and Sebastian Schelter.","author":"Guha Shubha","year":"2023","unstructured":"Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter. 2023. Automated data cleaning can hurt fairness in machine learning-based decision making. In ICDE. 3747--3754."},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","first-page":"727","DOI":"10.1109\/TKDE.2016.2637928","article-title":"A novel cost-based model for data repairing","volume":"29","author":"Hao Shuang","year":"2017","unstructured":"Shuang Hao, Nan Tang, Guoliang Li, Jian He, Na Ta, and Jianhua Feng. 2017. A novel cost-based model for data repairing. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 727--742.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_35_1","unstructured":"Satoshi Hara Atsushi Nitanda and Takanori Maehara. 2019. Data cleansing for models trained with SGD. In NeurIPS. 4213--4222."},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Md Kamrul Hasan and Mohammad Mahdavi. 2021. Automatic Error Correction Using the Wikipedia Page Revision History. In CIKM. 3073--3077.","DOI":"10.1145\/3459637.3482062"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","first-page":"1282","DOI":"10.14778\/1687627.1687771","article-title":"Framework for evaluating clustering algorithms in duplicate detection","volume":"2","author":"Hassanzadeh Oktie","year":"2009","unstructured":"Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Ren\u00e9e J. Miller. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment 2, 1 (2009), 1282--1293.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Alireza Heidari Joshua McGrath Ihab F. Ilyas and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In SIGMOD. 829--846.","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of the NeurIPS Workshop on Tackling Climate Change with Machine Learning.","author":"Hu Yue","year":"2019","unstructured":"Yue Hu, Yanbing Wang, Canwen Jiao, Rajesh Sankaran, Charles E Catlett, and Daniel B Work. 2019. Automatic data cleaning via tensor factorization for large urban environmental sensor networks. In Proceedings of the NeurIPS Workshop on Tackling Climate Change with Machine Learning."},{"key":"e_1_2_1_40_1","volume-title":"Towards reasoning in large language models: A survey. ArXiv Preprint ArXiv:2212.10403","author":"Huang Jie","year":"2023","unstructured":"Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. ArXiv Preprint ArXiv:2212.10403 (2023)."},{"key":"e_1_2_1_41_1","volume-title":"Auto-detect: Data-driven error detection in tables. In SIGMOD. 1377--1392.","author":"Huang Zhipeng","year":"2018","unstructured":"Zhipeng Huang and Yeye He. 2018. Auto-detect: Data-driven error detection in tables. In SIGMOD. 1377--1392."},{"key":"e_1_2_1_42_1","unstructured":"Kadir Ider and Andreas Schmietendorf. 2018. Data privacy for AI fraud detection models. In ICDS. 102--107."},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1561\/1900000045","article-title":"Trends in cleaning relational data: Consistency and deduplication","volume":"5","author":"Ilyas Ihab F.","year":"2015","unstructured":"Ihab F. Ilyas and Xu Chu. 2015. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases 5, 4 (2015), 281--393.","journal-title":"Foundations and Trends in Databases"},{"key":"e_1_2_1_44_1","volume-title":"Ilyas and Xu Chu","author":"Ihab","year":"2019","unstructured":"Ihab F. Ilyas and Xu Chu. 2019. Data cleaning. Association for Computing Machinery."},{"key":"e_1_2_1_45_1","doi-asserted-by":"crossref","unstructured":"Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quian\u00e9-Ruiz Nan Tang and Si Yin. 2015. BigDansing: A system for big data cleansing. In SIGMOD. 1215--1230.","DOI":"10.1145\/2723372.2747646"},{"key":"e_1_2_1_46_1","unstructured":"Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In ICML. 1885--1894."},{"key":"e_1_2_1_47_1","volume-title":"BoostClean: Automated error detection and repair for machine learning. ArXiv Preprint ArXiv:1711.01299","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated error detection and repair for machine learning. ArXiv Preprint ArXiv:1711.01299 (2017)."},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Sanjay Krishnan Daniel Haas Michael J. Franklin and Eugene Wu. 2016. Towards reliable interactive data cleaning: A user survey and recommendations. In HILDA. Article 9 5 pages.","DOI":"10.1145\/2939502.2939511"},{"key":"e_1_2_1_49_1","doi-asserted-by":"crossref","first-page":"948","DOI":"10.14778\/2994509.2994514","article-title":"ActiveClean: Interactive data cleaning for statistical modeling","volume":"9","author":"Krishnan Sanjay","year":"2016","unstructured":"Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","first-page":"341","DOI":"10.14778\/2732269.2732271","article-title":"A data- and workload-aware algorithm for range queries under differential privacy","volume":"7","author":"Li Chao","year":"2014","unstructured":"Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. 2014. A data- and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment 7, 5 (2014), 341--352.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_51_1","doi-asserted-by":"crossref","unstructured":"Peng Li Xi Rao Jennifer Blase Yue Zhang Xu Chu and Ce Zhang. 2021. CleanML: A study for evaluating the impact of data cleaning on ML classification tasks. In ICDE. 13--24.","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_52_1","first-page":"1475","article-title":"Approximate nearest neighbor search on high dimensional data---experiments, analyses, and improvement","volume":"32","author":"Li Wen","year":"2019","unstructured":"Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data---experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1475--1488.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_53_1","doi-asserted-by":"crossref","first-page":"50","DOI":"10.14778\/3421424.3421431","article-title":"Deep entity matching with pre-trained language models","volume":"14","author":"Li Yuliang","year":"2020","unstructured":"Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_54_1","volume-title":"Matthew Wiener, et al","author":"Liaw Andy","year":"2002","unstructured":"Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18--22."},{"key":"e_1_2_1_55_1","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in PCM","volume":"28","author":"Lloyd Stuart","year":"1982","unstructured":"Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129--137.","journal-title":"IEEE Transactions on Information Theory"},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","first-page":"1948","DOI":"10.14778\/3407790.3407801","article-title":"Baran: Effective error correction via a unified context representation and transfer learning","volume":"13","author":"Mahdavi Mohammad","year":"2020","unstructured":"Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment 13, 12 (2020), 1948--1961.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_57_1","volume-title":"Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang.","author":"Mahdavi Mohammad","year":"2019","unstructured":"Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A configuration-free error detection system. In SIGMOD. 865--882."},{"key":"e_1_2_1_58_1","volume-title":"Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv Preprint ArXiv:2303.08896","author":"Manakul Potsawee","year":"2023","unstructured":"Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv Preprint ArXiv:2303.08896 (2023)."},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Yinan Mei Shaoxu Song Chenguang Fang Ziheng Wei Jingyun Fang and Jiang Long. 2023. Discovering editing rules by deep reinforcement learning. In ICDE. 355--367.","DOI":"10.1109\/ICDE55515.2023.00034"},{"key":"e_1_2_1_60_1","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1007\/s11704-016-6195-x","article-title":"Incomplete data management: A survey","volume":"12","author":"Miao Xiaoye","year":"2018","unstructured":"Xiaoye Miao, Yunjun Gao, Su Guo, and Wanqi Liu. 2018. Incomplete data management: A survey. Frontiers of Computer Science 12, 1 (2018), 4--25.","journal-title":"Frontiers of Computer Science"},{"key":"e_1_2_1_61_1","doi-asserted-by":"crossref","first-page":"624","DOI":"10.14778\/3494124.3494143","article-title":"Efficient and effective data imputation with influence functions","volume":"15","author":"Miao Xiaoye","year":"2021","unstructured":"Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, Jun Wang, and Jianwei Yin. 2021. Efficient and effective data imputation with influence functions. Proceedings of the VLDB Endowment 15, 3 (2021), 624--632.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_62_1","first-page":"6630","article-title":"An experimental survey of missing data imputation algorithms","volume":"35","author":"Miao Xiaoye","year":"2023","unstructured":"Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, and Jianwei Yin. 2023. An experimental survey of missing data imputation algorithms. IEEE Transactions on Knowledge and Data Engineering 35, 7 (2023), 6630--6650.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"Xiaoye Miao Yangyang Wu Jun Wang Yunjun Gao Xudong Mao and Jianwei Yin. 2021. Generative semi-supervised learning for multivariate time series imputation. In AAAI. 8983--8991.","DOI":"10.1609\/aaai.v35i10.17086"},{"key":"e_1_2_1_64_1","doi-asserted-by":"crossref","unstructured":"Sidharth Mudgal Han Li Theodoros Rekatsinas AnHai Doan Youngchoon Park Ganesh Krishnan Rohit Deep Esteban Arcaute and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_1_65_1","doi-asserted-by":"crossref","unstructured":"Felix Neutatz Mohammad Mahdavi and Ziawasch Abedjan. 2019. ED2: A case for active learning in error detection. In CIKM. 2249--2252.","DOI":"10.1145\/3357384.3358129"},{"key":"e_1_2_1_66_1","unstructured":"Andrew Ng Michael Jordan and Yair Weiss. 2001. On spectral clustering: Analysis and an algorithm. In NeurIPS. 849--856."},{"key":"e_1_2_1_67_1","volume-title":"Training language models to follow instructions with human feedback. ArXiv Preprint ArXiv:2203.02155","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv Preprint ArXiv:2203.02155 (2022)."},{"key":"e_1_2_1_68_1","doi-asserted-by":"crossref","first-page":"1726","DOI":"10.14778\/3529337.3529356","article-title":"Analyzing how BERT performs entity matching","volume":"15","author":"Paganelli Matteo","year":"2022","unstructured":"Matteo Paganelli, Francesco Del Buono, Andrea Baraldi, and Francesco Guerra. 2022. Analyzing how BERT performs entity matching. Proceedings of the VLDB Endowment 15, 8 (2022), 1726--1738.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_69_1","doi-asserted-by":"crossref","first-page":"1082","DOI":"10.14778\/2794367.2794377","article-title":"Functional dependency discovery: An experimental evaluation of seven algorithms","volume":"8","author":"Papenbrock Thorsten","year":"2015","unstructured":"Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Sch\u00f6nberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment 8, 10 (2015), 1082--1093.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_70_1","doi-asserted-by":"crossref","first-page":"266","DOI":"10.14778\/3368289.3368293","article-title":"Discovery of approximate (and exact) denial constraints","volume":"13","author":"Pena Eduardo H. M.","year":"2019","unstructured":"Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of approximate (and exact) denial constraints. Proceedings of the VLDB Endowment 13, 3 (2019), 266--278.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_71_1","volume-title":"SPADE: A semi-supervised probabilistic approach for detecting errors in tables. In IJCAI. 3543--3551.","author":"Pham Minh","year":"2021","unstructured":"Minh Pham, Craig A. Knoblock, Muhao Chen, Binh Vu, and Jay Pujara. 2021. SPADE: A semi-supervised probabilistic approach for detecting errors in tables. In IJCAI. 3543--3551."},{"key":"e_1_2_1_72_1","doi-asserted-by":"crossref","first-page":"1190","DOI":"10.14778\/3137628.3137631","article-title":"Holo-Clean: Holistic data repairs with probabilistic inference","volume":"10","author":"Rekatsinas Theodoros","year":"2017","unstructured":"Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher R\u00e9. 2017. Holo-Clean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_73_1","doi-asserted-by":"crossref","first-page":"2546","DOI":"10.14778\/3476249.3476301","article-title":"Horizon: Scalable dependency-driven data cleaning","volume":"14","author":"Rezig El Kindi","year":"2021","unstructured":"El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable dependency-driven data cleaning. Proceedings of the VLDB Endowment 14, 11 (2021), 2546--2554.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_74_1","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1016\/0005-1098(78)90005-5","article-title":"Modeling by shortest data description","volume":"14","author":"Rissanen Jorma","year":"1978","unstructured":"Jorma Rissanen. 1978. Modeling by shortest data description. Automatica 14, 5 (1978), 465--471.","journal-title":"Automatica"},{"key":"e_1_2_1_75_1","doi-asserted-by":"crossref","first-page":"1310","DOI":"10.14778\/2809974.2809991","article-title":"Incremental knowledge base construction using deepdive","volume":"8","author":"Shin Jaeho","year":"2015","unstructured":"Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher R\u00e9. 2015. Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment 8, 11 (2015), 1310--1321.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_76_1","volume-title":"Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quian\u00e9-Ruiz, Armando Solar-Lezama, and Nan Tang.","author":"Singh Rohit","year":"2017","unstructured":"Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quian\u00e9-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Generating concise entity matching rules. In SIGMOD. 1635--1638."},{"key":"e_1_2_1_77_1","doi-asserted-by":"crossref","unstructured":"Shaoxu Song Han Zhu and Jianmin Wang. 2016. Constraint-variance tolerant data repairing. In SIGMOD. 877--892.","DOI":"10.1145\/2882903.2882955"},{"key":"e_1_2_1_78_1","doi-asserted-by":"crossref","first-page":"809","DOI":"10.1109\/TNNLS.2015.2424995","article-title":"Extreme learning machine for multilayer perceptron","volume":"27","author":"Tang Jiexiong","year":"2015","unstructured":"Jiexiong Tang, Chenwei Deng, and Guang-Bin Huang. 2015. Extreme learning machine for multilayer perceptron. IEEE Transactions on Neural Networks and Learning Systems 27, 4 (2015), 809--821.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_2_1_79_1","volume-title":"VerifAI: verified generative AI. ArXiv Preprint ArXiv:2307.02796","author":"Tang Nan","year":"2023","unstructured":"Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, and Alon Halevy. 2023. VerifAI: verified generative AI. ArXiv Preprint ArXiv:2307.02796 (2023)."},{"key":"e_1_2_1_80_1","volume-title":"Uni-detect: A unified approach to automated error detection in tables. In SIGMOD. 811--828.","author":"Wang Pei","year":"2019","unstructured":"Pei Wang and Yeye He. 2019. Uni-detect: A unified approach to automated error detection in tables. In SIGMOD. 811--828."},{"key":"e_1_2_1_81_1","volume-title":"Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. ArXiv Preprint arXiv:2210.14896","author":"Wang Zijie","year":"2022","unstructured":"Zijie JWang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2022. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. ArXiv Preprint arXiv:2210.14896 (2022)."},{"key":"e_1_2_1_82_1","volume-title":"Chi, Quoc Le, and Denny Zhou","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022), 24824--24837."},{"key":"e_1_2_1_83_1","doi-asserted-by":"crossref","unstructured":"Renzhi Wu Sanya Chaba Saurabh Sawlani Xu Chu and Saravanan Thirumuruganathan. 2020. ZeroER: Entity resolution using zero labeled examples. In SIGMOD. 1149--1164.","DOI":"10.1145\/3318464.3389743"},{"key":"e_1_2_1_84_1","doi-asserted-by":"crossref","unstructured":"Yangyang Wu Xiaoye Miao Zilinghan Li Shilan He Xinkai Yuan and Jianwei Yin. 2023. An efficient generative data imputation toolbox with adversarial learning. In ICDE. 3651--3654.","DOI":"10.1109\/ICDE55515.2023.00290"},{"key":"e_1_2_1_85_1","volume-title":"Elmagarmid","author":"Yakout Mohamed","year":"2013","unstructured":"Mohamed Yakout, Laure Berti-\u00c9quille, and Ahmed K. Elmagarmid. 2013. Don't be scared: Use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD. 553--564."},{"key":"e_1_2_1_86_1","unstructured":"Haojun Zhang Chengliang Chai AnHai Doan Paris Koutris and Esteban Arcaute. 2020. Manually detecting errors for data cleaning using adaptive crowd-sourcing strategies. In EDBT. 311--322."},{"key":"e_1_2_1_87_1","unstructured":"Wayne Xin Zhao Kun Zhou Junyi Li Tianyi Tang Xiaolei Wang Yupeng Hou Yingqian Min Beichen Zhang Junjie Zhang Zican Dong et al. 2023. A survey of large language models. ArXiv Preprint ArXiv:2303.18223 (2023)."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3675034.3675051","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,26]],"date-time":"2024-11-26T07:41:38Z","timestamp":1732606898000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3675034.3675051"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6]]},"references-count":87,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["10.14778\/3675034.3675051"],"URL":"https:\/\/doi.org\/10.14778\/3675034.3675051","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,6]]},"assertion":[{"value":"2024-08-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}