{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:01:31Z","timestamp":1775638891082,"version":"3.50.1"},"reference-count":110,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T00:00:00Z","timestamp":1734480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,12,18]]},"abstract":"<jats:p>\n                    Data quality is critical across many applications. The utility of data is undermined by various errors, making rigorous data cleaning a necessity. Traditional data cleaning systems depend heavily on predefined rules and constraints, which necessitate significant domain knowledge and manual effort. Moreover, while configuration-free approaches and deep learning methods have been explored, they struggle with complex error patterns, lacking interpretability, requiring extensive feature engineering or labeled data. This paper introduces GIDCL (\n                    <jats:underline>G<\/jats:underline>\n                    raph-enhanced\n                    <jats:underline>I<\/jats:underline>\n                    nterpretable\n                    <jats:underline>D<\/jats:underline>\n                    ata\n                    <jats:underline>C<\/jats:underline>\n                    leaning with\n                    <jats:underline>L<\/jats:underline>\n                    arge language models), a pioneering framework that harnesses the capabilities of Large Language Models (LLMs) alongside Graph Neural Network (GNN) to address the challenges of traditional and machine learning-based data cleaning methods. By converting relational tables into graph structures, GIDCL utilizes GNN to effectively capture and leverage structural correlations among data, enhancing the model's ability to understand and rectify complex dependencies and errors. The framework's creator-critic workflow innovatively employs LLMs to automatically generate interpretable data cleaning rules and tailor feature engineering with minimal labeled data. This process includes the iterative refinement of error detection and correction models through few-shot learning, significantly reducing the need for extensive manual configuration. GIDCL not only improves the precision and efficiency of data cleaning but also enhances its interpretability, making it accessible and practical for non-expert users. Our extensive experiments demonstrate that GIDCL significantly outperforms existing methods, improving F1-scores by 10% on average while requiring only 20 labeled tuples.\n                  <\/jats:p>","DOI":"10.1145\/3698811","type":"journal-article","created":{"date-parts":[[2024,12,20]],"date-time":"2024-12-20T16:40:35Z","timestamp":1734712835000},"page":"1-29","source":"Crossref","is-referenced-by-count":8,"title":["GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-8249-9695","authenticated-orcid":false,"given":"Mengyi","family":"Yan","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Beihang University, Beijing, CN"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5760-5145","authenticated-orcid":false,"given":"Yaoshu","family":"Wang","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Computing Sciences, Shenzhen University, Shenzhen, Guangdong, CN"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8618-9806","authenticated-orcid":false,"given":"Yue","family":"Wang","sequence":"additional","affiliation":[{"name":"Shenzhen Institute of Computing Sciences, Shenzhen, Guangdong, CN"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8632-1539","authenticated-orcid":false,"given":"Xiaoye","family":"Miao","sequence":"additional","affiliation":[{"name":"Center for Data Science, Zhejiang University, Hangzhou, Zhejiang, CN"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5152-0055","authenticated-orcid":false,"given":"Jianxin","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Beihang University, Beijing, CN"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,12,20]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2023. IMDB. https:\/\/www.imdb.com\/interfaces\/."},{"key":"e_1_2_1_2_1","unstructured":"2024. Code datasets and full version. https:\/\/github.com\/SICS-Fundamental-Research-Center\/GIDCL."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856328"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_5_1","volume-title":"Foundations of databases","author":"Abiteboul Serge","unstructured":"Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of databases. Vol. 8. Addison-Wesley Reading."},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"Marcelo Arenas Leopoldo Bertossi and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS. 68--79.","DOI":"10.1145\/303976.303983"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850578.2850579"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767864"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/2371212"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1938551.1938585"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544854"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157800"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","unstructured":"Philip Bohannon Wenfei Fan Floris Geerts Xibei Jia and Anastasios Kementsietsidis. 2007. Conditional Functional Dependencies for Data Cleaning. In ICDE. 746--755. https:\/\/doi.org\/10.1109\/ICDE.2007.367920","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_1_14_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_2_1_15_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]"},{"key":"e_1_2_1_16_1","volume-title":"LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. arXiv preprint arXiv:2309.12307","author":"Chen Yukang","year":"2023","unstructured":"Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. arXiv preprint arXiv:2309.12307 (2023)."},{"key":"e_1_2_1_17_1","volume-title":"Seed: Domain-specific data curation with large language models. arXiv e-prints","author":"Chen Zui","year":"2023","unstructured":"Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, and Michael Cafarella. 2023. Seed: Domain-specific data curation with large language models. arXiv e-prints (2023), arXiv--2310."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2749431"},{"key":"e_1_2_1_21_1","unstructured":"Gao Cong Wenfei Fan Floris Geerts Xibei Jia and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In VLDB."},{"key":"e_1_2_1_22_1","unstructured":"Gao Cong Wenfei Fan Floris Geerts Xibei Jia and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. In VLDB. 315--326."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-1189"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2992787"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3542700.3542709"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11573"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","unstructured":"Daniel Deutch Nave Frost Amir Gilad and Oren Sheffer. 2021. Explanations for Data Repair Through Shapley Values. In CIKM. 362--371. https:\/\/doi.org\/10.1145\/3459637.3482341","DOI":"10.1145\/3459637.3482341"},{"key":"e_1_2_1_28_1","volume-title":"Leveraging currency for repairing inconsistent and incomplete data. TKDE","author":"Ding Xiaoou","year":"2020","unstructured":"Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Muxian Wang, Jianzhong Li, and Hong Gao. 2020. Leveraging currency for repairing inconsistent and incomplete data. TKDE (2020)."},{"key":"e_1_2_1_29_1","volume-title":"A survey for in-context learning. arXiv preprint arXiv:2301.00234","author":"Dong Qingxiu","year":"2022","unstructured":"Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022)."},{"key":"e_1_2_1_30_1","unstructured":"Prashant Doshi Lloyd G Greenwald and John R Clarke. 2003. Using Bayesian Networks for Cleansing Trauma Data. In FLAIRS. 72--76."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536274.2536280"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454200"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366102.1366103"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","unstructured":"Wenfei Fan Floris Geerts Laks V. S. Lakshmanan and Ming Xiong. 2009. Discovering Conditional Functional Dependencies. In ICDE. 1231--1234. https:\/\/doi.org\/10.1109\/ICDE.2009.208","DOI":"10.1109\/ICDE.2009.208"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.154"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.154"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Wenfei Fan Ziyan Han Yaoshu Wang and Min Xie. 2022. Parallel Rule Discovery from Large Datasets by Sampling. In SIGMOD Zachary Ives Angela Bonifati and Amr El Abbadi (Eds.). 384--398.","DOI":"10.1145\/3514221.3526165"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402774"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-011-0253-7"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-020-2917-1"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3317315.3317318"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457400"},{"key":"e_1_2_1_43_1","unstructured":"Centers for Medicare and Medicaid Services. [n.d.]. Provider data catalog. https:\/\/data.cms.gov\/provider-data\/."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.3012472"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536363"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-019-00586-5"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389775"},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Amir Gilad Daniel Deutch and Sudeepa Roy. 2020. On multiple semantics for declarative database repairs. In SIGMOD. 817--831.","DOI":"10.1145\/3318464.3389721"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2637928"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3595360.3595857"},{"key":"e_1_2_1_52_1","volume-title":"Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685","author":"Hu Edward J","year":"2021","unstructured":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)."},{"key":"e_1_2_1_53_1","volume-title":"Bayesian Data Cleaning for Web Data. CoRR abs\/1204.3677","author":"Hu Yuheng","year":"2012","unstructured":"Yuheng Hu, Sushovan De, Yi Chen, and Subbarao Kambhampati. 2012. Bayesian Data Cleaning for Web Data. CoRR abs\/1204.3677 (2012). arXiv:1204.3677 http:\/\/arxiv.org\/abs\/1204.3677"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196889"},{"key":"e_1_2_1_55_1","volume-title":"Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al.","author":"Jiang Albert Q","year":"2023","unstructured":"Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)."},{"key":"e_1_2_1_56_1","volume-title":"Xu Chu, Wentao Wu, and Ce Zhang.","author":"Karlas Bojan","year":"2020","unstructured":"Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve G\u00fcrel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020)."},{"key":"e_1_2_1_57_1","volume-title":"Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_59_1","volume-title":"Joseph E Gonzalez, Hao Zhang, and Ion Stoica.","author":"Kwon Woosuk","year":"2023","unstructured":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180 (2023)."},{"key":"e_1_2_1_60_1","unstructured":"Alexander K. Lew Monica Agrawal David A. Sontag and Vikash Mansinghka. 2021. PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. In AISTATS. 1927--1935. http:\/\/proceedings.mlr.press\/ v130\/lew21a.html"},{"key":"e_1_2_1_61_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_62_1","doi-asserted-by":"crossref","unstructured":"Peng Li Xi Rao Jennifer Blase Yue Zhang Xu Chu and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. 13--24.","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_64_1","volume-title":"Kenneth Lyons, Weiyi Meng, and Divesh Srivastava.","author":"Li Xian","year":"2015","unstructured":"Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2015. Truth finding on the deep web: Is the problem solved? arXiv preprint arXiv:1503.00303 (2015)."},{"key":"e_1_2_1_65_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00699-w"},{"key":"e_1_2_1_67_1","volume-title":"Approximate denial constraints. arXiv preprint arXiv:2005.08540","author":"Livshits Ester","year":"2020","unstructured":"Ester Livshits, Alireza Heidari, Ihab F Ilyas, and Benny Kimelfeld. 2020. Approximate denial constraints. arXiv preprint arXiv:2005.08540 (2020)."},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_69_1","volume-title":"Semi-Supervised Data Cleaning with Raha and Baran. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11--15, 2021, Online Proceedings. www.cidrdb.org.","author":"Mahdavi Mohammad","year":"2021","unstructured":"Mohammad Mahdavi and Ziawasch Abedjan. 2021. Semi-Supervised Data Cleaning with Raha and Baran. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11--15, 2021, Online Proceedings. www.cidrdb.org."},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1016\/0169-023X(94)90023-X"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457258"},{"key":"e_1_2_1_73_1","volume-title":"1 Blog: Probabilistic Models with Unknown Objects. Statistical Relational Learning","author":"Milch Brian","year":"2007","unstructured":"Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L Ong, and Andrey Kolobov. 2007. 1 Blog: Probabilistic Models with Unknown Objects. Statistical Relational Learning (2007), 373."},{"key":"e_1_2_1_74_1","volume-title":"Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943","author":"Min Sewon","year":"2021","unstructured":"Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943 (2021)."},{"key":"e_1_2_1_75_1","volume-title":"Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911","author":"Narayan Avanika","year":"2022","unstructured":"Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher R\u00e9. 2022. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022)."},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358129"},{"key":"e_1_2_1_77_1","volume-title":"Rayyan-a web and mobile app for systematic reviews. Systematic reviews 5","author":"Ouzzani Mourad","year":"2016","unstructured":"Mourad Ouzzani, Hossam Hammady, Zbys Fedorowicz, and Ahmed Elmagarmid. 2016. Rayyan-a web and mobile app for systematic reviews. Systematic reviews 5 (2016), 1--10."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.14778\/3368289.3368293"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.14778\/3570690.3570694"},{"key":"e_1_2_1_80_1","volume-title":"SPADE: A Semi-supervised Probabilistic Approach for Detecting Errors in Tables.. In IJCAI. 3543--3551.","author":"Pham Minh","year":"2021","unstructured":"Minh Pham, Craig A Knoblock, Muhao Chen, Binh Vu, and Jay Pujara. 2021. SPADE: A Semi-supervised Probabilistic Approach for Detecting Errors in Tables.. In IJCAI. 3543--3551."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389708"},{"key":"e_1_2_1_82_1","unstructured":"Clement Pit-Claudel Zelda Mariet Rachael Harding and Sam Madden. 2016. Outlier detection in heterogeneous datasets using automatic tuple expansion. (2016)."},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.14778\/3377369.3377377"},{"key":"e_1_2_1_84_1","volume-title":"BClean: A Bayesian Data Cleaning System. arXiv preprint arXiv:2311.06517","author":"Qin Jianbin","year":"2023","unstructured":"Jianbin Qin, Sifan Huang, Yaoshu Wang, Jing Zhu, Yifan Zhang, Yukai Miao, Rui Mao, Makoto Onizuka, and Chuan Xiao. 2023. BClean: A Bayesian Data Cleaning System. arXiv preprint arXiv:2311.06517 (2023)."},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_2_1_86_1","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","unstructured":"Erhard Rahm, Hong Hai Do, et al. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236193"},{"key":"e_1_2_1_88_1","doi-asserted-by":"crossref","unstructured":"Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084 [cs.CL]","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_2_1_89_1","volume-title":"Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820","author":"Rekatsinas Theodoros","year":"2017","unstructured":"Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher R\u00e9. 2017. Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017)."},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476301"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.ICDT.2019.6"},{"key":"e_1_2_1_92_1","doi-asserted-by":"crossref","unstructured":"Michael Sejr Schlichtkrull Thomas N. Kipf Peter Bloem Rianne van den Berg Ivan Titov and Max Welling. 2018. Modeling Relational Data with Graph Convolutional Networks. In ESWC.","DOI":"10.1007\/978-3-319-93417-4_38"},{"key":"e_1_2_1_93_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2882955"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1145\/3616855.3635752"},{"key":"e_1_2_1_95_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329486.3329493"},{"key":"e_1_2_1_96_1","volume-title":"RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. arXiv preprint arXiv:2012.02469","author":"Tang Nan","year":"2020","unstructured":"Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden, and Mourad Ouzzani. 2020. RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. arXiv preprint arXiv:2012.02469 (2020)."},{"key":"e_1_2_1_97_1","volume-title":"Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)."},{"key":"e_1_2_1_98_1","volume-title":"International conference on machine learning. PMLR","author":"Trouillon Th\u00e9o","year":"2016","unstructured":"Th\u00e9o Trouillon, JohannesWelbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International conference on machine learning. PMLR, 2071--2080."},{"key":"e_1_2_1_99_1","volume-title":"Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082","author":"Vashishth Shikhar","year":"2019","unstructured":"Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. 2019. Composition-based multi-relational graph convolutional networks. arXiv preprint arXiv:1911.03082 (2019)."},{"key":"e_1_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1145\/3221269.3223028"},{"key":"e_1_2_1_101_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610505"},{"key":"e_1_2_1_102_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319855"},{"key":"e_1_2_1_103_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_1_104_1","volume-title":"Guided data repair. arXiv preprint arXiv:1103.3103","author":"Yakout Mohamed","year":"2011","unstructured":"Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. 2011. Guided data repair. arXiv preprint arXiv:1103.3103 (2011)."},{"key":"e_1_2_1_105_1","volume-title":"International conference on machine learning. PMLR, 5689--5698","author":"Yoon Jinsung","year":"2018","unstructured":"Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning. PMLR, 5689--5698."},{"key":"e_1_2_1_106_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10804"},{"key":"e_1_2_1_107_1","volume-title":"Jellyfish: A Large Language Model for Data Preprocessing. arXiv preprint arXiv:2312.01678","author":"Zhang Haochen","year":"2023","unstructured":"Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2023. Jellyfish: A Large Language Model for Data Preprocessing. arXiv preprint arXiv:2312.01678 (2023)."},{"key":"e_1_2_1_108_1","volume-title":"TabMentor: Detect Errors on Tabular Data with Noisy Labels. In International Conference on Advanced Data Mining and Applications. Springer, 167--182","author":"Zhang Yaru","year":"2023","unstructured":"Yaru Zhang, Jianbin Qin, Yaoshu Wang, Muhammad Asif Ali, Yan Ji, and Rui Mao. 2023. TabMentor: Detect Errors on Tabular Data with Noisy Labels. In International Conference on Advanced Data Mining and Applications. Springer, 167--182."},{"key":"e_1_2_1_109_1","volume-title":"The Twelfth International Conference on Learning Representations.","author":"Zheng Chujie","year":"2023","unstructured":"Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_2_1_110_1","volume-title":"Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036 14","author":"Zhu Yanqiao","year":"2021","unstructured":"Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. 2021. Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036 14 (2021)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698811","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3698811","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T17:45:08Z","timestamp":1774979108000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3698811"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,18]]},"references-count":110,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,12,18]]}},"alternative-id":["10.1145\/3698811"],"URL":"https:\/\/doi.org\/10.1145\/3698811","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,18]]}}}