{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,15]],"date-time":"2025-08-15T02:06:30Z","timestamp":1755223590243,"version":"3.43.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"German Federal Ministry of Education and Research"},{"name":"\u201dCenter for Scalable Data Analytics and Artificial Intelligence Dresden\/Leipzig\u201d"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>Entity resolution plays an important role in data integration. However, most entity resolution methods focus on pairwise linkage and ignore potential errors generated by the transitive closure based on the determined equality links between two or more data sources. The transitive closure of a record forms a cluster where each record represents the same entity. Cluster repair methods aim to determine these errors and correct them. In the first category of methods, the assumption is that the data sources themselves do not contain any duplicates. Consequently, each cluster can contain at most one record from the same data source. However, real-world data often deviates from this assumption due to quality issues. Recent approaches apply clustering methods in combination with link categorization methods or graph clustering algorithms based on a single graph metric so they can be applied to data sources with duplicates. Nevertheless, the quality of the results highly varies depending on the configuration and dataset. In this study, we introduce a novel approach for cluster repair that utilizes graph metrics derived from the underlying similarity graphs. These metrics enable a comprehensive characterization of links and the generation of enhanced classification models. In addition to graph metric-based models, we integrate an active learning mechanism tailored to cluster-specific attributes. Moreover, we integrate large language models as an oracle. The evaluation shows that the graph metric-based method outperforms existing cluster repair methods and is more robust regarding different datasets and configurations.<\/jats:p>","DOI":"10.1145\/3735511","type":"journal-article","created":{"date-parts":[[2025,6,4]],"date-time":"2025-06-04T07:33:03Z","timestamp":1749022383000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Graph Metrics-driven Record Cluster Repair meets LLM-based active learning"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7175-7359","authenticated-orcid":false,"given":"Victor","family":"Christen","sequence":"first","affiliation":[{"name":"Faculty of Mathematics and Computer Science, Leipzig University","place":["Leipzig, Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0366-9872","authenticated-orcid":false,"given":"Daniel","family":"Obraczka","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Computer Science, Leipzig University","place":["Leipzig, Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4667-5743","authenticated-orcid":false,"given":"Marvin","family":"Hofer","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Computer Science, Leipzig University","place":["Leipzig, Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4157-8637","authenticated-orcid":false,"given":"Martin","family":"Franke","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Computer Science, Leipzig University","place":["Leipzig, Germany"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2665-1114","authenticated-orcid":false,"given":"Erhard","family":"Rahm","sequence":"additional","affiliation":[{"name":"Faculty of Mathematics and Computer Science, Leipzig University","place":["Leipzig, Germany"]}]}],"member":"320","published-online":{"date-parts":[[2025,6,24]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/2339530.2339707"},{"key":"e_1_3_3_3_2","unstructured":"Alessio Benavoli Giorgio Corani Janez Dem\u0161ar and Marco Zaffalon. 2017. Time for a change: A tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18 77 (2017) 1\u201336. Retrieved from http:\/\/jmlr.org\/papers\/v18\/16-305.html"},{"key":"e_1_3_3_4_2","series-title":"JMLR Workshop and Conference Proceedings","first-page":"1026","volume-title":"Proceedings of the 31st International Conference on Machine Learning, ICML","volume":"32","author":"Benavoli Alessio","year":"2014","unstructured":"Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. 2014. A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In Proceedings of the 31st International Conference on Machine Learning, ICML(JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 1026\u20131034. Retrieved from http:\/\/proceedings.mlr.press\/v32\/benavoli14.html"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/1456650.1456651"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0169-7552(98)00110-X"},{"key":"e_1_3_3_7_2","volume-title":"Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014","author":"Bruna Joan","year":"2014","unstructured":"Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014. Retrieved from http:\/\/arxiv.org\/abs\/1312.6203"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330925"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/1401890.1402020"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31164-2"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2011.127"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-43887-6_11"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3418896"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1609\/AAAI.V27I1.8468"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3405476"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611527"},{"key":"e_1_3_3_17_2","doi-asserted-by":"crossref","unstructured":"James Fox and Sivasankaran Rajamanickam. 2019. How robust are graph neural networks to structural noise?arXiv:1912.10206. Retrieved from http:\/\/arxiv.org\/abs\/1912.10206","DOI":"10.2172\/1592845"},{"key":"e_1_3_3_18_2","doi-asserted-by":"crossref","unstructured":"Linton C. Freeman. 1977. A set of measures of centrality based on betweenness. Sociometry 40 1 (1977) 35\u201341. Retrieved June 12 2025 from http:\/\/www.jstor.org\/stable\/3033543","DOI":"10.2307\/3033543"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","unstructured":"Shivani Gupta and Atul Gupta. 2019. Dealing with noise problem in machine learning data-sets: A systematic review. Procedia Computer Science 161 (2019) 466\u2013474. DOI:10.1016\/j.procs.2019.11.146","DOI":"10.1016\/j.procs.2019.11.146"},{"key":"e_1_3_3_20_2","first-page":"1024","volume-title":"Advances in Neural Information Processing Systems 30: Proceedings of the Annual Conference on Neural Information Processing Systems 2017","author":"Hamilton William L.","year":"2017","unstructured":"William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Proceedings of the Annual Conference on Neural Information Processing Systems 2017. 1024\u20131034."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687771"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.21105\/JOSS.02173"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2016.2637378"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1177\/104649647100200201"},{"key":"e_1_3_3_25_2","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_3_26_2","first-page":"3","volume-title":"Proceedings of the QDB\/MUD","author":"K\u00f6pcke Hanna","year":"2008","unstructured":"Hanna K\u00f6pcke and Erhard Rahm. 2008. Training selection for tuning entity matching.. In Proceedings of the QDB\/MUD. Auckland, 3\u201312."},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIC.2010.58"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.18420\/btw2021-11"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1609\/AAAI.V35I15.17562"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_3_3_31_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.14778\/2735471.2735474"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.3233\/SW-150210"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/S13218-021-00713-X"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-07443-6_26"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-30284-8_17"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","unstructured":"Curtis Northcutt Lu Jiang and Isaac Chuang. 2021. Confident Learning: Estimating uncertainty in dataset labels. J. Artif. Int. Res. 70 (May 2021) 1373\u20131411. DOI:10.1613\/jair.1.12125","DOI":"10.1613\/jair.1.12125"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","unstructured":"Ralph Peeters and Christian Bizer. 2023. Using ChatGPT for entity matching. In New Trends in Database and Information Systems - ADBIS 2023 Short Papers Doctoral Consortium and Workshops: AIDMA DOING K-Gals MADEISD PeRS Barcelona Spain September 4-7 2023 Proceedings (Communications in Computer and Information Science) Alberto Abell\u00f3 Panos Vassiliadis Oscar Romero Robert Wrembel Francesca Bugiotti Johann Gamper Genoveva Vargas-Solar and Ester Zumpano (Eds.). Springer 221\u2013230. DOI:10.1007\/978-3-031-42941-5_20","DOI":"10.1007\/978-3-031-42941-5_20"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-88361-4_11"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-00671-6_23"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF02289527"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.5220\/0010649600003064"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93417-4_37"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-49461-2_23"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-36257-6_11"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1016\/0020-0190(74)90003-9"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3106426.3106497"},{"key":"e_1_3_3_50_2","volume-title":"Proceedings of the 6th International Conference on Learning Representations, ICLR 2018","author":"Velickovic Petar","year":"2018","unstructured":"Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018. OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=rJXMpikCZ"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.18653\/V1\/2023.EMNLP-MAIN.896"},{"key":"e_1_3_3_52_2","unstructured":"Zeyu Zhang and Yulong Pei. 2021. A comparative study on robust Graph Neural Networks to structural noises. Retrieved from https:\/\/arxiv.org\/abs\/2112.06070"},{"key":"e_1_3_3_53_2","volume-title":"Advances in Neural Information Processing Systems 36: Proceedings of the Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023","author":"Zhou Zhanke","year":"2023","unstructured":"Zhanke Zhou, Jiangchao Yao, Jiaxu Liu, Xiawei Guo, Quanming Yao, LI He, Liang Wang, Bo Zheng, and Bo Han. 2023. Combating bilateral edge noise for robust link prediction. In Advances in Neural Information Processing Systems 36: Proceedings of the Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023. Retrieved from http:\/\/papers.nips.cc\/paper_files\/paper\/2023\/hash\/435986a8cc3e0667648df5d1c2d55c83-Abstract-Conference.html"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3735511","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,12]],"date-time":"2025-08-12T04:44:29Z","timestamp":1754973869000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3735511"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,24]]},"references-count":52,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3735511"],"URL":"https:\/\/doi.org\/10.1145\/3735511","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"type":"print","value":"1936-1955"},{"type":"electronic","value":"1936-1963"}],"subject":[],"published":{"date-parts":[[2025,6,24]]},"assertion":[{"value":"2024-08-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-02","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}