{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T09:55:25Z","timestamp":1773482125842,"version":"3.50.1"},"reference-count":50,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2024,11,25]],"date-time":"2024-11-25T00:00:00Z","timestamp":1732492800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets.<\/jats:p>","DOI":"10.3390\/data9120139","type":"journal-article","created":{"date-parts":[[2024,11,25]],"date-time":"2024-11-25T05:21:32Z","timestamp":1732512092000},"page":"139","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Detective Gadget: Generic Iterative Entity Resolution over Dirty Data"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2022-7475","authenticated-orcid":false,"given":"Marcello","family":"Buoncristiano","sequence":"first","affiliation":[{"name":"Svelto!\u2014Big Data-Cleaning and Analytics, 85100 Potenza, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1189-1481","authenticated-orcid":false,"given":"Giansalvatore","family":"Mecca","sequence":"additional","affiliation":[{"name":"Dipartimento di Ingegneria, Universit\u00e0 degli Studi della Basilicata, 85100 Potenza, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5651-8584","authenticated-orcid":false,"given":"Donatello","family":"Santoro","sequence":"additional","affiliation":[{"name":"Dipartimento di Ingegneria, Universit\u00e0 degli Studi della Basilicata, 85100 Potenza, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9947-8909","authenticated-orcid":false,"given":"Enzo","family":"Veltri","sequence":"additional","affiliation":[{"name":"Dipartimento di Ingegneria, Universit\u00e0 degli Studi della Basilicata, 85100 Potenza, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2024,11,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1411","DOI":"10.1109\/TKDE.2006.152","article-title":"A Survey of Web Information Extraction Systems","volume":"18","author":"Chang","year":"2006","journal-title":"IEEE Trans. Data Know. Eng."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate Record Detection: A Survey","volume":"19","author":"Elmagarmid","year":"2007","journal-title":"IEEE Trans. Data Know. Eng."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer.","DOI":"10.1007\/978-3-642-31164-2"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-031-01878-7","article-title":"The Four Generations of Entity Resolution","volume":"16","author":"Papadakis","year":"2021","journal-title":"Synth. Lect. Data Manag."},{"key":"ref_5","first-page":"1574","article-title":"Comparative Evaluation of Entity Resolution Approaches with FEVER","volume":"2","author":"Thor","year":"2009","journal-title":"PVLDB"},{"key":"ref_6","first-page":"305","article-title":"Humanity Is Overrated. or Not. Automatic Diagnostic Suggestions by Greg, ML (Extended Abstract)","volume":"Volume 909","author":"Lapadula","year":"2018","journal-title":"Communications in Computer and Information Science, Proceedings of the New Trends in Databases and Information Systems\u2014ADBIS 2018 Short Papers and Workshops, AI*QA, BIGPMED, CSACDB, M2U, BigDataMAPS, ISTREND, DC, Budapest, Hungary, 2\u20135 September 2018"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1485","DOI":"10.1007\/s12553-020-00468-9","article-title":"Greg, ML\u2013Machine Learning for Healthcare at a Scale","volume":"10","author":"Lapadula","year":"2020","journal-title":"Health Technol."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Sagiroglu, S., and Sinanc, D. (2013, January 20\u201324). Big data: A review. Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA.","DOI":"10.1109\/CTS.2013.6567202"},{"key":"ref_9","unstructured":"Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D., and Veltri, E. (2024, January 25\u201328). Similarity Measures For Incomplete Database Instances. Proceedings of the 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Fan, W., and Geerts, F. (2012). Foundations of Data Quality Management, Morgan & Claypool.","DOI":"10.1007\/978-3-031-01892-3"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1483","DOI":"10.14778\/2350229.2350263","article-title":"CrowdER: Crowdsourcing Entity Resolution","volume":"5","author":"Wang","year":"2012","journal-title":"Proc. VLDB Endow."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"711","DOI":"10.1007\/s00778-013-0328-8","article-title":"Hybrid Entity Clustering Using Crowds and Data","volume":"22","author":"Lee","year":"2013","journal-title":"VLDB J."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Veltri, E., Badaro, G., Saeed, M., and Papotti, P. (2023, January 3\u20137). Data Ambiguity Profiling for the Generation of Training Examples. Proceedings of the 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA.","DOI":"10.1109\/ICDE55515.2023.00041"},{"key":"ref_14","unstructured":"Ives, Z.G., Bonifati, A., and Abbadi, A.E. (2022, January 12\u201317). Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data. Proceedings of the SIGMOD \u201922: International Conference on Management of Data, Philadelphia, PA, USA."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1007\/s00778-008-0098-x","article-title":"Swoosh: A Generic Approach to Entity Resolution","volume":"18","author":"Benjelloun","year":"2009","journal-title":"VLDB J."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/j.is.2019.01.003","article-title":"INDIANA: An interactive system for assisting database exploration","volume":"83","author":"Giuzio","year":"2019","journal-title":"Inf. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"eabi8021","DOI":"10.1126\/sciadv.abi8021","article-title":"(Almost) all of entity resolution","volume":"8","author":"Binette","year":"2022","journal-title":"Sci. Adv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Verroios, V., and Garcia-Molina, H. (2015, January 13\u201317). Entity resolution with crowd errors. Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Republic of Korea.","DOI":"10.1109\/ICDE.2015.7113286"},{"key":"ref_19","unstructured":"Konstantinos, N., Ioannou, E., and Papadakis, G. (2024, January 17\u201320). The Five Generations of Entity Resolution on Web Data. Proceedings of the International Conference on Web Engineering, Tampere, Finland."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1878","DOI":"10.14778\/2367502.2367527","article-title":"Dedoop: Efficient Deduplication with Hadoop","volume":"5","author":"Kolb","year":"2012","journal-title":"Proc. VLDB"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Chu, X., Ilyas, I.F., Krishnan, S., and Wang, J. (July, January 26). Data cleaning: Overview and emerging challenges. Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA.","DOI":"10.1145\/2882903.2912574"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ilyas, I.F., and Chu, X. (2019). Data Cleaning, Morgan & Claypool.","DOI":"10.1145\/3310205"},{"key":"ref_23","unstructured":"Chu, X., Ilyas, I.F., and Papotti, P. (2013, January 8\u201312). Holistic data cleaning: Putting violations into context. Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD, Australia."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"867","DOI":"10.1007\/s00778-019-00586-5","article-title":"Cleaning data with LLUNATIC","volume":"29","author":"Geerts","year":"2020","journal-title":"VLDB J."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., and Tang, N. (July, January 26). Interactive and deterministic data cleaning. Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA.","DOI":"10.1145\/2882903.2915242"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1261","DOI":"10.1007\/s00778-009-0136-3","article-title":"Generic Entity Resolution with Negative Rules","volume":"18","author":"Whang","year":"2009","journal-title":"VLDB J."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"414","DOI":"10.1080\/01621459.1989.10478785","article-title":"Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida","volume":"84","author":"Jaro","year":"1989","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Talburt, J.R., and Zhou, Y. (2013). A practical guide to entity resolution with OYSTER. Handbook of Data Quality, Springer.","DOI":"10.1007\/978-3-642-36257-6_11"},{"key":"ref_29","unstructured":"Forest, G., and Derek, E. (2024, November 22). Dedupe. Available online: https:\/\/github.com\/dedupeio\/dedupe."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Christen, P. (2008, January 24\u201327). Febrl\u2014An open source data cleaning, deduplication and record linkage system with a graphical user interface. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA.","DOI":"10.1145\/1401890.1402020"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wu, R., Chaba, S., Sawlani, S., Chu, X., and Thirumuruganathan, S. (2020, January 14\u201319). Zeroer: Entity resolution using zero labeled examples. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA.","DOI":"10.1145\/3318464.3389743"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1950","DOI":"10.14778\/3229863.3236232","article-title":"The return of jedai: End-to-end entity resolution for structured and semi-structured data","volume":"11","author":"Papadakis","year":"2018","journal-title":"Proc. VLDB Endow."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, Y., Qin, J., and Wang, W. (2017). Efficient approximate entity matching using jaro-winkler distance. Proceedings of the International Conference on Web Information Systems Engineering, Springer.","DOI":"10.1007\/978-3-319-68783-4_16"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1016\/0888-7543(91)90071-L","article-title":"Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms","volume":"11","author":"Pearson","year":"1991","journal-title":"Genomics"},{"key":"ref_35","unstructured":"Holmes, D., and McCabe, M.C. (2002, January 8\u201310). Improving precision and recall for soundex retrieval. Proceedings of the Proceedings. International Conference on Information Technology: Coding and Computing, Las Vegas, NV, USA."},{"key":"ref_36","first-page":"12:1","article-title":"BUNNI: Learning Repair Actions in Rule-driven Data Cleaning","volume":"16","author":"Mecca","year":"2024","journal-title":"ACM J. Data Inf. Qual."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"101565","DOI":"10.1016\/j.is.2020.101565","article-title":"Three-dimensional entity resolution with JedAI","volume":"93","author":"Papadakis","year":"2020","journal-title":"Inf. Syst."},{"key":"ref_38","unstructured":"Zhou, Y., Talburt, J., Su, Y., and Yin, L. (2010, January 27\u201329). OYSTER: A tool for entity resolution in health information exchange. Proceedings of the 5th International Conference on Cooperation and Promotion of Information Resources in Science and Technology, Beijing, China."},{"key":"ref_39","first-page":"15","article-title":"Efficient similarity joins for near-duplicate detection","volume":"36","author":"Xiao","year":"2011","journal-title":"ACM Trans. Database Syst. TODS"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., and Thavisomboon, S. (2007, January 25\u201327). D-swoosh: A family of algorithms for generic, distributed entity resolution. Proceedings of the Distributed Computing Systems, ICDCS\u201907, Toronto, ON, Canada.","DOI":"10.1109\/ICDCS.2007.96"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1581","DOI":"10.14778\/3007263.3007314","article-title":"Magellan: Toward building entity matching management systems over data science stacks","volume":"9","author":"Konda","year":"2016","journal-title":"Proc. VLDB Endow."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., and Zhu, X. (2014, January 22\u201327). Corleone: Hands-off crowdsourcing for entity matching. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA.","DOI":"10.1145\/2588555.2588576"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.2517-6161.1977.tb01600.x","article-title":"Maximum likelihood from incomplete data via the EM algorithm","volume":"39","author":"Dempster","year":"1977","journal-title":"J. R. Stat. Soc. Ser. B Methodol."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1794","DOI":"10.14778\/3352063.3352068","article-title":"Systemer: A human-in-the-loop system for explainable entity resolution","volume":"12","author":"Qian","year":"2019","journal-title":"Proc. VLDB Endow."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"50","DOI":"10.14778\/3421424.3421431","article-title":"Deep Entity Matching with Pre-Trained Language Models","volume":"14","author":"Li","year":"2020","journal-title":"Proc. VLDB Endow."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"2459","DOI":"10.14778\/3476249.3476294","article-title":"Deep Learning for Blocking in Entity Matching: A Design Space Exploration","volume":"14","author":"Thirumuruganathan","year":"2021","journal-title":"Proc. VLDB Endow."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"84:1","DOI":"10.1145\/3588938","article-title":"Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration","volume":"1","author":"Tu","year":"2023","journal-title":"Proc. ACM Manag. Data"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1316","DOI":"10.1109\/TKDE.2014.2359666","article-title":"Progressive duplicate detection","volume":"27","author":"Papenbrock","year":"2014","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"1208","DOI":"10.1109\/TKDE.2018.2852763","article-title":"Schema-agnostic progressive entity resolution","volume":"31","author":"Simonini","year":"2018","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3641289","article-title":"A survey on evaluation of large language models","volume":"15","author":"Chang","year":"2024","journal-title":"ACM Trans. Intell. Syst. Technol."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/9\/12\/139\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:38:56Z","timestamp":1760114336000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/9\/12\/139"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,25]]},"references-count":50,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["data9120139"],"URL":"https:\/\/doi.org\/10.3390\/data9120139","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,25]]}}}