{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T19:01:58Z","timestamp":1774983718511,"version":"3.50.1"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006374","name":"HORIZON EUROPE Framework Programme","doi-asserted-by":"publisher","award":["101070122"],"award-info":[{"award-number":["101070122"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,2,10]]},"abstract":"<jats:p>Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Matching that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.<\/jats:p>","DOI":"10.1145\/3709715","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T15:45:06Z","timestamp":1739288706000},"page":"1-25","source":"Crossref","is-referenced-by-count":6,"title":["Progressive Entity Matching: A Design Space Exploration"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-8307-8843","authenticated-orcid":false,"given":"Jakub","family":"Maciejewski","sequence":"first","affiliation":[{"name":"National and Kapodistrian University of Athens, Athens, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3465-1197","authenticated-orcid":false,"given":"Konstantinos","family":"Nikoletos","sequence":"additional","affiliation":[{"name":"National and Kapodistrian University of Athens, Athens, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7298-9431","authenticated-orcid":false,"given":"George","family":"Papadakis","sequence":"additional","affiliation":[{"name":"National and Kapodistrian University of Athens, Athens, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6332-0296","authenticated-orcid":false,"given":"Yannis","family":"Velegrakis","sequence":"additional","affiliation":[{"name":"University of Trento and Utrecht University, Utrecht, Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2019.02.006"},{"key":"e_1_2_2_2_1","volume-title":"Enriching Word Vectors with Subword Information. CoRR","author":"Bojanowski Piotr","year":"2016","unstructured":"Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tom\u00e1s Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR, Vol. abs\/1607.04606 (2016)."},{"key":"e_1_2_2_3_1","volume-title":"Entity Resolution, and Duplicate Detection","author":"Christen Peter","unstructured":"Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer."},{"key":"e_1_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Peter Christen Ross W. Gayler and David Hawking. 2009. Similarity-aware indexing for real-time entity resolution. In CIKM. 1565--1568.","DOI":"10.1145\/1645953.1646173"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3418896"},{"key":"e_1_2_2_6_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186."},{"key":"e_1_2_2_7_1","volume-title":"Big Data Integration","author":"Dong Xin Luna","unstructured":"Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers."},{"key":"e_1_2_2_8_1","first-page":"44","article-title":"Unicorn","volume":"53","author":"Fan Ju","year":"2024","unstructured":"Ju Fan, Jianhong Tu, Guoliang Li, Peng Wang, Xiaoyong Du, Xiaofeng Jia, Song Gao, and Nan Tang. 2024. Unicorn: A Unified Multi-Tasking Matching Model. SIGMOD Rec., Vol. 53, 1 (2024), 44--53.","journal-title":"A Unified Multi-Tasking Matching Model. SIGMOD Rec."},{"key":"e_1_2_2_9_1","volume-title":"BEER: Blocking for Effective Entity Resolution. In SIGMOD. 2711--2715.","author":"Galhotra Sainyam","year":"2021","unstructured":"Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021a. BEER: Blocking for Effective Entity Resolution. In SIGMOD. 2711--2715."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00656-7"},{"key":"e_1_2_2_11_1","unstructured":"Leonardo Gazzarri and Melanie Herschel. 2023. Progressive Entity Resolution over Incremental Data. In EDBT. 80--91."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3626711"},{"key":"e_1_2_2_13_1","first-page":"2018","article-title":"Entity Resolution","volume":"5","author":"Getoor Lise","year":"2012","unstructured":"Lise Getoor and Ashwin Machanavajjhala. 2012. Entity Resolution: Theory, Practice & Open Challenges. PVLDB, Vol. 5, 12 (2012), 2018--2019.","journal-title":"Theory, Practice & Open Challenges. PVLDB"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687771"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3485450.3485455"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2921572"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920904"},{"key":"e_1_2_2_18_1","volume-title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR.","author":"Lan Zhenzhong","year":"2020","unstructured":"Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR."},{"key":"e_1_2_2_19_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs\/1907.11692 (2019)."},{"key":"e_1_2_2_20_1","unstructured":"Jakub Maciejewski Konstantinos Nikoletos George Papadakis and Yannis Velegrakis. 2022. Progressive Entity Matching: A Design Space Exploration. https:\/\/github.com\/JacobMaciejewski\/PER-Design-Space-Exploration\/paper\/PEMextended.pdf. In SIGMOD (extended version)."},{"key":"e_1_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Venkata Vamsikrishna Meduri Lucian Popa Prithviraj Sen and Mohamed Sarwat. 2020. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In SIGMOD. 1133--1147.","DOI":"10.1145\/3318464.3380597"},{"key":"e_1_2_2_22_1","unstructured":"Tom\u00e1s Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In ICLR."},{"key":"e_1_2_2_23_1","unstructured":"Tom\u00e1s Mikolov Ilya Sutskever Kai Chen Gregory S. Corrado and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NIPS. 3111--3119."},{"key":"e_1_2_2_24_1","doi-asserted-by":"crossref","unstructured":"Sidharth Mudgal Han Li Theodoros Rekatsinas AnHai Doan Youngchoon Park Ganesh Krishnan Rohit Deep Esteban Arcaute and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. ACM 19--34.","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_2_25_1","volume-title":"Open benchmark for filtering techniques in entity resolution. The VLDB Journal","author":"Neuhof Franziska","year":"2024","unstructured":"Franziska Neuhof, Marco Fisichella, George Papadakis, Konstantinos Nikoletos, Nikolaus Augsten, Wolfgang Nejdl, and Manolis Koubarakis. 2024. Open benchmark for filtering techniques in entity resolution. The VLDB Journal (2024), 1--26."},{"key":"e_1_2_2_26_1","unstructured":"Daniel Obraczka Jonathan Schuchart and Erhard Rahm. 2021. Embedding-Assisted Entity Resolution for Knowledge Graphs. In KGCW@ESWC."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-023-00791-3"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00389"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-01878-7"},{"key":"e_1_2_2_30_1","doi-asserted-by":"crossref","unstructured":"George Papadakis Nishadi Kirielle Peter Christen and Themis Palpanas. 2024. A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms. In ICDE. 3435--3448.","DOI":"10.1109\/ICDE60146.2024.00265"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2020.101565"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377455"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2947618.2947624"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2014.2359666"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3583140.3583163"},{"key":"e_1_2_2_36_1","volume-title":"Manning","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543."},{"key":"e_1_2_2_37_1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., Vol. 21 (2020), 140:1--140:67.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_2_38_1","doi-asserted-by":"crossref","unstructured":"Banda Ramadan and Peter Christen. 2015. Unsupervised Blocking Key Selection for Real-Time Entity Resolution. In PAKDD. 574--585.","DOI":"10.1007\/978-3-319-18032-8_45"},{"key":"e_1_2_2_39_1","article-title":"Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution","volume":"6","author":"Ramadan Banda","year":"2015","unstructured":"Banda Ramadan, Peter Christen, Huizhi Liang, and Ross W. Gayler. 2015. Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution. ACM J. Data Inf. Qual., Vol. 6, 4 (2015), 15:1--15:29.","journal-title":"ACM J. Data Inf. Qual."},{"key":"e_1_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Banda Ramadan Peter Christen Huizhi Liang Ross W. Gayler and David Hawking. 2013. Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution. In PAKDD. 47--58.","DOI":"10.1007\/978-3-642-40319-4_5"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.7250\/csimq.2018-16.04"},{"key":"e_1_2_2_42_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, Vol. abs\/1910.01108 (2019)."},{"key":"e_1_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In KDD. ACM 269--278.","DOI":"10.1145\/775047.775087"},{"key":"e_1_2_2_44_1","volume-title":"Proceedings of the 5th International Workshop on Ontology Matching (OM-2010)","volume":"689","author":"Shvaiko Pavel","year":"2010","unstructured":"Pavel Shvaiko, J\u00e9r\u00f4me Euzenat, Fausto Giunchiglia, Heiner Stuckenschmidt, Ming Mao, and Isabel F. Cruz (Eds.). 2010. Proceedings of the 5th International Workshop on Ontology Matching (OM-2010), Shanghai, China, November 7, 2010. Vol. 689."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2852763"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523226"},{"key":"e_1_2_2_47_1","unstructured":"Kaitao Song Xu Tan Tao Qin Jianfeng Lu and Tie-Yan Liu. 2020. MPNet: Masked and Permuted Pre-training for Language Understanding. In NIPS."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3139987"},{"key":"e_1_2_2_49_1","first-page":"228","article-title":"Exploring the Design Space of Unsupervised Blocking with Pre-trained Language Models in Entity Resolution","volume":"14176","author":"Sun Chenchen","year":"2023","unstructured":"Chenchen Sun, Yuyuan Jin, Yang Xu, Derong Shen, Tiezheng Nie, and Xite Wang. 2023. Exploring the Design Space of Unsupervised Blocking with Pre-trained Language Models in Entity Resolution. In ADMA, Vol. 14176. 228--244.","journal-title":"ADMA"},{"key":"e_1_2_2_50_1","first-page":"2459","article-title":"Deep Learning for Blocking in Entity Matching","volume":"14","author":"Thirumuruganathan Saravanan","year":"2021","unstructured":"Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep Learning for Blocking in Entity Matching: A Design Space Exploration. PVLDB, Vol. 14, 11 (2021), 2459--2472.","journal-title":"A Design Space Exploration. PVLDB"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2732982"},{"key":"e_1_2_2_52_1","unstructured":"Runhui Wang and Yongfeng Zhang. 2024. Pre-trained Language Models for Entity Blocking: A Reproducibility Study. In NAACL. 8712--8722."},{"key":"e_1_2_2_53_1","unstructured":"Wenhui Wang Furu Wei Li Dong Hangbo Bao Nan Yang and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In NIPS."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.43"},{"key":"e_1_2_2_55_1","volume-title":"Le","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS. 5754--5764."},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.14778\/3598581.3598594"},{"key":"e_1_2_2_57_1","first-page":"4026","article-title":"BrewER","volume":"16","author":"Zecchini Luca","year":"2023","unstructured":"Luca Zecchini, Giovanni Simonini, Sonia Bergamaschi, and Felix Naumann. 2023. BrewER: Entity Resolution On-Demand. PVLDB, Vol. 16, 12 (2023), 4026--4029.","journal-title":"Entity Resolution On-Demand. PVLDB"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709715","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3709715","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:20:55Z","timestamp":1774981255000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709715"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":57,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,10]]}},"alternative-id":["10.1145\/3709715"],"URL":"https:\/\/doi.org\/10.1145\/3709715","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]}}}