{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T11:39:23Z","timestamp":1763552363531,"version":"3.45.0"},"reference-count":47,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,16]],"date-time":"2025-11-16T00:00:00Z","timestamp":1763251200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100006958","name":"United States Census Bureau","doi-asserted-by":"publisher","award":["CB21RMD0160003"],"award-info":[{"award-number":["CB21RMD0160003"]}],"id":[{"id":"10.13039\/100006958","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Record linkage is an essential task in data integration in the fields of healthcare, law enforcement, fraud detection, transportation, biology, and supply chain management. The problem of record linkage is to cluster records from various sources such that each cluster belongs to a single entity. Scalability in record linking is limited by the large number of pairwise comparisons required. Blocking addresses this challenge by partitioning data into smaller parts, substantially reducing the computational cost. With the advancement of Large Language Models (LLMs), there are several possibilities to improve record linkage by leveraging their semantic understanding of textual attributes. LLM-based record linkage algorithms in the literature have very large runtimes. In this paper, we show that the employment of blocking can result in significant improvements not only in the runtime but also in the accuracy. Specifically, we propose a record linkage algorithm that combines LLMs with blocking. Experimental evaluation demonstrates that our algorithm achieves lower runtimes while simultaneously improving F1 scores compared to the approaches relying solely on LLMs. These findings demonstrate the importance of blocking even in the era of advanced machine learning models.<\/jats:p>","DOI":"10.3390\/a18110723","type":"journal-article","created":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T11:17:27Z","timestamp":1763551047000},"page":"723","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Efficient Record Linkage in the Age of Large Language Models: The Critical Role of Blocking"],"prefix":"10.3390","volume":"18","author":[{"given":"Nidhibahen","family":"Shah","sequence":"first","affiliation":[{"name":"School of Computing, University of Connecticut, 371 Fairfield Way, Storrs, CT 06269, USA"}]},{"given":"Sreevar","family":"Patiyara","sequence":"additional","affiliation":[{"name":"CS Department, Purdue University, 305 N. University St., West Lafayette, IN 47907, USA"}]},{"given":"Joyanta","family":"Basak","sequence":"additional","affiliation":[{"name":"School of Computing, University of Connecticut, 371 Fairfield Way, Storrs, CT 06269, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8129-1676","authenticated-orcid":false,"given":"Sartaj","family":"Sahni","sequence":"additional","affiliation":[{"name":"CISE Department, University of Florida, Gainesville, FL 32611, USA"}]},{"given":"Anup","family":"Mathur","sequence":"additional","affiliation":[{"name":"U.S. Census Bureau, 4600 Silver Hill Road, Washington, DC 20233, USA"}]},{"given":"Krista","family":"Park","sequence":"additional","affiliation":[{"name":"U.S. Census Bureau, 4600 Silver Hill Road, Washington, DC 20233, USA"}]},{"given":"Sanguthevar","family":"Rajasekaran","sequence":"additional","affiliation":[{"name":"School of Computing, University of Connecticut, 371 Fairfield Way, Storrs, CT 06269, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Papadakis, G., Ioannou, E., Thanos, E., and Palpanas, T. (2021). Four Generations of Entity Resolution, Springer.","DOI":"10.1007\/978-3-031-01878-7"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Shah, N., Soliman, A., Basak, J., Sahni, S., Haase, K., Mathur, A., Park, K., Weinberg, D., White, J., and Rajasekaran, S. (2024, January 15\u201318). The Soundex Blocking: A Novel Blocking Approach for Record Linkage. Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA.","DOI":"10.1109\/BigData62323.2024.10825041"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Shah, N., Basak, J., Sahni, S., Mathur, A., Park, K., Weinberg, D., and Rajasekaran, S. (2025). Double Metaphone Blocking: An Innovative Blocking Approach to Record Linkage. International Symposium on Bioinformatics Research and Applications, Springer Nature.","DOI":"10.1007\/978-981-95-0695-8_12"},{"key":"ref_4","first-page":"5998","article-title":"Attention Is All You Need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018, January 24). Improving Language Understanding by Generative Pre-Training. Proceedings of the 2018 OpenAI Workshop, San Francisco, CA, USA."},{"key":"ref_6","unstructured":"Chu, Z., Ni, S., Wang, Z., Feng, X., Li, C., Hu, X., Xu, R., Yang, M., and Zhang, W. (2024). History, Development, and Principles of Large Language Models-An Introductory Survey. arXiv."},{"key":"ref_7","unstructured":"Xhst (2025, September 24). Unstructured Record Linkage Using Siamese Networks and Large Language Models (LLMs). Available online: https:\/\/github.com\/Xhst\/ml-record-linkage."},{"key":"ref_8","unstructured":"Liu, M., Roy, S., Li, W., Zhong, Z., Sebe, N., and Ricci, E. (2024, January 7\u201311). Democratizing Fine-grained Visual Recognition with Large Language Models. Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria. Available online: https:\/\/openreview.net\/forum?id=c7DND1iIgb."},{"key":"ref_9","unstructured":"Liu, H., Zeng, S., Deng, L., Liu, T., Liu, X., Zhang, Z., and Li, Y.-F. HPCTrans: Heterogeneous Plumage Cues-Aware Texton Correlation Representation for FBIC via Transformers, IEEE Trans. Circuits Syst. Video Technol., in press."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"105266","DOI":"10.1016\/j.dsp.2025.105266","article-title":"DSR-Net: Distinct Selective Rollback Queries for Road Cracks Detection with Detection Transformer","volume":"164","author":"Deng","year":"2025","journal-title":"Digit. Signal Process."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1017\/S0003055418000783","article-title":"Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records","volume":"113","author":"Enamorado","year":"2019","journal-title":"Am. Polit. Sci. Rev."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","article-title":"A Theory for Record Linkage","volume":"64","author":"Fellegi","year":"1969","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"385","DOI":"10.1080\/01621459.2012.757231","article-title":"A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Record Systems","volume":"108","author":"Sadinle","year":"2013","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_14","unstructured":"Enamorado, T., Fifield, B., and Imai, K. (2019). FastLink: Fast Probabilistic Record Linkage with Missing Data. R Package, Version 0.6.1, CRAN. Available online: https:\/\/CRAN.R-project.org\/package=fastLink."},{"key":"ref_15","unstructured":"Ministry of Justice (MoJ) (2025, September 24). Splink: MoJ\u2019s Open Source Library for Probabilistic Record Linkage at Scale. Version 1.0, Available online: https:\/\/github.com\/moj-analytical-services\/splink."},{"key":"ref_16","unstructured":"Christen, P., and Churches, T. (2002). FEBRL\u2014Freely Extensible Biomedical Record Linkage. Joint Computer Science Technical Report Series (Online), Australian National University, Department of Computer Science. TRCS-02-05."},{"key":"ref_17","unstructured":"(2025, September 24). FEBRL. Available online: http:\/\/sourceforge.net\/projects\/febrl\/."},{"key":"ref_18","unstructured":"U.S. Department of Commerce and Labor, and Bureau of the Census (1900). The Deaf, Special Reports: The Blind and the Deaf."},{"key":"ref_19","unstructured":"Box, J.F. (1978). R.A. Fisher, the Life of a Scientist, Wiley."},{"key":"ref_20","first-page":"1409","article-title":"Record Linkage of Healthcare Insurance Claims","volume":"84","author":"Victor","year":"2001","journal-title":"Stud. Health Technol. Inform."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sauleau, E.A., Paumier, J.P., and Buemi, A. (2005). Medical Record Linkage in Health Information Systems by Approximate String Matching and Clustering. BMC Med. Inform. Decis. Mak., 5.","DOI":"10.1186\/1472-6947-5-32"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1007\/s10654-018-0442-4","article-title":"Approach to Record Linkage of Primary Care Data from Clinical Practice Research Datalink to Other Health-Related Patient Data: Overview and Implications","volume":"34","author":"Padmanabhan","year":"2019","journal-title":"Eur. J. Epidemiol."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1197\/jamia.M2605","article-title":"Opportunities for Electronic Health Record Data to Support Business Functions in the Pharmaceutical Industry\u2014A Case Study from Pfizer, Inc","volume":"15","author":"Kim","year":"2008","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1537","DOI":"10.1109\/TKDE.2011.127","article-title":"A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication","volume":"24","author":"Christen","year":"2012","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_25","first-page":"31","article-title":"Blocking and Filtering Techniques for Entity Resolution: A Survey","volume":"53","author":"Papadakis","year":"2020","journal-title":"ACM Comput. Surv."},{"key":"ref_26","unstructured":"Odell, M., and Russell, R. (1918). The Soundex Coding System. (US1261167A), U.S. Patent."},{"key":"ref_27","first-page":"38","article-title":"The Double Metaphone Search Algorithm","volume":"18","author":"Philips","year":"2000","journal-title":"C\/C++ Users J."},{"key":"ref_28","unstructured":"Christen, P. (2006, January 6\u201311). A Comparison of Phonetic Encoding Algorithms for Historical Name Matching. Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM), Arlington, VA, USA."},{"key":"ref_29","unstructured":"Talburt, J.R., and Zhou, Y. (2010, January 4\u20136). Entity Resolution Using Double Metaphone in Commercial Datasets. Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI), Las Vegas, NV, USA."},{"key":"ref_30","first-page":"45","article-title":"Improving Record Linkage with Phonetic Algorithms in Healthcare","volume":"48","author":"Ong","year":"2014","journal-title":"J. Biomed. Inform."},{"key":"ref_31","unstructured":"Behm, A., Ji, S., Li, C., and Lu, J. (April, January 29). Fuzzy Search with Double Metaphone for Approximate Matching. Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), Shanghai, China."},{"key":"ref_32","unstructured":"Hassanzadeh, O., Chiang, F., and Miller, R.J. (2011, January 22\u201325). Clustering Records with Double Metaphone: A Scalability Study. Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA), Hong Kong, China."},{"key":"ref_33","first-page":"345","article-title":"Approximate String Joins with Double Metaphone in Databases","volume":"12","author":"Gravano","year":"2003","journal-title":"VLDB J."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Karakasidis, A., and Verykios, V.S. (2009, January 10\u201312). Privacy-Preserving Record Linkage Using Phonetic Codes. Proceedings of the 13th Panhellenic Conference on Informatics (PCI), Corfu, Greece.","DOI":"10.1109\/BCI.2009.29"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Mudgal, S., Li, H., Ko, T., Srivastava, A., Wang, R., Mitra, S., Srivatsa, S., Popa, R.A., Elmore, A.J., and Halevy, A. (2018, January 10\u201315). Deep Learning for Entity Matching: A Design Space Exploration. Proceedings of the 2018 International Conference on Management of Data (SIGMOD\u201918), Houston, TX, USA.","DOI":"10.1145\/3183713.3196926"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"50","DOI":"10.14778\/3421424.3421431","article-title":"Deep Entity Matching with Pre-Trained Language Models","volume":"14","author":"Li","year":"2020","journal-title":"Proc. VLDB Endow."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"107378","DOI":"10.1016\/j.engappai.2023.107378","article-title":"Network-based exploratory data analysis and explainable three-stage deep clustering for financial customer profiling","volume":"128","author":"Choi","year":"2024","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Deo, N., Rajasekaran, S., and Kamel, R. (2023, January 15\u201318). Identifying Suitable Attributes for Record Linkage using Association Analysis. Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy.","DOI":"10.1109\/BigData59044.2023.10386177"},{"key":"ref_39","unstructured":"Meta AI (2024). LLaMA 3.1 8B Instruct Model, Version 3.1, Meta. Available online: https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B-Instruct."},{"key":"ref_40","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"8068","DOI":"10.1109\/TII.2023.3266366","article-title":"LDCNet: Limb Direction Cues-Aware Network for Flexible Human Pose Estimation in Industrial Behavioral Biometrics Systems","volume":"20","author":"Liu","year":"2023","journal-title":"IEEE Trans. Ind. Informat."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"8464","DOI":"10.1109\/TMM.2022.3197364","article-title":"EHPE: Skeleton Cues-Based Gaussian Coordinate Encoding for Efficient Human Pose Estimation","volume":"26","author":"Liu","year":"2022","journal-title":"IEEE Trans. Multimed."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"1677","DOI":"10.1109\/TMM.2023.3238548","article-title":"TransIFC: Invariant Cues-Aware Feature Concentration Learning for Efficient Fine-Grained Bird Image Classification","volume":"27","author":"Liu","year":"2023","journal-title":"IEEE Trans. Multimed."},{"key":"ref_44","unstructured":"(2025, September 24). SSDMF Homepage. Available online: http:\/\/ssdmf.info\/download.html."},{"key":"ref_45","unstructured":"North Carolina State Board of Elections (NCSBE) (2025, September 24). Voter Registration Data, Available online: https:\/\/www.ncsbe.gov\/results-data\/voter-registration-data."},{"key":"ref_46","unstructured":"Revinate Engineering (2016). CRM Data Pipeline Record Linkage (Part I). Revinate Engineering Blog (Online), Revinate. Available online: https:\/\/underthehood.meltwater.com\/blog\/2020\/06\/29\/the-record-linking-pipeline-for-our-knowledge-graph-part-1\/."},{"key":"ref_47","first-page":"1420","article-title":"Health Information Exchange among U.S. Hospitals: Who\u2019s In, Who\u2019s Out, and What Are the Implications?","volume":"36","author":"Jha","year":"2017","journal-title":"Health Aff."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/11\/723\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T11:30:50Z","timestamp":1763551850000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/11\/723"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,16]]},"references-count":47,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["a18110723"],"URL":"https:\/\/doi.org\/10.3390\/a18110723","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,16]]}}}