{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,15]],"date-time":"2026-03-15T07:36:23Z","timestamp":1773560183501,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,2]]},"abstract":"<jats:p>Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf\/idf measure has received virtually no attention. Yet, when we experimented with tf\/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf\/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf\/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall\/output size and runtime. Our findings suggest that (a) tf\/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.<\/jats:p>","DOI":"10.14778\/3583140.3583163","type":"journal-article","created":{"date-parts":[[2023,4,20]],"date-time":"2023-04-20T16:45:59Z","timestamp":1682009159000},"page":"1507-1519","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":27,"title":["Sparkly: A Simple yet Surprisingly Strong TF\/IDF Blocker for Entity Matching"],"prefix":"10.14778","volume":"16","author":[{"given":"Derek","family":"Paulsen","sequence":"first","affiliation":[{"name":"University of Wisconsin-Madison and Informatica Inc."}]},{"given":"Yash","family":"Govind","sequence":"additional","affiliation":[{"name":"Apple Inc."}]},{"given":"AnHai","family":"Doan","sequence":"additional","affiliation":[{"name":"University of Wisconsin-Madison and Informatica Inc."}]}],"member":"320","published-online":{"date-parts":[[2023,4,20]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Neural networks for entity matching. arXiv preprint arXiv:2010.11075","author":"Barlaug Nils","year":"2020","unstructured":"Nils Barlaug and Jon Atle Gulla . 2020. Neural networks for entity matching. arXiv preprint arXiv:2010.11075 ( 2020 ). Nils Barlaug and Jon Atle Gulla. 2020. Neural networks for entity matching. arXiv preprint arXiv:2010.11075 (2020)."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-65965-3_20"},{"key":"e_1_2_1_3_1","volume-title":"In Proc. of the 12th ACM Conf. on Information and Knowledge Management.","author":"Broder Andrei Z.","year":"2003","unstructured":"Andrei Z. Broder , Michael Herscovici , and Jason Zien . 2003 . Efficient query evaluation using a two-level retrieval process . In In Proc. of the 12th ACM Conf. on Information and Knowledge Management. Andrei Z. Broder, Michael Herscovici, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In In Proc. of the 12th ACM Conf. on Information and Knowledge Management."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/872757.872796"},{"key":"e_1_2_1_5_1","volume-title":"A survey of indexing techniques for scalable record linkage and deduplication","author":"Christen Peter","year":"2011","unstructured":"Peter Christen . 2011. A survey of indexing techniques for scalable record linkage and deduplication . IEEE transactions on knowledge and data engineering 24, 9 ( 2011 ), 1537--1555. Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering 24, 9 (2011), 1537--1555."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31164-2"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3418896"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)","author":"Cohen William W.","year":"2003","unstructured":"William W. Cohen , Pradeep Ravikumar , and Stephen E. Fienberg . 2003. A Comparison of String Distance Metrics for Name-Matching Tasks . In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03) , August 9-10, 2003 , Acapulco, Mexico, Subbarao Kambhampati and Craig A. Knoblock (Eds.). 73--78. http:\/\/www.isi.edu\/info-agents\/workshops\/ijcai03\/papers\/Cohen-p.pdf William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico, Subbarao Kambhampati and Craig A. Knoblock (Eds.). 73--78. http:\/\/www.isi.edu\/info-agents\/workshops\/ijcai03\/papers\/Cohen-p.pdf"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3035960"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2433396.2433412"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2009916.2010048"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"A. Doan A. Halevy and Z. Ives. 2012. Principles of Data Integration. Elsevier. A. Doan A. Halevy and Z. Ives. 2012. Principles of Data Integration. Elsevier.","DOI":"10.1016\/B978-0-12-416044-6.00019-3"},{"key":"e_1_2_1_13_1","first-page":"1454","article-title":"Distributed representations of tuples for entity resolution","volume":"11","author":"Ebraheem Muhammad","year":"2018","unstructured":"Muhammad Ebraheem , Saravanan Thirumuruganathan , Shafiq Joty , Mourad Ouzzani , and Nan Tang . 2018 . Distributed representations of tuples for entity resolution . PVLDB 11 , 11 (2018), 1454 -- 1467 . Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454--1467.","journal-title":"PVLDB"},{"key":"e_1_2_1_14_1","volume-title":"Verykios","author":"Elmagarmid Ahmed K.","year":"2007","unstructured":"Ahmed K. Elmagarmid , Panagiotis G. Ipeirotis , and Vassilios S . Verykios . 2007 . Duplicate Record Detection: A Survey. TKDE 19, 1 (2007). Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE 19, 1 (2007)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"C. Gokhale S. Das A. Doan J. F. Naughton N. Rampalli J. Shavlik and X. Zhu. 2014. Corleone: hands-off crowdsourcing for entity matching. SIGMOD. C. Gokhale S. Das A. Doan J. F. Naughton N. Rampalli J. Shavlik and X. Zhu. 2014. Corleone: hands-off crowdsourcing for entity matching. SIGMOD.","DOI":"10.1145\/2588555.2588576"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3314042"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3236255"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45442-5_3"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007263.3007314"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/371578.371598"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000055"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452824"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Sidharth Mudgal Han Li Theodoros Rekatsinas AnHai Doan Youngchoon Park Ganesh Krishnan Rohit Deep Esteban Arcaute and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. Sidharth Mudgal Han Li Theodoros Rekatsinas AnHai Doan Youngchoon Park Ganesh Krishnan Rohit Deep Esteban Arcaute and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD.","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/1841211"},{"key":"e_1_2_1_27_1","volume-title":"A review of unsupervised and semi-supervised blocking methods for record linkage. Linking and Mining Heterogeneous and Multi-view Data","author":"Hare Kevin","year":"2019","unstructured":"Kevin O Hare , Anna Jurek-Loughrey , and Cassio de Campos . 2019. A review of unsupervised and semi-supervised blocking methods for record linkage. Linking and Mining Heterogeneous and Multi-view Data ( 2019 ), 79--105. Kevin OHare, Anna Jurek-Loughrey, and Cassio de Campos. 2019. A review of unsupervised and semi-supervised blocking methods for record linkage. Linking and Mining Heterogeneous and Multi-view Data (2019), 79--105."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3450527"},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"G. Papadakis M. Fisichella F. Schoger G. Mandilaras N. Augsten and W. Nejdl. 2022. Benchmarking Filtering Techniques for Entity Resolution. Technical Report. arXiv:2022.12521v3. G. Papadakis M. Fisichella F. Schoger G. Mandilaras N. Augsten and W. Nejdl. 2022. Benchmarking Filtering Techniques for Entity Resolution. Technical Report. arXiv:2022.12521v3.","DOI":"10.1109\/ICDE55515.2023.00389"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.2200\/S01067ED1V01Y202012DTM064"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2020.101565"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377455"},{"key":"e_1_2_1_33_1","unstructured":"George Papadakis Leonidas Tsekouras Emmanouil Thanos Nikiforos Pittaras Giovanni Simonini Dimitrios Skoutas Paul Isaris George Giannakopoulos Themis Palpanas and Manolis Koubarakis. 2020. JedAI3: beyond batch blocking-based Entity Resolution.. In EDBT. 603--606. George Papadakis Leonidas Tsekouras Emmanouil Thanos Nikiforos Pittaras Giovanni Simonini Dimitrios Skoutas Paul Isaris George Giannakopoulos Themis Palpanas and Manolis Koubarakis. 2020. JedAI3: beyond batch blocking-based Entity Resolution.. In EDBT. 603--606."},{"key":"e_1_2_1_34_1","unstructured":"D. Paulsen Y. Govind and A. Doan. 2022. Homepage of the Sparkly Blocking System. https:\/\/github.com\/anhaidgroup\/sparkly. D. Paulsen Y. Govind and A. Doan. 2022. Homepage of the Sparkly Blocking System. https:\/\/github.com\/anhaidgroup\/sparkly."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308560.3316609"},{"key":"e_1_2_1_37_1","first-page":"4","article-title":"The Case for Shared Nothing","volume":"9","author":"Stonebraker Michael","year":"1986","unstructured":"Michael Stonebraker . 1986 . The Case for Shared Nothing . IEEE Database Eng. Bull. 9 , 1 (1986), 4 -- 9 . http:\/\/sites.computer.org\/debull\/86MAR-CD.pdf Michael Stonebraker. 1986. The Case for Shared Nothing. IEEE Database Eng. Bull. 9, 1 (1986), 4--9. http:\/\/sites.computer.org\/debull\/86MAR-CD.pdf","journal-title":"IEEE Database Eng. Bull."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476294"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.2307\/3001968"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"C. Xiao W. Wang X. Lin and H. Shang. 2009. Top-k set similarity joins. ICDE. C. Xiao W. Wang X. Lin and H. Shang. 2009. Top-k set similarity joins. ICDE.","DOI":"10.1109\/ICDE.2009.111"},{"key":"e_1_2_1_41_1","volume-title":"Christos Faloutsos, and Davd Page.","author":"Zhang Wei","year":"2020","unstructured":"Wei Zhang , Hao Wei , Bunyamin Sisman , Xin Luna Dong , Christos Faloutsos, and Davd Page. 2020 . AutoBlock: A hands-off blocking framework for entity matching. In WSDM. 744--752. Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. AutoBlock: A hands-off blocking framework for entity matching. In WSDM. 744--752."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3583140.3583163","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T03:56:18Z","timestamp":1702266978000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3583140.3583163"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2]]},"references-count":40,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2023,2]]}},"alternative-id":["10.14778\/3583140.3583163"],"URL":"https:\/\/doi.org\/10.14778\/3583140.3583163","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,2]]},"assertion":[{"value":"2023-04-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}