{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,28]],"date-time":"2025-11-28T12:18:41Z","timestamp":1764332321142},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2014,4]]},"abstract":"<jats:p>String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.<\/jats:p>","DOI":"10.14778\/2732296.2732299","type":"journal-article","created":{"date-parts":[[2015,5,12]],"date-time":"2015-05-12T15:37:52Z","timestamp":1431445072000},"page":"625-636","source":"Crossref","is-referenced-by-count":96,"title":["String similarity joins"],"prefix":"10.14778","volume":"7","author":[{"given":"Yu","family":"Jiang","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Jianhua","family":"Feng","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Wen-Syan","family":"Li","sequence":"additional","affiliation":[{"name":"SAP Lab, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2014,4]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"918","volume-title":"VLDB","author":"Arasu A.","year":"2006","unstructured":"A. Arasu , V. Ganti , and R. Kaushik . Efficient exact set-similarity joins . In VLDB , pages 918 -- 929 , 2006 . A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242591"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2011.5767856"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_5_1","first-page":"426","volume-title":"VLDB","author":"Ciaccia P.","year":"1997","unstructured":"P. Ciaccia , M. Patella , and P. Zezula . M-tree: An efficient access method for similarity search in metric spaces . In VLDB , pages 426 -- 435 , 1997 . P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2593675"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544886"},{"key":"e_1_2_1_8_1","first-page":"491","volume-title":"VLDB","author":"Gravano L.","year":"2001","unstructured":"L. Gravano , P. G. Ipeirotis , H. V. Jagadish , N. Koudas , S. Muthukrishnan , and D. Srivastava . Approximate string joins in a database (almost) for free . In VLDB , pages 491 -- 500 , 2001 . L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807204"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2012.87"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687702"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/1978665.1978666"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497434"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2078331.2078340"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920924"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/2212351.2212353"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989431"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007652"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2140436.2140440"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807222"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920992"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213847"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.79"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453957"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.111"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1367497.1367516"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807266"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2732296.2732299","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T09:17:33Z","timestamp":1672219053000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2732296.2732299"}},"subtitle":["an experimental evaluation"],"short-title":[],"issued":{"date-parts":[[2014,4]]},"references-count":27,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2014,4]]}},"alternative-id":["10.14778\/2732296.2732299"],"URL":"https:\/\/doi.org\/10.14778\/2732296.2732299","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2014,4]]}}}