{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,12]],"date-time":"2025-09-12T19:02:00Z","timestamp":1757703720836,"version":"3.32.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>String matching is at the core of data cleaning, record matching, and information retrieval. String matching relies on a similarity measure that evaluates the similarity of two strings, regarding the two as a match if their similarity is larger than a user-defined threshold. In our collaboration with journalists and public defenders, we found that real-world datasets, such as police rosters that journalists and public defenders work with, often contain acronyms, abbreviations, and typos, thanks to errors during manual entry, into, say, a spreadsheet or a form. Unfortunately, traditional similarity measures lead to low accuracy since they do not consider all three aspects together. Some recent work proposes leveraging synonym rules to improve matching, but either requires these rules to be provided upfront, or generated prior to matching, which leads to low accuracy in our setting and similar ones. To address these limitations, we propose Smash, a simple yet effective measure to assess the similarity of two strings with acronyms, abbreviations, and typos, all without relying on synonym rules. We design a dynamic programming algorithm to efficiently compute this measure, along with two optimizations that improve accuracy. We show that compared to the best baselines, including one based on ChatGPT with GPT-4, Smash improves the max and mean F-score by 23.5% and 110.8%, respectively. We implement Smash in OpenRefine, a graphical data cleaning tool, to facilitate its use by journalists, public defenders, and other non-programmers for data cleaning.<\/jats:p>","DOI":"10.14778\/3685800.3685830","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4104-4116","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching"],"prefix":"10.14778","volume":"17","author":[{"given":"Joshua","family":"Wu","sequence":"first","affiliation":[{"name":"UC Berkeley"}]},{"given":"Dixin","family":"Tang","sequence":"additional","affiliation":[{"name":"UT Austin"}]},{"given":"Nithin","family":"Chalapathi","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Tristan","family":"Chambers","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]},{"given":"Julie","family":"Ciccolini","sequence":"additional","affiliation":[{"name":"Techtivist"}]},{"given":"Cheryl","family":"Phillips","sequence":"additional","affiliation":[{"name":"Dept. of Communication, Stanford University"}]},{"given":"Lisa","family":"Pickoff-White","sequence":"additional","affiliation":[{"name":"Investigative Reporting Program, UC Berkeley"}]},{"given":"Aditya","family":"Parameswaran","sequence":"additional","affiliation":[{"name":"UC Berkeley"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"volume-title":"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.sparse.csgraph.maximum_bipartite_matching.html [Online","year":"2023","key":"e_1_2_1_1_1","unstructured":"Bipartite matching. https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.sparse.csgraph.maximum_bipartite_matching.html [Online; accessed 17-Novembor-2023]."},{"volume-title":"https:\/\/chat.openai.com\/ [Online","year":"2024","key":"e_1_2_1_2_1","unstructured":"Chatgpt. https:\/\/chat.openai.com\/ [Online; accessed 12-Mar-2024]."},{"volume-title":"https:\/\/zenodo.org\/record\/4266963 [Online","year":"2023","key":"e_1_2_1_3_1","unstructured":"Disease dataset. https:\/\/zenodo.org\/record\/4266963 [Online; accessed 27-February-2023]."},{"volume-title":"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance [Online","year":"2023","key":"e_1_2_1_4_1","unstructured":"Levenshtein distance. https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance [Online; accessed 27-February-2023]."},{"volume-title":"https:\/\/github.com\/OpenRefine\/OpenRefine [Online","year":"2023","key":"e_1_2_1_5_1","unstructured":"Openrefine. https:\/\/github.com\/OpenRefine\/OpenRefine [Online; accessed 27-February-2023]."},{"volume-title":"https:\/\/www.ranks.nl\/stopwords [Online","year":"2023","key":"e_1_2_1_6_1","unstructured":"Stopwords. https:\/\/www.ranks.nl\/stopwords [Online; accessed 27-February-2023]."},{"volume-title":"https:\/\/en.wikipedia.org\/wiki\/Subsequence [Oneline","year":"2023","key":"e_1_2_1_7_1","unstructured":"Subsequence. https:\/\/en.wikipedia.org\/wiki\/Subsequence [Oneline; accessed 17-Novembor-2023]."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3580305.3599402"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7--12, 2008","author":"Arasu A.","year":"2008","unstructured":"A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In G. Alonso, J. A. Blakeley, and A. L. P. Chen, editors, Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7--12, 2008, Canc\u00fan, Mexico, pages 40--49. IEEE Computer Society, 2008."},{"key":"e_1_2_1_10_1","first-page":"918","volume-title":"Proceedings of the 32nd International Conference on Very Large Data Bases","author":"Arasu A.","year":"2006","unstructured":"A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In U. Dayal, K. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, and Y. Kim, editors, Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 918--929. ACM, 2006."},{"key":"e_1_2_1_11_1","first-page":"67","volume-title":"IIWeb","author":"Bilenko M.","year":"2003","unstructured":"M. Bilenko and R. J. Mooney. Employing trainable string similarity metrics for information integration. In IIWeb, pages 67--72, 2003."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/3291264.3291272"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8","author":"Chaudhuri S.","year":"2006","unstructured":"S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In L. Liu, A. Reuter, K. Whang, and J. Zhang, editors, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA, page 5. IEEE Computer Society, 2006."},{"key":"e_1_2_1_14_1","first-page":"51","volume-title":"Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018","author":"Dai J.","year":"2018","unstructured":"J. Dai, M. Zhang, G. Chen, J. Fan, K. Y. Ngiam, and B. C. Ooi. Fine-grained concept linking using neural networks in healthcare. In G. Das, C. M. Jermaine, and P. A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, pages 51--66. ACM, 2018."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.14778\/3115404.3115413"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.250581"},{"key":"e_1_2_1_17_1","first-page":"11","volume-title":"Proceedings of the Fifth International Workshop on Quality in Databases, QDB 2007, at the VLDB 2007 conference","author":"Hassanzadeh O.","year":"2007","unstructured":"O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of approximate string joins using grams. In V. Ganti and F. Naumann, editors, Proceedings of the Fifth International Workshop on Quality in Databases, QDB 2007, at the VLDB 2007 conference, Vienna, Austria, September 23, 2007, pages 11--18, 2007."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824036"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732296.2732299"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-020-0350-4"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7--12, 2008","author":"Li C.","year":"2008","unstructured":"C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In G. Alonso, J. A. Blakeley, and A. L. P. Chen, editors, Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7--12, 2008, Canc\u00fan, Mexico, pages 257--266. IEEE Computer Society, 2008."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/2078331.2078340"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465313"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/2947618.2947620"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00121"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/2212351.2212356"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-59419-0_24"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00191"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3151113.3151118"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920992"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213847"},{"key":"e_1_2_1_33_1","first-page":"109","volume-title":"Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT","author":"Wang J.","year":"2019","unstructured":"J. Wang, C. Lin, M. Li, and C. Zaniolo. An efficient sliding window approach for approximate entity extraction with synonyms. In M. Herschel, H. Galhardas, B. Reinwald, I. Fundulaki, C. Binnig, and Z. Kaoudi, editors, Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26--29, 2019, pages 109--120. OpenProceedings.org, 2019."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.79"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00093"},{"volume-title":"Jaccard index","year":"2023","key":"e_1_2_1_36_1","unstructured":"Wikipedia contributors. Jaccard index, 2023. [Online; accessed 27-February-2023]."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453957"},{"key":"e_1_2_1_38_1","first-page":"131","volume-title":"Proceedings of the 17th International Conference on World Wide Web, WWW 2008","author":"Xiao C.","year":"2008","unstructured":"C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In J. Huai, R. Chen, H. Hon, Y. Liu, W. Ma, A. Tomkins, and X. Zhang, editors, Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21--25, 2008, pages 131--140. ACM, 2008."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342268"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685830","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:30:13Z","timestamp":1735623013000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685830"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":39,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685830"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685830","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}