{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T18:09:04Z","timestamp":1757614144195,"version":"3.44.0"},"reference-count":83,"publisher":"Association for Computing Machinery (ACM)","issue":"11","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:p>We study the similarity join problem from a systems perspective. A similarity join retrieves all similar record pairs from two collections based on a given distance function. Existing solutions are often optimized for a single distance function and domain. Such monolithic solutions are limited in both their extensibility to new distance functions and their robustness against changing data characteristics.<\/jats:p>\n          <jats:p>To address these challenges, we introduce Fast, a similarity join algorithm designed for extensible and robust query evaluation. It leverages a novel abstraction called reductions, which transform similarity join problems from complex domains into simpler ones. A reduction graph is constructed to systematically enumerate query plans. Since cost models for similarity queries are typically unavailable, Fast employs runtime partitioning and a sampling-based strategy to select a near-optimal query plan with performance guarantees. It can utilize prebuilt indexes or build them on-the-fly, incorporating caching techniques to accelerate index construction and probing. Extensive experiments across diverse datasets, domains, and distance functions show that Fast consistently performs close to the optimal plan. Finally, two case studies highlight its strength as a baseline and its utility for prototyping future similarity join algorithms.<\/jats:p>","DOI":"10.14778\/3749646.3749660","type":"journal-article","created":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T17:55:06Z","timestamp":1757008506000},"page":"3868-3882","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Extensible and Robust Evaluation of Similarity Queries"],"prefix":"10.14778","volume":"18","author":[{"given":"Daniel","family":"Schmitt","sequence":"first","affiliation":[{"name":"University of Salzburg, Austria"}]},{"given":"Thomas","family":"H\u00fctter","sequence":"additional","affiliation":[{"name":"Software Competence, Center Hagenberg, Austria"}]},{"given":"Nikolaus","family":"Augsten","sequence":"additional","affiliation":[{"name":"University of Salzburg, Austria"}]}],"member":"320","published-online":{"date-parts":[[2025,9,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipl.2006.06.002"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.5555\/1182635.1164206"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5441\/002\/EDBT.2018.59"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.2200\/S00544ED1V01Y201310DTM038"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1083592.1083630"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1670243.1670247"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335420"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 32nd International Conference on Very Large Data Bases","author":"Nardini Barioni Maria Camila","year":"2006","unstructured":"Maria Camila Nardini Barioni, Humberto Luiz Razente, Agma J. M. Traina, and Caetano Traina Jr. 2006. SIREN: A Similarity Retrieval Engine for Complex Data. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12\u201315, 2006, Umeshwar Dayal, Kyu-Young Whang, David B. Lomet, Gustavo Alonso, Guy M. Lohman, Martin L. Kersten, Sang Kyun Cha, and Young-Kuk Kim (Eds.). ACM, 1155\u20131158. http:\/\/dl.acm.org\/citation.cfm?id=1164232"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242591"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOMS.2008.4575140"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/354756.354832"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/375663.375714"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/2428536.2428537"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3055399.3055443"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/352595.352598"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03)","author":"Cohen William W.","year":"2003","unstructured":"William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9\u201310, 2003, Acapulco, Mexico, Subbarao Kambhampati and Craig A. Knoblock (Eds.). 73\u201378. http:\/\/www.isi.edu\/info-agents\/workshops\/ijcai03\/papers\/Cohen-p.pdf"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/3020488.3020497"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856330"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3183748"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1609\/SOCS.V4I1.18299"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.3233\/SW-150209"},{"key":"e_1_2_1_23_1","unstructured":"Kevin Dre\u00dfler and Axel-Cyrille Ngonga Ngomo. 2017. On the efficient execution of bounded Jaro-Winkler distances (Source Code). https:\/\/github.com\/dice-group\/LIMES. Source code accompanying the paper."},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI","author":"Feldman Zohar","year":"2013","unstructured":"Zohar Feldman and Carmel Domshlak. 2013. Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 11\u201315, 2013, Ann E. Nicholson and Padhraic Smyth (Eds.). AUAI Press. https:\/\/dslpitt.org\/uai\/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=2382&proceeding_id=29"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1613\/JAIR.4432"},{"key":"e_1_2_1_26_1","volume-title":"VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11\u201314","author":"Gravano Luis","year":"2001","unstructured":"Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. 2001. Approximate String Joins in a Database (Almost) for Free. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11\u201314, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 491\u2013500. http:\/\/www.vldb.org\/conf\/2001\/P491.pdf"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/564691.564725"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517850"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00081"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732296.2732299"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/S00778-007-0061-2"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24741-8_39"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/S00778-023-00806-Z"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Nikolai Karpov Haoyu Zhang and Qin Zhang. 2024. MinJoin++: a fast algorithm for string similarity joins under edit distance (Source Code). https:\/\/github.com\/kedayuge\/MinJoin. Source code accompanying the paper.","DOI":"10.1007\/s00778-023-00806-z"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.14778\/3565816.3565833"},{"key":"e_1_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Nikolai Karpov and Qin Zhang. 2022. SyncSignature: A Simple Efficient Parallelizable Framework for Tree Similarity Joins (Source Code). https:\/\/github.com\/nkkarpov\/syncsignature. Source code accompanying the paper.","DOI":"10.14778\/3565816.3565833"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.18420\/BTW2019-13"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1609\/icaps.v22i1.13518"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3275366.3275370"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-17928-0_12"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/11871842_29"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497434"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2590989.2590994"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/2078331.2078340"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2007.367848"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken, Bozen-Bolzano, Italy, October 21st to 24th, 2014 (CEUR Workshop Proceedings), Friederike Klan, G\u00fcnther Specht, and Johann Gamper (Eds.)","volume":"1313","author":"Mann Willi","year":"2014","unstructured":"Willi Mann and Nikolaus Augsten. 2014. PEL: Position-Enhanced Length Filter for Set Similarity Joins. In Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken, Bozen-Bolzano, Italy, October 21st to 24th, 2014 (CEUR Workshop Proceedings), Friederike Klan, G\u00fcnther Specht, and Johann Gamper (Eds.), Vol. 1313. CEUR-WS.org, 89\u201394. https:\/\/ceur-ws.org\/Vol-1313\/paper_16.pdf"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/2947618.2947620"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242524.1242529"},{"key":"e_1_2_1_49_1","volume-title":"Tappert","author":"Muniswamaiah Manoj","year":"2023","unstructured":"Manoj Muniswamaiah, Tilak Agerwala, and Charles C. Tappert. 2023. Applications of Binary Similarity and Distance Measures. arXiv:2307.00411 [cs.CV] https:\/\/arxiv.org\/abs\/2307.00411"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/S13218-021-00713-X"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/S10462-020-09821-W"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856325"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989431"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3275536.3275539"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.195"},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Daniel Schmitt Thomas H\u00fctter and Nikolaus Augsten. 2025. Extensible and Robust Evaluation of Similarity Queries. Technical Report. University of Salzburg. https:\/\/frosch.cosy.sbg.ac.at\/dschmitt\/fast-similarity-evaluation\/-\/tree\/squashed\/paper","DOI":"10.14778\/3749646.3749660"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611480"},{"key":"e_1_2_1_58_1","unstructured":"Avi Silberschatz Henry F. Korth and S. Sudarshan. 2020. Database System Concepts Seventh Edition. McGraw-Hill Book Company. https:\/\/www.db-book.com\/"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-012-0296-4"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1038\/NATURE16961"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452790"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.14778\/3329772.3329774"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137765.3137810"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/S10462-022-10228-Y"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/1401890.1402008"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/2809974.2809976"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.COSREV.2009.03.001"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3464389"},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Carl","year":"1994","unstructured":"Carl A. Waldspurger and William E. Weihl. 1994. Lottery Scheduling: Flexible Proportional-Share Resource Management. In Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation (OSDI), Monterey, California, USA, November 14\u201317, 1994. USENIX Association, 1\u201311. http:\/\/dl.acm.org\/citation.cfm?id=1267639"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213847"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.79"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1007\/S00778-018-0529-2"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/CIG.2007.368095"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-68783-4_16"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380570"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00085"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453957"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.111"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000824.2000825"},{"key":"e_1_2_1_80_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066243"},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1609\/SOCS.V2I1.18194"},{"key":"e_1_2_1_82_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098003"},{"key":"e_1_2_1_83_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330853"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3749646.3749660","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T03:19:06Z","timestamp":1757042346000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3749646.3749660"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7]]},"references-count":83,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["10.14778\/3749646.3749660"],"URL":"https:\/\/doi.org\/10.14778\/3749646.3749660","relation":{},"ISSN":["2150-8097"],"issn-type":[{"type":"print","value":"2150-8097"}],"subject":[],"published":{"date-parts":[[2025,7]]},"assertion":[{"value":"2025-09-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}