{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T05:12:38Z","timestamp":1775538758493,"version":"3.50.1"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,12,4]]},"abstract":"<jats:p>Similarity join-a widely used operation in data science-finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I\/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I\/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50\u00d7 to 1000\u00d7.<\/jats:p>","DOI":"10.1145\/3769780","type":"journal-article","created":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T04:32:13Z","timestamp":1764995533000},"page":"1-27","source":"Crossref","is-referenced-by-count":0,"title":["DiskJoin: Large-scale Vector Similarity Join with SSD"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-9494-4947","authenticated-orcid":false,"given":"Yanqi","family":"Chen","sequence":"first","affiliation":[{"name":"University of Massachusetts Amherst, Amherst, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2122-915X","authenticated-orcid":false,"given":"Xiao","family":"Yan","sequence":"additional","affiliation":[{"name":"Institute for Math &amp; AI, Wuhan, Wuhan University, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7346-6002","authenticated-orcid":false,"given":"Alexandra","family":"Meliou","sequence":"additional","affiliation":[{"name":"University of Massachusetts Amherst, Amherst, MA, USA and Archimedes\/Athena RC, Marousi, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2679-3945","authenticated-orcid":false,"given":"Eric","family":"Lo","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,5]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540","author":"Abbas Amro","year":"2023","unstructured":"Amro Abbas, Kushal Tirumala, D\u00e1niel Simig, Surya Ganguli, and Ari S Morcos. 2023. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540 (2023)."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.14778\/3611479.3611537"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/1182635.1164206"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1177\/1354856517737222"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055-2063","author":"Babenko Artem","year":"2016","unstructured":"Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055-2063."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.52.0078"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/376284.375714"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/2428536.2428537"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/170036.170075"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541882"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2631599"},{"key":"e_1_2_1_13_1","first-page":"5199","article-title":"Spann: Highly-efficient billion-scale approximate nearest neighborhood search","volume":"34","author":"Chen Qi","year":"2021","unstructured":"Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems, Vol. 34 (2021), 5199-5212.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_14_1","first-page":"426","article-title":"M-tree: An efficient access method for similarity search in metric spaces","volume":"97","author":"Ciaccia Paolo","year":"1997","unstructured":"Paolo Ciaccia, Marco Patella, Pavel Zezula, et al., 1997. M-tree: An efficient access method for similarity search in metric spaces. In Vldb, Vol. 97. 426-435.","journal-title":"Vldb"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CBMI.2016.7500278"},{"key":"e_1_2_1_16_1","first-page":"1","article-title":"Near-duplicate video detection based on an approximate similarity self-join strategy","author":"da Silva Henrique B.","year":"2016","unstructured":"Henrique B. da Silva, Zenilton K. G. do Patroc\u00ednio, Guillaume Gravier, Laurent Amsaleg, Arnaldo de A. Ara\u00fajo, and Silvio Jamil F. Guimar\u00e3es. 2016b. Near-duplicate video detection based on an approximate similarity self-join strategy. In CBMI. 1-6.","journal-title":"CBMI."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732977.2732981"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-45227-0_48"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3303753.3303754"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2213836.2213898"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3725413"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/2168651.2168654"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3543507.3583552"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2022.3161156"},{"key":"e_1_2_1_26_1","first-page":"901","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Gyawali Bikash","year":"2020","unstructured":"Bikash Gyawali, Lucas Anastasiou, and Petr Knoth. 2020. Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France, 901-910."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850469.2850470"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/276698.276876"},{"key":"e_1_2_1_29_1","volume-title":"An overview of electronic commerce (e-Commerce). The journal of contemporary issues in business and government","author":"Jain Vipin","year":"2021","unstructured":"Vipin Jain, BINDOO Malviya, and SATYENDRA Arya. 2021. An overview of electronic commerce (e-Commerce). The journal of contemporary issues in business and government, Vol. 27, 3 (2021), 665-670."},{"key":"e_1_2_1_30_1","volume-title":"Product quantization for nearest neighbor search","author":"Jegou Herve","year":"2010","unstructured":"Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, Vol. 33, 1 (2010), 117-128."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-012-0305-7"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2005.07.002"},{"key":"e_1_2_1_33_1","volume-title":"Jyothi Vedurada, et al.","author":"Khan Saim","year":"2024","unstructured":"Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada, et al., 2024. Bang: Billion-scale approximate nearest neighbor search using a single gpu. arXiv preprint arXiv:2401.11324 (2024)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1005"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD56317.2022.00051"},{"key":"e_1_2_1_36_1","first-page":"30233","article-title":"Matryoshka representation learning","volume":"35","author":"Kusupati Aditya","year":"2022","unstructured":"Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al., 2022. Matryoshka representation learning. Advances in Neural Information Processing Systems, Vol. 35 (2022), 30233-30249.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2836464"},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the 2018 World Wide Web Conference. 1439-1448","author":"Luo Chen","year":"2018","unstructured":"Chen Luo and Anshumali Shrivastava. 2018. Arrays of (locality-sensitive) count estimators (ace) anomaly detection on the edge. In Proceedings of the 2018 World Wide Web Conference. 1439-1448."},{"key":"e_1_2_1_39_1","volume-title":"Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs","author":"Malkov Yu A","year":"2018","unstructured":"Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 4 (2018), 824-836."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2019.106970"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.06.050"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1989323.1989431"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2012.195"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3662010.3663448"},{"key":"e_1_2_1_45_1","first-page":"25278","article-title":"Laion-5b: An open large-scale dataset for training next generation image-text models","volume":"35","author":"Schuhmann Christoph","year":"2022","unstructured":"Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, Vol. 35 (2022), 25278-25294.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_46_1","volume-title":"Ravishankar Krishnaswamy, and Harsha Vardhan Simhadri.","author":"Singh Aditi","year":"2021","unstructured":"Aditi Singh, Suhas Jayaram Subramanya, Ravishankar Krishnaswamy, and Harsha Vardhan Simhadri. 2021. Freshdiskann: A fast and accurate graph-based ann index for streaming similarity search. arXiv preprint arXiv:2105.09613 (2021)."},{"key":"e_1_2_1_47_1","first-page":"13748","article-title":"DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node","author":"Subramanya Suhas Jayaram","year":"2019","unstructured":"Suhas Jayaram Subramanya, Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnaswamy, and Rohan Kadekodi. 2019. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems. 13748-13758.","journal-title":"Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_49_1","first-page":"1135","volume-title":"2024 USENIX Annual Technical Conference (USENIX ATC 24)","author":"Tian Bing","year":"2024","unstructured":"Bing Tian, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, and Yu Zhang. 2024. Scalable billion-point approximate nearest neighbor search using {SmartSSDs}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24). 1135-1150."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959160"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-014-2185-x"},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of 2012 International Conference on Management of Data. 85-96","author":"Wang Jiannan","year":"2012","unstructured":"Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering? An adaptive framework for similarity join and search. In Proceedings of 2012 International Conference on Management of Data. 85-96."},{"key":"e_1_2_1_53_1","first-page":"15738","article-title":"An efficient and robust framework for approximate nearest neighbor search with attribute constraint","volume":"36","author":"Wang Mengzhao","year":"2023","unstructured":"Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and Jiongkang Ni. 2023. An efficient and robust framework for approximate nearest neighbor search with attribute constraint. Advances in Neural Information Processing Systems, Vol. 36 (2023), 15738-15751.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639269"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2487625"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915220"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/1291233.1291280"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2008.2009673"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453957"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2638838"},{"key":"e_1_2_1_61_1","volume-title":"A Topology-Aware Localized Update Strategy for Graph-Based ANN Index. arXiv preprint arXiv:2503.00402","author":"Yu Song","year":"2025","unstructured":"Song Yu, Shengyuan Lin, Shufeng Gong, Yongqing Xie, Ruicheng Liu, Yijie Zhou, Ji Sun, Yanfeng Zhang, Guoliang Li, and Ge Yu. 2025. A Topology-Aware Localized Update Strategy for Graph-Based ANN Index. arXiv preprint arXiv:2503.00402 (2025)."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357938"},{"key":"e_1_2_1_63_1","volume-title":"Approximate Vector Queries on Very Large Unstructured Datasets. In 20th Symposium on Networked Systems Design and Implementation. 995-1011","author":"Zhang Zili","year":"2023","unstructured":"Zili Zhang, Chao Jin, Linpeng Tang, Xuanzhe Liu, and Xin Jin. 2023. Fast, Approximate Vector Queries on Very Large Unstructured Datasets. In 20th Symposium on Networked Systems Design and Implementation. 995-1011."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE48307.2020.00094"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3639324"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769780","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T04:31:50Z","timestamp":1775536310000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769780"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,4]]},"references-count":65,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12,4]]}},"alternative-id":["10.1145\/3769780"],"URL":"https:\/\/doi.org\/10.1145\/3769780","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,4]]}}}