{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T23:31:25Z","timestamp":1780356685358,"version":"3.54.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T00:00:00Z","timestamp":1739318400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>\n            Clustering is a fundamental task in machine learning. One of the most successful and broadly used algorithms is DBSCAN, a density-based clustering algorithm. DBSCAN requires \u03f5-nearest neighbor graphs of the input dataset, which are computed with range-search algorithms and spatial data structures like KD-trees. Despite many efforts to design scalable implementations for DBSCAN, existing work is limited to low-dimensional datasets, as constructing \u03f5-nearest neighbor graphs can be expensive in high-dimensions. This article introduces a modified DBSCAN, using\n            <jats:italic>k<\/jats:italic>\n            -nearest neighbor (\n            <jats:italic>k<\/jats:italic>\n            NN) graphs to improve efficiency. We outline conditions for\n            <jats:italic>k<\/jats:italic>\n            NN-DBSCAN to match DBSCAN\u2019s results and present a parallel implementation using OpenMP and MPI for shared and distributed memory systems. Testing on datasets up to 32 dimensions, we achieve remarkable scalability. Our implementation clusters one billion 3D points in under one second on 28K cores at TACC\u2019s Frontera system. In a larger run, we cluster 65 billion points in 20 dimensions in under 40 seconds using 114,688 cores. Our method is up to 37\u00d7 faster than state-of-the-art parallel DBSCAN on a 20-dimensional dataset with 4 million points. Code is available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"url\" xlink:href=\"https:\/\/github.com\/ut-padas\/knndbscan\">https:\/\/github.com\/ut-padas\/knndbscan<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3701624","type":"journal-article","created":{"date-parts":[[2024,10,24]],"date-time":"2024-10-24T09:40:54Z","timestamp":1729762854000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["KNN-DBSCAN: a DBSCAN in high dimensions"],"prefix":"10.1145","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0633-6691","authenticated-orcid":false,"given":"Youguang","family":"Chen","sequence":"first","affiliation":[{"name":"Oden Institute, UT Austin, Austin, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5702-022X","authenticated-orcid":false,"given":"William","family":"Ruys","sequence":"additional","affiliation":[{"name":"Oden Institute, UT Austin, Austin, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0033-3994","authenticated-orcid":false,"given":"George","family":"Biros","sequence":"additional","affiliation":[{"name":"Oden Institute, UT Austin, Austin, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,2,12]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783405"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/FOCS.2006.49"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327494"},{"key":"e_1_3_2_5_2","first-page":"3","volume-title":"Proceedings of the International European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning","volume":"3","year":"2013","unstructured":"Davide Anguita, Alessandro Ghio, Luca Oneto, Francesc Parra, and J. Reyes-Ortiz. 2013. A public domain dataset for human activity recognition using smartphones. In Proceedings of the International European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Vol. 3. 3."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/304181.304187"},{"key":"e_1_3_2_7_2","unstructured":"Arjun Bhasin. 2018. Credit Card Data for Clustering. Retrieved 10 Nov. 2024 from https:\/\/www.kaggle.com\/datasets\/arjunbhasin2013\/ccdata"},{"key":"e_1_3_2_8_2","unstructured":"Shiva Chandel. 2018. KC House Data. Retrieved 10 Nov. 2024 from https:\/\/www.kaggle.com\/datasets\/astronautelvis\/kc-house-data\/data"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-46502-2_13"},{"key":"e_1_3_2_10_2","first-page":"226","volume-title":"Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining","volume":"96","year":"1996","unstructured":"Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Vol. 96, 226\u2013231."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3083897"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1217299.1217303"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/1656274.1656278"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2009.09.011"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305555"},{"key":"e_1_3_2_16_2","unstructured":"Heinrich Jiang Jennifer Jang and Jakub Lacki. 2020. Faster DBSCAN via subsampled similarity queries. Advances in Neural Information Processing Systems 33 (2020) 22407\u201322419."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1107769108"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.781637"},{"key":"e_1_3_2_19_2","volume-title":"Algorithm Design","author":"Kleinberg J.","year":"2006","unstructured":"J. Kleinberg. 2006. Algorithm Design. Vol. 92. Pearson Education."},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3007748.3007773"},{"key":"e_1_3_2_21_2","unstructured":"Leland McInnes John Healy and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426. Retrieved from https:\/\/arxiv.org\/abs\/1802.03426"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2321376"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503255"},{"key":"e_1_3_2_24_2","unstructured":"Md Mostofa Ali Patwary. 2012. Parallel DBSCAN (PDSCAN). Retrieved 10 Nov. 2024 from http:\/\/cucis.ece.northwestern.edu\/projects\/Clustering\/download_code_dbscan.html"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2807591.2807616"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.5555\/2388996.2389081"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.51"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_3_2_29_2","volume-title":"R: A Language and Environment for Statistical Computing","author":"Team R Core","year":"2013","unstructured":"R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http:\/\/www.R-project.org\/ISBN 3-900051-07-0."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.24432\/C5F88Q"},{"key":"e_1_3_2_31_2","volume-title":"Foundations of Multidimensional and Metric Data Structures","author":"Samet H.","year":"2006","unstructured":"H. Samet. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009745219419"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2019.8891020"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824115"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3068335"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442530"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2019.10.019"},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","unstructured":"Yiqiu Wang Yan Gu and Julian Shun. 2019. Theoretically-efficient and practical parallel DBSCAN. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2555\u20132571.","DOI":"10.1145\/3318464.3380582"},{"key":"e_1_3_2_39_2","first-page":"194","volume-title":"Proceedings of the International Conference on Very Large Data Bases","author":"Weber R.","year":"1998","unstructured":"R. Weber, H. J. Schek, and S. Blott. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Data Bases. IEEE, 194\u2013205."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1137\/15M1026377"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00018"},{"key":"e_1_3_2_42_2","unstructured":"Chenhan D. Yu Severin Reiz and James Levitt. 2018. GOFMM. Retrieved 10 Nov. 2024 from https:\/\/github.com\/ChenhanYu\/hmlp"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701624","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3701624","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:05Z","timestamp":1750295405000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3701624"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,12]]},"references-count":41,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3701624"],"URL":"https:\/\/doi.org\/10.1145\/3701624","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"value":"2329-4949","type":"print"},{"value":"2329-4957","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,12]]},"assertion":[{"value":"2024-06-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-10-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}