{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,5,10]],"date-time":"2025-05-10T06:05:39Z","timestamp":1746857139242},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T00:00:00Z","timestamp":1707955200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T00:00:00Z","timestamp":1707955200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Zhejiang Provincial Natural Science Foundation of China","award":["No. LDT23F01013F01","No. LDT23F01013F01"],"award-info":[{"award-number":["No. LDT23F01013F01","No. LDT23F01013F01"]}]},{"name":"Zhejiang Provincial Natural Science Foundation of China","award":["No. LDT23F01013F01","No. LDT23F01013F01","No. LDT23F01013F01"],"award-info":[{"award-number":["No. LDT23F01013F01","No. LDT23F01013F01","No. LDT23F01013F01"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Despite the promising progress that has been made, large-scale clustering tasks still face various challenges: (i) high time and space complexity in K-nearest neighbors (KNN), which is often overlooked by most methods, and (ii) low recall rate caused by simply splitting the dataset. In this paper, we propose a novel framework for large-scale clustering tasks named large-scale clustering via recall KNN and subgraph segmentation (LS-RKSS) to perform faster clustering with guaranteed clustering performance, which embraces the ability of handling large-scale data up to 100 million using a single T4 GPU with less than 10% of the running time. We propose recall KNN (RKNN) and subgraph segmentation (SS) to effectively address the primary challenges in large-scale clustering tasks. Firstly, the recall KNN is proposed to perform efficient similarity search among dense vectors with lower time and space complexity compared to traditional exact search methods of KNN. Then, the subgraph segmentation is proposed to split the whole dataset into multiple subgraphs based on the recall KNN. Given the recall rate of RKNN based on traditional exact search methods, it is theoretically proved that dividing the dataset into multiple subgraphs using recall KNN and subgraph segmentation is a more reasonable and effective approach. Finally, clusters are generated independently on each subgraph, and the final clustering result is obtained by combining the results of all subgraphs. Extensive experiments demonstrate that LS-RKSS outperforms previous large-scale clustering methods in both effectiveness and efficiency.\n<\/jats:p>","DOI":"10.1007\/s11063-024-11444-z","type":"journal-article","created":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T04:23:52Z","timestamp":1707971032000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Large-Scale Clustering on 100\u00a0M-Scale Datasets Using a Single T4 GPU via Recall KNN and Subgraph Segmentation"],"prefix":"10.1007","volume":"56","author":[{"given":"Junjie","family":"Liu","sequence":"first","affiliation":[]},{"given":"Rongxin","family":"Jiang","sequence":"additional","affiliation":[]},{"given":"Xuesong","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Fan","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Yaowu","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Chen","family":"Shen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,15]]},"reference":[{"issue":"4","key":"11444_CR1","doi-asserted-by":"publisher","first-page":"461","DOI":"10.1007\/s10791-008-9066-8","volume":"12","author":"E Amig\u00f3","year":"2009","unstructured":"Amig\u00f3 E, Gonzalo J, Artiles J et al (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retrieval 12(4):461\u2013486","journal-title":"Inf Retrieval"},{"key":"11444_CR2","doi-asserted-by":"crossref","unstructured":"An X, Zhu X, Gao Y, et\u00a0al (2021) Partial fc: Training 10 million identities on a single machine. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV) workshops, pp 1445\u20131449","DOI":"10.1109\/ICCVW54120.2021.00166"},{"key":"11444_CR3","doi-asserted-by":"crossref","unstructured":"Baumann P (2016) Sparse-reduced computation for large-scale spectral clustering. In: 2016 IEEE international conference on industrial engineering and engineering management (IEEM), IEEE, pp 1284\u20131288","DOI":"10.1109\/IEEM.2016.7798085"},{"key":"11444_CR4","doi-asserted-by":"crossref","unstructured":"Bendechache M, Kechadi MT, Le-Khac NA (2016) Efficient large scale clustering based on data partitioning. In: 2016 IEEE international conference on data science and advanced analytics (DSAA), IEEE, pp 612\u2013621","DOI":"10.1109\/DSAA.2016.70"},{"issue":"6","key":"11444_CR5","doi-asserted-by":"publisher","first-page":"671","DOI":"10.1080\/17445760.2018.1446210","volume":"34","author":"M Bendechache","year":"2019","unstructured":"Bendechache M, Tari AK, Kechadi MT (2019) Parallel and distributed clustering framework for big spatial data mining. Int J Parallel Emergent Distrib Syst 34(6):671\u2013689","journal-title":"Int J Parallel Emergent Distrib Syst"},{"key":"11444_CR6","doi-asserted-by":"crossref","unstructured":"Cai J, Fan J, Guo W, et\u00a0al (2022) Efficient deep embedded subspace clustering. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1\u201310","DOI":"10.1109\/CVPR52688.2022.00012"},{"issue":"3","key":"11444_CR7","doi-asserted-by":"publisher","first-page":"725","DOI":"10.1109\/TNNLS.2019.2909425","volume":"31","author":"X Chen","year":"2019","unstructured":"Chen X, Chen R, Wu Q et al (2019) Labin: Balanced min cut for large-scale data. IEEE Trans Neural Netw Learn Syst 31(3):725\u2013736","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"11444_CR8","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2022.109238","volume":"136","author":"S Ding","year":"2023","unstructured":"Ding S, Li C, Xu X et al (2023) A sampling-based density peaks clustering algorithm for large-scale data. Pattern Recogn 136:109238","journal-title":"Pattern Recogn"},{"key":"11444_CR9","doi-asserted-by":"publisher","unstructured":"Du M, Zhao J, Sun J, et\u00a0al (2022) M3w: multistep three-way clustering. IEEE Trans Neural Netw Learn Syst pp 1\u201314. https:\/\/doi.org\/10.1109\/TNNLS.2022.3208418","DOI":"10.1109\/TNNLS.2022.3208418"},{"key":"11444_CR10","unstructured":"Ester M, Kriegel HP, Sander J, et\u00a0al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226\u2013231"},{"key":"11444_CR11","doi-asserted-by":"crossref","unstructured":"Fan J (2021) Large-scale subspace clustering via k-factorization. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp 342\u2013352","DOI":"10.1145\/3447548.3467267"},{"issue":"5814","key":"11444_CR12","doi-asserted-by":"publisher","first-page":"972","DOI":"10.1126\/science.1136800","volume":"315","author":"BJ Frey","year":"2007","unstructured":"Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972\u2013976","journal-title":"Science"},{"key":"11444_CR13","doi-asserted-by":"crossref","unstructured":"Guo Y, Zhang L, Hu Y, et\u00a0al (2016) Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In: European conference on computer vision, Springer, pp 87\u2013102","DOI":"10.1007\/978-3-319-46487-9_6"},{"issue":"11","key":"11444_CR14","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0277012","volume":"17","author":"G Jin","year":"2022","unstructured":"Jin G, Gao J, Tan L (2022) Robust large-scale clustering based on correntropy. PLoS ONE 17(11):e0277012","journal-title":"PLoS ONE"},{"issue":"3","key":"11444_CR15","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1109\/TBDATA.2019.2921572","volume":"7","author":"J Johnson","year":"2019","unstructured":"Johnson J, Douze M, J\u00e9gou H (2019) Billion-scale similarity search with gpus. IEEE Trans Big Data 7(3):535\u2013547","journal-title":"IEEE Trans Big Data"},{"key":"11444_CR16","doi-asserted-by":"crossref","unstructured":"Kim M, Jain AK, Liu X (2022) Adaface: Quality adaptive margin for face recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 18750\u201318759","DOI":"10.1109\/CVPR52688.2022.01819"},{"key":"11444_CR17","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1016\/j.neucom.2022.06.006","volume":"501","author":"H Li","year":"2022","unstructured":"Li H, Ye X, Imakura A et al (2022) Divide-and-conquer based large-scale spectral clustering. Neurocomputing 501:664\u2013678","journal-title":"Neurocomputing"},{"issue":"1","key":"11444_CR18","doi-asserted-by":"publisher","first-page":"59","DOI":"10.3233\/IDA-216240","volume":"27","author":"H Li","year":"2023","unstructured":"Li H, Ye X, Imakura A et al (2023) Lsec: Large-scale spectral ensemble clustering. Intell Data Anal 27(1):59\u201377","journal-title":"Intell Data Anal"},{"key":"11444_CR19","doi-asserted-by":"publisher","first-page":"3231","DOI":"10.1007\/s00500-015-1698-1","volume":"20","author":"Y Li","year":"2016","unstructured":"Li Y, Yang G, He H et al (2016) A study of large-scale data clustering based on fuzzy clustering. Soft Comput 20:3231\u20133242","journal-title":"Soft Comput"},{"key":"11444_CR20","unstructured":"Liu Y, Liang K, Xia J, et\u00a0al (2023) Dink-net: Neural clustering on large graphs. arXiv preprint arXiv:2305.18405"},{"issue":"2","key":"11444_CR21","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","volume":"28","author":"S Lloyd","year":"1982","unstructured":"Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129\u2013137","journal-title":"IEEE Trans Inf Theory"},{"key":"11444_CR22","doi-asserted-by":"publisher","first-page":"20926","DOI":"10.1109\/ACCESS.2022.3149548","volume":"10","author":"MM Madbouly","year":"2022","unstructured":"Madbouly MM, Darwish SM, Bagi NA et al (2022) Clustering big data based on distributed fuzzy k-medoids: an application to geospatial informatics. IEEE Access 10:20926\u201320936","journal-title":"IEEE Access"},{"issue":"4","key":"11444_CR23","doi-asserted-by":"publisher","first-page":"824","DOI":"10.1109\/TPAMI.2018.2889473","volume":"42","author":"YA Malkov","year":"2020","unstructured":"Malkov YA, Yashunin DA (2020) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 42(4):824\u2013836","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"1","key":"11444_CR24","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1140\/epjst\/e2010-01179-1","volume":"178","author":"M Rosvall","year":"2009","unstructured":"Rosvall M, Axelsson D, Bergstrom CT (2009) The map equation. The Eur Phys J Spec Top 178(1):13\u201323. https:\/\/doi.org\/10.1140\/epjst\/e2010-01179-1","journal-title":"The Eur Phys J Spec Top"},{"issue":"1","key":"11444_CR25","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1140\/epjst\/e2010-01179-1","volume":"178","author":"M Rosvall","year":"2009","unstructured":"Rosvall M, Axelsson D, Bergstrom CT (2009) The map equation. Eur Phys J Spec Top 178(1):13\u201323","journal-title":"Eur Phys J Spec Top"},{"key":"11444_CR26","doi-asserted-by":"crossref","unstructured":"Shen S, Li W, Zhu Z, et\u00a0al (2021) Structure-aware face clustering on a large-scale graph with 107 nodes. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 9085\u20139094","DOI":"10.1109\/CVPR46437.2021.00897"},{"key":"11444_CR27","doi-asserted-by":"crossref","unstructured":"Shen X, Liu W, Tsang I, et\u00a0al (2017) Compressed k-means for large-scale clustering. In: Proceedings of the AAAI conference on artificial intelligence","DOI":"10.1609\/aaai.v31i1.10852"},{"issue":"8","key":"11444_CR28","doi-asserted-by":"publisher","first-page":"888","DOI":"10.1109\/34.868688","volume":"22","author":"J Shi","year":"2000","unstructured":"Shi J, Malik J (2000) Normalized cuts and image segmentation. PAMI 22(8):888\u2013905","journal-title":"PAMI"},{"issue":"1","key":"11444_CR29","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1093\/comjnl\/16.1.30","volume":"16","author":"R Sibson","year":"1973","unstructured":"Sibson R (1973) Slink: An optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30\u201334","journal-title":"Comput J"},{"key":"11444_CR30","doi-asserted-by":"crossref","unstructured":"Vakhnin A, Sopov E (2020) Large-scale clustering using decomposition-based evolutionary algorithms. In: 2020 IEEE symposium series on computational intelligence (SSCI), IEEE, pp 345\u2013352","DOI":"10.1109\/SSCI47803.2020.9308257"},{"key":"11444_CR31","unstructured":"Wang Y, Zhang Y, Zhang F, et\u00a0al (2022) Ada-nets: Face clustering via adaptive neighbour discovery in the structure space. In: International conference on learning representations (ICLR)"},{"key":"11444_CR32","doi-asserted-by":"crossref","unstructured":"Wang Z, Zheng L, Li Y, et\u00a0al (2019) Linkage based face clustering via graph convolution network. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1117\u20131125","DOI":"10.1109\/CVPR.2019.00121"},{"key":"11444_CR33","doi-asserted-by":"crossref","unstructured":"Yang L, Zhan X, Chen D, et\u00a0al (2019) Learning to cluster faces on an affinity graph. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 2298\u20132306","DOI":"10.1109\/CVPR.2019.00240"},{"key":"11444_CR34","doi-asserted-by":"crossref","unstructured":"Yang L, Chen D, Zhan X, et\u00a0al (2020) Learning to cluster faces via confidence and connectivity estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)","DOI":"10.1109\/CVPR42600.2020.01338"},{"key":"11444_CR35","unstructured":"Yu X, Yang Y, Wang A, et\u00a0al (2022) Facemap: Towards unsupervised face clustering via map equation. arXiv preprint arXiv:2203.10090"},{"key":"11444_CR36","doi-asserted-by":"publisher","first-page":"416","DOI":"10.1016\/j.knosys.2018.09.007","volume":"163","author":"X Zhao","year":"2019","unstructured":"Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl-Based Syst 163:416\u2013428","journal-title":"Knowl-Based Syst"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11444-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11444-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11444-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T20:16:45Z","timestamp":1709237805000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11444-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,15]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,2]]}},"alternative-id":["11444"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11444-z","relation":{},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,15]]},"assertion":[{"value":"18 October 2023","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 February 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"34"}}