{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T03:02:09Z","timestamp":1769569329670,"version":"3.49.0"},"reference-count":54,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2020,12,27]],"date-time":"2020-12-27T00:00:00Z","timestamp":1609027200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002341","name":"Academy of Finland","doi-asserted-by":"publisher","award":["315550"],"award-info":[{"award-number":["315550"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002341","name":"Academy of Finland","doi-asserted-by":"publisher","award":["311877"],"award-info":[{"award-number":["311877"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means\u2016 type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means\u2016 methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.<\/jats:p>","DOI":"10.3390\/a14010006","type":"journal-article","created":{"date-parts":[[2020,12,27]],"date-time":"2020-12-27T20:04:58Z","timestamp":1609099498000},"page":"6","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":34,"title":["Improving Scalable K-Means++"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8466-9232","authenticated-orcid":false,"given":"Joonas","family":"H\u00e4m\u00e4l\u00e4inen","sequence":"first","affiliation":[{"name":"Faculty of Information Technology, University of Jyv\u00e4skyl\u00e4, 40014 Jyv\u00e4skyl\u00e4, Finland"}]},{"given":"Tommi","family":"K\u00e4rkk\u00e4inen","sequence":"additional","affiliation":[{"name":"Faculty of Information Technology, University of Jyv\u00e4skyl\u00e4, 40014 Jyv\u00e4skyl\u00e4, Finland"}]},{"given":"Tuomo","family":"Rossi","sequence":"additional","affiliation":[{"name":"Faculty of Information Technology, University of Jyv\u00e4skyl\u00e4, 40014 Jyv\u00e4skyl\u00e4, Finland"}]}],"member":"1968","published-online":{"date-parts":[[2020,12,27]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in PCM","volume":"28","author":"Lloyd","year":"1982","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_2","first-page":"200","article-title":"A comparative study of efficient initialization methods for the k-means clustering algorithm","volume":"40","author":"Kingravi","year":"2012","journal-title":"Expert Syst. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1016\/j.patcog.2019.04.014","article-title":"How much can k-means be improved by using better initialization and repeats?","volume":"93","author":"Sieranoja","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"H\u00e4m\u00e4l\u00e4inen, J., Jauhiainen, S., and K\u00e4rkk\u00e4inen, T. (2017). Comparison of Internal Clustering Validation Indices for Prototype-Based Clustering. Algorithms, 10.","DOI":"10.3390\/a10030105"},{"key":"ref_5","unstructured":"Arthur, D., and Vassilvitskii, S. (2007, January 7\u20139). k-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, New Orleans, LA, USA."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"622","DOI":"10.14778\/2180912.2180915","article-title":"Scalable K-means++","volume":"5","author":"Bahmani","year":"2012","journal-title":"Proc. VLDB Endow."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"3135","DOI":"10.1109\/TPDS.2014.2306193","article-title":"Efficient k -Means++ Approximation with MapReduce","volume":"25","author":"Xu","year":"2014","journal-title":"IEEE Trans. Paral. Distrib. Syst."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Dhillon, I.S., and Modha, D.S. (2002). A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel Data Mining, Springer.","DOI":"10.1007\/3-540-46502-2_13"},{"key":"ref_9","unstructured":"Zhao, W., Ma, H., and He, Q. (2009, January 19\u201321). Parallel K-Means Clustering Based on MapReduce. Proceedings of the 1st International Conference on Cloud Computing, CloudCom \u201909, Munich, Germany."},{"key":"ref_10","unstructured":"Elkan, C. (2003, January 21\u201324). Using the triangle inequality to accelerate k-means. Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Hamerly, G. (May, January 29). Making k-means even faster. Proceedings of the 2010 SIAM International Conference on Data Mining, Columbus, OH, USA.","DOI":"10.1137\/1.9781611972801.12"},{"key":"ref_12","unstructured":"Drake, J., and Hamerly, G. (2012, January 7\u20138). Accelerated k-means with adaptive distance bounds. Proceedings of the 5th NIPS Workshop On Optimization for Machine Learning, Lake Tahoe, NV, USA."},{"key":"ref_13","first-page":"579","article-title":"Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup","volume":"37","author":"Ding","year":"2015","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_14","unstructured":"Bottesch, T., B\u00fchler, T., and K\u00e4chele, M. (2016). Speeding up k-means by approximating Euclidean distances via block vectors. Int. Conf. Mach. Learn., 2578\u20132586. Available online: http:\/\/proceedings.mlr.press\/v48\/bottesch16.pdf."},{"key":"ref_15","unstructured":"Bachem, O., Lucic, M., and Lattanzi, S. (2018). One-shot coresets: The case of k-clustering. International Conference On Artificial Intelligence And Statistics, PMLR."},{"key":"ref_16","unstructured":"Cap\u00f3, M., P\u00e9rez, A., and Lozano, J.A. (2018). An efficient K-means clustering algorithm for massive data. arXiv."},{"key":"ref_17","first-page":"298","article-title":"Random projections for k-means clustering","volume":"23","author":"Boutsidis","year":"2010","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_18","unstructured":"Fern, X., and Brodley, C. (2003, January 21\u201324). Random projection for high dimensional data clustering: A cluster ensemble approach. Proceedings of the 20th International Conference on Machine Learning, Washington, DC, USA."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1109\/TPAMI.2008.292","article-title":"Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA","volume":"32","author":"Alzate","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_20","first-page":"41","article-title":"A new method for dimensionality reduction using K-means clustering algorithm for high dimensional data set","volume":"13","author":"Napoleon","year":"2011","journal-title":"Int. J. Comput. Appl."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Liu, H., and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.","DOI":"10.1007\/978-1-4615-5689-3"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1016\/S0022-0000(03)00025-4","article-title":"Database-friendly random projections: Johnson-Lindenstrauss with binary coins","volume":"66","author":"Achlioptas","year":"2003","journal-title":"J. Comput. Syst. Sci."},{"key":"ref_23","unstructured":"Jolliffe, I.T. (2002). Principal Component Analysis, Springer."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Bingham, E., and Mannila, H. (2001, January 26\u201329). Random Projection in Dimensionality Reduction: Applications to Image and Text Data. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/502512.502546"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Cohen, M.B., Elder, S., Musco, C., Musco, C., and Persu, M. (2015, January 15\u201317). Dimensionality reduction for k-means clustering and low rank approximation. Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, Portland, OR, USA.","DOI":"10.1145\/2746539.2746569"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1109\/TIT.2014.2375327","article-title":"Randomized dimensionality reduction for k-means clustering","volume":"61","author":"Boutsidis","year":"2014","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1749","DOI":"10.1016\/j.patrec.2012.06.007","article-title":"Iterative random projections for high-dimensional data clustering","volume":"33","author":"Cardoso","year":"2012","journal-title":"Pattern Recognit. Lett."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Chan, J.Y., and Leung, A.P. (2017, January 14\u201319). Efficient k-means++ with random projection. Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, Anchorage, AK, USA.","DOI":"10.1109\/IJCNN.2017.7965841"},{"key":"ref_29","unstructured":"Sarkar, S., and Ghosh, A.K. (2020, December 24). On perfect clustering of high dimension, low sample size data. IEEE Trans. Pattern Anal. Mach. Intell., Available online: https:\/\/www.isical.ac.in\/~statmath\/report\/74969-cluster.pdf."},{"key":"ref_30","unstructured":"H\u00e4m\u00e4l\u00e4inen, J., and K\u00e4rkk\u00e4inen, T. (2016, January 27\u201329). Initialization of Big Data Clustering using Distributionally Balanced Folding. Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning-ESANN 2016, Bruges, Belgium."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1304","DOI":"10.1109\/TNNLS.2012.2199516","article-title":"Study on the Impact of Partition-Induced Dataset Shift on k-fold Cross-Validation","volume":"23","author":"Herrera","year":"2012","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1007\/s00454-011-9340-1","article-title":"K-means requires exponentially many iterations even in the plane","volume":"45","author":"Vattani","year":"2011","journal-title":"Discr. Comput. Geom."},{"key":"ref_33","first-page":"281","article-title":"Some methods for classification and analysis of multivariate observations","volume":"Volume 1","author":"MacQueen","year":"1967","journal-title":"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability"},{"key":"ref_34","first-page":"768","article-title":"Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications","volume":"21","author":"Forgy","year":"1965","journal-title":"Biometrics"},{"key":"ref_35","first-page":"91","article-title":"Refining Initial Points for K-Means Clustering","volume":"98","author":"Bradley","year":"1998","journal-title":"ICML"},{"key":"ref_36","first-page":"292","article-title":"Distributed and provably good seedings for k-means in constant rounds","volume":"70","author":"Bachem","year":"2017","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_37","unstructured":"H\u00e4m\u00e4l\u00e4inen, J., K\u00e4rkk\u00e4inen, T., and Rossi, T. (2018, January 25\u201327). Scalable robust clustering method for large and sparse data. Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning-ESANN 2018, Bruges, Belgium."},{"key":"ref_38","first-page":"1","article-title":"Extensions of Lipschitz mappings into a Hilbert space","volume":"26","author":"Johnson","year":"1984","journal-title":"Contemp. Math."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"89239","DOI":"10.1109\/ACCESS.2020.2993295","article-title":"Can the Number of Clusters Be Determined by External Indices?","volume":"8","author":"Rezaei","year":"2020","journal-title":"IEEE Access"},{"key":"ref_40","unstructured":"Bottou, L., and Bengio, Y. (1995). Convergence properties of the k-means algorithms. Adv. Neural Inf. Process. Syst., 585\u2013592. Available online: https:\/\/papers.nips.cc\/paper\/1994\/file\/a1140a3d0df1c81e24ae954d935e8926-Paper.pdf."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Broder, A., Garcia-Pueyo, L., Josifovski, V., Vassilvitskii, S., and Venkatesan, S. (2014, January 24\u201328). Scalable k-means by ranked retrieval. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA.","DOI":"10.1145\/2556195.2556260"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/s10766-008-0082-5","article-title":"MATLAB\u00ae: A Language for Parallel Computing","volume":"37","author":"Sharma","year":"2009","journal-title":"Int. J. Parallel Progr."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"569","DOI":"10.1214\/aoms\/1177728169","article-title":"Some continuous Monte Carlo methods for the Dirichlet problem","volume":"27","author":"Muller","year":"1956","journal-title":"Ann. Math. Stat."},{"key":"ref_44","first-page":"1","article-title":"K-means properties on six clustering benchmark datasets","volume":"48","author":"Sieranoja","year":"2018","journal-title":"Appl. Intell."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"De Vries, C.M., and Geva, S. (2009, January 19\u201323). K-tree: Large scale document clustering. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA.","DOI":"10.1145\/1571941.1572094"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1007\/s10115-016-1004-2","article-title":"The (black) art of runtime evaluation: Are we comparing algorithms or implementations?","volume":"52","author":"Kriegel","year":"2017","journal-title":"Knowl. Inf. Syst."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1016\/j.patcog.2017.09.038","article-title":"Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation","volume":"74","author":"Gallego","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1080\/01621459.1952.10483441","article-title":"Use of ranks in one-criterion variance analysis","volume":"47","author":"Kruskal","year":"1952","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Saarela, M., H\u00e4m\u00e4l\u00e4inen, J., and K\u00e4rkk\u00e4inen, T. (2017, January 23\u201326). Feature Ranking of Large, Robust, and Weighted Clustering Result. Proceedings of the 21st Pacific Asia Conference on Knowledge Discovery and Data Mining-PAKDD 2017, Jeju, Korea.","DOI":"10.1007\/978-3-319-57454-7_8"},{"key":"ref_50","unstructured":"\u00c4yr\u00e4m\u00f6, S. (2006). Knowledge Mining Using Robust Clustering, University of Jyv\u00e4skyl\u00e4. Vol. 63 of Jyv\u00e4skyl\u00e4 Studies in Computing."},{"key":"ref_51","first-page":"483","article-title":"Validity of the single processor approach to achieving large scale computing capabilities","volume":"30","author":"Amdahl","year":"1967","journal-title":"AFIPS Conf. Proc."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (2001). On the surprising behavior of distance metrics in high dimensional space. International Conference On Database Theory, Springer.","DOI":"10.1007\/3-540-44503-X_27"},{"key":"ref_53","unstructured":"Strehl, A. (2002). Relationship-Based Clustering and Cluster Ensembles for High-Dimensional Data Mining. [Ph.D. Thesis, The University of Texas at Austin]."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Li, P., Hastie, T.J., and Church, K.W. (2006, January 20\u201323). Very sparse random projections. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge diScovery and Data Mining, Philadelphia, PA, USA.","DOI":"10.1145\/1150402.1150436"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/1\/6\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:46:37Z","timestamp":1760179597000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/1\/6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12,27]]},"references-count":54,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2021,1]]}},"alternative-id":["a14010006"],"URL":"https:\/\/doi.org\/10.3390\/a14010006","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,12,27]]}}}