{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T10:48:06Z","timestamp":1766400486180,"version":"3.37.3"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,4,1]],"date-time":"2023-04-01T00:00:00Z","timestamp":1680307200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,4,1]],"date-time":"2023-04-01T00:00:00Z","timestamp":1680307200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61972261"],"award-info":[{"award-number":["61972261"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partition-based Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the I-niceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.<\/jats:p>","DOI":"10.1186\/s40537-023-00709-4","type":"journal-article","created":{"date-parts":[[2023,4,3]],"date-time":"2023-04-03T05:50:27Z","timestamp":1680501027000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["An ensemble method for estimating the number of clusters in a big data set using multiple random samples"],"prefix":"10.1186","volume":"10","author":[{"given":"Mohammad Sultan","family":"Mahmud","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Joshua Zhexue","family":"Huang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rukhsana","family":"Ruby","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kaishun","family":"Wu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,4,1]]},"reference":[{"key":"709_CR1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02289263","author":"RL Thorndike","year":"1953","unstructured":"Thorndike RL. Who belongs in the family. Psychometrika. 1953. https:\/\/doi.org\/10.1007\/BF02289263.","journal-title":"Psychometrika"},{"key":"709_CR2","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","volume":"20","author":"PJ Rousseeuw","year":"1987","unstructured":"Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53\u201365. https:\/\/doi.org\/10.1016\/0377-0427(87)90125-7.","journal-title":"J Comput Appl Math"},{"issue":"2","key":"709_CR3","doi-asserted-by":"publisher","first-page":"411","DOI":"10.1111\/1467-9868.00293","volume":"63","author":"R Tibshirani","year":"2001","unstructured":"Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Series B. 2001;63(2):411\u201323. https:\/\/doi.org\/10.1111\/1467-9868.00293.","journal-title":"J R Stat Soc Series B"},{"key":"709_CR4","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1016\/j.ins.2018.07.034","volume":"466","author":"MA Masud","year":"2018","unstructured":"Masud MA, Huang JZ, Wei C, Wang J, Khan I, Zhong M. I-nice: a new approach for identifying the number of clusters and initial cluster centres. Inf Sci. 2018;466:129\u201351. https:\/\/doi.org\/10.1016\/j.ins.2018.07.034.","journal-title":"Inf Sci"},{"issue":"1","key":"709_CR5","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1145\/2688072","volume":"58","author":"R Nair","year":"2014","unstructured":"Nair R. Big data needs approximate computing: technical perspective. Commun ACM. 2014;58(1):104\u2013104. https:\/\/doi.org\/10.1145\/2688072.","journal-title":"Commun ACM"},{"issue":"2","key":"709_CR6","doi-asserted-by":"publisher","first-page":"685","DOI":"10.1214\/18-AOAS1161SF","volume":"12","author":"X-L Meng","year":"2018","unstructured":"Meng X-L. Statistical paradises and paradoxes in big data (i): law of large populations, big data paradox, and the 2016 US presidential election. Ann Appl Stat. 2018;12(2):685\u2013726. https:\/\/doi.org\/10.1214\/18-AOAS1161SF.","journal-title":"Ann Appl Stat"},{"key":"709_CR7","unstructured":"Rojas, J.A.R., Beth Kery, M., Rosenthal, S., Dey, A.: Sampling techniques to improve big data exploration. In: 2017 IEEE 7th Symp. Large Data Analy Vis. 2017. 10.1109\/LDAV.2017.8231848"},{"issue":"11","key":"709_CR8","doi-asserted-by":"publisher","first-page":"5846","DOI":"10.1109\/TII.2019.2912723","volume":"15","author":"S Salloum","year":"2019","unstructured":"Salloum S, Huang JZ, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Informat. 2019;15(11):5846\u201354. https:\/\/doi.org\/10.1109\/TII.2019.2912723.","journal-title":"IEEE Trans Ind Informat"},{"issue":"2","key":"709_CR9","doi-asserted-by":"publisher","first-page":"85","DOI":"10.26599\/BDMA.2019.9020015","volume":"3","author":"MS Mahmud","year":"2020","unstructured":"Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal. 2020;3(2):85\u2013101.","journal-title":"Big Data Mining Anal"},{"key":"709_CR10","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1016\/j.ins.2020.09.068","volume":"548","author":"Y He","year":"2021","unstructured":"He Y, Wu Y, Qin H, Huang JZ, Jin Y. Improved i-nice clustering algorithm based on density peaks mechanism. Inf Sci. 2021;548:177\u201390. https:\/\/doi.org\/10.1016\/j.ins.2020.09.068.","journal-title":"Inf Sci"},{"key":"709_CR11","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1016\/j.ins.2020.11.050","volume":"554","author":"X Xu","year":"2021","unstructured":"Xu X, Ding S, Wang Y, Wang L, Jia W. A fast density peaks clustering algorithm with sparse search. Inform Sci. 2021;554:61\u201383. https:\/\/doi.org\/10.1016\/j.ins.2020.11.050.","journal-title":"Inform Sci"},{"issue":"6191","key":"709_CR12","doi-asserted-by":"publisher","first-page":"1492","DOI":"10.1126\/science.1242072","volume":"344","author":"A Rodriguez","year":"2014","unstructured":"Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014;344(6191):1492\u20136. https:\/\/doi.org\/10.1126\/science.1242072.","journal-title":"Science"},{"key":"709_CR13","doi-asserted-by":"publisher","DOI":"10.1145\/3068335","author":"E Schubert","year":"2017","unstructured":"Schubert E, Sander J, Ester M, Kriegel HP, Xu X. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst. 2017. https:\/\/doi.org\/10.1145\/3068335.","journal-title":"ACM Trans Database Syst"},{"key":"709_CR14","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1007\/s41019-019-0091-y","volume":"4","author":"C Patil","year":"2019","unstructured":"Patil C, Baidari I. Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng. 2019;4:132\u201340.","journal-title":"Data Sci Eng"},{"key":"709_CR15","doi-asserted-by":"publisher","first-page":"416","DOI":"10.1016\/j.knosys.2018.09.007","volume":"163","author":"X Zhao","year":"2019","unstructured":"Zhao X, Liang J, Dang C. A stratified sampling based clustering algorithm for large-scale data. Know Based Syst. 2019;163:416\u201328. https:\/\/doi.org\/10.1016\/j.knosys.2018.09.007.","journal-title":"Know Based Syst"},{"issue":"10","key":"709_CR16","doi-asserted-by":"publisher","first-page":"1456","DOI":"10.1016\/j.patrec.2011.04.008","volume":"32","author":"J Jia","year":"2011","unstructured":"Jia J, Xiao X, Liu B, Jiao L. Bagging-based spectral clustering ensemble selection. Pattern Recognit Lett. 2011;32(10):1456\u201367. https:\/\/doi.org\/10.1016\/j.patrec.2011.04.008.","journal-title":"Pattern Recognit Lett"},{"issue":"6","key":"709_CR17","doi-asserted-by":"publisher","first-page":"1557","DOI":"10.1109\/TFUZZ.2014.2298244","volume":"22","author":"Y Wang","year":"2014","unstructured":"Wang Y, Chen L, Mei J. Incremental fuzzy clustering with multiple medoids for large data. IEEE Trans Fuzzy Syst. 2014;22(6):1557\u201368. https:\/\/doi.org\/10.1109\/TFUZZ.2014.2298244.","journal-title":"IEEE Trans Fuzzy Syst"},{"key":"709_CR18","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1016\/j.knosys.2017.06.020","volume":"132","author":"J Hu","year":"2017","unstructured":"Hu J, Li T, Luo C, Fujita H, Yang Y. Incremental fuzzy cluster ensemble learning based on rough set theory. Know Based Syst. 2017;132:144\u201355. https:\/\/doi.org\/10.1016\/j.knosys.2017.06.020.","journal-title":"Know Based Syst"},{"issue":"4","key":"709_CR19","doi-asserted-by":"publisher","first-page":"866","DOI":"10.1016\/j.patcog.2010.10.018","volume":"44","author":"AM Bagirov","year":"2011","unstructured":"Bagirov AM, Ugon J, Webb D. Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognit. 2011;44(4):866\u201376. https:\/\/doi.org\/10.1016\/j.patcog.2010.10.018.","journal-title":"Pattern Recognit"},{"key":"709_CR20","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2010.09.008","author":"S Mimaroglu","year":"2011","unstructured":"Mimaroglu S, Erdil E. Combining multiple clusterings using similarity graph. Pattern Recogn. 2011. https:\/\/doi.org\/10.1016\/j.patcog.2010.09.008.","journal-title":"Pattern Recogn"},{"issue":"C","key":"709_CR21","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1016\/j.patcog.2015.08.015","volume":"50","author":"D Huang","year":"2016","unstructured":"Huang D, Lai J, Wang CD. Ensemble clustering using factor graph. Pattern Recognit. 2016;50(C):131\u201342. https:\/\/doi.org\/10.1016\/j.patcog.2015.08.015.","journal-title":"Pattern Recognit"},{"issue":"5","key":"709_CR22","doi-asserted-by":"publisher","first-page":"1943","DOI":"10.1016\/j.patcog.2009.11.012","volume":"43","author":"HG Ayad","year":"2010","unstructured":"Ayad HG, Kamel MS. On voting-based consensus of cluster ensembles. Pattern Recognit. 2010;43(5):1943\u201353. https:\/\/doi.org\/10.1016\/j.patcog.2009.11.012.","journal-title":"Pattern Recognit"},{"issue":"12","key":"709_CR23","doi-asserted-by":"publisher","first-page":"2396","DOI":"10.1109\/TPAMI.2011.84","volume":"33","author":"N Iam-On","year":"2011","unstructured":"Iam-On N, Boongoen T, Garrett S, Price C. A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell. 2011;33(12):2396\u2013409. https:\/\/doi.org\/10.1109\/TPAMI.2011.84.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"6","key":"709_CR24","doi-asserted-by":"publisher","first-page":"1537","DOI":"10.1109\/TPAMI.2019.2913863","volume":"42","author":"J Yang","year":"2020","unstructured":"Yang J, Liang J, Wang K, Rosin PL, Yang M. Subspace clustering via good neighbors. IEEE Trans Pattern Anal. 2020;42(6):1537\u201344. https:\/\/doi.org\/10.1109\/TPAMI.2019.2913863.","journal-title":"IEEE Trans Pattern Anal"},{"issue":"3","key":"709_CR25","doi-asserted-by":"publisher","first-page":"468","DOI":"10.1016\/j.csda.2011.09.003","volume":"56","author":"Y Fang","year":"2012","unstructured":"Fang Y, Wang J. Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal. 2012;56(3):468\u201377. https:\/\/doi.org\/10.1016\/j.csda.2011.09.003.","journal-title":"Comput Stat Data Anal"},{"key":"709_CR26","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1016\/j.bdr.2018.05.003","volume":"13","author":"H Estiri","year":"2018","unstructured":"Estiri H, Abounia Omran B, Murphy SN. kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Res. 2018;13:38\u201351. https:\/\/doi.org\/10.1016\/j.bdr.2018.05.003.","journal-title":"Big Data Res"},{"key":"709_CR27","unstructured":"Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proc. 17th Int. Conf. Mach. Learn. ICML \u201900, pp. 727\u2013734. Morgan Kaufmann Publishers Inc., CA, USA 2000."},{"key":"709_CR28","doi-asserted-by":"crossref","unstructured":"Bachem, O., Lucic, M., Krause, A.: Scalable k-means clustering via lightweight coresets. In: Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD\u201918), NY, USA, pp. 1119\u20131127 (2018). 10.1145\/3219819.3219973.","DOI":"10.1145\/3219819.3219973"},{"issue":"1","key":"709_CR29","doi-asserted-by":"publisher","first-page":"155","DOI":"10.1109\/TKDE.2014.2316512","volume":"27","author":"J Wu","year":"2015","unstructured":"Wu J, Liu H, Xiong H, Cao J, Chen J. K-means-based consensus clustering: a unified view. IEEE Trans Knowl Data Eng. 2015;27(1):155\u201369. https:\/\/doi.org\/10.1109\/TKDE.2014.2316512.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"3","key":"709_CR30","doi-asserted-by":"publisher","first-page":"413","DOI":"10.1109\/TKDE.2010.268","volume":"24","author":"N Iam-On","year":"2012","unstructured":"Iam-On N, Boongeon T, Garrett S, Price C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans Knowl Data Eng. 2012;24(3):413\u201325.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"2","key":"709_CR31","doi-asserted-by":"publisher","first-page":"661","DOI":"10.1007\/s10115-016-0988-y","volume":"51","author":"Y Ren","year":"2017","unstructured":"Ren Y, Domeniconi C, Zhang G, Yu G. Weighted-object ensemble clustering: Methods and analysis. Knowl Inf Syst. 2017;51(2):661\u201389. https:\/\/doi.org\/10.1007\/s10115-016-0988-y.","journal-title":"Knowl Inf Syst"},{"issue":"4","key":"709_CR32","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v025.i04","volume":"25","author":"G Brock","year":"2008","unstructured":"Brock G, Pihur V, Datta S, Datta S. clvalid: an r package for cluster validation. J Stat Softw. 2008;25(4):1\u201322. https:\/\/doi.org\/10.18637\/jss.v025.i04.","journal-title":"J Stat Softw"},{"issue":"2","key":"709_CR33","doi-asserted-by":"publisher","first-page":"224","DOI":"10.1109\/TPAMI.1979.4766909","volume":"1","author":"DL Davies","year":"1979","unstructured":"Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979;1(2):224\u20137. https:\/\/doi.org\/10.1109\/TPAMI.1979.4766909.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"709_CR34","unstructured":"Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proc. 2007 Joint Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn. (EMNLP-CoNLL), pp. 410\u2013420. Association for Computational Linguistics, Prague, Czech Republic 2007. 10.1109\/10.7916\/D80V8N84."},{"issue":"1","key":"709_CR35","doi-asserted-by":"publisher","first-page":"193","DOI":"10.1007\/BF01908075","volume":"2","author":"H Lawrence","year":"1985","unstructured":"Lawrence H, Phipps A. Comparing partitions. J Classif. 1985;2(1):193\u2013218. https:\/\/doi.org\/10.1007\/BF01908075.","journal-title":"J Classif"},{"key":"709_CR36","doi-asserted-by":"publisher","first-page":"2837","DOI":"10.5555\/1756006.1953024","volume":"11","author":"NX Vinh","year":"2010","unstructured":"Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837\u201354. https:\/\/doi.org\/10.5555\/1756006.1953024.","journal-title":"J Mach Learn Res"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00709-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-023-00709-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00709-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,4,3]],"date-time":"2023-04-03T05:59:06Z","timestamp":1680501546000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-023-00709-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,1]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["709"],"URL":"https:\/\/doi.org\/10.1186\/s40537-023-00709-4","relation":{},"ISSN":["2196-1115"],"issn-type":[{"type":"electronic","value":"2196-1115"}],"subject":[],"published":{"date-parts":[[2023,4,1]]},"assertion":[{"value":"15 July 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 April 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declaration"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compeing interests"}}],"article-number":"40"}}