{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T21:04:42Z","timestamp":1761253482396,"version":"build-2065373602"},"reference-count":44,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T00:00:00Z","timestamp":1761177600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T00:00:00Z","timestamp":1761177600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>As the amount of unlabeled data has continued to grow and present challenges to machine learning practitioners, the need for unsupervised solutions is more evident than ever. With many unsupervised algorithms available to classify instances, the challenge remains that these algorithms require fine-tuning and\/or appropriate parameter selection to produce reliable results. The difficulty remains that given an unlabeled dataset, the true class distribution is unknown, which impacts appropriateness of the selection of unsupervised algorithms and hyperparameter tuning, as well as the evaluation metrics chosen. Our novel approach addresses this critical gap in current literature. Through a fully automated and unsupervised framework, we take a binary unlabeled dataset, and return the class distribution without prior domain knowledge and regardless of the class distribution - imbalanced or balanced. We thoroughly investigate multiple datasets ranging in size, class distribution, and domain, and our empirical evidence demonstrates the successful determination of the class distribution given this variety of factors. Our approach uses data-driven threshold and parameter settings to improve model performance, particularly in imbalanced class scenarios. This helps in selecting suitable algorithms, guiding appropriate evaluation metrics, and promoting fairer, evidence-based decision making in fields such as fraud detection and healthcare.<\/jats:p>","DOI":"10.1186\/s40537-025-01231-5","type":"journal-article","created":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T08:55:16Z","timestamp":1761209716000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A novel approach to automating unsupervised estimation of class distribution"],"prefix":"10.1186","volume":"12","author":[{"given":"Mary Anne","family":"Walauskis","sequence":"first","affiliation":[]},{"given":"Taghi M.","family":"Khoshgoftaar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,23]]},"reference":[{"key":"1231_CR1","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1016\/j.neunet.2018.07.011","volume":"106","author":"M Buda","year":"2018","unstructured":"Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249\u201359.","journal-title":"Neural Netw"},{"key":"1231_CR2","doi-asserted-by":"publisher","unstructured":"Kong J, Kowalczyk W, Nguyen DA, B\u00e4ck T, Menzel S: Hyperparameter optimisation for improving classification under class imbalance. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), 2019; 3072\u20133078. https:\/\/doi.org\/10.1109\/SSCI44817.2019.9002679.","DOI":"10.1109\/SSCI44817.2019.9002679"},{"issue":"9","key":"1231_CR3","doi-asserted-by":"publisher","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","volume":"21","author":"H He","year":"2009","unstructured":"He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263\u201384.","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1231_CR4","doi-asserted-by":"publisher","first-page":"25256","DOI":"10.1109\/ACCESS.2020.2970293","volume":"8","author":"S Choi","year":"2020","unstructured":"Choi S, Khan MKH, Chen C-L. Performance evaluation metrics for classification algorithms: a review. IEEE Access. 2020;8:25256\u201380. https:\/\/doi.org\/10.1109\/ACCESS.2020.2970293.","journal-title":"IEEE Access"},{"issue":"6","key":"1231_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(6):1\u201313. https:\/\/doi.org\/10.1186\/s12864-019-6413-7.","journal-title":"BMC Genom"},{"issue":"1","key":"1231_CR6","doi-asserted-by":"publisher","first-page":"232","DOI":"10.32473\/flairs.38.1.139140","volume":"38","author":"MA Walauskis","year":"2025","unstructured":"Walauskis MA, Khoshgoftaar TM. Choosing the right metrics: a study of performance measurement for binary classification in imbalanced and big data. Int FLAIRS Conf Proc. 2025;38(1):232\u20139. https:\/\/doi.org\/10.32473\/flairs.38.1.139140.","journal-title":"Int FLAIRS Conf Proc"},{"key":"1231_CR7","doi-asserted-by":"crossref","unstructured":"Liu FT, Ting KM, Zhou Z-H. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, 2008; 413\u2013422. IEEE.","DOI":"10.1109\/ICDM.2008.17"},{"issue":"7","key":"1231_CR8","doi-asserted-by":"publisher","first-page":"1443","DOI":"10.1162\/089976601750264965","volume":"13","author":"B Sch\u00f6lkopf","year":"2001","unstructured":"Sch\u00f6lkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC. Estimating the support of a high-dimensional distribution. Neural Comput. 2001;13(7):1443\u201371.","journal-title":"Neural Comput"},{"key":"1231_CR9","unstructured":"Wagstaff K, Cardie C, Rogers S, Schr\u00f6dl S: Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning, 2001;577\u2013584. Morgan Kaufmann Publishers Inc."},{"key":"1231_CR10","doi-asserted-by":"publisher","unstructured":"Gao C, Goswami M, Chen J, Dubrawski A: Classifying unstructured clinical notes via automatic weak supervision. In: Proceedings of the 7th Machine Learning for Healthcare Conference, 2022;673\u2013690. https:\/\/doi.org\/10.48550\/arXiv.2207.12345.","DOI":"10.48550\/arXiv.2207.12345"},{"key":"1231_CR11","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-021-06008-0","volume":"618","author":"T Yoshida","year":"2021","unstructured":"Yoshida T, Shin\u2019ya E, Washio T. Class-prior probability estimation using density ratio between positive and unlabeled data. Neurocomputing. 2021;618: 129074. https:\/\/doi.org\/10.1007\/s10994-021-06008-0.","journal-title":"Neurocomputing"},{"key":"1231_CR12","unstructured":"Zhang M, Zhang A, Xiao TZ, McDonagh S. Out-of-Distribution Detection with Class Ratio Estimation. 2022. arXiv preprint arXiv:2206.03955"},{"key":"1231_CR13","doi-asserted-by":"publisher","unstructured":"Walauskis MA, Khoshgoftaar TM. Confident labels: A novel approach to new class labeling and evaluation on highly imbalanced data. In: 2024 IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI), 2024;232\u2013239. https:\/\/doi.org\/10.1109\/ICTAI62512.2024.00042.","DOI":"10.1109\/ICTAI62512.2024.00042"},{"issue":"1","key":"1231_CR14","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1186\/s40537-025-01120-x","volume":"12","author":"MA Walauskis","year":"2025","unstructured":"Walauskis MA, Khoshgoftaar TM. Unsupervised label generation for severely imbalanced fraud data. J Big Data. 2025;12(1):63. https:\/\/doi.org\/10.1186\/s40537-025-01120-x.","journal-title":"J Big Data"},{"key":"1231_CR15","unstructured":"scikit-learn developers. sklearn.preprocessing.normalize. https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.preprocessing.normalize.html 2024."},{"issue":"11","key":"1231_CR16","doi-asserted-by":"publisher","first-page":"1942","DOI":"10.3390\/math10111942","volume":"10","author":"I Izonin","year":"2022","unstructured":"Izonin I, Tkachenko R, Shakhovska N, Ilchyshyn B, Singh KK. A two-step data normalization approach for improving classification accuracy in the medical diagnosis domain. Mathematics. 2022;10(11):1942. https:\/\/doi.org\/10.3390\/math10111942.","journal-title":"Mathematics"},{"issue":"11","key":"1231_CR17","doi-asserted-by":"publisher","first-page":"2913","DOI":"10.1162\/neco.2007.19.11.2913","volume":"19","author":"MA Montemurro","year":"2007","unstructured":"Montemurro MA, Senatore R, Panzeri S. Tight data-robust bounds to mutual information combining shuffling and model selection techniques. Neural Comput. 2007;19(11):2913\u201357. https:\/\/doi.org\/10.1162\/neco.2007.19.11.2913.","journal-title":"Neural Comput"},{"key":"1231_CR18","unstructured":"Team PD. pandas.DataFrame.sample \u2013 Pandas Documentation. https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.sample.html. Accessed: 2024-12-02 2024."},{"issue":"12","key":"1231_CR19","doi-asserted-by":"publisher","first-page":"3302","DOI":"10.1021\/ci500480b","volume":"54","author":"S Gan","year":"2014","unstructured":"Gan S, Cosgrove DA, Gardiner EJ, Gillet VJ. Investigation of the use of spectral clustering for the analysis of molecular data. J Chem Inf Model. 2014;54(12):3302\u201319. https:\/\/doi.org\/10.1021\/ci500480b.","journal-title":"J Chem Inf Model"},{"key":"1231_CR20","unstructured":"Yu Z, Jiang W, Alonso G. Efficient Tabular Data Preprocessing of ML Pipelines 2024. arXiv:2409.14912."},{"key":"1231_CR21","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1016\/j.aap.2012.03.020","volume":"58","author":"M Dozza","year":"2013","unstructured":"Dozza M, B\u00e4rgman J, Lee JD. Chunking: A procedure to improve naturalistic data analysis. Accid Anal Prev. 2013;58:309\u201317. https:\/\/doi.org\/10.1016\/j.aap.2012.03.020.","journal-title":"Accid Anal Prev"},{"issue":"8","key":"1231_CR22","first-page":"1982","volume":"35","author":"W Chen","year":"2011","unstructured":"Chen W, Cai D. Large scale spectral clustering using landmark-based representation. IEEE Trans Pattern Anal Mach Intell. 2011;35(8):1982\u201395.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"1231_CR23","first-page":"15631","volume":"33","author":"T Wang","year":"2020","unstructured":"Wang T, Isola P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. Adv Neural Inf Process Syst. 2020;33:15631\u201341.","journal-title":"Adv Neural Inf Process Syst"},{"key":"1231_CR24","doi-asserted-by":"publisher","unstructured":"Khan I, Huang J, Tung N, Williams G. Ensemble clustering of high dimensional data with fastmap projection. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining 2014. https:\/\/doi.org\/10.1007\/978-3-319-13186-3_43.","DOI":"10.1007\/978-3-319-13186-3_43"},{"key":"1231_CR25","doi-asserted-by":"crossref","unstructured":"Dundar M, Kou Q, Zhang B, He Y, Rajwa B. Simplicity of kmeans versus deepness of deep learning: A case of unsupervised feature learning with limited data. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015; 883\u2013888. IEEE.","DOI":"10.1109\/ICMLA.2015.78"},{"issue":"19","key":"1231_CR26","doi-asserted-by":"publisher","first-page":"20149","DOI":"10.1007\/s11042-017-4566-4","volume":"76","author":"Q Zhan","year":"2017","unstructured":"Zhan Q, Mao Y. Improved spectral clustering based on nystr\u00f6m method. Multimed Tools Appl. 2017;76(19):20149\u201365. https:\/\/doi.org\/10.1007\/s11042-017-4566-4.","journal-title":"Multimed Tools Appl"},{"key":"1231_CR27","doi-asserted-by":"publisher","first-page":"158","DOI":"10.1016\/j.procs.2020.04.017","volume":"171","author":"E Patel","year":"2020","unstructured":"Patel E, Kushwaha DS. Clustering cloud workloads: K-means vs gaussian mixture model. Proced Comput Sci. 2020;171:158\u201367. https:\/\/doi.org\/10.1016\/j.procs.2020.04.017. (Third International Conference on Computing and Network Communications (CoCoNet &apos;19)).","journal-title":"Proced Comput Sci"},{"issue":"10","key":"1231_CR28","doi-asserted-by":"publisher","first-page":"2546","DOI":"10.38124\/ijisrt\/IJISRT24OCT1507","volume":"9","author":"VNS Gummadi","year":"2024","unstructured":"Gummadi VNS, Tubagus RA, Vasala R, Donavalli H. Comparative analysis of kmeans technique on non convex cluster. Int J Innovative Sci Res Tech (IJISRT). 2024;9(10):2546\u201352. https:\/\/doi.org\/10.38124\/ijisrt\/IJISRT24OCT1507.","journal-title":"Int J Innovative Sci Res Tech (IJISRT)"},{"key":"1231_CR29","doi-asserted-by":"publisher","first-page":"242","DOI":"10.1109\/ojsp.2020.3039330","volume":"1","author":"F Pourkamali-Anaraki","year":"2020","unstructured":"Pourkamali-Anaraki F. Scalable spectral clustering with nystr\u00f6m approximation: Practical and theoretical aspects. IEEE Open J Signal Proces. 2020;1:242\u201356. https:\/\/doi.org\/10.1109\/ojsp.2020.3039330.","journal-title":"IEEE Open J Signal Proces"},{"key":"1231_CR30","unstructured":"scikit-learn: Nystroem 2024. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.kernel_approximation.Nystro-em.html"},{"key":"1231_CR31","unstructured":"Blackard J. UCI Machine Learning Repository: Covertype Data Set. https:\/\/archive.ics.uci.edu\/dataset\/31\/covertype. Accessed 18 Feb 2025"},{"key":"1231_CR32","unstructured":"Yasser H. Titanic Dataset. https:\/\/www.kaggle.com\/datasets\/yasserh\/titanic-dataset. Kaggle dataset. Accessed 18 Feb 2025"},{"key":"1231_CR33","unstructured":"scikit-learn developers: DecisionTreeClassifier. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html. Accessed: 2025-02-27 2023."},{"key":"1231_CR34","unstructured":"scikit-learn developers: PolynomialFeatures. 2023. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.PolynomialFeatures.html. Accessed 27 Feb 2025"},{"key":"1231_CR35","unstructured":"scikit-learn developers: LinearRegression. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.linear_model.LinearRegression.html. Accessed: 2025-02-27, 2023."},{"key":"1231_CR36","unstructured":"Centers for Medicare and Medicaid Services: Medicare Part D Prescribers \u2013 By Provider and Drug. 2024. https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-part-d-prescribers\/medicare-part-d-prescribers-by-provider-and-drug. Accessed. 26 Nov 2024"},{"key":"1231_CR37","unstructured":"U.S. Department of Health and Human Services, Office of Inspector General: LEIE Downloadable Databases. 2024. https:\/\/oig.hhs.gov\/exclusions\/exclusions_list.asp. Accessed. 26 Nov 2024"},{"key":"1231_CR38","doi-asserted-by":"crossref","unstructured":"Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: Proceedings of the 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 2016; 11\u201319.","DOI":"10.1109\/IRI.2016.11"},{"issue":"4","key":"1231_CR39","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1007\/s42979-023-01134-5","volume":"4","author":"JM Johnson","year":"2023","unstructured":"Johnson JM, Khoshgoftaar TM. Data-centric ai for healthcare fraud detection. SN Comput Sci. 2023;4(4):389. https:\/\/doi.org\/10.1007\/s42979-023-01134-5.","journal-title":"SN Comput Sci"},{"key":"1231_CR40","unstructured":"CDC: UCI Machine Learning Repository: CDC Diabetes Health Indicators. 2019. https:\/\/archive.ics.uci.edu\/dataset\/891\/cdc+diabetes+health+indicators. Accessed 8 Feb 2025"},{"key":"1231_CR41","unstructured":"Chaves RM. Financial Distress Prediction. 2021.https:\/\/www.kaggle.com\/datasets\/rubensmchaves\/ml-fdp-ds. Kaggle dataset. Accessed 18 Feb 2025"},{"key":"1231_CR42","unstructured":"OpenML. EEG Eye State Dataset. 2020. https:\/\/www.openml.org\/d\/1471. Accessed 18 Feb 2025"},{"key":"1231_CR43","unstructured":"Sakar C, Kastro Y. UCI Machine Learning Repository: Online Shoppers Purchasing Intention Dataset. 2019. https:\/\/archive.ics.uci.edu\/dataset\/468\/online+shoppers+purchasing+intention+dataset. Accessed. 18 Feb 2025"},{"key":"1231_CR44","unstructured":"Lung Cancer Dataset. 2020. https:\/\/www.kaggle.com\/datasets\/figolm10\/lung-cancer-dataset. Kaggle dataset. Accessed 22 Feb 2025"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01231-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01231-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01231-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T21:02:08Z","timestamp":1761253328000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01231-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,23]]},"references-count":44,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1231"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01231-5","relation":{},"ISSN":["2196-1115"],"issn-type":[{"type":"electronic","value":"2196-1115"}],"subject":[],"published":{"date-parts":[[2025,10,23]]},"assertion":[{"value":"10 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 June 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no Conflict of interest.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"237"}}