{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T23:17:53Z","timestamp":1761175073507,"version":"build-2065373602"},"reference-count":63,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T00:00:00Z","timestamp":1761091200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T00:00:00Z","timestamp":1761091200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>There is a growing need for labeled data, yet manual annotation is costly, error-prone, and often infeasible in privacy-sensitive, highly imbalanced domains such as fraud detection. We introduce a fully unsupervised framework that combines unsupervised SHapley Additive exPlanations (SHAP) feature selection with our novel unsupervised labeling method. We apply unsupervised SHAP to the Kaggle Credit Card Fraud Detection and Medicare Part D datasets to produce high-impact feature subsets, and then label the datasets with our unsupervised labeling approach. To effectively evaluate the labels generated by our novel methodology, we apply a baseline unsupervised learner, Isolation Forest (IF), to both the original datasets and their subsets. We calculate Matthew\u2019s Correlation Coefficient (MCC), Jaccard Index (JI), Precision, Recall, and F1-score by comparing our generated labels against the ground truth labels. It is important to note, the ground truth labels were used solely for evaluation. Our empirical results surpass the results obtained with the full feature dataset and baseline. By improving label quality while reducing computational complexity and preserving privacy, our approach offers a practical solution for learning from unlabeled, severely imbalanced data.<\/jats:p>","DOI":"10.1186\/s40537-025-01248-w","type":"journal-article","created":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T07:46:30Z","timestamp":1761119190000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Scalable unsupervised labeling with SHAP feature selection for fraud detection in imbalanced data"],"prefix":"10.1186","volume":"12","author":[{"given":"Mary Anne","family":"Walauskis","sequence":"first","affiliation":[]},{"given":"Taghi M.","family":"Khoshgoftaar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,22]]},"reference":[{"key":"1248_CR1","unstructured":"Gao C, Goswami M, Chen J, Dubrawski A. Classifying unstructured clinical notes via automatic weak supervision. In: Proceedings of the 7th Machine Learning for Healthcare Conference; 2022. p. 673\u2013690. 10.48550\/arXiv.2207.12345."},{"issue":"1","key":"1248_CR2","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1007\/s10515-024-00435-y","volume":"31","author":"J Shen","year":"2024","unstructured":"Shen J, Li Z, Lu Y, Pan M, Li X. Mitigating the impact of mislabeled data on deep predictive models: an empirical study of learning with noise approaches in software engineering tasks. Autom Softw Eng. 2024;31(1):33. https:\/\/doi.org\/10.1007\/s10515-024-00435-y.","journal-title":"Autom Softw Eng"},{"key":"1248_CR3","doi-asserted-by":"publisher","first-page":"202","DOI":"10.1007\/978-3-030-64148-1_13","volume-title":"Product-focused software process improvement","author":"T Fredriksson","year":"2020","unstructured":"Fredriksson T, Mattos DI, Bosch J, Olsson H. Data labeling: an empirical investigation into industrial challenges and mitigation strategies. In: Product-focused software process improvement. Cham: Springer; 2020. p. 202\u201316."},{"issue":"4","key":"1248_CR4","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","volume":"5","author":"B Krawczyk","year":"2016","unstructured":"Krawczyk B. Learning from imbalanced data: open challenges and future directions. Progress Artif Intell. 2016;5(4):221\u201332. https:\/\/doi.org\/10.1007\/s13748-016-0094-0.","journal-title":"Progress Artif Intell"},{"key":"1248_CR5","unstructured":"Kaggle: credit card fraud detection. 2018. https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud."},{"issue":"4","key":"1248_CR6","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1007\/s42979-023-01134-5","volume":"4","author":"JM Johnson","year":"2023","unstructured":"Johnson JM, Khoshgoftaar TM. Data-centric AI for healthcare fraud detection. SN Comput Sci. 2023;4(4):389. https:\/\/doi.org\/10.1007\/s42979-023-01134-5.","journal-title":"SN Comput Sci"},{"issue":"2","key":"1248_CR7","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1093\/nsr\/nwt032","volume":"1","author":"J Fan","year":"2014","unstructured":"Fan J, Han F, Liu H. Challenges of big data analysis. Natl Sci Rev. 2014;1(2):293\u2013314. https:\/\/doi.org\/10.1093\/nsr\/nwt032 (https:\/\/academic.oup.com\/nsr\/article-pdf\/1\/2\/293\/31565398\/nwt032.pdf).","journal-title":"Natl Sci Rev"},{"issue":"6","key":"1248_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3136625","volume":"50","author":"J Li","year":"2017","unstructured":"Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: a data perspective. ACM Comput Surv. 2017;50(6):1\u201345. https:\/\/doi.org\/10.1145\/3136625.","journal-title":"ACM Comput Surv"},{"key":"1248_CR9","unstructured":"Lundberg SM, Lee S. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS\u201917, p. 4768\u20134777. Red Hook, NY: Curran Associates Inc.; 2017."},{"key":"1248_CR10","doi-asserted-by":"publisher","unstructured":"Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining; 2008. p. 413\u2013422. https:\/\/doi.org\/10.1109\/ICDM.2008.17.","DOI":"10.1109\/ICDM.2008.17"},{"key":"1248_CR11","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1007\/978-981-99-6346-1_10","volume-title":"Data analytics and learning","author":"KT Vasudev","year":"2024","unstructured":"Vasudev KT, Manohara Pai MM, Pai RM. Comparative analysis of generic outlier detection techniques. In: Guru DS, Kumar NV, Javed M, editors. Data analytics and learning. Singapore: Springer; 2024. p. 117\u201326."},{"issue":"5","key":"1248_CR12","doi-asserted-by":"publisher","first-page":"645","DOI":"10.3390\/e24050611","volume":"24","author":"C Shao","year":"2022","unstructured":"Shao C, Du X, Yu J, Chen J. Cluster-based improved isolation forest. Entropy. 2022;24(5):645. https:\/\/doi.org\/10.3390\/e24050611.","journal-title":"Entropy"},{"issue":"1","key":"1248_CR13","doi-asserted-by":"publisher","first-page":"155","DOI":"10.1016\/j.datak.2007.01.002","volume":"63","author":"R Gelbard","year":"2007","unstructured":"Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: an empirical comparison. Data Knowl Eng. 2007;63(1):155\u201366. https:\/\/doi.org\/10.1016\/j.datak.2007.01.002. (Data Warehouse and Knowledge Discovery (DAWAK &apos;05)).","journal-title":"Data Knowl Eng"},{"key":"1248_CR14","unstructured":"U.S. Department of Justice: Criminal Resource Manual: 976. Health Care Fraud Generally. n.d. Accessed from https:\/\/www.justice.gov\/archives\/jm\/criminal-resource-manual-976-health-care-fraud-generally."},{"key":"1248_CR15","unstructured":"Bureau of Justice Statistics: Victims of Identity Theft, 2021. Accessed from https:\/\/bjs.ojp.gov\/press-release\/victims-identity-theft-2021."},{"key":"1248_CR16","unstructured":"Security.org: Credit Card Fraud Report. 2024. Accessed from https:\/\/www.security.org\/digital-safety\/credit-card-fraud-report\/."},{"key":"1248_CR17","unstructured":"U.S. Department of Justice: Criminal Resource Manual: 1007. Fraud. n.d. Accessed from https:\/\/www.justice.gov\/archives\/jm\/criminal-resource-manual-1007-fraud."},{"key":"1248_CR18","unstructured":"Federal Trade Commission: Nationwide fraud losses top \\$10 billion in 2023 as FTC steps up efforts to protect the public. 2024. Accessed from https:\/\/www.ftc.gov\/news-events\/news\/press-releases\/2024\/02\/nationwide-fraud-losses-top-10-billion-2023-ftc-steps-efforts-protect-public."},{"key":"1248_CR19","unstructured":"Social Security Administration Blog: Medicare Fraud Prevention Week. 2024. Accessed from https:\/\/blog.ssa.gov\/medicare-fraud-prevention-week\/."},{"key":"1248_CR20","unstructured":"U.S. Government Accountability Office: Federal fraud: challenges and costs. 2024. https:\/\/www.gao.gov\/products\/gao-24-105833."},{"key":"1248_CR21","unstructured":"Civil Division, U.S. Department of Justice: Fraud statistics, overview. 2020. Accessed from https:\/\/www.justice.gov\/opa\/press-release\/file\/1354316\/download."},{"key":"1248_CR22","doi-asserted-by":"publisher","unstructured":"Walauskis MA, Khoshgoftaar TM. Confident labels: A novel approach to new class labeling and evaluation on highly imbalanced data. In: 2024 IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI), 2024. p. 232\u2013239. https:\/\/doi.org\/10.1109\/ICTAI62512.2024.00042.","DOI":"10.1109\/ICTAI62512.2024.00042"},{"issue":"1","key":"1248_CR23","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1186\/s40537-025-01120-x","volume":"12","author":"MA Walauskis","year":"2025","unstructured":"Walauskis MA, Khoshgoftaar TM. Unsupervised label generation for severely imbalanced fraud data. J Big Data. 2025;12(1):63.","journal-title":"J Big Data"},{"key":"1248_CR24","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/s40537-024-01041-1","volume":"12","author":"JT Hancock","year":"2024","unstructured":"Hancock JT, Khoshgoftaar TM, Liang Q. A problem-agnostic approach to feature selection and analysis using SHAP. J Big Data. 2024;12:12.","journal-title":"J Big Data"},{"issue":"4","key":"1248_CR25","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1007\/s10044-024-01340-6","volume":"27","author":"J Park","year":"2024","unstructured":"Park J, Lee Y. Advanced pseudo-labeling approach in mixing-based text data augmentation method. Pattern Anal Appl. 2024;27(4):129. https:\/\/doi.org\/10.1007\/s10044-024-01340-6.","journal-title":"Pattern Anal Appl"},{"issue":"3","key":"1248_CR26","doi-asserted-by":"publisher","first-page":"856","DOI":"10.1016\/j.bbe.2022.06.007","volume":"42","author":"Y Liu","year":"2022","unstructured":"Liu Y, Liu Z, Luo X, Zhao H. Diagnosis of Parkinson\u2019s disease based on SHAP value feature selection. Biocybernet Biomed Eng. 2022;42(3):856\u201369. https:\/\/doi.org\/10.1016\/j.bbe.2022.06.007.","journal-title":"Biocybernet Biomed Eng"},{"issue":"13","key":"1248_CR27","doi-asserted-by":"publisher","first-page":"1858","DOI":"10.1080\/10255842.2023.2263125","volume":"27","author":"P Ghaheri","year":"2024","unstructured":"Ghaheri P, Nasiri ASH, Homafar A. Diagnosis of Parkinson\u2019s disease based on voice signals using SHAP and hard voting ensemble method. Comput Methods Biomech Biomed Eng. 2024;27(13):1858\u201374. https:\/\/doi.org\/10.1080\/10255842.2023.2263125.","journal-title":"Comput Methods Biomech Biomed Eng"},{"key":"1248_CR28","unstructured":"Bin\u00a0Sulaiman R, Schetinin V, Sant P. Review of credit card fraud detection using machine learning. 2020. https:\/\/api.semanticscholar.org\/CorpusID:262477642."},{"issue":"7","key":"1248_CR29","doi-asserted-by":"publisher","first-page":"651","DOI":"10.3390\/app7070651","volume":"7","author":"A So","year":"2017","unstructured":"So A, Hooshyar D, Park KW, Lim HS. Early diagnosis of dementia from clinical data by machine learning techniques. Appl Sci. 2017;7(7):651.","journal-title":"Appl Sci"},{"issue":"1","key":"1248_CR30","first-page":"4190023","volume":"2022","author":"A Revathi","year":"2022","unstructured":"Revathi A, Kaladevi R, Ramana K, Jhaveri RH, Rudrakumar M, Prasanna Kumar MS. Early detection of cognitive decline using machine learning algorithm and cognitive ability test. Secur Commun Netw. 2022;2022(1):4190023.","journal-title":"Secur Commun Netw"},{"key":"1248_CR31","doi-asserted-by":"publisher","unstructured":"El\u00a0Naby AA, El-Din\u00a0Hemdan E, El-Sayed A. Deep learning approach for credit card fraud detection. In: 2021 International Conference on Electronic Engineering (ICEEM), 2021. p. 1\u20135. https:\/\/doi.org\/10.1109\/ICEEM52022.2021.9480639.","DOI":"10.1109\/ICEEM52022.2021.9480639"},{"issue":"2","key":"1248_CR32","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1109\/MIS.2009.36","volume":"24","author":"A Halevy","year":"2009","unstructured":"Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2):8\u201312. https:\/\/doi.org\/10.1109\/MIS.2009.36.","journal-title":"IEEE Intell Syst"},{"key":"1248_CR33","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1016\/j.ins.2019.05.042","volume":"557","author":"F Carcillo","year":"2021","unstructured":"Carcillo F, Borgne YAL, Caelen O, Kessaci Y, Obl\u00e9 F, Bontempi G. Combining unsupervised and supervised learning in credit card fraud detection. Inf Sci. 2021;557:317\u201331. https:\/\/doi.org\/10.1016\/j.ins.2019.05.042.","journal-title":"Inf Sci"},{"issue":"6","key":"1248_CR34","doi-asserted-by":"publisher","first-page":"305","DOI":"10.3390\/systems11060305","volume":"11","author":"S Jiang","year":"2023","unstructured":"Jiang S, Dong R, Wang J, Xia M. Credit card fraud detection based on unsupervised attentional anomaly detection network. Systems. 2023;11(6):305. https:\/\/doi.org\/10.3390\/systems11060305.","journal-title":"Systems"},{"issue":"5","key":"1248_CR35","doi-asserted-by":"publisher","first-page":"2621","DOI":"10.24200\/sci.2019.51110.2010","volume":"27","author":"F Moslehi","year":"2020","unstructured":"Moslehi F, Haeri A, Gholamian MR. A novel selective clustering framework for appropriate labeling of clusters based on k-means algorithm. Sci Iran. 2020;27(5):2621\u201334. https:\/\/doi.org\/10.24200\/sci.2019.51110.2010.","journal-title":"Sci Iran"},{"key":"1248_CR36","doi-asserted-by":"publisher","unstructured":"Babu AM, Pratap A. Credit card fraud detection using deep learning. In: 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2020. p. 32\u201336. https:\/\/doi.org\/10.1109\/RAICS51191.2020.9332497.","DOI":"10.1109\/RAICS51191.2020.9332497"},{"key":"1248_CR37","doi-asserted-by":"publisher","unstructured":"Li J, Stones RJ, Wang G, Li Z, Liu X, Xiao K. Being accurate is not enough: New metrics for disk failure prediction. 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS), 2016. p. 71\u201380. https:\/\/doi.org\/10.1109\/SRDS.2016.019.","DOI":"10.1109\/SRDS.2016.019"},{"key":"1248_CR38","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1007\/978-3-030-88942-5_6","volume-title":"Discovery science","author":"JG Gaudreault","year":"2021","unstructured":"Gaudreault JG, Branco P, Gama J. An analysis of performance metrics for imbalanced classification. In: Discovery science. Cham: Springer; 2021. p. 67\u201377."},{"key":"1248_CR39","doi-asserted-by":"publisher","unstructured":"Zhou Y, Yan H, Wang J, Chen Z, Ma A. Knowledge transfer-based network from medium and high-resolution SAR imagery for built-up extraction with class-imbalanced data. 2023 SAR in Big Data Era (BIGSARDATA), 2023. p. 1\u20134. https:\/\/doi.org\/10.1109\/BIGSARDATA59007.2023.10294846.","DOI":"10.1109\/BIGSARDATA59007.2023.10294846"},{"key":"1248_CR40","first-page":"60527","volume-title":"Advances in neural information processing systems","author":"A Gadetsky","year":"2023","unstructured":"Gadetsky A, Brbic M. The pursuit of human labeling: a new perspective on unsupervised learning. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in neural information processing systems, vol. 36. Cham: Springer; 2023. p. 60527\u201346."},{"key":"1248_CR41","doi-asserted-by":"publisher","first-page":"02386","DOI":"10.1016\/j.sciaf.2024.e02386","volume":"26","author":"EF Agyemang","year":"2024","unstructured":"Agyemang EF. Anomaly detection using unsupervised machine learning algorithms: a simulation study. Sci Afr. 2024;26:02386. https:\/\/doi.org\/10.1016\/j.sciaf.2024.e02386.","journal-title":"Sci Afr"},{"key":"1248_CR42","doi-asserted-by":"crossref","unstructured":"Zhang J, Wang Y, Yang Y, Luo Y, Ratner A. Binary classification with positive labeling sources. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. CIKM \u201922, p. 4672\u20134676. 2022; New York, NY: Association for Computing Machinery.","DOI":"10.1145\/3511808.3557552"},{"issue":"3","key":"1248_CR43","first-page":"2299","volume":"35","author":"D Shi","year":"2023","unstructured":"Shi D, Zhu L, Li J, Cheng Z, Liu Z. Binary label learning for semi-supervised feature selection. IEEE Trans Knowl Data Eng. 2023;35(3):2299\u2013312.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"2","key":"1248_CR44","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3631326","volume":"1","author":"M Hort","year":"2024","unstructured":"Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias mitigation for machine learning classifiers: a comprehensive survey. ACM J Respons Comput. 2024;1(2):1\u201352. https:\/\/doi.org\/10.1145\/3631326.","journal-title":"ACM J Respons Comput"},{"key":"1248_CR45","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"1248_CR46","unstructured":"Scikit-learn: sklearn.preprocessing.normalize. 2024. https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.preprocessing.normalize.html. Accessed 02 Dec 2024."},{"key":"1248_CR47","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2022.109924","volume":"133","author":"LBV de Amorim","year":"2023","unstructured":"de Amorim LBV, Cavalcanti GDC, Cruz RMO. The choice of scaling technique matters for classification performance. Appl Soft Comput. 2023;133: 109924. https:\/\/doi.org\/10.1016\/j.asoc.2022.109924.","journal-title":"Appl Soft Comput"},{"key":"1248_CR48","unstructured":"Team PD. pandas.dataframe.samples - Pandas Documentation. 2024. pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.sample.html. Accessed 02 Dec 2024."},{"key":"1248_CR49","unstructured":"Yu Z, Jiang W, Alonso G. Efficient Tabular Data Preprocessing of ML Pipelines. 2024. https:\/\/arxiv.org\/abs\/2409.14912."},{"issue":"10","key":"1248_CR50","doi-asserted-by":"publisher","first-page":"2546","DOI":"10.38124\/ijisrt\/IJISRT24OCT1507","volume":"9","author":"VNS Gummadi","year":"2024","unstructured":"Gummadi VNS, Tubagus RA, Vasala R, Donavalli H. Comparative analysis of kmeans technique on non convex cluster. Int J Innov Sci Res Technol. 2024;9(10):2546\u201352. https:\/\/doi.org\/10.38124\/ijisrt\/IJISRT24OCT1507.","journal-title":"Int J Innov Sci Res Technol"},{"issue":"19","key":"1248_CR51","doi-asserted-by":"publisher","first-page":"20149","DOI":"10.1007\/s11042-017-4566-4","volume":"76","author":"Q Zhan","year":"2017","unstructured":"Zhan Q, Mao Y. Improved spectral clustering based on nystr\u00f6m method. Multimedia Tools Appl. 2017;76(19):20149\u201365. https:\/\/doi.org\/10.1007\/s11042-017-4566-4.","journal-title":"Multimedia Tools Appl"},{"key":"1248_CR52","doi-asserted-by":"publisher","unstructured":"Patel E, Kushwaha DS. Clustering cloud workloads: K-means vs gaussian mixture model. Proc Comput Sci. 2020;171:158\u201367. https:\/\/doi.org\/10.1016\/j.procs.2020.04.017. Third International Conference on Computing and Network Communications (CoCoNet\u201919).","DOI":"10.1016\/j.procs.2020.04.017"},{"key":"1248_CR53","doi-asserted-by":"publisher","first-page":"242","DOI":"10.1109\/ojsp.2020.3039330","volume":"1","author":"F Pourkamali-Anaraki","year":"2020","unstructured":"Pourkamali-Anaraki F. Scalable spectral clustering with nystr\u00f6m approximation: practical and theoretical aspects. IEEE Open J Signal Process. 2020;1:242\u201356. https:\/\/doi.org\/10.1109\/ojsp.2020.3039330.","journal-title":"IEEE Open J Signal Process"},{"key":"1248_CR54","unstructured":"Scikit-learn: Nystroem. 2024. https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.kernel_approximation.Nystro-em.html. Accessed 02 Dec 2024."},{"key":"1248_CR55","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1016\/j.patcog.2016.03.028","volume":"58","author":"SM Erfani","year":"2016","unstructured":"Erfani SM, Rajasegarar S, Karunasekera S, Leckie C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recogn. 2016;58:121\u201334. https:\/\/doi.org\/10.1016\/j.patcog.2016.03.028.","journal-title":"Pattern Recogn"},{"key":"1248_CR56","doi-asserted-by":"publisher","unstructured":"Leevy JL, Khoshgoftaar TM, Hancock J. Evaluating performance metrics for credit card fraud classification. 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), p. 1336\u20131341. https:\/\/doi.org\/10.1109\/ICTAI56018.2022.00202.","DOI":"10.1109\/ICTAI56018.2022.00202"},{"key":"1248_CR57","unstructured":"Centers for Medicare and Medicaid Services: Medicare Part D Prescribers-By Provider and Drug. https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-part-d-prescribers\/medicare-part-d-prescribers-by-provider-and-drug. Accessed 26 Nov 2024."},{"key":"1248_CR58","unstructured":"U.S. Department of Health and Human Services, Office of Inspector General: LEIE Downloadable Databases. 2024. https:\/\/oig.hhs.gov\/exclusions\/exclusions_list.asp. Accessed 26 Nov 2024."},{"key":"1248_CR59","doi-asserted-by":"crossref","unstructured":"Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). 2016. In: Proceedings of the 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), p. 11\u201319.","DOI":"10.1109\/IRI.2016.11"},{"key":"1248_CR60","doi-asserted-by":"crossref","unstructured":"M\u00fcller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. 2022. https:\/\/arxiv.org\/abs\/2202.05273.","DOI":"10.1186\/s13104-022-06096-y"},{"issue":"6","key":"1248_CR61","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21(6):1\u201313. https:\/\/doi.org\/10.1186\/s12864-019-6413-7.","journal-title":"BMC Genom"},{"key":"1248_CR62","doi-asserted-by":"publisher","first-page":"429","DOI":"10.1016\/j.ins.2019.11.004","volume":"513","author":"F Thabtah","year":"2020","unstructured":"Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Inf Sci. 2020;513:429\u201341. https:\/\/doi.org\/10.1016\/j.ins.2019.11.004.","journal-title":"Inf Sci"},{"issue":"9","key":"1248_CR63","doi-asserted-by":"publisher","first-page":"4180","DOI":"10.1021\/acs.jcim.9b01162","volume":"60","author":"S Korkmaz","year":"2020","unstructured":"Korkmaz S. Deep learning-based imbalanced data classification for drug discovery. J Chem Inf Model. 2020;60(9):4180\u201390. https:\/\/doi.org\/10.1021\/acs.jcim.9b01162.","journal-title":"J Chem Inf Model"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01248-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01248-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01248-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T07:46:37Z","timestamp":1761119197000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01248-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,22]]},"references-count":63,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1248"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01248-w","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,22]]},"assertion":[{"value":"25 February 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 July 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"236"}}