{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T14:38:52Z","timestamp":1777646332711,"version":"3.51.4"},"reference-count":36,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2022,11,1]],"date-time":"2022-11-01T00:00:00Z","timestamp":1667260800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Major Project of Natural Science Research in Colleges and Universities of Anhui Province","award":["KJ2021ZD0007"],"award-info":[{"award-number":["KJ2021ZD0007"]}]},{"name":"Major Project of Natural Science Research in Colleges and Universities of Anhui Province","award":["2021xjxm049"],"award-info":[{"award-number":["2021xjxm049"]}]},{"name":"2021 cultivation project of Anhui Normal University","award":["KJ2021ZD0007"],"award-info":[{"award-number":["KJ2021ZD0007"]}]},{"name":"2021 cultivation project of Anhui Normal University","award":["2021xjxm049"],"award-info":[{"award-number":["2021xjxm049"]}]},{"name":"Wuhu Science and Technology Bureau Project","award":["KJ2021ZD0007"],"award-info":[{"award-number":["KJ2021ZD0007"]}]},{"name":"Wuhu Science and Technology Bureau Project","award":["2021xjxm049"],"award-info":[{"award-number":["2021xjxm049"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Axioms"],"abstract":"<jats:p>Machine learning models may not be able to effectively learn and predict from imbalanced data in the fields of machine learning and data mining. This study proposed a method for analyzing the performance impact of imbalanced binary data on machine learning models. It systematically analyzes 1. the relationship between varying performance in machine learning models and imbalance rate (IR); 2. the performance stability of machine learning models on imbalanced binary data. In the proposed method, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced dataset with gradually varying IR. Then, in order to obtain more objective classification results, the evaluation metric AFG, arithmetic mean of area under the receiver operating characteristic curve (AUC), F-measure and G-mean are used to evaluate the classification performance of machine learning models. Finally, based on AFG and coefficient of variation (CV), the performance stability evaluation method of machine learning models is proposed. Experiments of eight widely used machine learning models on 48 different imbalanced datasets demonstrate that the classification performance of machine learning models decreases with the increase of IR on the same imbalanced data. Meanwhile, the classification performances of LR, DT and SVC are unstable, while GNB, BNB, KNN, RF and GBDT are relatively stable and not susceptible to imbalanced data. In particular, the BNB has the most stable classification performance. The Friedman and Nemenyi post hoc statistical tests also confirmed this result. The SMOTE method is used in oversampling-based imbalanced data augmentation, and determining whether other oversampling methods can obtain consistent results needs further research. In the future, an imbalanced data augmentation algorithm based on undersampling and hybrid sampling should be used to analyze the performance impact of imbalanced binary data on machine learning models.<\/jats:p>","DOI":"10.3390\/axioms11110607","type":"journal-article","created":{"date-parts":[[2022,11,2]],"date-time":"2022-11-02T06:49:02Z","timestamp":1667371742000},"page":"607","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":33,"title":["A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9001-0859","authenticated-orcid":false,"given":"Ming","family":"Zheng","sequence":"first","affiliation":[{"name":"School of Computer and Information, Anhui Normal University, Wuhu 241002, China"},{"name":"Anhui Provincial Key Laboratory of Network and Information Security, Wuhu 241002, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fei","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer and Information, Anhui Normal University, Wuhu 241002, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaowen","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Computer and Information, Anhui Normal University, Wuhu 241002, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuhao","family":"Miao","sequence":"additional","affiliation":[{"name":"Affiliated Institution of Anhui Normal University, Wuhu 241002, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huo","family":"Cao","sequence":"additional","affiliation":[{"name":"School of Computer and Information, Anhui Normal University, Wuhu 241002, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingjing","family":"Tang","sequence":"additional","affiliation":[{"name":"School of Life Science, Yunnan Normal University, Kunming 650500, China"},{"name":"Engineering Research Center of Sustainable Development and Utilization of Biomass Energy, Ministry of Education, Yunnan Normal University, Kunming 650500, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,11,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1109\/TPAMI.2019.2929166","article-title":"Multiset feature learning for highly imbalanced data classification","volume":"43","author":"Jing","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1009","DOI":"10.1016\/j.ins.2019.10.014","article-title":"Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification","volume":"512","author":"Zheng","year":"2020","journal-title":"Inf. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"658","DOI":"10.1016\/j.ins.2021.07.053","article-title":"UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classification","volume":"576","author":"Zheng","year":"2021","journal-title":"Inf. Sci."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"116051","DOI":"10.1016\/j.eswa.2021.116051","article-title":"Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE","volume":"188","author":"Liang","year":"2022","journal-title":"Expert Syst. Appl."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"176","DOI":"10.1016\/j.neunet.2020.06.026","article-title":"Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data","volume":"130","author":"Kim","year":"2020","journal-title":"Neural Netw."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: Synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"J. Artif. Intell. Res."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"79","DOI":"10.32614\/RJ-2014-008","article-title":"ROSE: A Package for Binary Imbalanced Learning","volume":"6","author":"Lunardon","year":"2014","journal-title":"R J."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"102435","DOI":"10.1016\/j.cose.2021.102435","article-title":"STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment","volume":"110","author":"Al","year":"2021","journal-title":"Comput. Secur."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"104814","DOI":"10.1016\/j.knosys.2019.06.022","article-title":"SMOTE based class-specific extreme learning machine for imbalanced learning","volume":"187","author":"Raghuwanshi","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1016\/j.inffus.2019.07.006","article-title":"Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting","volume":"54","author":"Sun","year":"2020","journal-title":"Inf. Fusion"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1214","DOI":"10.1016\/j.ins.2019.10.048","article-title":"Learning imbalanced datasets based on SMOTE and Gaussian distribution","volume":"512","author":"Pan","year":"2020","journal-title":"Inf. Sci."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Saini, M., and Susan, S. (2022). VGGIN-Net: Deep Transfer Network for Imbalanced Breast Cancer Dataset. IEEE\/ACM Trans. Comput. Biol. Bioinform.","DOI":"10.1109\/TCBB.2022.3163277"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhu, Q., Zhu, T., Zhang, R., Ye, H., Sun, K., Xu, Y., and Zhang, D. (2022). A Cognitive Driven Ordinal Preservation for Multi-Modal Imbalanced Brain Disease Diagnosis. IEEE Trans. Cogn. Dev. Syst.","DOI":"10.1109\/TCDS.2022.3175360"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Sun, Y., Cai, L., Liao, B., Zhu, W., and Xu, J. (2022). A Robust Oversampling Approach for Class Imbalance Problem with Small Disjuncts. IEEE Trans. Knowl. Data Eng.","DOI":"10.1109\/TKDE.2022.3161291"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/j.eswa.2017.03.073","article-title":"Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning","volume":"82","author":"Douzas","year":"2017","journal-title":"Expert Syst. Appl."},{"key":"ref_16","first-page":"809","article-title":"The impact study of class imbalance on the performance of software defect prediction models","volume":"41","author":"Yu","year":"2018","journal-title":"Chin. J. Comput."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"233","DOI":"10.1080\/03610920802187448","article-title":"Estimator and tests for common coefficients of variation in normal distributions","volume":"38","author":"Forkman","year":"2009","journal-title":"Commun. Stat.\u2014Theory Methods"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1104","DOI":"10.1109\/TKDE.2019.2898861","article-title":"Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data","volume":"32","author":"Fernandes","year":"2019","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3525","DOI":"10.1109\/TNNLS.2019.2944962","article-title":"Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem","volume":"31","author":"Lu","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"2799","DOI":"10.1109\/TFUZZ.2019.2939989","article-title":"Fuzzy Ordered c-Means Clustering and Least Angle Regression for Fuzzy Rule-Based Classifier: Study for Imbalanced Data","volume":"28","author":"Leski","year":"2019","journal-title":"IEEE Trans. Fuzzy Syst."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1007\/s40815-020-00936-4","article-title":"A New Bayesian Network Based on Gaussian Naive Bayes with Fuzzy Parameters for Training Assessment in Virtual Simulators","volume":"23","author":"Moraes","year":"2020","journal-title":"Int. J. Fuzzy Syst."},{"key":"ref_22","unstructured":"Raschka, S. (2014). Naive bayes and text classification i-introduction and theory. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"10844","DOI":"10.1109\/TIE.2019.2962465","article-title":"A Reinforced k-Nearest Neighbors Method with Application to Chatter Identification in High Speed Milling","volume":"67","author":"Shi","year":"2020","journal-title":"IEEE Trans. Ind. Electron."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1713","DOI":"10.1109\/TPAMI.2019.2901688","article-title":"Logistic regression confined by cardinality-constrained sample and feature selection","volume":"42","author":"Adeli","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1109\/TII.2019.2915559","article-title":"Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification","volume":"16","author":"Chai","year":"2019","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"113783","DOI":"10.1016\/j.eswa.2020.113783","article-title":"Efficiency analysis trees: A new methodology for estimating production frontiers through decision trees","volume":"162","author":"Esteve","year":"2020","journal-title":"Expert Syst. Appl."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2706","DOI":"10.1109\/TPDS.2019.2920131","article-title":"Exploiting GPUs for efficient gradient boosting decision tree training","volume":"30","author":"Wen","year":"2019","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"105754","DOI":"10.1016\/j.knosys.2020.105754","article-title":"One-class support vector classifiers: A survey","volume":"196","author":"Alam","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_29","first-page":"2825","article-title":"Scikit-learn: Machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"2159","DOI":"10.1109\/TKDE.2019.2913859","article-title":"Entropy-based Sampling Approaches for Multi-class Imbalanced Problems","volume":"32","author":"Li","year":"2020","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.neunet.2007.12.031","article-title":"Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance","volume":"21","author":"Mazurowski","year":"2008","journal-title":"Neural Netw."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"935","DOI":"10.1016\/j.neucom.2015.04.120","article-title":"Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases","volume":"175","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1016\/j.patcog.2019.02.023","article-title":"The impact of class imbalance in classification performance metrics based on the binary confusion matrix","volume":"91","author":"Luque","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"105662","DOI":"10.1016\/j.asoc.2019.105662","article-title":"An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets","volume":"83","year":"2019","journal-title":"Appl. Soft Comput."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"429","DOI":"10.1016\/j.ins.2019.11.004","article-title":"Data imbalance in classification: Experimental evaluation","volume":"513","author":"Thabtah","year":"2020","journal-title":"Inf. Sci."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"18473","DOI":"10.1007\/s00521-022-07454-4","article-title":"Adam or Eve? Automatic users\u2019 gender classification via gestures analysis on touch devices","volume":"34","author":"Guarino","year":"2022","journal-title":"Neural Comput. Appl."}],"container-title":["Axioms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2075-1680\/11\/11\/607\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:08:41Z","timestamp":1760144921000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2075-1680\/11\/11\/607"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,1]]},"references-count":36,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2022,11]]}},"alternative-id":["axioms11110607"],"URL":"https:\/\/doi.org\/10.3390\/axioms11110607","relation":{},"ISSN":["2075-1680"],"issn-type":[{"value":"2075-1680","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,1]]}}}