{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T23:47:33Z","timestamp":1772236053429,"version":"3.50.1"},"reference-count":35,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2023,12,22]],"date-time":"2023-12-22T00:00:00Z","timestamp":1703203200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science and Technology Council, Taiwan","award":["NSC 112-2221-E-032-038-MY2"],"award-info":[{"award-number":["NSC 112-2221-E-032-038-MY2"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>When the binary response variable contains an excess of zero counts, the data are imbalanced. Imbalanced data cause trouble for binary classification. To simplify the numerical computation to obtain the maximum likelihood estimators of the zero-inflated Bernoulli (ZIBer) model parameters with imbalanced data, an expectation-maximization (EM) algorithm is proposed to derive the maximum likelihood estimates of the model parameters. The logistic regression model links the Bernoulli probabilities with the covariates in the ZIBer model, and the prediction performance among the ZIBer model, LightGBM, and artificial neural network (ANN) procedures is compared by Monte Carlo simulation. The results show that no method can dominate the other methods regarding predictive performance under the imbalanced data. The LightGBM and ZIBer models are more competitive than the ANN model for zero-inflated-imbalanced data sets.<\/jats:p>","DOI":"10.3390\/e26010015","type":"journal-article","created":{"date-parts":[[2023,12,22]],"date-time":"2023-12-22T08:53:01Z","timestamp":1703235181000},"page":"15","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Binary Classification with Imbalanced Data"],"prefix":"10.3390","volume":"26","author":[{"given":"Jyun-You","family":"Chiang","sequence":"first","affiliation":[{"name":"School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1080-0231","authenticated-orcid":false,"given":"Yuhlong","family":"Lio","sequence":"additional","affiliation":[{"name":"Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chien-Ya","family":"Hsu","sequence":"additional","affiliation":[{"name":"Department of Statistics, Tamkang University, New Taipei City 251301, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6257-9677","authenticated-orcid":false,"given":"Chia-Ling","family":"Ho","sequence":"additional","affiliation":[{"name":"Department of Risk Management and Insurance, Tamkang University, New Taipei City 251301, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6548-7663","authenticated-orcid":false,"given":"Tzong-Ru","family":"Tsai","sequence":"additional","affiliation":[{"name":"Department of Statistics, Tamkang University, New Taipei City 251301, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,12,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1037\/h0042519","article-title":"The perceptron: A probabilistic model for information storage and organization in the brain","volume":"65","author":"Rosenblatt","year":"1958","journal-title":"Psychol. Rev."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/323533a0","article-title":"Learning representations by back-propagating errors","volume":"323","author":"Rumelhart","year":"1986","journal-title":"Nature"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1016\/0893-6080(91)90032-Z","article-title":"Back-propagation algorithm which varies the number of hidden units","volume":"4","author":"Hirose","year":"1991","journal-title":"Neural Netw."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1016\/0893-6080(91)90033-2","article-title":"Creating artificial neural networks that generalize","volume":"4","author":"Sietsma","year":"1991","journal-title":"Neural Netw."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1016\/j.procs.2016.06.105","article-title":"Classification of cervical cancer using artificial neural networks","volume":"89","author":"Devi","year":"2016","journal-title":"Procedia Comput. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2295","DOI":"10.1109\/JPROC.2017.2761740","article-title":"Efficient processing of deep neural networks: A tutorial and survey","volume":"105","author":"Sze","year":"2017","journal-title":"Proc. IEEE"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"e00938","DOI":"10.1016\/j.heliyon.2018.e00938","article-title":"State-of-the-art in artificial neural network applications: A survey","volume":"4","author":"Abiodun","year":"2018","journal-title":"Heliyon"},{"key":"ref_8","first-page":"17","article-title":"Lung cancer detection using artificial neural network","volume":"3","author":"Nasser","year":"2019","journal-title":"Int. J. Eng."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2","DOI":"10.3389\/frai.2019.00002","article-title":"Pancreatic cancer prediction through an artificial neural network","volume":"2","author":"Muhammad","year":"2019","journal-title":"Front. Artif. Intell."},{"key":"ref_10","first-page":"24","article-title":"Student academic performance prediction using artificial neural networks: A case study","volume":"178","author":"Umar","year":"2019","journal-title":"Int. J. Comput. Appl."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"121073","DOI":"10.1016\/j.physa.2019.121073","article-title":"A novel ensemble classification model based on neural networks and a classifier optimisation technique for imbalanced credit risk evaluation","volume":"526","author":"Shen","year":"2019","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2307\/1269547","article-title":"Zero-inflated Poisson regression, with an application to defects in manufacturing","volume":"34","author":"Lambert","year":"1992","journal-title":"Technometrics"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1030","DOI":"10.1111\/j.0006-341X.2000.01030.x","article-title":"Zero-inflated Poisson and binomial regression with random effects: A case study","volume":"56","author":"Hall","year":"2000","journal-title":"Biometrics"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1461","DOI":"10.1002\/sim.1088","article-title":"Zero-inflated models for regression analysis of count data: A study of growth and development","volume":"21","author":"Cheung","year":"2002","journal-title":"Stat. Med."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1023\/A:1020910605990","article-title":"Zero-inflated models with application to spatial count data","volume":"9","author":"Gelfand","year":"2002","journal-title":"Environ. Ecol. Stat."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1081\/STA-120018186","article-title":"Bayesian analysis of zero-inflated distributions","volume":"32","author":"Rodrigues","year":"2003","journal-title":"Commun. Stat.-Theory Methods"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1360","DOI":"10.1016\/j.jspi.2004.10.008","article-title":"Bayesian analysis of zero-inflated regression models","volume":"136","author":"Ghosh","year":"2006","journal-title":"J. Stat. Plan. Inference"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1073","DOI":"10.1016\/j.jeconom.2007.01.002","article-title":"A zero-inflated ordered probit model, with an application to modelling tobacco consumption","volume":"141","author":"Harris","year":"2007","journal-title":"J. Econom."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1111\/j.2044-8317.2011.02031.x","article-title":"The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression","volume":"65","author":"Loeys","year":"2011","journal-title":"Br. J. Math. Stat. Psychol."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"460","DOI":"10.1214\/11-EJS616","article-title":"Maximum likelihood estimation in the logistic regression model with a cure fraction","volume":"5","author":"Diop","year":"2011","journal-title":"Electron. J. Stat."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1002\/hec.2844","article-title":"Consistent estimation of zero-inflated count models","volume":"22","author":"Staub","year":"2012","journal-title":"Health Econ."},{"key":"ref_22","first-page":"236","article-title":"Structural zeroes and zero-inflated models","volume":"26","author":"He","year":"2014","journal-title":"Shanghai Arch. Psychiatry"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"3597","DOI":"10.1080\/03610918.2014.950743","article-title":"Simulation-based inference in a zero-inflated Bernoulli regression model","volume":"45","author":"Diop","year":"2016","journal-title":"Commun. Stat.-Simul. Comput."},{"key":"ref_24","unstructured":"Alsabti, K., Ranka, S., and Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. Electr. Eng. Comput. Sci.-All Scholarsh., 41, Available online: https:\/\/surface.syr.edu\/eecs\/41."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: A gradient boosting machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Jin, R., and Agrawal, G. (2003, January 1\u20133). Communication and memory efficient parallel decision tree construction. Proceedings of the 2003 SIAM International Conference on Data Mining, San Francisco, CA, USA.","DOI":"10.1137\/1.9781611972733.11"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 13\u201317). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_28","first-page":"3149","article-title":"LightGBM: A highly efficient gradient boosting decision tree","volume":"30","author":"Ke","year":"2017","journal-title":"Neural Inf. Process. Syst."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, D., Zhang, Y., and Zhao, Y. (2017, January 18). LightGBM: An effective miRNA classification method in breast cancer patients. Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics, New York, NY, USA.","DOI":"10.1145\/3155077.3155079"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1016\/j.elerap.2018.08.002","article-title":"Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning","volume":"31","author":"Ma","year":"2018","journal-title":"Electron. Commer. Res. Appl."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Machado, M.R., Karray, S., and De Sousa, I.T. (2019, January 19\u201321). LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. Proceedings of the 14th International Conference on Computer Science & Education (ICCSE), Toronto, ON, Canada.","DOI":"10.1109\/ICCSE.2019.8845529"},{"key":"ref_32","first-page":"263928","article-title":"Light GBM machine learning algorithm to online click fraud detection","volume":"2019","author":"Minstireanu","year":"2019","journal-title":"J. Inf. Assur. Cyber Secur."},{"key":"ref_33","first-page":"6","article-title":"Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset","volume":"13","author":"Daoud","year":"2019","journal-title":"Int. J. Comput. Inf. Eng."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","article-title":"Learning from imbalanced data: Open challenges and future directions","volume":"5","author":"Krawczyk","year":"2016","journal-title":"Prog. Artif. Intell."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"857","DOI":"10.1007\/s10462-017-9611-1","article-title":"Selecting training sets for support vector machines: A review","volume":"52","author":"Nalepa","year":"2019","journal-title":"Artif. Intell. Rev."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/1\/15\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:40:32Z","timestamp":1760132432000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/1\/15"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,22]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,1]]}},"alternative-id":["e26010015"],"URL":"https:\/\/doi.org\/10.3390\/e26010015","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,22]]}}}