{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T00:19:35Z","timestamp":1771460375747,"version":"3.50.1"},"reference-count":68,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2023,3,22]],"date-time":"2023-03-22T00:00:00Z","timestamp":1679443200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data.<\/jats:p>","DOI":"10.3390\/bdcc7010055","type":"journal-article","created":{"date-parts":[[2023,3,23]],"date-time":"2023-03-23T03:07:57Z","timestamp":1679540877000},"page":"55","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study"],"prefix":"10.3390","volume":"7","author":[{"given":"Menna Ibrahim","family":"Gabr","sequence":"first","affiliation":[{"name":"Department of Business Information Systems (BIS), Faculty of Commerce and Business Administration, Helwan University, Cairo 11795, Egypt"}]},{"given":"Yehia Mostafa","family":"Helmy","sequence":"additional","affiliation":[{"name":"Department of Business Information Systems (BIS), Faculty of Commerce and Business Administration, Helwan University, Cairo 11795, Egypt"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2590-0000","authenticated-orcid":false,"given":"Doaa Saad","family":"Elzanfaly","sequence":"additional","affiliation":[{"name":"Department of Information Systems, Faculty of Computer and Artificial Intelligence, Helwan University, Cairo 11795, Egypt"},{"name":"Department of Information Systems, Faculty of Informatics Computer Science, British University in Egypt, Cairo 11837, Egypt"}]}],"member":"1968","published-online":{"date-parts":[[2023,3,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"3","DOI":"10.54623\/fue.fcij.6.1.3","article-title":"Data Quality Dimensions, Metrics, and Improvement Techniques","volume":"6","author":"Gabr","year":"2021","journal-title":"Future Comput. Inform. J."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"157","DOI":"10.2147\/CLEP.S129785","article-title":"Missing data and multiple imputation in clinical epidemiological research","volume":"9","author":"Pedersen","year":"2017","journal-title":"Clin. Epidemiol."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1007\/s42979-020-00131-0","article-title":"Multiple imputation ensembles (MIE) for dealing with missing data","volume":"1","author":"Aleryani","year":"2020","journal-title":"SN Comput. Sci."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Blomberg, L.C., and Ruiz, D.D.A. (2013, January 22). Evaluating the influence of missing data on classification algorithms in data mining applications. Proceedings of the Anais do IX Simp\u00f3sio Brasileiro de Sistemas de Informa\u00e7\u00e3o, SBC, Porto Alegre, Brazil.","DOI":"10.5753\/sbsi.2013.5736"},{"key":"ref_5","unstructured":"Acuna, E., and Rodriguez, C. (2004). Classification, Clustering, and Data Mining Applications, Springer."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"693674","DOI":"10.3389\/fdata.2021.693674","article-title":"A benchmark for data imputation methods","volume":"4","author":"Allhorn","year":"2021","journal-title":"Front. Big Data"},{"key":"ref_7","first-page":"5321","article-title":"Missing value imputation in multi attribute data set","volume":"5315","author":"Gimpy","year":"2014","journal-title":"Int. J. Comput. Sci. Inf. Technol."},{"key":"ref_8","first-page":"19075","article-title":"Handling missing data with graph representation learning","volume":"33","author":"You","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_9","first-page":"264","article-title":"Effects of missing data imputation on classifier accuracy","volume":"2","author":"Samant","year":"2013","journal-title":"Int. J. Eng. Res. Technol. IJERT"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Christopher, S.Z., Siswantining, T., Sarwinda, D., and Bustaman, A. (2019, January 29\u201330). Missing value analysis of numerical data using fractional hot deck imputation. Proceedings of the 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia.","DOI":"10.1109\/ICICoS48119.2019.8982412"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Aljuaid, T., and Sasi, S. (2016, January 23\u201325). Proper imputation techniques for missing values in data sets. Proceedings of the 2016 International Conference on Data Science and Engineering (ICDSE), Cochin, India.","DOI":"10.1109\/ICDSE.2016.7823957"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Thirukumaran, S., and Sumathi, A. (2012, January 13\u201315). Missing value imputation techniques depth survey and an imputation algorithm to improve the efficiency of imputation. Proceedings of the 2012 Fourth International Conference on Advanced Computing (ICoAC), Chennai, India.","DOI":"10.1109\/ICoAC.2012.6416805"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hossin, M., Sulaiman, M., Mustapha, A., Mustapha, N., and Rahmat, R. (2011, January 28\u201329). A hybrid evaluation metric for optimizing classifier. Proceedings of the 2011 3rd Conference on Data Mining and Optimization (DMO), Kuala Lumpur, Malaysia.","DOI":"10.1109\/DMO.2011.5976522"},{"key":"ref_14","first-page":"27","article-title":"Evaluation measures for models assessment over imbalanced data sets","volume":"3","author":"Bekkar","year":"2013","journal-title":"J. Inf. Eng. Appl."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1016\/j.aci.2018.08.003","article-title":"Classification assessment methods","volume":"17","author":"Tharwat","year":"2021","journal-title":"Appl. Comput. Inform."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"78368","DOI":"10.1109\/ACCESS.2021.3084050","article-title":"The Matthews correlation coefficient (MCC) is more informative than Cohen\u2019s Kappa and Brier score in binary classification assessment","volume":"9","author":"Chicco","year":"2021","journal-title":"IEEE Access"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1186\/s13040-021-00244-z","article-title":"The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation","volume":"14","author":"Chicco","year":"2021","journal-title":"BioData Min."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1000197","DOI":"10.4172\/2161-0487.1000197","article-title":"Five ways to look at Cohen\u2019s kappa","volume":"5","author":"Warrens","year":"2015","journal-title":"J. Psychol. Psychother."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Jeni, L.A., Cohn, J.F., and De La Torre, F. (2013, January 2\u20135). Facing imbalanced data\u2013recommendations for the use of performance metrics. Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland.","DOI":"10.1109\/ACII.2013.47"},{"key":"ref_20","first-page":"220","article-title":"Understanding auc-roc curve","volume":"26","author":"Narkhede","year":"2018","journal-title":"Towards Data Sci."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"7137524","DOI":"10.1155\/2022\/7137524","article-title":"Investigating the role of image fusion in brain tumor classification models based on machine learning algorithm for personalized medicine","volume":"2022","author":"Nanmaran","year":"2022","journal-title":"Comput. Math. Methods Med."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","article-title":"A systematic analysis of performance measures for classification tasks","volume":"45","author":"Sokolova","year":"2009","journal-title":"Inf. Process. Manag."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"113391","DOI":"10.1016\/j.eswa.2020.113391","article-title":"A novel weighted TPR-TNR measure to assess performance of the classifiers","volume":"152","author":"Jadhav","year":"2020","journal-title":"Expert Syst. Appl."},{"key":"ref_24","unstructured":"Liu, P., Lei, L., and Wu, N. (2005, January 21\u201323). A quantitative study of the effect of missing data in classifiers. Proceedings of the the Fifth International Conference on Computer and Information Technology (CIT\u201905), Shanghai, China."},{"key":"ref_25","unstructured":"Hunt, L.A. (2017). Data Science, Springer."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"5621","DOI":"10.1016\/j.eswa.2015.02.050","article-title":"Hybrid prediction model with missing value imputation for medical data","volume":"42","author":"Purwar","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Su, X., Khoshgoftaar, T.M., and Greiner, R. (2008, January 3\u20135). Using imputation techniques to help learn accurate classifiers. Proceedings of the 2008 20th IEEE International Conference on Tools with Artificial Intelligence, Dayton, OH, USA.","DOI":"10.1109\/ICTAI.2008.60"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1515\/jaiscr-2018-0002","article-title":"Classifiers accuracy improvement based on missing data imputation","volume":"8","author":"Jordanov","year":"2018","journal-title":"J. Artif. Intell. Soft Comput. Res."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1007\/s10115-011-0424-2","article-title":"On the choice of the best imputation methods for missing values considering three groups of classification methods","volume":"32","author":"Luengo","year":"2012","journal-title":"Knowl. Inf. Syst."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1016\/j.eswa.2017.07.026","article-title":"An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers","volume":"89","author":"Garciarena","year":"2017","journal-title":"Expert Syst. Appl."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Aggarwal, U., Popescu, A., and Hudelot, C. (2020, January 1\u20137). Active learning for imbalanced datasets. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass village, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093475"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Garc\u00eda, V., Mollineda, R.A., and S\u00e1nchez, J.S. (2010, January 23\u201326). Theoretical analysis of a performance measure for imbalanced data. Proceedings of the 2010 20th International Conference on Pattern Recognition, Washington, DC, USA.","DOI":"10.1109\/ICPR.2010.156"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Lei, L., Wu, N., and Liu, P. (2005, January 13\u201315). Applying sensitivity analysis to missing data in classifiers. Proceedings of the ICSSSM\u201905, 2005 International Conference on Services Systems and Services Management, Chongqing, China.","DOI":"10.1109\/ICSSSM.2005.1500155"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1186\/s13321-018-0281-z","article-title":"Effect of missing data on multitask prediction methods","volume":"10","author":"Chen","year":"2018","journal-title":"J. Cheminform."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hossain, T., and Inoue, S. (June, January 30). A comparative study on missing data handling using machine learning for human activity recognition. Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA.","DOI":"10.1109\/ICIEV.2019.8858520"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"739","DOI":"10.1109\/TCYB.2018.2872800","article-title":"A transfer-based additive LS-SVM classifier for handling missing data","volume":"50","author":"Wang","year":"2018","journal-title":"IEEE Trans. Cybern."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Makaba, T., and Dogo, E. (2019, January 21\u201322). A comparison of strategies for missing values in data on machine learning classification algorithms. Proceedings of the 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), Vanderbijlpark, South Africa.","DOI":"10.1109\/IMITEC45504.2019.9015889"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Liu, Q., and Hauswirth, M. (2020, January 28\u201331). A provenance meta learning framework for missing data handling methods selection. Proceedings of the 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), Virtual Conference.","DOI":"10.1109\/UEMCON51285.2020.9298089"},{"key":"ref_39","first-page":"749","article-title":"An approach towards missing data management using improved GRNN-SGTM ensemble method","volume":"24","author":"Izonin","year":"2021","journal-title":"Eng. Sci. Technol. Int. J."},{"key":"ref_40","first-page":"88","article-title":"Data mining: Concepts and techniques","volume":"10","author":"Han","year":"2006","journal-title":"Morgan Kaufinann"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"12","DOI":"10.9790\/0661-0651215","article-title":"K-NN classifier performs better than K-means clustering in missing value imputation","volume":"6","author":"Malarvizhi","year":"2012","journal-title":"IOSR J. Comput. Eng."},{"key":"ref_42","first-page":"34","article-title":"Comparative analysis of different imputation methods to treat missing values in data mining environment","volume":"82","author":"Singhai","year":"2013","journal-title":"Int. J. Comput. Appl."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1080\/1743727X.2016.1168798","article-title":"Random forest as an imputation method for education and psychology research: Its impact on item fit and difficulty of the Rasch model","volume":"39","author":"Golino","year":"2016","journal-title":"Int. J. Res. Method Educ."},{"key":"ref_44","first-page":"17","article-title":"Probabilistic neural network based categorical data imputation","volume":"218","author":"Nishanth","year":"2016","journal-title":"Neuro Comput."},{"key":"ref_45","unstructured":"Grandini, M., Bagli, E., and Visani, G. (2020). Metrics for multi-class classification: An overview. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1016\/j.patcog.2019.02.023","article-title":"The impact of class imbalance in classification performance metrics based on the binary confusion matrix","volume":"91","author":"Luque","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_47","first-page":"1","article-title":"A survey of predictive modeling on imbalanced domains","volume":"49","author":"Branco","year":"2016","journal-title":"ACM Comput. Surv. CSUR"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/s12864-019-6413-7","article-title":"The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation","volume":"21","author":"Chicco","year":"2020","journal-title":"BMC Genom."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Sa\u2019id, A.A., Rustam, Z., Wibowo, V.V.P., Setiawan, Q.S., and Laeli, A.R. (2020, January 8\u20139). Linear support vector machine and logistic regression for cerebral infarction classification. Proceedings of the 2020 International Conference on Decision Aid Sciences and Application (DASA), Online.","DOI":"10.1109\/DASA51403.2020.9317065"},{"key":"ref_50","unstructured":"Cao, C., Chicco, D., and Hoffman, M.M. (2020). The MCC-F1 curve: A performance evaluation technique for binary classification. arXiv."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"164","DOI":"10.1504\/IJSISE.2018.093268","article-title":"Evaluating compressive sensing algorithms in through-the-wall radar via F1-score","volume":"11","author":"AlBeladi","year":"2018","journal-title":"Int. J. Signal Imaging Syst. Eng."},{"key":"ref_52","unstructured":"Glazkova, A. (2020). A comparison of synthetic oversampling methods for multi-class text classification. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Toupas, P., Chamou, D., Giannoutakis, K.M., Drosou, A., and Tzovaras, D. (2019, January 16\u201319). An intrusion detection system for multi-class classification based on deep neural networks. Proceedings of the 2019 18th IEEE International Conference on Machine Learning And Applications (ICMLA), Boca Raton, FL, USA.","DOI":"10.1109\/ICMLA.2019.00206"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"2461","DOI":"10.1109\/JBHI.2020.2981526","article-title":"Deep multi-scale fusion neural network for multi-class arrhythmia detection","volume":"24","author":"Wang","year":"2020","journal-title":"IEEE J. Biomed. Health Inform."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"64486","DOI":"10.1109\/ACCESS.2018.2876674","article-title":"Multi-class sentiment analysis in Twitter: What if classification is not the answer","volume":"6","author":"Bouazizi","year":"2018","journal-title":"IEEE Access"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Baker, C., Deng, L., Chakraborty, S., and Dehlinger, J. (2019, January 15\u201319). Automatic multi-class non-functional software requirements classification using neural networks. Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA.","DOI":"10.1109\/COMPSAC.2019.10275"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Liu, C., Osama, M., and De Andrade, A. (2019). DENS: A dataset for multi-class emotion analysis. arXiv.","DOI":"10.18653\/v1\/D19-1656"},{"key":"ref_58","unstructured":"Opitz, J., and Burst, S. (2019). Macro f1 and macro f1. arXiv."},{"key":"ref_59","unstructured":"Josephine, S.A. (2017, January 2\u20135). Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data Classified negative. Proceedings of the SAS Global Forum, Orlando, FL, USA."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Boughorbel, S., Jarray, F., and El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE, 12.","DOI":"10.1371\/journal.pone.0177678"},{"key":"ref_61","unstructured":"Fisher, R. (2022, April 18). UCI Iris Data Set. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/iris."},{"key":"ref_62","unstructured":"Moro, S., Paulo, C., and Paulo, R. (2022, April 21). UCI Bank Marketing Data Set. Available online: https:\/\/archive.ics.uci.edu\/ml\/."},{"key":"ref_63","unstructured":"Bohanec, M., and Zupan, B. (2022, April 21). UCI Nursery Data Set. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/nursery."},{"key":"ref_64","unstructured":"Bohanec, M. (2022, April 21). Car Evaluation Data Set. Available online: https:\/\/www.kaggle.com\/datasets\/elikplim\/car-evaluation-data-setl."},{"key":"ref_65","unstructured":"Mehmet, A. (2022, April 21). Churn for Bank Customers. Available online: https:\/\/www.kaggle.com\/datasets\/mathchi\/churn-for-bank-customers."},{"key":"ref_66","unstructured":"Elawady, A., and Iskander, G. (2022, April 21). Dry Beans Classification. Available online: https:\/\/kaggle.com\/competitions\/dry-beans-classification-iti-ai-pro-intake01."},{"key":"ref_67","first-page":"1","article-title":"A novel performance measure for machine learning classification","volume":"13","author":"Gong","year":"2021","journal-title":"Int. J. Manag. Inf. Technol. IJMIT"},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"78","DOI":"10.3389\/frobt.2022.876814","article-title":"An invitation to greater use of Matthews correlation coefficient (MCC) in robotics and artificial intelligence","volume":"9","author":"Chicco","year":"2022","journal-title":"Front. Robot. AI"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/7\/1\/55\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:00:50Z","timestamp":1760122850000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/7\/1\/55"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,22]]},"references-count":68,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,3]]}},"alternative-id":["bdcc7010055"],"URL":"https:\/\/doi.org\/10.3390\/bdcc7010055","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,22]]}}}