{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T09:09:41Z","timestamp":1763543381016,"version":"3.45.0"},"reference-count":46,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T00:00:00Z","timestamp":1762732800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Unidata S.p.A.","award":["CUP-B53C22003700004"],"award-info":[{"award-number":["CUP-B53C22003700004"]}]},{"name":"National Recovery and Resilience Plan of Italy","award":["38-033-26-DOT1326HYC-3052"],"award-info":[{"award-number":["38-033-26-DOT1326HYC-3052"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid approach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. SHAP analysis identified glucose levels as the dominant predictor, followed by BMI and age, providing clinically interpretable risk factors that align with established medical knowledge. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 0.968, and an F1-score of 0.971. The model was successfully validated on two larger datasets (CDC BRFSS and a 130-hospital dataset), confirming its generalizability. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future Internet-of-Things-based smart healthcare systems.<\/jats:p>","DOI":"10.3390\/fi17110513","type":"journal-article","created":{"date-parts":[[2025,11,10]],"date-time":"2025-11-10T15:07:31Z","timestamp":1762787251000},"page":"513","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5826-4488","authenticated-orcid":false,"given":"Yas","family":"Barzegar","sequence":"first","affiliation":[{"name":"Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy"}]},{"given":"Atrin","family":"Barzegar","sequence":"additional","affiliation":[{"name":"Mathematics, Physics and Applications to Engineering Department, Universit\u00e0 degli Studi della Campania \u201cLuigi Vanvitelli\u201d, Viale Lincoln n\u00b05, 81100 Caserta, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0609-8796","authenticated-orcid":false,"given":"Francesco","family":"Bellini","sequence":"additional","affiliation":[{"name":"Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7627-265X","authenticated-orcid":false,"given":"Fabrizio","family":"D'Ascenzo","sequence":"additional","affiliation":[{"name":"Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1205-3658","authenticated-orcid":false,"given":"Irina","family":"Gorelova","sequence":"additional","affiliation":[{"name":"Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy"}]},{"given":"Patrizio","family":"Pisani","sequence":"additional","affiliation":[{"name":"Unidata S.p.A., Viale A. G. Eiffel, 00148 Rome, Italy"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"baaa010","DOI":"10.1093\/database\/baaa010","article-title":"Artificial intelligence with a multi-functional machine learning platform development for better healthcare and precision medicine","volume":"2020","author":"Ahmed","year":"2020","journal-title":"Database"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Caball\u00e9-Cervig\u00f3n, N., Castillo-Sequera, J.L., G\u00f3mez-Pulido, J.A., G\u00f3mez-Pulido, J.M., and Polo-Luque, M.L. (2020). Machine learning applied to diagnosis of human diseases: A systematic review. Appl. Sci., 10.","DOI":"10.3390\/app10155135"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1177\/2047487319881021","article-title":"The global epidemics of diabetes in the 21st century: Current situation and perspectives","volume":"26","author":"Standl","year":"2019","journal-title":"Eur. J. Prev. Cardiol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"100263","DOI":"10.1016\/j.slast.2025.100263","article-title":"AI-driven predictive modeling for disease prevention and early detection","volume":"31","author":"Behera","year":"2025","journal-title":"SLAS Technol."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chae, S., Kwon, S., and Lee, D. (2018). Predicting Infectious Disease Using Deep Learning and Big Data. Int. J. Environ. Res. Public Health, 15.","DOI":"10.3390\/ijerph15081596"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Nilashi, M., Asadi, S., Abumalloh, R.A., Samad, S., Ghabban, F., Supriyanto, E., and Osman, R. (2021). Sustainability performance assessment using self-organizing maps (SOM) and classification and ensembles of regression trees (CART). Sustainability, 13.","DOI":"10.3390\/su13073870"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bhatt, C.M., Patel, P., Ghetia, T., and Mazzeo, P.L. (2023). Effective heart disease prediction using machine learning techniques. Algorithms, 16.","DOI":"10.3390\/a16020088"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"290","DOI":"10.1007\/s42979-020-00305-w","article-title":"Breast cancer prediction: A comparative study using machine learning techniques","volume":"1","author":"Islam","year":"2020","journal-title":"SN Comput. Sci."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Sakib, S., Tanzeem, A.K., Tasawar, I.K., Shorna, F., Siddique, M.A.B., and Alam, S.B. (2021, January 27\u201330). Blood cancer recognition based on discriminant gene expressions: A comparative analysis of optimized machine learning algorithms. Proceedings of the 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada.","DOI":"10.1109\/IEMCON53756.2021.9623210"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"121574","DOI":"10.1109\/ACCESS.2023.3328909","article-title":"A robust heart disease prediction system using hybrid deep neural networks","volume":"11","author":"Amin","year":"2023","journal-title":"IEEE Access"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"14723","DOI":"10.1007\/s00521-021-06124-1","article-title":"An intelligent heart disease prediction system based on swarm-artificial neural network","volume":"35","author":"Nandy","year":"2023","journal-title":"Neural Comput. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1007\/s44163-023-00049-5","article-title":"Evaluation of artificial intelligence techniques in disease diagnosis and prediction","volume":"3","author":"Kaplanoglu","year":"2023","journal-title":"Discov. Artif. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Kor, C.-T., Li, Y.-R., Lin, P.-R., Lin, S.-H., Wang, B.-Y., and Lin, C.-H. (2022). Explainable Machine Learning Model for Predicting First-Time Acute Exacerbation in Patients with Chronic Obstructive Pulmonary Disease. J. Pers. Med., 12.","DOI":"10.3390\/jpm12020228"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1319","DOI":"10.1007\/s00521-021-06431-7","article-title":"Deep convolutional neural network for diabetes mellitus prediction","volume":"34","author":"Alex","year":"2022","journal-title":"Neural Comput. Appl."},{"key":"ref_15","first-page":"54","article-title":"Diabetes prediction using artificial neural network","volume":"121","year":"2018","journal-title":"Int. J. Adv. Sci. Technol."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"El-Bashbishy, A.E.S., and El-Bakry, H.M. (2024). Pediatric diabetes prediction using deep learning. Sci. Rep., 14.","DOI":"10.1038\/s41598-024-51438-4"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wei, S., Zhao, X., and Miao, C. (2018, January 5\u20138). A comprehensive exploration to the machine learning techniques for diabetes identification. Proceedings of the 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore.","DOI":"10.1109\/WF-IoT.2018.8355130"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chandgude, N., and Pawar, S. (2016, January 12\u201313). Diagnosis of diabetes using Fuzzy inference System. Proceedings of the 2016 International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India.","DOI":"10.1109\/ICCUBEA.2016.7860001"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Barzegar, Y., Gorelova, I., Bellini, F., and D\u2019ascenzo, F. (2023). Drinking water quality assessment using a fuzzy inference system method: A case study of Rome (Italy). Int. J. Environ. Res. Public Health, 20.","DOI":"10.3390\/ijerph20156522"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"4431","DOI":"10.1016\/j.procs.2024.09.293","article-title":"Fuzzy inference system for risk assessment of wheat flour product manufacturing systems","volume":"246","author":"Barzegar","year":"2024","journal-title":"Procedia Comput. Sci."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bellini, F., Barzegar, Y., Barzegar, A., Marrone, S., Verde, L., and Pisani, P. (2025). Sustainable water quality evaluation based on cohesive Mamdani and Sugeno fuzzy inference system in Tivoli (Italy). Sustainability, 17.","DOI":"10.3390\/su17020579"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Barzegar, Y., Barzegar, A., Marrone, S., Verde, L., Bellini, F., and Pisani, P. (2025). Computational Risk Assessment in Water Distribution Network. International Conference on Computational Science, Springer Nature.","DOI":"10.1007\/978-3-031-97567-7_14"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Barzegar, A., Campanile, L., Marrone, S., Marulli, F., Verde, L., and Mastroianni, M. (2024, January 8\u201311). Fuzzy-based severity evaluation in privacy problems: An application to healthcare. Proceedings of the 2024 19th European Dependable Computing Conference (EDCC), Leuven, Belgium.","DOI":"10.1109\/EDCC61798.2024.00037"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Naseem, A., Habib, R., Naz, T., Atif, M., Arif, M., and Allaoua Chelloug, S. (2022). Novel Internet of Things based approach toward diabetes prediction using deep learning models. Front. Public Health, 10.","DOI":"10.3389\/fpubh.2022.914106"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"e7219","DOI":"10.1002\/cpe.7219","article-title":"Machine learning and IoT-based model for patient monitoring and early prediction of diabetes","volume":"34","author":"Verma","year":"2022","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1417","DOI":"10.1007\/s10115-022-01679-4","article-title":"Data pricing in machine learning pipelines","volume":"64","author":"Cong","year":"2022","journal-title":"Knowl. Inf. Syst."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"3055","DOI":"10.1109\/TPAMI.2021.3056950","article-title":"Predicting machine learning pipeline runtimes in the context of automated machine learning","volume":"43","author":"Mohr","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_28","unstructured":"Olson, R.S., and Moore, J.H. (2016, January 24). TPOT: A tree-based pipeline optimization tool for automating machine learning. Proceedings of the Workshop on Automatic Machine Learning, New York, NY, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Baccouche, A., Garcia-Zapirain, B., Castillo Olea, C., and Elmaghraby, A. (2020). Ensemble deep learning models for heart disease classification: A case study from Mexico. Information, 11.","DOI":"10.3390\/info11040207"},{"key":"ref_30","first-page":"134","article-title":"Application of machine learning k-nearest neighbour algorithm to predict diabetes","volume":"6","author":"Chandra","year":"2023","journal-title":"Int. J. Electr. Energy Power Syst. Eng."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"93","DOI":"10.2298\/SJEE2501093K","article-title":"Machine learning for early diabetes screening: A comparative study of algorithmic approaches","volume":"22","author":"Korkmaz","year":"2025","journal-title":"Serbian J. Electr. Eng."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Rezvani, S., Pourpanah, F., Lim, C.P., and Wu, Q.M. (2024). Methods for class-imbalanced learning with support vector machines: A review and an empirical evaluation. arXiv.","DOI":"10.1007\/s00500-024-09931-5"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1186\/s40537-020-00369-8","article-title":"CatBoost for big data: An interdisciplinary review","volume":"7","author":"Hancock","year":"2020","journal-title":"J. Big Data"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Lokker, C., Abdelkader, W., Bagheri, E., Parrish, R., Cotoi, C., Navarro, T., Germini, F., Linkins, L.A., Haynes, R.B., and Chu, L. (2024). Boosting efficiency in a clinical literature surveillance system with LightGBM. PLOS Digit. Health, 3.","DOI":"10.1371\/journal.pdig.0000299"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Parmar, A., Katariya, R., and Patel, V. (2018). A review on random forest: An ensemble classifier. International Conference on Intelligent Data Communication Technologies and Internet of Things, Springer International Publishing.","DOI":"10.1007\/978-3-030-03146-6_86"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"e13064","DOI":"10.1111\/exsy.13064","article-title":"Cardiovascular disease prediction using recursive feature elimination and gradient boosting classification techniques","volume":"39","author":"Theerthagiri","year":"2022","journal-title":"Expert Syst."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"4514","DOI":"10.1016\/j.jksuci.2020.10.013","article-title":"An optimized XGBoost based diagnostic system for effective prediction of heart disease","volume":"34","author":"Budholiya","year":"2022","journal-title":"J. King Saud Univ.-Comput. Inf. Sci."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Asra, T., Setiadi, A., Safudin, M., Lestari, E.W., Hardi, N., and Alamsyah, D.P. (2021, January 1\u20133). Implementation of AdaBoost algorithm in prediction of chronic kidney disease. Proceedings of the 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), Pattaya, Thailand.","DOI":"10.1109\/ICEAST52143.2021.9426291"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"33","DOI":"10.31937\/ijnmt.v7i1.1340","article-title":"Logistic regression prediction model for cardiovascular disease","volume":"7","author":"Ciu","year":"2020","journal-title":"IJNMT (Int. J. New Media Technol.)"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Uddin, S., Haque, I., Lu, H., Moni, M.A., and Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci. Rep., 12.","DOI":"10.1038\/s41598-022-10358-x"},{"key":"ref_41","first-page":"8","article-title":"A comparative study of diabetes detection using the Pima Indian diabetes database","volume":"7","author":"Mousa","year":"2023","journal-title":"Methods"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"138","DOI":"10.52783\/cana.v31.1008","article-title":"Diabetic prediction based on machine learning using PIMA Indian dataset","volume":"31","author":"Salih","year":"2024","journal-title":"Commun. Appl. Nonlinear Anal."},{"key":"ref_43","first-page":"3074","article-title":"Prediction of diabetes in females of pima Indian heritage: A complete supervised learning approach","volume":"12","author":"Bhoi","year":"2021","journal-title":"Turk. J. Comput. Math. Educ."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1016\/j.inffus.2021.11.011","article-title":"Tabular data: Deep learning is not all you need","volume":"81","author":"Armon","year":"2022","journal-title":"Inf. Fusion"},{"key":"ref_45","first-page":"507","article-title":"Why do tree-based models still outperform deep learning on typical tabular data?","volume":"35","author":"Grinsztajn","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"24","DOI":"10.1038\/s41591-018-0316-z","article-title":"A guide to deep learning in healthcare","volume":"25","author":"Esteva","year":"2019","journal-title":"Nat. Med."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/11\/513\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T09:04:57Z","timestamp":1763543097000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/11\/513"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,10]]},"references-count":46,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["fi17110513"],"URL":"https:\/\/doi.org\/10.3390\/fi17110513","relation":{},"ISSN":["1999-5903"],"issn-type":[{"type":"electronic","value":"1999-5903"}],"subject":[],"published":{"date-parts":[[2025,11,10]]}}}