{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T21:24:30Z","timestamp":1779312270767,"version":"3.51.4"},"reference-count":36,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T00:00:00Z","timestamp":1750118400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Insurance is essential for financial risk protection, but claim management is complex and requires accurate classification and forecasting strategies. This study aimed to empirically evaluate the performance of classification algorithms, including Logistic Regression, Decision Tree, Random Forest, XGBoost, K-Nearest Neighbors, Support Vector Machine, and Na\u00efve Bayes to predict high insurance claims. The research analyses the variables of claims, vehicles, and insured parties that influence the classification of high-cost claims. This investigation utilizes a dataset comprising 802 observations of bodily injury claims from the motor liability portfolio of a private insurance company in Albania, covering the period from 2018 to 2024. In order to evaluate and compare the performance of the models, we employed evaluation criteria, including classification accuracy (CA), area under the curve (AUC), confusion matrix, and error rates. We found that Random Forest performs better, achieving the highest classification accuracy (CA = 0.8867, AUC = 0.9437) with the lowest error rates, followed by the XGBoost model. At the same time, logistic regression demonstrated the weakest performance. Key predictive factors in high claim classification include claim type, deferred period, vehicle brand and age of driver. These findings highlight the potential of machine learning models in improving claim classification and risk assessment and refine underwriting policy.<\/jats:p>","DOI":"10.3390\/data10060090","type":"journal-article","created":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T11:53:14Z","timestamp":1750161194000},"page":"90","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data"],"prefix":"10.3390","volume":"10","author":[{"given":"Esmeralda","family":"Brati","sequence":"first","affiliation":[{"name":"Department of Statistics and Applied Informatics, Faculty of Economy, University of Tirana, 1010 Tirana, Albania"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3162-8987","authenticated-orcid":false,"given":"Alma","family":"Braimllari","sequence":"additional","affiliation":[{"name":"Department of Statistics and Applied Informatics, Faculty of Economy, University of Tirana, 1010 Tirana, Albania"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2173-1470","authenticated-orcid":false,"given":"Ardit","family":"Gje\u00e7i","sequence":"additional","affiliation":[{"name":"Department of Economics and Finance, University of New York Tirana, 1000 Tirana, Albania"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,6,17]]},"reference":[{"key":"ref_1","first-page":"100012","article-title":"Application of Machine Learning and Data Visualization Techniques for Decision Support in the Insurance Sector","volume":"1","author":"Rawat","year":"2021","journal-title":"Int. J. Inf. Manag. Data Insights"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Poufinas, T., Gogas, P., Papadimitriou, T., and Zaganidis, E. (2023). Machine Learning in Forecasting Motor Insurance Claims. Risks, 11.","DOI":"10.2139\/ssrn.4610457"},{"key":"ref_3","first-page":"3","article-title":"Indemnification of non-material damages caused by road traffic accidents\u2014Ethical and financial aspects","volume":"4","author":"Prodanov","year":"2017","journal-title":"Econ. Arch."},{"key":"ref_4","first-page":"225","article-title":"A practitioners approach to individual claims models for bodily injury claims in German non-life insurance","volume":"110","author":"Wiedemann","year":"2021","journal-title":"Z. Gesamte Versicherungswiss."},{"key":"ref_5","first-page":"47","article-title":"A comparative study of data mining algorithms in the prediction of auto insurance claims","volume":"5","author":"Weerasinghe","year":"2016","journal-title":"Eur. Int. J. Sci. Technol."},{"key":"ref_6","first-page":"1","article-title":"Classification of the Insureds Using Integrated Machine Learning Algorithms: A Comparative Study","volume":"36","author":"Hanafy","year":"2022","journal-title":"Appl. Artif. Intell. AAI"},{"key":"ref_7","first-page":"3","article-title":"Motor Insurance Claim Status Prediction Using Machine Learning Techniques","volume":"12","author":"Alamir","year":"2021","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_8","first-page":"66","article-title":"Review of Statistical and Machine Learning Methods Applied in Private Health Insurance","volume":"41","author":"Brati","year":"2024","journal-title":"Albanian J. Econ. Bus."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"100102","DOI":"10.1016\/j.jfds.2023.100102","article-title":"The Applications of Big Data in the Insurance Industry: A Bibliometric and Systematic Review of Relevant Literature","volume":"9","author":"Ellili","year":"2023","journal-title":"J. Finance Data Sci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Clemente, C., Guerreiro, G.R., and Bravo, J.M. (2023). Modelling Motor Insurance Claim Frequency and Severity Using Gradient Boosting. Risks, 11.","DOI":"10.3390\/risks11090163"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1012","DOI":"10.1016\/j.procs.2023.10.610","article-title":"Prediction of Health Insurance Claims Using Logistic Regression and XGBoost Methods","volume":"227","author":"Permai","year":"2023","journal-title":"Procedia Comput. Sci."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"17","DOI":"10.14445\/22315373\/IJMTT-V69I12P503","article-title":"Application of Bootstrap and Deterministic Methods for Reserving Claims in Private Health Insurance","volume":"69","author":"Brati","year":"2023","journal-title":"Int. J. Math. Trends Technol."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"130","DOI":"10.37394\/23207.2025.22.12","article-title":"A Comparative Analysis of Stochastic Approaches for Claims Reserving in Private Health Insurance","volume":"22","author":"Brati","year":"2025","journal-title":"WSEAS Trans. Bus. Econ."},{"key":"ref_14","first-page":"100516","article-title":"Machine Learning for an Explainable Cost Prediction of Medical Insurance","volume":"15","author":"Orji","year":"2024","journal-title":"Mach. Learn. Appl."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Vinora, A., Surya, V., Lloyds, E., Kathir Pandian, B., Deborah, R.N., and Gobinath, A. (2023, January 14\u201315). An Efficient Health Insurance Prediction System Using Machine Learning. Proceedings of the 2023 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India.","DOI":"10.1109\/ICSES60034.2023.10465334"},{"key":"ref_16","unstructured":"Maisog, J.M., Li, W., Xu, Y., Hurley, B., Shah, H., Lemberg, R., and Gutfraind, A. (2019). Using Massive Health Insurance Claims Data to Predict Very High-Cost Claimants: A Machine Learning Approach. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Langenberger, B., Schulte, T., and Groene, O. (2023). The application of machine learning to predict high-cost patients: A performance comparison of different models using healthcare claims data. PLoS ONE, 18.","DOI":"10.1371\/journal.pone.0279540"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1002\/asmb.2543","article-title":"Machine Learning Applications in Nonlife Insurance","volume":"36","author":"Grize","year":"2020","journal-title":"Appl. Stoch. Models Bus. Ind."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Alomair, G. (2024). Predictive Performance of Count Regression Models Versus Machine Learning Techniques: A Comparative Analysis Using an Automobile Insurance Claims Frequency Dataset. PLoS ONE, 19.","DOI":"10.1371\/journal.pone.0314975"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Hanafy, M., and Ming, R. (2021). Machine Learning Approaches for Auto Insurance Big Data. Risks, 9.","DOI":"10.3390\/risks9020042"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Nabrawi, E., and Alanazi, A. (2023). Fraud Detection in Healthcare Insurance Claims Using Machine Learning. Risks, 11.","DOI":"10.3390\/risks11090160"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Mavundla, K., Thakur, S., Adetiba, E., and Abayomi, A. (2024). Predicting Cross-Selling Health Insurance Products Using Machine-Learning Techniques. J. Comput. Inf. Syst., 1\u201318.","DOI":"10.1080\/08874417.2024.2395913"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yego, N.K.K., Nkurunziza, J., and Kasozi, J. (2023). Predicting Health Insurance Uptake in Kenya Using Random Forest: An Analysis of Socio-Economic and Demographic Factors. PLoS ONE, 18.","DOI":"10.1371\/journal.pone.0294166"},{"key":"ref_24","unstructured":"Wang, Y. (2021, January 19\u201322). Predictive Machine Learning for Underwriting Life and Health Insurance. Proceedings of the Actuarial Society of South Africa\u2019s 2021 Virtual Convention, Virtual. Available online: https:\/\/www.actuarialsociety.org.za\/convention\/wp-content\/uploads\/2021\/10\/2021-ASSA-Wang-FIN-reduced.pdf."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Taha, A., Cosgrave, B., and Mckeever, S. (2022). Using Feature Selection with Machine Learning for Generation of Insurance Insights. Appl. Sci., 12.","DOI":"10.3390\/app12063209"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Adnan Aslam, M., Murtaza, F., Ehatisham Ul Haq, M., Yasin, A., and Ali, N. (2025). SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning. Data, 10.","DOI":"10.3390\/data10030027"},{"key":"ref_27","first-page":"3","article-title":"Categorical Feature Encoding Techniques for Improved Classifier Performance When Dealing with Imbalanced Data of Fraudulent Transactions","volume":"18","author":"Dzemyda","year":"2023","journal-title":"Int. J. Comput. Commun. Control"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"79","DOI":"10.32614\/RJ-2014-008","article-title":"ROSE: A Package for Binary Imbalanced Learning","volume":"6","author":"Lunardon","year":"2014","journal-title":"R J."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Lee, C.-W., Fu, M.-W., Wang, C.-C., and Azis, M.I. (2025). Evaluating Machine Learning Algorithms for Financial Fraud Detection: Insights from Indonesia. Mathematics, 13.","DOI":"10.3390\/math13040600"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Dhamo, Z., Gje\u00e7i, A., Zibri, A., and Prendi, X. (2025). Business Distress Prediction in Albania: An Analysis of Classification Methods. J. Risk Financ. Manag., 18.","DOI":"10.3390\/jrfm18030118"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"55","DOI":"10.21608\/jocc.2024.380150","article-title":"An Efficient Framework for Predicting Medical Insurance Costs Using Machine Learning","volume":"3","author":"AbdElminaam","year":"2024","journal-title":"J. Comput. Commun."},{"key":"ref_32","unstructured":"Therneau, T., Atkinson, B., Ripley, B., Venables, W.N., Liaw, A., Wiener, M., Chen, T., He, T., Benesty, M., and Tang, Y. (2023). R Packages Used for Classification Modeling: Rpart, randomForest, xgboost, e1071, and Class, R Foundation for Statistical Computing. Available online: https:\/\/cran.r-project.org."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Liu, C.-J., Huang, T.-S., Ho, P.-T., Huang, J.-C., and Hsieh, C.-T. (2024). Correction: Machine Learning-Based E-Commerce Platform Repurchase Customer Prediction Model. PLoS ONE, 19.","DOI":"10.1371\/journal.pone.0315518"},{"key":"ref_34","first-page":"608","article-title":"The Applicability of Credit Scoring Models in Emerging Economies: An Evidence from Jordan","volume":"11","author":"Abbod","year":"2018","journal-title":"Int. J. Islam. Middle East Financ. Manag."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Rajput, D., Wang, W.J., and Chen, C.C. (2023). Evaluation of a Decided Sample Size in Machine Learning Applications. BMC Bioinform., 24.","DOI":"10.1186\/s12859-023-05156-9"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Vabalas, A., Gowen, E., Poliakoff, E., and Casson, A.J. (2019). Machine Learning Algorithm Validation with a Limited Sample Size. PLoS ONE, 14.","DOI":"10.1371\/journal.pone.0224365"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/6\/90\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:53:34Z","timestamp":1760032414000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/6\/90"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":36,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,6]]}},"alternative-id":["data10060090"],"URL":"https:\/\/doi.org\/10.3390\/data10060090","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}