{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T18:43:15Z","timestamp":1772822595792,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the domain of Medicare insurance fraud detection, handling imbalanced Big Data and high dimensionality remains a significant challenge. This study assesses the combined efficacy of two data reduction techniques: Random Undersampling (RUS), and a novel ensemble supervised feature selection method. The techniques are applied to optimize Machine Learning models for fraud identification in the classification of highly imbalanced Big Medicare Data. Utilizing two datasets from The Centers for Medicare &amp; Medicaid Services (CMS) labeled by the List of Excluded Individuals\/Entities (LEIE), our principal contribution lies in empirically demonstrating that data reduction techniques applied to these datasets significantly improves classification performance. The study employs a systematic experimental design to investigate various scenarios, ranging from using each technique in isolation to employing them in combination. The results indicate that a synergistic application of both techniques outperforms models that utilize all available features and data. Moreover, reduction in the number of features leads to more explainable models. Given the enormous financial implications of Medicare fraud, our findings not only offer computational advantages but also significantly enhance the effectiveness of fraud detection systems, thereby having the potential to improve healthcare services.<\/jats:p>","DOI":"10.1186\/s40537-023-00869-3","type":"journal-article","created":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T20:02:39Z","timestamp":1704312159000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Data reduction techniques for highly imbalanced medicare Big Data"],"prefix":"10.1186","volume":"11","author":[{"given":"John T.","family":"Hancock","sequence":"first","affiliation":[]},{"given":"Huanjing","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Taghi M.","family":"Khoshgoftaar","sequence":"additional","affiliation":[]},{"given":"Qianxin","family":"Liang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,1,3]]},"reference":[{"key":"869_CR1","doi-asserted-by":"crossref","unstructured":"Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. Joint European conference on machine learning and knowledge discovery in databases, 451\u2013466. Springer 2013","DOI":"10.1007\/978-3-642-40994-3_29"},{"key":"869_CR2","doi-asserted-by":"crossref","unstructured":"Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10).","DOI":"10.5121\/ijdkp.2013.3402"},{"issue":"1","key":"869_CR3","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1186\/s40537-023-00724-5","volume":"10","author":"JT Hancock","year":"2023","unstructured":"Hancock JT, Khoshgoftaar TM, Johnson JM. Evaluating classifier performance with highly imbalanced big data. J Big Data. 2023;10(1):42.","journal-title":"J Big Data"},{"key":"869_CR4","doi-asserted-by":"crossref","unstructured":"Hancock J, Khoshgoftaar TM, Johnson JM. Informative evaluation metrics for highly imbalanced big data classification. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1419\u20131426, 2022","DOI":"10.1109\/ICMLA55696.2022.00224"},{"key":"869_CR5","unstructured":"Civil Division, U.S. Department of Justice: Fraud Statistics, Overview. https:\/\/www.justice.gov\/opa\/press-release\/file\/1354316\/download, 2020"},{"key":"869_CR6","unstructured":"Centers for Medicare and Medicaid Services: 2019 Estimated Improper Payment Rates for Centers for Medicare & Medicaid Services (CMS) Programs (2019). https:\/\/www.cms.gov\/newsroom\/fact-sheets\/2019-estimated-improper-payment-rates-centers-medicare-medicaid-services-cms-programs"},{"key":"869_CR7","unstructured":"LEIE: Office of Inspector General Leie Downloadable Databases. https:\/\/oig.hhs.gov\/exclusions\/index.asp"},{"key":"869_CR8","first-page":"4785","volume":"7","author":"N Sateesh","year":"2020","unstructured":"Sateesh N, Kumar BP, Jyothi P. Supervised learning framework for healthcare fraud detection system with excluded provider labels. J Crit Rev. 2020;7:4785\u201394.","journal-title":"J Crit Rev"},{"key":"869_CR9","doi-asserted-by":"crossref","unstructured":"Mayaki MZA, Riveill M. Multiple inputs neural networks for fraud detection. In: 2022 International Conference on Machine Learning, Control, and Robotics (MLCR), pp. 8\u201313,2022. IEEE","DOI":"10.1109\/MLCR57210.2022.00011"},{"issue":"1","key":"869_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-018-0138-3","volume":"5","author":"M Herland","year":"2018","unstructured":"Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):1\u201321.","journal-title":"J Big Data"},{"key":"869_CR11","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Durable Medical Equipment, Devices & Supplies \u2013 by Referring Provider and Service (2021). https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-durable-medical-equipment-devices-supplies\/medicare-durable-medical-equipment-devices-supplies-by-referring-provider-and-service Accessed 2 July 2022."},{"issue":"2","key":"869_CR12","first-page":"223","volume":"9","author":"JA Lopo","year":"2023","unstructured":"Lopo JA, Hartomo KD. Evaluating sampling techniques for healthcare insurance fraud detection in imbalanced dataset. Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI). 2023;9(2):223\u201338.","journal-title":"Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI)"},{"key":"869_CR13","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321\u201357.","journal-title":"J Artif Intell Res"},{"issue":"5","key":"869_CR14","doi-asserted-by":"publisher","first-page":"1113","DOI":"10.1007\/s10796-020-10022-7","volume":"22","author":"JM Johnson","year":"2020","unstructured":"Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Inform Syst Front. 2020;22(5):1113\u201331.","journal-title":"Inform Syst Front"},{"key":"869_CR15","doi-asserted-by":"crossref","unstructured":"Hasanin T, Khoshgoftaar TM, Leevy J, Seliya N. Investigating random undersampling and feature selection on bioinformatics big data. In: 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 346\u2013356, 2019. IEEE","DOI":"10.1109\/BigDataService.2019.00063"},{"issue":"1","key":"869_CR16","doi-asserted-by":"publisher","first-page":"154","DOI":"10.1186\/s40537-023-00821-5","volume":"10","author":"JT Hancock","year":"2023","unstructured":"Hancock JT, Bauder RA, Wang H, Khoshgoftaar TM. Explainable machine learning models for medicare fraud detection. J Big Data. 2023;10(1):154.","journal-title":"J Big Data"},{"issue":"4","key":"869_CR17","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1007\/s42979-023-01809-x","volume":"4","author":"JM Johnson","year":"2023","unstructured":"Johnson JM, Khoshgoftaar TM. Data-centric ai for healthcare fraud detection. SN Comp Sci. 2023;4(4):389.","journal-title":"SN Comp Sci"},{"key":"869_CR18","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners \u2013 by Provider Data Dictionary. https:\/\/data.cms.gov\/resources\/medicare-physician-other-practitioners-by-provider-data-dictionary 2021."},{"key":"869_CR19","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners \u2013 by Provider (2021). https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-physician-other-practitioners\/medicare-physician-other-practitioners-by-provider Accessed 2 July 2022."},{"key":"869_CR20","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers \u2013 by Provider and Drug Data Dictionary (2021). https:\/\/data.cms.gov\/resources\/medicare-part-d-prescribers-by-provider-and-drug-data-dictionary Accessed 16 April 2022."},{"key":"869_CR21","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers \u2013 by Provider Data Dictionary (2020). https:\/\/data.cms.gov\/resources\/medicare-part-d-prescribers-by-provider-data-dictionary Accessed 27 May 2023."},{"key":"869_CR22","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers \u2013 by Provider and Drug (2021). https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-part-d-prescribers\/medicare-part-d-prescribers-by-provider-and-drug Accessed 16 April 2022."},{"key":"869_CR23","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Part D Prescribers - by Provider (2021). https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-part-d-prescribers\/medicare-part-d-prescribers-by-provider Accessed 16 April 2022."},{"key":"869_CR24","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners \u2013 by Provider and Service Data Dictionary. https:\/\/data.cms.gov\/resources\/medicare-physician-other-practitioners-by-provider-and-service-data-dictionary 2021."},{"key":"869_CR25","unstructured":"The Centers for Medicare and Medicaid Services: Medicare Physician & Other Practitioners \u2013 by Provider and Service (2021). https:\/\/data.cms.gov\/provider-summary-by-type-of-service\/medicare-physician-other-practitioners\/medicare-physician-other-practitioners-by-provider-and-service Accessed 2 July 2022."},{"key":"869_CR26","doi-asserted-by":"crossref","unstructured":"Bauder RA, Khoshgoftaar TM. A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), pp. 11\u201319 2016. IEEE.","DOI":"10.1109\/IRI.2016.11"},{"key":"869_CR27","doi-asserted-by":"crossref","unstructured":"Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD \u201916, 2016.","DOI":"10.1145\/2939672.2939785"},{"key":"869_CR28","first-page":"3146","volume":"30","author":"G Ke","year":"2017","unstructured":"Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inform Proc Syst. 2017;30:3146\u201354.","journal-title":"Adv Neural Inform Proc Syst"},{"issue":"1","key":"869_CR29","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/s10994-006-6226-1","volume":"63","author":"P Geurts","year":"2006","unstructured":"Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3\u201342.","journal-title":"Mach Learn"},{"issue":"1","key":"869_CR30","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L. Random forests. Mach Learn. 2001;45(1):5\u201332.","journal-title":"Mach Learn"},{"key":"869_CR31","unstructured":"Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems 2018;31."},{"issue":"1","key":"869_CR32","first-page":"191","volume":"41","author":"S Le Cessie","year":"1992","unstructured":"Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc Series C Appl Stat. 1992;41(1):191\u2013201.","journal-title":"J Royal Stat Soc Series C Appl Stat"},{"key":"869_CR33","volume-title":"Classification and Regression Trees","author":"L Breiman","year":"1984","unstructured":"Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. US: Taylor & Francis; 1984."},{"issue":"2","key":"869_CR34","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1007\/BF00058655","volume":"24","author":"L Breiman","year":"1996","unstructured":"Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123\u201340.","journal-title":"Mach Learn"},{"key":"869_CR35","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1201\/9780429246593","volume-title":"An Introduction to the Bootstrap","author":"B Efron","year":"1994","unstructured":"Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Boca Raton: CRC Press; 1994. p. 5\u20136."},{"key":"869_CR36","doi-asserted-by":"publisher","first-page":"1189","DOI":"10.1214\/aos\/1013203451","volume":"29","author":"JH Friedman","year":"2001","unstructured":"Friedman JH. Greedy function approximation: a gradient boosting machine. Ann stat. 2001;29:1189\u2013232.","journal-title":"Ann stat"},{"issue":"1","key":"869_CR37","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0274-4","volume":"6","author":"T Hasanin","year":"2019","unstructured":"Hasanin T, Khoshgoftaar TM, Leevy JL, Bauder RA. Severely imbalanced big data challenges: investigating data sampling approaches. J Big Data. 2019;6(1):1\u201325.","journal-title":"J Big Data"},{"key":"869_CR38","doi-asserted-by":"publisher","DOI":"10.4135\/9781412983327","volume-title":"Analysis of Variance","author":"GR Iversen","year":"1987","unstructured":"Iversen GR, Norpoth H. Analysis of Variance, vol. 1. Newbury Park: Sage; 1987."},{"key":"869_CR39","doi-asserted-by":"publisher","first-page":"99","DOI":"10.2307\/3001913","volume":"5","author":"JW Tukey","year":"1949","unstructured":"Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5:99\u2013114.","journal-title":"Biometrics"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00869-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-023-00869-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00869-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T20:11:12Z","timestamp":1704312672000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-023-00869-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,3]]},"references-count":39,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["869"],"URL":"https:\/\/doi.org\/10.1186\/s40537-023-00869-3","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,3]]},"assertion":[{"value":"22 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 January 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"8"}}