{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,19]],"date-time":"2026-07-19T02:44:15Z","timestamp":1784429055318,"version":"3.55.0"},"reference-count":26,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T00:00:00Z","timestamp":1704240000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Innovative Medicines Initiative 2 Joint Undertaking","award":["806968"],"award-info":[{"award-number":["806968"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>There is currently no consensus on the impact of class imbalance methods on the performance of clinical prediction models. We aimed to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Methods<\/jats:title>\n                <jats:p>We developed and externally validated prediction models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases. We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio. We evaluated the impact on model performance in terms of discrimination and calibration. Discrimination was assessed using the area under the receiver operating characteristic curve (AUROC) and calibration was assessed using calibration plots.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We developed and externally validated a total of 1,566 prediction models. On internal and external validation, random oversampling and random undersampling generally did not result in higher AUROCs. Moreover, we found overestimated risks, although this miscalibration could largely be corrected by recalibrating the models towards the imbalance ratios in the original dataset.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>Overall, we found that random oversampling or random undersampling generally does not improve the internal and external validation performance of prediction models developed in large observational health databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s40537-023-00857-7","type":"journal-article","created":{"date-parts":[[2024,1,3]],"date-time":"2024-01-03T20:02:39Z","timestamp":1704312159000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":85,"title":["Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data"],"prefix":"10.1186","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6769-3153","authenticated-orcid":false,"given":"Cynthia","family":"Yang","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Egill A.","family":"Fridgeirsson","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jan A.","family":"Kors","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jenna M.","family":"Reps","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peter R.","family":"Rijnbeek","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,1,3]]},"reference":[{"issue":"9","key":"857_CR1","doi-asserted-by":"publisher","first-page":"1263","DOI":"10.1109\/TKDE.2008.239","volume":"21","author":"H He","year":"2009","unstructured":"He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263\u201384.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"2","key":"857_CR2","first-page":"Article31","volume":"49","author":"P Branco","year":"2016","unstructured":"Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv. 2016;49(2):Article31.","journal-title":"ACM Comput Surv"},{"key":"857_CR3","doi-asserted-by":"publisher","first-page":"983","DOI":"10.1093\/jamia\/ocac002","volume":"29","author":"C Yang","year":"2022","unstructured":"Yang C, Kors JA, Ioannou S, John LH, Markus AF, Rekkas A, et al. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review. J Am Med Inform Assoc. 2022;29:983\u20139.","journal-title":"J Am Med Inform Assoc"},{"issue":"8","key":"857_CR4","doi-asserted-by":"publisher","first-page":"1756","DOI":"10.1093\/jamia\/ocab048","volume":"28","author":"J Liu","year":"2021","unstructured":"Liu J, Wong ZSY, So HY, Tsui KL. Evaluating resampling methods and structured features to improve fall incident report identification by the severity level. J Am Med Inform Assoc. 2021;28(8):1756\u201364.","journal-title":"J Am Med Inform Assoc"},{"key":"857_CR5","doi-asserted-by":"publisher","first-page":"103089","DOI":"10.1016\/j.jbi.2018.12.003","volume":"90","author":"S Fotouhi","year":"2019","unstructured":"Fotouhi S, Asadi S, Kattan MW. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform. 2019;90:103089.","journal-title":"J Biomed Inform"},{"key":"857_CR6","doi-asserted-by":"crossref","unstructured":"van Goorbergh Rvd M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. arXiv Preprint arXiv:220209101. 2022.","DOI":"10.1093\/jamia\/ocac093"},{"issue":"8","key":"857_CR7","doi-asserted-by":"publisher","first-page":"969","DOI":"10.1093\/jamia\/ocy032","volume":"25","author":"JM Reps","year":"2018","unstructured":"Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969\u201375.","journal-title":"J Am Med Inform Assoc"},{"key":"857_CR8","doi-asserted-by":"publisher","DOI":"10.1016\/j.cmpb.2021.106394","volume":"211","author":"S Khalid","year":"2021","unstructured":"Khalid S, Yang C, Blacketer C, Duarte-Salles T, Fern\u00e1ndez-Bertol\u00edn S, Kim C, et al. A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data. Comput Methods Programs Biomed. 2021;211: 106394.","journal-title":"Comput Methods Programs Biomed"},{"issue":"1","key":"857_CR9","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1186\/s12874-020-00991-3","volume":"20","author":"JM Reps","year":"2020","unstructured":"Reps JM, Williams RD, You SC, Falconer T, Minty E, Callahan A, et al. Feasibility and evaluation of a large-scale external validation approach for patient-level prediction in an international data network: validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation. BMC Med Res Methodol. 2020;20(1):102.","journal-title":"BMC Med Res Methodol"},{"issue":"1","key":"857_CR10","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1136\/amiajnl-2011-000376","volume":"19","author":"JM Overhage","year":"2012","unstructured":"Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54\u201360.","journal-title":"J Am Med Inform Assoc"},{"issue":"2","key":"857_CR11","doi-asserted-by":"publisher","first-page":"214","DOI":"10.1002\/sim.6787","volume":"35","author":"GS Collins","year":"2016","unstructured":"Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016;35(2):214\u201326.","journal-title":"Stat Med"},{"issue":"1","key":"857_CR12","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1186\/s40537-018-0151-6","volume":"5","author":"JL Leevy","year":"2018","unstructured":"Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.","journal-title":"J Big Data"},{"issue":"1","key":"857_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v033.i01","volume":"33","author":"JH Friedman","year":"2010","unstructured":"Friedman JH, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1\u201322.","journal-title":"J Stat Softw"},{"key":"857_CR14","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"857_CR15","doi-asserted-by":"crossref","unstructured":"Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; San Francisco, California, USA: Association for Computing Machinery; 2016. p. 785\u201394.","DOI":"10.1145\/2939672.2939785"},{"issue":"12","key":"857_CR16","doi-asserted-by":"publisher","DOI":"10.1136\/bmjopen-2021-050146","volume":"11","author":"JM Reps","year":"2021","unstructured":"Reps JM, Ryan P, Rijnbeek P. Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data. BMJ Open. 2021;11(12): e050146.","journal-title":"BMJ Open"},{"key":"857_CR17","doi-asserted-by":"publisher","first-page":"363","DOI":"10.1186\/s12859-015-0784-9","volume":"16","author":"R Blagus","year":"2015","unstructured":"Blagus R, Lusa L. Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models. BMC Bioinform. 2015;16:363.","journal-title":"BMC Bioinform"},{"issue":"11","key":"857_CR18","doi-asserted-by":"publisher","first-page":"1389","DOI":"10.1109\/LSP.2014.2337313","volume":"21","author":"X Sun","year":"2014","unstructured":"Sun X, Xu W. Fast implementation of DeLong\u2019s Algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process Lett. 2014;21(11):1389\u201393.","journal-title":"IEEE Signal Process Lett"},{"issue":"1","key":"857_CR19","doi-asserted-by":"publisher","first-page":"230","DOI":"10.1186\/s12916-019-1466-7","volume":"17","author":"B Van Calster","year":"2019","unstructured":"Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230.","journal-title":"BMC Med"},{"key":"857_CR20","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1016\/j.jclinepi.2015.12.005","volume":"74","author":"B Van Calster","year":"2016","unstructured":"Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167\u201376.","journal-title":"J Clin Epidemiol"},{"key":"857_CR21","volume-title":"Clinical prediction models: a practical approach to development. Validation, and updating","author":"EW Steyerberg","year":"2008","unstructured":"Steyerberg EW. Clinical prediction models: a practical approach to development. Validation, and updating. New York: Springer, New York; 2008."},{"issue":"5","key":"857_CR22","doi-asserted-by":"publisher","first-page":"563","DOI":"10.1007\/s40264-022-01161-8","volume":"45","author":"RD Williams","year":"2022","unstructured":"Williams RD, Reps JM, Kors JA, Ryan PB, Steyerberg E, Verhamme KM, et al. Using iterative pairwise external validation to contextualize prediction model performance: a use case predicting 1-year heart failure risk in patients with diabetes across five data sources. Drug Saf. 2022;45(5):563\u201370.","journal-title":"Drug Saf"},{"issue":"6","key":"857_CR23","doi-asserted-by":"publisher","first-page":"1133","DOI":"10.1097\/SLA.0000000000003297","volume":"272","author":"CJ Chiew","year":"2020","unstructured":"Chiew CJ, Liu N, Wong TH, Sim YE, Abdullah HR. Utilizing machine learning methods for preoperative prediction of postsurgical mortality and intensive care unit admission. Ann Surg. 2020;272(6):1133\u20139.","journal-title":"Ann Surg"},{"key":"857_CR24","doi-asserted-by":"publisher","first-page":"234","DOI":"10.1016\/j.ijmedinf.2019.06.007","volume":"129","author":"L Liu","year":"2019","unstructured":"Liu L, Ni Y, Zhang N, Nick Pratap J. Mining patient-specific and contextual data with machine learning technologies to predict cancellation of children\u2019s Surgery. Int J Med Inform. 2019;129:234\u201341.","journal-title":"Int J Med Inform"},{"issue":"1","key":"857_CR25","doi-asserted-by":"publisher","first-page":"11862","DOI":"10.1038\/s41598-019-48263-5","volume":"9","author":"M Makino","year":"2019","unstructured":"Makino M, Yoshimoto R, Ono M, Itoko T, Katsuki T, Koseki A, et al. Artificial intelligence predicts the progression of diabetic kidney disease using big data machine learning. Sci Rep. 2019;9(1):11862.","journal-title":"Sci Rep"},{"key":"857_CR26","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321\u201357.","journal-title":"J Artif Intell Res"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00857-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-023-00857-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00857-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,19]],"date-time":"2024-02-19T09:13:39Z","timestamp":1708334019000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-023-00857-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,3]]},"references-count":26,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["857"],"URL":"https:\/\/doi.org\/10.1186\/s40537-023-00857-7","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,3]]},"assertion":[{"value":"27 January 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 December 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 January 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All patient data included in this study were deidentified. The New England Institutional Review Board (IRB) determined that studies conducted in these data are exempt from study-specific IRB review, as these studies do not qualify as human subjects research. No experiments were conducted on humans in this study. The research methods were conducted in accordance with appropriate guidelines.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"JMR is an employee of Janssen Research and Development and shareholder of Johnson and Johnson. CY, EAF, JAK and PRR work for a research group who received unconditional research grants from Janssen Research and Development, none of which relate to the content of this work.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"7"}}