{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,21]],"date-time":"2026-02-21T06:28:52Z","timestamp":1771655332339,"version":"3.50.1"},"reference-count":28,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T00:00:00Z","timestamp":1696982400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Institute on Aging of the National Institutes of Health","award":["R01AG058537"],"award-info":[{"award-number":["R01AG058537"]}]},{"name":"National Institute on Aging of the National Institutes of Health","award":["R01AG054073"],"award-info":[{"award-number":["R01AG054073"]}]},{"name":"National Institute on Aging of the National Institutes of Health","award":["R01AG058533"],"award-info":[{"award-number":["R01AG058533"]}]},{"name":"National Institute on Aging of the National Institutes of Health","award":["3R01AG058533-02S1"],"award-info":[{"award-number":["3R01AG058533-02S1"]}]},{"name":"National Institute on Aging of the National Institutes of Health","award":["P41EB015922"],"award-info":[{"award-number":["P41EB015922"]}]},{"name":"National Institute on Aging of the National Institutes of Health","award":["U19AG078109"],"award-info":[{"award-number":["U19AG078109"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>The Health and Aging Brain Study\u2013Health Disparities (HABS\u2013HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS\u2013HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS\u2013HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS\u2013HD. Therefore, we proposed a three-step workflow to handle missing data in HABS\u2013HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS\u2013HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS\u2013HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS\u2013HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer\u2019s disease models. They can also be applied to other disease data analyses.<\/jats:p>","DOI":"10.3390\/informatics10040077","type":"journal-article","created":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T08:18:57Z","timestamp":1697012337000},"page":"77","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study\u2013Health Disparities"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3502-1808","authenticated-orcid":false,"given":"Fan","family":"Zhang","sequence":"first","affiliation":[{"name":"Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"},{"name":"Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"}]},{"given":"Melissa","family":"Petersen","sequence":"additional","affiliation":[{"name":"Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"},{"name":"Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"}]},{"given":"Leigh","family":"Johnson","sequence":"additional","affiliation":[{"name":"Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"},{"name":"Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"}]},{"given":"James","family":"Hall","sequence":"additional","affiliation":[{"name":"Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"},{"name":"Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"}]},{"given":"Raymond F.","family":"Palmer","sequence":"additional","affiliation":[{"name":"Department of Family and Community Medicine, University of Texas Health Science Center, San Antonio, TX 78229, USA"}]},{"given":"Sid E.","family":"O\u2019Bryant","sequence":"additional","affiliation":[{"name":"Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"},{"name":"Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX 76107, USA"}]},{"name":"on behalf of the Health and Aging Brain Study (HABS\u2013HD) Study Team","sequence":"additional","affiliation":[]}],"member":"1968","published-online":{"date-parts":[[2023,10,11]]},"reference":[{"key":"ref_1","unstructured":"(2023, February 13). Alzheimer\u2019s Association 2022 Alzheimer\u2019s Disease Facts and Figures. Available online: https:\/\/www.alz.org\/alzheimers-dementia\/facts-figures."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1243","DOI":"10.3233\/JAD-210543","article-title":"Proteomic Profiles of Neurodegeneration Among Mexican Americans and Non-Hispanic Whites in the HABS-HD Study","volume":"86","author":"Zhang","year":"2022","journal-title":"J. Alzheimers Dis."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1146\/annurev.psych.58.110405.085530","article-title":"Missing data analysis: Making it work in the real world","volume":"60","author":"Graham","year":"2009","journal-title":"Annu. Rev. Psychol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1037\/1082-989X.7.2.147","article-title":"Missing data: Our view of the state of the art","volume":"7","author":"Schafer","year":"2002","journal-title":"Psychol. Methods"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1093\/biomet\/63.3.581","article-title":"Inference and Missing Data","volume":"63","author":"Rubin","year":"1976","journal-title":"Biometrika"},{"key":"ref_6","unstructured":"Enders, C.K. (2022). Applied Missing Data Analysis, Guilford Press. Available online: https:\/\/www.guilford.com\/books\/Applied-Missing-Data-Analysis\/Craig-Enders\/9781462549863."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Little, R.J.A., and Rubin, D.B. (2019). Statistical Analysis with Missing Data, Wiley.","DOI":"10.1002\/9781119482260"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"473","DOI":"10.1080\/01621459.1996.10476908","article-title":"Multiple Imputation After 18+ Years","volume":"91","author":"Rubin","year":"1996","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1080\/07350015.1986.10509497","article-title":"Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations","volume":"4","author":"Rubin","year":"1986","journal-title":"J. Bus. Econ. Stat."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1080\/07350015.1988.10509663","article-title":"Missing-Data Adjustments in Large Surveys","volume":"6","author":"Roderick","year":"1988","journal-title":"J. Bus. Econ. Stat."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Marshall, A., Altman, D.G., and Holder, R.L. (2010). Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: A resampling study. BMC Med. Res. Methodol., 10.","DOI":"10.1186\/1471-2288-10-112"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Marshall, A., Altman, D.G., Royston, P., and Holder, R.L. (2010). Comparison of techniques for handling missing covariate data within prognostic modelling studies: A simulation study. BMC Med. Res. Methodol., 10.","DOI":"10.1186\/1471-2288-10-7"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Mirza, B., Wang, W., Wang, J., Choi, H., Chung, N.C., and Ping, P. (2019). Machine Learning and Integrative Analysis of Biomedical Big Data. Genes, 10.","DOI":"10.3390\/genes10020087"},{"key":"ref_14","unstructured":"Breiman, L., Freidman, J.H., Olshen, R.A., and Stone, C.J. (1984). CART: Classification and Regression Trees, Routledge."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1093\/bioinformatics\/btr597","article-title":"MissForest--non-parametric missing value imputation for mixed-type data","volume":"28","author":"Stekhoven","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"680054","DOI":"10.3389\/fpubh.2021.680054","article-title":"The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model","volume":"9","author":"Guo","year":"2021","journal-title":"Front. Public. Health"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, H., Tang, J., Wu, M., Wang, X., and Zhang, T. (2022). Application of machine learning missing data imputation techniques in clinical decision making: Taking the discharge assessment of patients with spontaneous supratentorial intracerebral hemorrhage as an example. BMC Med. Inform. Decis. Mak., 22.","DOI":"10.1186\/s12911-022-01752-6"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"939","DOI":"10.1212\/WNL.34.7.939","article-title":"Clinical diagnosis of Alzheimer\u2019s disease: Report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer\u2019s Disease","volume":"34","author":"McKhann","year":"1984","journal-title":"Neurology"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1186\/alzrt138","article-title":"Alzheimer\u2019s disease diagnostic criteria: Practical applications","volume":"4","author":"Cummings","year":"2012","journal-title":"Alzheimers Res. Ther."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1198","DOI":"10.1001\/archneur.1994.00540240042014","article-title":"Reliability and validity of NINCDS-ADRDA criteria for Alzheimer\u2019s disease. The National Institute of Mental Health Genetics Initiative","volume":"51","author":"Blacker","year":"1994","journal-title":"Arch. Neurol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1002\/alz.12382","article-title":"A blood screening tool for detecting mild cognitive impairment and Alzheimer\u2019s disease among community-dwelling Mexican Americans and non-Hispanic Whites: A method for increasing representation of diverse populations in clinical research","volume":"18","author":"Zhang","year":"2022","journal-title":"Alzheimers Dement."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zhang, F., Petersen, M., Johnson, L., Hall, J., and O\u2019Bryant, S.E. (2022). Combination of Serum and Plasma Biomarkers Could Improve Prediction Performance for Alzheimer\u2019s Disease. Genes, 13.","DOI":"10.3390\/genes13101738"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1325","DOI":"10.3233\/JAD-141041","article-title":"Validation of a serum screen for Alzheimer\u2019s disease across assay platforms, species, and tissues","volume":"42","author":"Xiao","year":"2014","journal-title":"J. Alzheimers Dis."},{"key":"ref_24","first-page":"83","article-title":"A blood screening test for Alzheimer\u2019s disease","volume":"3","author":"Edwards","year":"2016","journal-title":"Alzheimers Dement."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1198","DOI":"10.1080\/01621459.1988.10478722","article-title":"A Test of Missing Completely at Random for Multivariate Data with Missing Values","volume":"83","author":"Little","year":"1988","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_26","unstructured":"(2023, October 02). Available online: https:\/\/github.com\/microsat2018\/figure1."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1093\/gerona\/glac169","article-title":"Plasma Biomarkers of Alzheimer\u2019s Disease Are Associated with Physical Functioning Outcomes Among Cognitively Normal Adults in the Multiethnic HABS-HD Cohort","volume":"78","author":"Petersen","year":"2023","journal-title":"J. Gerontol. A Biol. Sci. Med. Sci."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"3254","DOI":"10.1038\/s41598-020-74399-w","article-title":"Multimodal deep learning models for early detection of Alzheimer\u2019s disease stage","volume":"11","author":"Venugopalan","year":"2021","journal-title":"Sci. Rep."}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/10\/4\/77\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:04:49Z","timestamp":1760130289000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/10\/4\/77"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,11]]},"references-count":28,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["informatics10040077"],"URL":"https:\/\/doi.org\/10.3390\/informatics10040077","relation":{},"ISSN":["2227-9709"],"issn-type":[{"value":"2227-9709","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,11]]}}}