{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T08:18:33Z","timestamp":1773476313994,"version":"3.50.1"},"reference-count":21,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T00:00:00Z","timestamp":1652659200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T00:00:00Z","timestamp":1652659200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U01 CA235488"],"award-info":[{"award-number":["U01 CA235488"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U01 CA235488"],"award-info":[{"award-number":["U01 CA235488"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U01 CA235488"],"award-info":[{"award-number":["U01 CA235488"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U01 CA235488"],"award-info":[{"award-number":["U01 CA235488"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["U01 CA235488"],"award-info":[{"award-number":["U01 CA235488"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Computational Bioscience NLM Training Grant","award":["T15 LM009451"],"award-info":[{"award-number":["T15 LM009451"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR\/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.<\/jats:p>","DOI":"10.1186\/s12859-022-04659-1","type":"journal-article","created":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T07:13:28Z","timestamp":1652685208000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics"],"prefix":"10.1186","volume":"23","author":[{"given":"Jonathan P.","family":"Dekermanjian","sequence":"first","affiliation":[]},{"given":"Elin","family":"Shaddox","sequence":"additional","affiliation":[]},{"given":"Debmalya","family":"Nandy","sequence":"additional","affiliation":[]},{"given":"Debashis","family":"Ghosh","sequence":"additional","affiliation":[]},{"given":"Katerina","family":"Kechris","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,5,16]]},"reference":[{"key":"4659_CR1","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1576\/toag.13.3.189.27672","volume":"13","author":"RP Horgan","year":"2011","unstructured":"Horgan RP, Kenny LC. \u2018Omic\u2019 technologies: genomics, transcriptomics, proteomics and metabolomics. Obstet Gynaecol. 2011;13:189\u201395.","journal-title":"Obstet Gynaecol"},{"key":"4659_CR2","doi-asserted-by":"publisher","DOI":"10.3390\/metabo9070123","author":"AH Emwas","year":"2019","unstructured":"Emwas AH, Roy R, McKay RT, et al. NMR spectroscopy for metabolomics research. Metabolites. 2019. https:\/\/doi.org\/10.3390\/metabo9070123.","journal-title":"Metabolites"},{"issue":"11","key":"4659_CR3","doi-asserted-by":"publisher","first-page":"592","DOI":"10.1016\/j.tree.2008.06.014","volume":"23","author":"S Nakagawa","year":"2008","unstructured":"Nakagawa S, Freckleton RP. Missing inaction: the dangers of ignoring missing data. Trends Ecol Evol. 2008;23(11):592\u20136. https:\/\/doi.org\/10.1016\/j.tree.2008.06.014.","journal-title":"Trends Ecol Evol"},{"issue":"1","key":"4659_CR4","doi-asserted-by":"publisher","first-page":"663","DOI":"10.1038\/s41598-017-19120-0","volume":"8","author":"R Wei","year":"2018","unstructured":"Wei R, Wang J, Su M, et al. Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep. 2018;8(1):663. https:\/\/doi.org\/10.1038\/s41598-017-19120-0.","journal-title":"Sci Rep"},{"issue":"404","key":"4659_CR5","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1080\/01621459.1988.10478722","volume":"83","author":"RJA Little","year":"1988","unstructured":"Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):5.","journal-title":"J Am Stat Assoc"},{"issue":"5","key":"4659_CR6","doi-asserted-by":"publisher","first-page":"402","DOI":"10.4097\/kjae.2013.64.5.402","volume":"64","author":"H Kang","year":"2013","unstructured":"Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402\u20136. https:\/\/doi.org\/10.4097\/kjae.2013.64.5.402.","journal-title":"Korean J Anesthesiol"},{"issue":"1","key":"4659_CR7","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1005973","volume":"14","author":"R Wei","year":"2018","unstructured":"Wei R, Wang J, Jia E, Chen T, Ni Y, Jia W. GSimp: a Gibbs sampler based left-censored missing value imputation approach for metabolomics studies. PLoS Comput Biol. 2018;14(1): e1005973. https:\/\/doi.org\/10.1371\/journal.pcbi.1005973.","journal-title":"PLoS Comput Biol"},{"issue":"2","key":"4659_CR8","doi-asserted-by":"publisher","first-page":"313","DOI":"10.1111\/rssc.12164","volume":"66","author":"FD Atem","year":"2017","unstructured":"Atem FD, Qian J, Maye JE, Johnson KA, Betensky RA. Linear regression with a randomly censored covariate: application to an Alzheimer\u2019s study. J R Stat Soc Ser C Appl Stat. 2017;66(2):313\u201328. https:\/\/doi.org\/10.1111\/rssc.12164.","journal-title":"J R Stat Soc Ser C Appl Stat"},{"issue":"12","key":"4659_CR9","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1007\/s11306-018-1451-8","volume":"14","author":"JY Lee","year":"2018","unstructured":"Lee JY, Styczynski MP. NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics. 2018;14(12):153. https:\/\/doi.org\/10.1007\/s11306-018-1451-8.","journal-title":"Metabolomics"},{"issue":"1","key":"4659_CR10","doi-asserted-by":"publisher","first-page":"112","DOI":"10.1093\/bioinformatics\/btr597","volume":"28","author":"DJ Stekhoven","year":"2012","unstructured":"Stekhoven DJ, Buhlmann P. MissForest\u2013non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112\u20138. https:\/\/doi.org\/10.1093\/bioinformatics\/btr597.","journal-title":"Bioinformatics"},{"key":"4659_CR11","first-page":"27","volume":"45","author":"L Brieman","year":"2001","unstructured":"Brieman L. Random forests. Mach Learn. 2001;45:27.","journal-title":"Mach Learn"},{"issue":"1","key":"4659_CR12","doi-asserted-by":"publisher","first-page":"e66","DOI":"10.1093\/sysbio\/syw077","volume":"66","author":"J Lintusaari","year":"2017","unstructured":"Lintusaari J, Gutmann MU, Dutta R, Kaski S, Corander J. Fundamentals and recent developments in approximate Bayesian computation. Syst Biol. 2017;66(1):e66\u201382. https:\/\/doi.org\/10.1093\/sysbio\/syw077.","journal-title":"Syst Biol"},{"key":"4659_CR13","unstructured":"Team RC. R: a language and environment for statistical computing. https:\/\/www.R-project.org\/."},{"key":"4659_CR14","unstructured":"Kuhn M. caret: classification and regression training. R package version 6.0\u201388. https:\/\/CRAN.R-project.org\/package=caret."},{"issue":"1","key":"4659_CR15","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1089\/nsm.2020.0009","volume":"3","author":"LA Gillenwater","year":"2020","unstructured":"Gillenwater LA, Pratte KA, Hobbs BD, et al. Plasma metabolomic signatures of chronic obstructive pulmonary disease and the impact of genetic variants on phenotype-driven modules. Netw Syst Med. 2020;3(1):159\u201381. https:\/\/doi.org\/10.1089\/nsm.2020.0009.","journal-title":"Netw Syst Med"},{"key":"4659_CR16","unstructured":"World Health Organization-Chronic obstructive pulmonary disease (COPD). 2020. https:\/\/www.who.int\/news-room\/fact-sheets\/detail\/chronic-obstructive-pulmonary-disease-(copd)."},{"issue":"1","key":"4659_CR17","doi-asserted-by":"publisher","first-page":"17132","DOI":"10.1038\/s41598-018-35372-w","volume":"8","author":"CI Cruickshank-Quinn","year":"2018","unstructured":"Cruickshank-Quinn CI, Jacobson S, Hughes G, et al. Metabolomics and transcriptomics pathway approach reveals outcome-specific perturbations in COPD. Sci Rep. 2018;8(1):17132. https:\/\/doi.org\/10.1038\/s41598-018-35372-w.","journal-title":"Sci Rep"},{"key":"4659_CR18","doi-asserted-by":"crossref","unstructured":"Fix E, Hodges JL. Discriminatory analysis, nonparametric discrimination: consistency properties. USAF School of Aviation Medicine, Randolph Field, Texas. 1951;(Technical Report 4).","DOI":"10.1037\/e471672008-001"},{"issue":"9","key":"4659_CR19","doi-asserted-by":"publisher","first-page":"1164","DOI":"10.1093\/bioinformatics\/btm069","volume":"23","author":"W Stacklies","year":"2007","unstructured":"Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcaMethods\u2013a bioconductor package providing PCA methods for incomplete data. Bioinformatics. 2007;23(9):1164\u20137. https:\/\/doi.org\/10.1093\/bioinformatics\/btm069.","journal-title":"Bioinformatics"},{"key":"4659_CR20","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321\u201357.","journal-title":"J Artif Intell Res"},{"key":"4659_CR21","doi-asserted-by":"publisher","first-page":"32","DOI":"10.3109\/15412550903499522","volume":"7","author":"EA Regan","year":"2010","unstructured":"Regan EA, Hokanson JE, Murphy JR, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD. 2010;7:32\u201343.","journal-title":"COPD"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04659-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-022-04659-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04659-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T07:13:35Z","timestamp":1652685215000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-022-04659-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,16]]},"references-count":21,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["4659"],"URL":"https:\/\/doi.org\/10.1186\/s12859-022-04659-1","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,16]]},"assertion":[{"value":"25 June 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 March 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 May 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The NIH-sponsored multicenter Genetic Epidemiology of COPD (COPDGene; ClinicalTrials.gov Identifier: NCT00608764) study was approved and reviewed by the institutional review board at all participating centers []. All study participants provided written informed consent.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"179"}}