{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T12:15:35Z","timestamp":1773404135727,"version":"3.50.1"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,9,3]],"date-time":"2021-09-03T00:00:00Z","timestamp":1630627200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,3]],"date-time":"2021-09-03T00:00:00Z","timestamp":1630627200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s13040-021-00274-7","type":"journal-article","created":{"date-parts":[[2021,9,3]],"date-time":"2021-09-03T12:31:45Z","timestamp":1630672305000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":45,"title":["Evaluation of different approaches for missing data imputation on features associated to genomic data"],"prefix":"10.1186","volume":"14","author":[{"given":"Ben Omega","family":"Petrazzini","sequence":"first","affiliation":[]},{"given":"Hugo","family":"Naya","sequence":"additional","affiliation":[]},{"given":"Fernando","family":"Lopez-Bello","sequence":"additional","affiliation":[]},{"given":"Gustavo","family":"Vazquez","sequence":"additional","affiliation":[]},{"given":"Luc\u00eda","family":"Spangenberg","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,9,3]]},"reference":[{"key":"274_CR1","doi-asserted-by":"publisher","first-page":"549","DOI":"10.1146\/annurev.psych.58.110405.085530","volume":"60","author":"JW Graham","year":"2009","unstructured":"Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60:549\u201376.","journal-title":"Annu Rev Psychol"},{"issue":"1","key":"274_CR2","doi-asserted-by":"publisher","first-page":"78","DOI":"10.1093\/bioinformatics\/btq613","volume":"27","author":"S Oh","year":"2011","unstructured":"Oh S, Kang DD, Brock GN, Tseng GC. Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics. 2011;27(1):78\u201386.","journal-title":"Bioinformatics"},{"key":"274_CR3","doi-asserted-by":"publisher","unstructured":"Little R, Rubin D. Missing Data. International Encyclopedia of the Social & Behavioral Sciences, 2020, 2nd edition, volume 15, 2015. https:\/\/doi.org\/10.1016\/B978-0-08-097086-8.42082-9.","DOI":"10.1016\/B978-0-08-097086-8.42082-9"},{"key":"274_CR4","doi-asserted-by":"publisher","first-page":"581","DOI":"10.1093\/biomet\/63.3.581","volume":"63","author":"DB Rubin","year":"1976","unstructured":"Rubin DB. Inference and missing data. Biometrika. 1976;63:581\u201392.","journal-title":"Biometrika"},{"key":"274_CR5","unstructured":"Tim Bock. What are the Different Types of Missing Data?. Displayr. https:\/\/www.displayr.com\/different-types-of-missingdata\/."},{"key":"274_CR6","volume-title":"Statistical Analysis with Missing Data","author":"JA Little Roderick","year":"1987","unstructured":"Little Roderick JA, Rubin Donald B. Statistical Analysis with Missing Data. New York: Wiley; 1987."},{"key":"274_CR7","doi-asserted-by":"crossref","unstructured":"Mack C, Su Z, Westreich D. Managing Missing Data in Patient Registries. Rockville: Agency for Healthcare Research and Quality (US); 2018. Report No.: 17(18)-EHC015-EF. PMID: 29671990.","DOI":"10.23970\/AHRQREGISTRIESMISSINGDATA"},{"key":"274_CR8","doi-asserted-by":"publisher","first-page":"b2393","DOI":"10.1136\/bmj.b2393","volume":"338","author":"AC Jonathan","year":"2009","unstructured":"Jonathan AC, Sterne IR, White JB, Carlin M, Spratt P, Royston MG, Kenward, Angela M. Wood and James R Carpenter. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.","journal-title":"BMJ"},{"key":"274_CR9","doi-asserted-by":"publisher","unstructured":"Stephens Z, Lee S, Faghri F, Campbell R, Zhai C, Efron M, et al. Big Data: Astronomical or Genomical? PLOS Biology. 2015;13(7):e1002195. https:\/\/doi.org\/10.1371\/journal.pbio.1002195.","DOI":"10.1371\/journal.pbio.1002195"},{"key":"274_CR10","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1038\/s41588-018-0062-7","volume":"50","author":"J di Iulio","year":"2018","unstructured":"di Iulio J, Bartha I, Wong EHM, et al. The human noncoding genome defined by genetic diversity. Nat Genet. 2018;50:333\u20137. https:\/\/doi.org\/10.1038\/s41588-018-0062-7.","journal-title":"Nat Genet"},{"key":"274_CR11","doi-asserted-by":"publisher","unstructured":"Makrythanasis P, Antonarakis S. Pathogenic variants in non-protein-coding sequences. Clin Genet. 2013;84(5):422\u20138. https:\/\/doi.org\/10.1111\/cge.12272.","DOI":"10.1111\/cge.12272"},{"key":"274_CR12","doi-asserted-by":"publisher","unstructured":"Stekhoven D, Buhlmann P. MissForest\u2013non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112\u20138. https:\/\/doi.org\/10.1093\/bioinformatics\/btr597.","DOI":"10.1093\/bioinformatics\/btr597"},{"key":"274_CR13","doi-asserted-by":"publisher","unstructured":"Luis Torgo. Data Mining with. R, learning with case studies. CRC Press; 2010. https:\/\/doi.org\/10.1201\/9780429292859.","DOI":"10.1201\/9780429292859"},{"key":"274_CR14","doi-asserted-by":"publisher","unstructured":"Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate Imputation by Chained Equiation in R. J Stat Softw. 2011;45(3). https:\/\/doi.org\/10.18637\/jss.v045.i03.","DOI":"10.18637\/jss.v045.i03"},{"key":"274_CR15","doi-asserted-by":"publisher","unstructured":"King G, Honaker J, Anne Joseph, Kenneth Scheve. \u201cAnalyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation\u201d. Am POl Sci Rev. 2001;95(1)49\u201369. https:\/\/doi.org\/10.1017\/S0003055401000235.","DOI":"10.1017\/S0003055401000235"},{"key":"274_CR16","doi-asserted-by":"publisher","unstructured":"Su Y-S, Gelman A, Jennifer Hill, and Yajima M. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. J Stat Softw. 2011;45(2). https:\/\/doi.org\/10.18637\/jss.v045.i02.","DOI":"10.18637\/jss.v045.i02"},{"issue":"11","key":"274_CR17","doi-asserted-by":"publisher","first-page":"1623","DOI":"10.1002\/humu.23641","volume":"39","author":"MJ Human Mutation Landrum","year":"2018","unstructured":"Human Mutation Landrum MJ, Kattman BL. ClinVar at five years: Delivering on the promise. Hum Mutat. 2018;39(11):1623\u201330.","journal-title":"Hum Mutat"},{"key":"274_CR18","doi-asserted-by":"publisher","unstructured":"Wang K, Li M, Hakonarson H. Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16). https:\/\/doi.org\/10.1093\/nar\/gkq603.","DOI":"10.1093\/nar\/gkq603"},{"key":"274_CR19","doi-asserted-by":"publisher","unstructured":"Rentzsch P, Witten D, Cooper G, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2018;47(D1):D886\u201394. https:\/\/doi.org\/10.1093\/nar\/gky1016.","DOI":"10.1093\/nar\/gky1016"},{"key":"274_CR20","doi-asserted-by":"publisher","unstructured":"Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2014;31(5):761\u20133. https:\/\/doi.org\/10.1093\/bioinformatics\/btu703.","DOI":"10.1093\/bioinformatics\/btu703"},{"key":"274_CR21","doi-asserted-by":"publisher","first-page":"575","DOI":"10.1038\/nmeth0810-575","volume":"7","author":"JM Schwarz","year":"2010","unstructured":"Schwarz JM, Rodelsperger C. Schuelke M. Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods. 2010;7:575\u20136.","journal-title":"Nat Methods"},{"key":"274_CR22","doi-asserted-by":"publisher","first-page":"e1001025","DOI":"10.1371\/journal.pcbi.1001025","volume":"6","author":"EV Davydov","year":"2010","unstructured":"Davydov EV, Goode DL, Sirota M. Cooper GM, Sidow A. Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010;6:e1001025.","journal-title":"PLoS Comput Biol"},{"key":"274_CR23","doi-asserted-by":"publisher","unstructured":"Shihab H, Rogers M, Gough J, Mort M, Cooper D, Day I, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. 2015;31(10):1536\u201343. https:\/\/doi.org\/10.1093\/bioinformatics\/btv009.","DOI":"10.1093\/bioinformatics\/btv009"},{"key":"274_CR24","doi-asserted-by":"publisher","unstructured":"Gulko B, Hubisz M, Gronau I, Siepel A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet. 2015;47(3):276\u201383. https:\/\/doi.org\/10.1038\/ng.3196.","DOI":"10.1038\/ng.3196"},{"key":"274_CR25","doi-asserted-by":"publisher","first-page":"i54","DOI":"10.1093\/bioinformatics\/btp190","volume":"25","author":"M Garber","year":"2009","unstructured":"Garber M. Guttman M. Clamp M. Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25:i54\u201362.","journal-title":"Bioinformatics"},{"key":"274_CR26","doi-asserted-by":"publisher","first-page":"901","DOI":"10.1101\/gr.3577405","volume":"15","author":"GM Cooper","year":"2005","unstructured":"Cooper GM, Stone EA, Asimenos G. Program NCS, Green ED, Batzoglou S. Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901\u201313.","journal-title":"Genome Res"},{"key":"274_CR27","doi-asserted-by":"publisher","first-page":"325","DOI":"10.1007\/0-387-27733-1_12","volume-title":"Statistical Methods in Molecular Evolution","author":"A Siepel","year":"2005","unstructured":"Siepel A, Haussler D. Phylogenetic hidden Markov models. In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: Springer; 2005. pp.\u00a0325\u201351."},{"key":"274_CR28","doi-asserted-by":"publisher","unstructured":"Ritchie G, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods. 2014;11(3):294\u20136. https:\/\/doi.org\/10.1038\/nmeth.2832.","DOI":"10.1038\/nmeth.2832"},{"key":"274_CR29","doi-asserted-by":"publisher","unstructured":"Glusman G. Caballero J, Mauldin DE, Hood L, Roach J. KAVIAR: an accessible system for testing SNV novelty. Bioinformatics. 2011;27(22):3216\u20137. https:\/\/doi.org\/10.1093\/bioinformatics\/btr540.","DOI":"10.1093\/bioinformatics\/btr540"},{"issue":"4","key":"274_CR30","doi-asserted-by":"publisher","first-page":"679","DOI":"10.1016\/j.ijforecast.2006.03.001","volume":"22","author":"RJ Hyndman","year":"2006","unstructured":"Hyndman RJ, Koehler AB. \u00abAnother look at measures of forecast accuracy\u00bb. Int J Forecast. 2006;22(4):679\u201388.","journal-title":"Int J Forecast"},{"key":"274_CR31","doi-asserted-by":"publisher","unstructured":"Shivaram Venkataraman Z, Yang D, Liu E, Liang H, Falaki X, Meng R, Xin A, Ghodsi MJ, Franklin I, Stoica. Matei A Zaharia \u201cSparkR: Scaling R Programs with Spark\u201d. SIGMOD; 2016. p. 1099\u2013104. https:\/\/doi.org\/10.1145\/2882903.2903740.","DOI":"10.1145\/2882903.2903740"},{"key":"274_CR32","doi-asserted-by":"publisher","unstructured":"Lin W, Tsai C. Missing value imputation: a review and analysis of the literature (2006\u20132017). Artif Intell Rev. 2019;53(2):1487\u2013509. https:\/\/doi.org\/10.1007\/s10462-019-09709-4.","DOI":"10.1007\/s10462-019-09709-4"},{"issue":"3","key":"274_CR33","first-page":"18","volume":"2","author":"A Liaw","year":"2002","unstructured":"Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18\u201322.","journal-title":"R News"},{"key":"274_CR34","doi-asserted-by":"crossref","unstructured":"Karatzoglou A, Smola A, Hornik K, Zeileis A. kernlab \u2013 An S4 Package for Kernel Methods in R. J Stat Softw. 2004;11(9):1\u201320. http:\/\/www.jstatsoft.org\/v11\/i09\/.","DOI":"10.18637\/jss.v011.i09"},{"key":"274_CR35","doi-asserted-by":"publisher","unstructured":"Mean Absolute Error. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning. Boston: Springer; 2011. https:\/\/doi.org\/10.1007\/978-0-387-30164-8_525.","DOI":"10.1007\/978-0-387-30164-8_525"}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-021-00274-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-021-00274-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-021-00274-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,3]],"date-time":"2021-09-03T12:36:39Z","timestamp":1630672599000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-021-00274-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,3]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["274"],"URL":"https:\/\/doi.org\/10.1186\/s13040-021-00274-7","relation":{},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,3]]},"assertion":[{"value":"24 February 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 August 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 September 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"44"}}