{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T04:34:15Z","timestamp":1774499655907,"version":"3.50.1"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2021,1,18]],"date-time":"2021-01-18T00:00:00Z","timestamp":1610928000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100003554","name":"Lundbeck foundation","doi-asserted-by":"publisher","award":["R215-2015-4174"],"award-info":[{"award-number":["R215-2015-4174"]}],"id":[{"id":"10.13039\/501100003554","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["31900487"],"award-info":[{"award-number":["31900487"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,7,27]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods cannot handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08\u00d7. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU\u2019s capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>EMU is written in Python and is freely available at https:\/\/github.com\/rosemeis\/emu.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab027","type":"journal-article","created":{"date-parts":[[2021,1,12]],"date-time":"2021-01-12T18:57:05Z","timestamp":1610477825000},"page":"1868-1875","source":"Crossref","is-referenced-by-count":37,"title":["Large-scale inference of population structure in presence of missingness using PCA"],"prefix":"10.1093","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9540-6673","authenticated-orcid":false,"given":"Jonas","family":"Meisner","sequence":"first","affiliation":[{"name":"Department of Biology, University of Copenhagen , Copenhagen DK-2200, Denmark"}]},{"given":"Siyang","family":"Liu","sequence":"additional","affiliation":[{"name":"BGI-Shenzhen , Shenzhen 518083, China"}]},{"given":"Mingxi","family":"Huang","sequence":"additional","affiliation":[{"name":"BGI-Shenzhen , Shenzhen 518083, China"}]},{"given":"Anders","family":"Albrechtsen","sequence":"additional","affiliation":[{"name":"Department of Biology, University of Copenhagen , Copenhagen DK-2200, Denmark"}]}],"member":"286","published-online":{"date-parts":[[2021,1,18]]},"reference":[{"key":"2023051611453118400_btab027-B1","doi-asserted-by":"crossref","first-page":"2776","DOI":"10.1093\/bioinformatics\/btx299","article-title":"Flashpca2: principal component analysis of biobank-scale genotype datasets","volume":"33","author":"Abraham","year":"2017","journal-title":"Bioinformatics"},{"key":"2023051611453118400_btab027-B2","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1109\/MCSE.2010.118","article-title":"Cython: the best of both worlds","volume":"13","author":"Behnel","year":"2011","journal-title":"Comput. Sci. Eng"},{"key":"2023051611453118400_btab027-B3","doi-asserted-by":"crossref","first-page":"261b","DOI":"10.1126\/science.296.5566.261b","article-title":"A human genome diversity cell line panel","volume":"296","author":"Cann","year":"2002","journal-title":"Science"},{"key":"2023051611453118400_btab027-B4","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1186\/s13742-015-0047-8","article-title":"Second-generation plink: rising to the challenge of larger and richer datasets","volume":"4","author":"Chang","year":"2015","journal-title":"Gigascience"},{"key":"2023051611453118400_btab027-B5","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1016\/j.ajhg.2015.11.022","article-title":"Model-free estimation of recent genetic relatedness","volume":"98","author":"Conomos","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023051611453118400_btab027-B6","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","author":"Consortium","year":"2015","journal-title":"Nature"},{"key":"2023051611453118400_btab027-B7","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1038\/nature14962","article-title":"The uk10k project identifies rare variants in health and disease","volume":"526","author":"Consortium","year":"2015","journal-title":"Nature"},{"key":"2023051611453118400_btab027-B8","author":"Dryden","year":"2016"},{"key":"2023051611453118400_btab027-B9","doi-asserted-by":"crossref","first-page":"e1001117","DOI":"10.1371\/journal.pgen.1001117","article-title":"Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis","volume":"6","author":"Engelhardt","year":"2010","journal-title":"PLoS Genet"},{"key":"2023051611453118400_btab027-B10","doi-asserted-by":"crossref","first-page":"818","DOI":"10.1038\/ng.3021","article-title":"Whole-genome sequence variation, population structure and demographic history of the Dutch population","volume":"46","author":"Francioli","year":"2014","journal-title":"Nat. Genet"},{"key":"2023051611453118400_btab027-B11","doi-asserted-by":"crossref","first-page":"e79667","DOI":"10.1371\/journal.pone.0079667","article-title":"Assessing the effect of sequencing depth and sample size in population genetics inferences","volume":"8","author":"Fumagalli","year":"2013","journal-title":"PLoS One"},{"key":"2023051611453118400_btab027-B12","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.ajhg.2015.12.022","article-title":"Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia","volume":"98","author":"Galinsky","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023051611453118400_btab027-B13","doi-asserted-by":"crossref","first-page":"435","DOI":"10.1038\/ng.3247","article-title":"Large-scale whole-genome sequencing of the Icelandic population","volume":"47","author":"Gudbjartsson","year":"2015","journal-title":"Nat. Genet"},{"key":"2023051611453118400_btab027-B14","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1137\/090771806","article-title":"Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions","volume":"53","author":"Halko","year":"2011","journal-title":"SIAM Rev"},{"key":"2023051611453118400_btab027-B15","doi-asserted-by":"crossref","first-page":"713","DOI":"10.1093\/bioinformatics\/btv641","article-title":"Probabilistic models of genetic variation in structured populations applied to global human studies","volume":"32","author":"Hao","year":"2016","journal-title":"Bioinformatics"},{"key":"2023051611453118400_btab027-B16","first-page":"79","article-title":"Handling missing values in exploratory multivariate data analysis methods","volume":"153","author":"Josse","year":"2012","journal-title":"J. Soc. Fran\u00e7aise Stat"},{"key":"2023051611453118400_btab027-B17","doi-asserted-by":"crossref","first-page":"251","DOI":"10.1007\/BF02295279","article-title":"Weighted least squares fitting using ordinary least squares algorithms","volume":"62","author":"Kiers","year":"1997","journal-title":"Psychometrika"},{"key":"2023051611453118400_btab027-B18","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1038\/nature13673","article-title":"Ancient human genomes suggest three ancestral populations for present-day Europeans","volume":"513","author":"Lazaridis","year":"2014","journal-title":"Nature"},{"key":"2023051611453118400_btab027-B19","author":"Lehoucq","year":"1998"},{"key":"2023051611453118400_btab027-B20","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023051611453118400_btab027-B21","doi-asserted-by":"crossref","first-page":"347","DOI":"10.1016\/j.cell.2018.08.016","article-title":"Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history","volume":"175","author":"Liu","year":"2018","journal-title":"Cell"},{"key":"2023051611453118400_btab027-B22","doi-asserted-by":"crossref","first-page":"512","DOI":"10.1038\/ng1337","article-title":"The effects of human population structure on large genetic association studies","volume":"36","author":"Marchini","year":"2004","journal-title":"Nat. Genet"},{"key":"2023051611453118400_btab027-B23","doi-asserted-by":"crossref","first-page":"719","DOI":"10.1534\/genetics.118.301336","article-title":"Inferring population structure and admixture proportions in low-depth NGS data","volume":"210","author":"Meisner","year":"2018","journal-title":"Genetics"},{"key":"2023051611453118400_btab027-B24","doi-asserted-by":"crossref","first-page":"1144","DOI":"10.1111\/1755-0998.13019","article-title":"Testing for Hardy-Weinberg equilibrium in structured populations using genotype or low-depth NGS data","volume":"19","author":"Meisner","year":"2019","journal-title":"Mol. Ecol. Resources"},{"key":"2023051611453118400_btab027-B25","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1038\/nrg2626","article-title":"Sequencing technologies-the next generation","volume":"11","author":"Metzker","year":"2010","journal-title":"Nat. Rev. Genet"},{"key":"2023051611453118400_btab027-B26","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023051611453118400_btab027-B27","doi-asserted-by":"crossref","first-page":"e190","DOI":"10.1371\/journal.pgen.0020190","article-title":"Population structure and Eigen analysis","volume":"2","author":"Patterson","year":"2006","journal-title":"PLoS Genet"},{"key":"2023051611453118400_btab027-B28","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res"},{"key":"2023051611453118400_btab027-B29","doi-asserted-by":"crossref","first-page":"904","DOI":"10.1038\/ng1847","article-title":"Principal components analysis corrects for stratification in genome-wide association studies","volume":"38","author":"Price","year":"2006","journal-title":"Nat. Genet"},{"key":"2023051611453118400_btab027-B30","doi-asserted-by":"crossref","first-page":"945","DOI":"10.1093\/genetics\/155.2.945","article-title":"Inference of population structure using multilocus genotype data","volume":"155","author":"Pritchard","year":"2000","journal-title":"Genetics"},{"key":"2023051611453118400_btab027-B31","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1109\/MCSE.2011.37","article-title":"The numpy array: a structure for efficient numerical computation","volume":"13","author":"Van Der Walt","year":"2011","journal-title":"Comput. Sci. Eng"},{"key":"2023051611453118400_btab027-B32","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1111\/j.1467-9469.2007.00585.x","article-title":"Simple and globally convergent methods for accelerating the convergence of any EM algorithm","volume":"35","author":"Varadhan","year":"2008","journal-title":"Scand. J. Stat"},{"key":"2023051611453118400_btab027-B33","doi-asserted-by":"crossref","first-page":"3326","DOI":"10.1093\/bioinformatics\/bts606","article-title":"A high-performance computing toolset for relatedness and principal component analysis of SNP data","volume":"28","author":"Zheng","year":"2012","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab027\/36297325\/btab027.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/13\/1868\/50340083\/btab027.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/13\/1868\/50340083\/btab027.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T07:46:48Z","timestamp":1684223208000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/13\/1868\/6103565"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,1,18]]},"references-count":33,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2021,7,27]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab027","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.04.29.067496","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,1,18]]}}}