{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T19:36:46Z","timestamp":1772739406476,"version":"3.50.1"},"reference-count":17,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2020,3,20]],"date-time":"2020-03-20T00:00:00Z","timestamp":1584662400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01 HG008773"],"award-info":[{"award-number":["R01 HG008773"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"UK Biobank Resource","award":["45227"],"award-info":[{"award-number":["45227"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Population stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16\u201316 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com\/daviddaiweizhang\/fraposa.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Contact<\/jats:title>\n                    <jats:p>leeshawn@umich.edu<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa152","type":"journal-article","created":{"date-parts":[[2020,2,27]],"date-time":"2020-02-27T15:16:34Z","timestamp":1582816594000},"page":"3439-3446","source":"Crossref","is-referenced-by-count":57,"title":["Fast and robust ancestry prediction using principal component analysis"],"prefix":"10.1093","volume":"36","author":[{"given":"Daiwei","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Biostatistics , University of Michigan, Ann Arbor, MI 48109, USA"}]},{"given":"Rounak","family":"Dey","sequence":"additional","affiliation":[{"name":"Department of Biostatistics , Harvard University, Boston, MA 02115, USA"}]},{"given":"Seunggeun","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Biostatistics , University of Michigan, Ann Arbor, MI 48109, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,3,20]]},"reference":[{"key":"2023062312020797700_btaa152-B1","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","year":"2015","journal-title":"Nature"},{"key":"2023062312020797700_btaa152-B2","first-page":"707","volume-title":"European Conference on Computer Vision,","author":"Brand","year":"2002"},{"key":"2023062312020797700_btaa152-B3","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1038\/s41586-018-0579-z","article-title":"The UK Biobank resource with deep phenotyping and genomic data","volume":"562","author":"Bycroft","year":"2018","journal-title":"Nature"},{"key":"2023062312020797700_btaa152-B4"},{"key":"2023062312020797700_btaa152-B5","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1016\/j.jmva.2019.02.007","article-title":"Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model","volume":"173","author":"Dey","year":"2019","journal-title":"J. Multivariate Anal"},{"key":"2023062312020797700_btaa152-B6","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.ajhg.2015.12.022","article-title":"Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia","volume":"98","author":"Galinsky","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023062312020797700_btaa152-B7","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1137\/090771806","article-title":"Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions","volume":"53","author":"Halko","year":"2011","journal-title":"SIAM Rev"},{"key":"2023062312020797700_btaa152-B8","volume-title":"Principal Component Analysis","author":"Jolliffe","year":"2002"},{"key":"2023062312020797700_btaa152-B9","doi-asserted-by":"crossref","first-page":"3605","DOI":"10.1214\/10-AOS821","article-title":"Convergence and prediction of principal component scores in high-dimensional settings","volume":"38","author":"Lee","year":"2010","journal-title":"Ann. Statist"},{"key":"2023062312020797700_btaa152-B10","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1038\/ng.1074","article-title":"Differential confounding of rare and common variants in spatially structured populations","volume":"44","author":"Mathieson","year":"2012","journal-title":"Nat. Genet"},{"key":"2023062312020797700_btaa152-B11","doi-asserted-by":"crossref","first-page":"904","DOI":"10.1038\/ng1847","article-title":"Principal components analysis corrects for stratification in genome-wide association studies","volume":"38","author":"Price","year":"2006","journal-title":"Nat. Genet"},{"key":"2023062312020797700_btaa152-B12","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng0508-491","article-title":"Principal component analysis of genetic data","volume":"40","author":"Reich","year":"2008","journal-title":"Nat. Genet"},{"key":"2023062312020797700_btaa152-B13","doi-asserted-by":"crossref","first-page":"e1001779","DOI":"10.1371\/journal.pmed.1001779","article-title":"UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age","volume":"12","author":"Sudlow","year":"2015","journal-title":"PLoS Med"},{"key":"2023062312020797700_btaa152-B14","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1038\/ng.2924","article-title":"Ancestry estimation and control of population stratification for sequence-based association studies","volume":"46","author":"Wang","year":"2014","journal-title":"Nat. Genet"},{"key":"2023062312020797700_btaa152-B15","doi-asserted-by":"crossref","first-page":"926","DOI":"10.1016\/j.ajhg.2015.04.018","article-title":"Improved ancestry estimation for both genotyping and sequencing data using projection Procrustes analysis and genotype imputation","volume":"96","author":"Wang","year":"2015","journal-title":"Am. J. Hum. Genet"},{"key":"2023062312020797700_btaa152-B16","first-page":"1358","article-title":"Estimating f-statistics for the analysis of population structure","volume":"38","author":"Weir","year":"1984","journal-title":"Evolution"},{"key":"2023062312020797700_btaa152-B17","doi-asserted-by":"crossref","first-page":"1375","DOI":"10.1038\/ng.2758","article-title":"Identification of a rare coding variant in complement 3 associated with age-related macular degeneration","volume":"45","author":"Zhan","year":"2013","journal-title":"Nat. Genet"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa152\/33151260\/btaa152.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/11\/3439\/50670795\/bioinformatics_36_11_3439.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/11\/3439\/50670795\/bioinformatics_36_11_3439.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T17:24:25Z","timestamp":1687627465000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/11\/3439\/5810493"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,3,20]]},"references-count":17,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2020,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa152","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/713172","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,6]]},"published":{"date-parts":[[2020,3,20]]}}}