{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T17:15:01Z","timestamp":1776446101392,"version":"3.51.2"},"reference-count":43,"publisher":"Oxford University Press (OUP)","issue":"16","license":[{"start":{"date-parts":[[2020,5,16]],"date-time":"2020-05-16T00:00:00Z","timestamp":1589587200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001732","name":"Danish National Research Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001732","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH","award":["R248-2017-2003"],"award-info":[{"award-number":["R248-2017-2003"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,8,15]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>For example, we find that PC19\u2013PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16\u201318 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https:\/\/github.com\/privefl\/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https:\/\/privefl.github.io\/bigsnpr\/articles\/bedpca.html. All code used for this paper is available at https:\/\/github.com\/privefl\/paper4-bedpca\/tree\/master\/code.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa520","type":"journal-article","created":{"date-parts":[[2020,5,12]],"date-time":"2020-05-12T09:21:44Z","timestamp":1589275304000},"page":"4449-4457","source":"Crossref","is-referenced-by-count":150,"title":["Efficient toolkit implementing best practices for principal component analysis of population genetic data"],"prefix":"10.1093","volume":"36","author":[{"given":"Florian","family":"Priv\u00e9","sequence":"first","affiliation":[{"name":"National Centre for Register-Based Research, Aarhus University , Aarhus 8210, Denmark"},{"name":"Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes , La Tronche 38700, France"}]},{"given":"Keurcien","family":"Luu","sequence":"additional","affiliation":[{"name":"Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes , La Tronche 38700, France"}]},{"given":"Michael G B","family":"Blum","sequence":"additional","affiliation":[{"name":"Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes , La Tronche 38700, France"},{"name":"OWKIN France , Paris 75010, France"}]},{"given":"John J","family":"McGrath","sequence":"additional","affiliation":[{"name":"National Centre for Register-Based Research, Aarhus University , Aarhus 8210, Denmark"},{"name":"Queensland Brain Institute, University of Queensland , St. Lucia, 4072 Queensland, Australia"},{"name":"Queensland Centre for Mental Health Research, The Park Centre for Mental Health , Wacol, 4076 Queensland, Australia"}]},{"given":"Bjarni J","family":"Vilhj\u00e1lmsson","sequence":"additional","affiliation":[{"name":"National Centre for Register-Based Research, Aarhus University , Aarhus 8210, Denmark"}]}],"member":"286","published-online":{"date-parts":[[2020,5,16]]},"reference":[{"key":"2023062213525253900_btaa520-B1","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","year":"2015","journal-title":"Nature"},{"key":"2023062213525253900_btaa520-B2","doi-asserted-by":"crossref","first-page":"1277","DOI":"10.1038\/ejhg.2013.48","article-title":"Population structure, migration, and diversifying selection in the Netherlands","volume":"21","author":"Abdellaoui","year":"2013","journal-title":"Eur. J. Hum. Genet"},{"key":"2023062213525253900_btaa520-B3","doi-asserted-by":"crossref","first-page":"2776","DOI":"10.1093\/bioinformatics\/btx299","article-title":"FlashPCA2: principal component analysis of biobank-scale genotype datasets","volume":"33","author":"Abraham","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B4","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pgen.1008773","article-title":"Scalable probabilistic PCA for large-scale genetic variation data","author":"Agrawal","year":"2019"},{"key":"2023062213525253900_btaa520-B5","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1093\/bioinformatics\/btr599","article-title":"A robust clustering algorithm for identifying problematic samples in genome-wide association studies","volume":"28","author":"Bellenguez","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B6","doi-asserted-by":"crossref","first-page":"3679","DOI":"10.1093\/bioinformatics\/btz157","article-title":"TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes","volume":"35","author":"Bose","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B7","first-page":"37","article-title":"Fast online SVD revisions for lightweight recommender systems","author":"Brand","year":"2003"},{"key":"2023062213525253900_btaa520-B8","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1198\/106186004X12632","article-title":"A robust measure of skewness","volume":"13","author":"Brys","year":"2004","journal-title":"J. Comput. Graph. Stat"},{"key":"2023062213525253900_btaa520-B9","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1038\/s41586-018-0579-z","article-title":"The UK biobank resource with deep phenotyping and genomic data","volume":"562","author":"Bycroft","year":"2018","journal-title":"Nature"},{"key":"2023062213525253900_btaa520-B10","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1186\/s13742-015-0047-8","article-title":"Second-generation PLINK: rising to the challenge of larger and richer datasets","volume":"4","author":"Chang","year":"2015","journal-title":"Gigascience"},{"key":"2023062213525253900_btaa520-B11","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1186\/s12859-019-3307-2","article-title":"Guidelines for cell-type heterogeneity quantification based on a comparative analysis of reference-free DNA methylation deconvolution software","volume":"21","author":"Decamps","year":"2020","journal-title":"BMC Bioinform"},{"key":"2023062213525253900_btaa520-B12","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1016\/j.jmva.2019.02.007","article-title":"Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model","volume":"173","author":"Dey","year":"2019","journal-title":"J. Multivar. Anal"},{"key":"2023062213525253900_btaa520-B13","first-page":"2","article-title":"Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration","volume":"3","author":"Elseberg","year":"2012","journal-title":"J. Softw. Eng. Rob"},{"key":"2023062213525253900_btaa520-B14","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.ajhg.2015.12.022","article-title":"Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia","volume":"98","author":"Galinsky","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023062213525253900_btaa520-B15","doi-asserted-by":"crossref","first-page":"81","DOI":"10.2307\/2528963","article-title":"Robust estimates, residuals, and outlier detection with multiresponse data","volume":"28","author":"Gnanadesikan","year":"1972","journal-title":"Biometrics"},{"key":"2023062213525253900_btaa520-B16","doi-asserted-by":"crossref","first-page":"5186","DOI":"10.1016\/j.csda.2007.11.008","article-title":"An adjusted boxplot for skewed distributions","volume":"52","author":"Hubert","year":"2008","journal-title":"Comput. Stat. Data Anal"},{"key":"2023062213525253900_btaa520-B17","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1038\/nature09298","article-title":"Integrating common and rare genetic variation in diverse human populations","volume":"467","year":"2010","journal-title":"Nature"},{"key":"2023062213525253900_btaa520-B18","first-page":"1649","article-title":"LoOP: local outlier probabilities","author":"Kriegel","year":"2009"},{"key":"2023062213525253900_btaa520-B19","doi-asserted-by":"crossref","first-page":"3605","DOI":"10.1214\/10-AOS821","article-title":"Convergence and prediction of principal component scores in high-dimensional settings","volume":"38","author":"Lee","year":"2010","journal-title":"Ann. Stat"},{"key":"2023062213525253900_btaa520-B20","doi-asserted-by":"crossref","first-page":"789","DOI":"10.1137\/S0895479895281484","article-title":"Deflation techniques for an implicitly restarted Arnoldi iteration","volume":"17","author":"Lehoucq","year":"1996","journal-title":"SIAM J. Mat. Anal. Appl"},{"key":"2023062213525253900_btaa520-B21","doi-asserted-by":"crossref","first-page":"1385","DOI":"10.1038\/ng.3431","article-title":"Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis","volume":"47","author":"Loh","year":"2015","journal-title":"Nat. Genet"},{"key":"2023062213525253900_btaa520-B22","doi-asserted-by":"crossref","first-page":"284","DOI":"10.1038\/ng.3190","article-title":"Efficient Bayesian mixed-model analysis increases association power in large cohorts","volume":"47","author":"Loh","year":"2015","journal-title":"Nat. Genet"},{"key":"2023062213525253900_btaa520-B23","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1111\/1755-0998.12592","article-title":"pcadapt: an R package to perform genome scans for selection based on principal component analysis","volume":"17","author":"Luu","year":"2017","journal-title":"Mol. Ecol. Resour"},{"key":"2023062213525253900_btaa520-B24","doi-asserted-by":"crossref","first-page":"2867","DOI":"10.1093\/bioinformatics\/btq559","article-title":"Robust relationship inference in genome-wide association studies","volume":"26","author":"Manichaikul","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B25","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1198\/004017002188618509","article-title":"Robust estimates of location and dispersion for high-dimensional datasets","volume":"44","author":"Maronna","year":"2002","journal-title":"Technometrics"},{"key":"2023062213525253900_btaa520-B26","doi-asserted-by":"crossref","first-page":"e1000686","DOI":"10.1371\/journal.pgen.1000686","article-title":"A genealogical interpretation of principal components analysis","volume":"5","author":"McVean","year":"2009","journal-title":"PLoS Genet"},{"key":"2023062213525253900_btaa520-B27","article-title":"Processing 1000 genomes reference data for ancestry estimation","author":"Meyer","year":"2019"},{"key":"2023062213525253900_btaa520-B28","doi-asserted-by":"crossref","first-page":"646","DOI":"10.1038\/ng.139","article-title":"Interpreting principal component analyses of spatial population genetic variation","volume":"40","author":"Novembre","year":"2008","journal-title":"Nat. Genet"},{"key":"2023062213525253900_btaa520-B29","doi-asserted-by":"crossref","first-page":"e190","DOI":"10.1371\/journal.pgen.0020190","article-title":"Population structure and eigenanalysis","volume":"2","author":"Patterson","year":"2006","journal-title":"PLoS Genet"},{"key":"2023062213525253900_btaa520-B30","doi-asserted-by":"crossref","first-page":"768","DOI":"10.1038\/nature08872","article-title":"Understanding mechanisms underlying human gene expression variation with RNA sequencing","volume":"464","author":"Pickrell","year":"2010","journal-title":"Nature"},{"key":"2023062213525253900_btaa520-B31","doi-asserted-by":"crossref","first-page":"904","DOI":"10.1038\/ng1847","article-title":"Principal components analysis corrects for stratification in genome-wide association studies","volume":"38","author":"Price","year":"2006","journal-title":"Nat. Genet"},{"key":"2023062213525253900_btaa520-B32","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1016\/j.ajhg.2008.06.005","article-title":"Long-range LD can confound genome scans in admixed populations","volume":"83","author":"Price","year":"2008","journal-title":"Am. J. Hum. Genet"},{"key":"2023062213525253900_btaa520-B33","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1038\/nrg2813","article-title":"New approaches to population stratification in genome-wide association studies","volume":"11","author":"Price","year":"2010","journal-title":"Nat. Rev. Genet"},{"key":"2023062213525253900_btaa520-B34","doi-asserted-by":"crossref","first-page":"2781","DOI":"10.1093\/bioinformatics\/bty185","article-title":"Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr","volume":"34","author":"Priv\u00e9","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B35","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1534\/genetics.119.302019","article-title":"Efficient implementation of penalized regression for genetic risk prediction","volume":"212","author":"Priv\u00e9","year":"2019","journal-title":"Genetics"},{"key":"2023062213525253900_btaa520-B36","doi-asserted-by":"crossref","DOI":"10.1093\/molbev\/msaa053","article-title":"Performing highly efficient genome scans for local adaptation with R package pcadapt version 4","author":"Priv\u00e9","year":"2020","journal-title":"Mol. Biol. Evol"},{"key":"2023062213525253900_btaa520-B37","volume-title":"Exploratory Data Analysis","author":"Tukey","year":"1977"},{"key":"2023062213525253900_btaa520-B38","doi-asserted-by":"crossref","first-page":"926","DOI":"10.1016\/j.ajhg.2015.04.018","article-title":"Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation","volume":"96","author":"Wang","year":"2015","journal-title":"Am. J. Hum. Genet"},{"key":"2023062213525253900_btaa520-B39","doi-asserted-by":"crossref","first-page":"S248","DOI":"10.1137\/16M1082214","article-title":"Primme_svds: a high-performance preconditioned SVD solver for accurate large-scale computations","volume":"39","author":"Wu","year":"2017","journal-title":"SIAM J. Sci. Comput"},{"key":"2023062213525253900_btaa520-B40","doi-asserted-by":"crossref","first-page":"565","DOI":"10.1038\/ng.608","article-title":"Common SNPS explain a large proportion of the heritability for human height","volume":"42","author":"Yang","year":"2010","journal-title":"Nat. Genet"},{"key":"2023062213525253900_btaa520-B41","doi-asserted-by":"crossref","first-page":"406","DOI":"10.1080\/01621459.1988.10478611","article-title":"High breakdown-point estimates of regression by means of the minimization of an efficient scale","volume":"83","author":"Yohai","year":"1988","journal-title":"J. Am. Stat. Assoc"},{"key":"2023062213525253900_btaa520-B42","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btaa152","article-title":"Fast and robust ancestry prediction using principal component analysis","author":"Zhang","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062213525253900_btaa520-B43","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1159\/000288706","article-title":"Quantification of population structure using correlated SNPS by shrinkage principal components","volume":"70","author":"Zou","year":"2010","journal-title":"Hum. Hered"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa520\/33694623\/btaa520.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/16\/4449\/50676539\/btaa520.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/16\/4449\/50676539\/btaa520.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,23]],"date-time":"2023-06-23T08:27:39Z","timestamp":1687508859000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/16\/4449\/5838185"}},"subtitle":[],"editor":[{"given":"Russell","family":"Schwartz","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,5,16]]},"references-count":43,"journal-issue":{"issue":"16","published-print":{"date-parts":[[2020,8,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa520","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/841452","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,8,15]]},"published":{"date-parts":[[2020,5,16]]}}}