{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:45:09Z","timestamp":1740185109307,"version":"3.37.3"},"reference-count":23,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2017,11,27]],"date-time":"2017-11-27T00:00:00Z","timestamp":1511740800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01GM116065"],"award-info":[{"award-number":["R01GM116065"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Inferring population structure is important for both population genetics and genetic epidemiology. Principal components analysis (PCA) has been effective in ascertaining population structure with array genotype data but can be difficult to use with sequencing data, especially when low depth leads to uncertainty in called genotypes. Because PCA is sensitive to differences in variability, PCA using sequencing data can result in components that correspond to differences in sequencing quality (read depth and error rate), rather than differences in population structure. We demonstrate that even existing methods for PCA specifically designed for sequencing data can still yield biased conclusions when used with data having sequencing properties that are systematically different across different groups of samples (i.e. sequencing groups). This situation can arise in population genetics when combining sequencing data from different studies, or in genetic epidemiology when using historical controls such as samples from the 1000 Genomes Project.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>To allow inference on population structure using PCA in these situations, we provide an approach that is based on using sequencing reads directly without calling genotypes. Our approach is to adjust the data from different sequencing groups to have the same read depth and error rate so that PCA does not generate spurious components representing sequencing quality. To accomplish this, we have developed a subsampling procedure to match the depth distributions in different sequencing groups, and a read-flipping procedure to match the error rates. We average over subsamples and read flips to minimize loss of information. We demonstrate the utility of our approach using two datasets from 1000 Genomes, and further evaluate it using simulation studies.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>TASER-PC software is publicly available at http:\/\/web1.sph.emory.edu\/users\/yhu30\/software.html.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx708","type":"journal-article","created":{"date-parts":[[2017,11,24]],"date-time":"2017-11-24T12:10:32Z","timestamp":1511525432000},"page":"1157-1163","source":"Crossref","is-referenced-by-count":1,"title":["Robust inference of population structure from next-generation sequencing data with systematic differences in sequencing"],"prefix":"10.1093","volume":"34","author":[{"given":"Peizhou","family":"Liao","sequence":"first","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA"}]},{"given":"Glen A","family":"Satten","sequence":"additional","affiliation":[{"name":"Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, GA, USA"}]},{"given":"Yi-Juan","family":"Hu","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA"}]}],"member":"286","published-online":{"date-parts":[[2017,11,27]]},"reference":[{"key":"2023012712570916600_btx708-B1","doi-asserted-by":"crossref","first-page":"639","DOI":"10.1126\/science.8430313","article-title":"Demic expansions and human evolution","volume":"259","author":"Cavalli-Sforza","year":"1993","journal-title":"Science"},{"key":"2023012712570916600_btx708-B2","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat. Genet"},{"key":"2023012712570916600_btx708-B3","doi-asserted-by":"crossref","first-page":"979","DOI":"10.1534\/genetics.113.154740","article-title":"Quantifying population genetic differentiation from next-generation sequencing data","volume":"195","author":"Fumagalli","year":"2013","journal-title":"Genetics"},{"key":"2023012712570916600_btx708-B4","doi-asserted-by":"crossref","first-page":"1020","DOI":"10.1101\/gr.074187.107","article-title":"Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals","volume":"18","author":"Hellmann","year":"2008","journal-title":"Genome Res"},{"key":"2023012712570916600_btx708-B5","doi-asserted-by":"crossref","first-page":"e1006040","DOI":"10.1371\/journal.pgen.1006040","article-title":"Testing rare-variant association without calling genotypes allows for systematic differences in sequencing between cases and controls","volume":"12","author":"Hu","year":"2016","journal-title":"PLoS Genet"},{"key":"2023012712570916600_btx708-B6","doi-asserted-by":"crossref","first-page":"199.","DOI":"10.1093\/molbev\/msm239","article-title":"Accounting for bias from sequencing error in population genetic estimates","volume":"25","author":"Johnson","year":"2008","journal-title":"Mol. Biol. Evol"},{"key":"2023012712570916600_btx708-B7","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1038\/nbt.2522","article-title":"Updating benchtop sequencing performance comparison","volume":"31","author":"Junemann","year":"2013","journal-title":"Nat. Biotechnol"},{"key":"2023012712570916600_btx708-B8","doi-asserted-by":"crossref","first-page":"231.","DOI":"10.1186\/1471-2105-12-231","article-title":"Estimation of allele frequency and association mapping using next-generation sequencing data","volume":"12","author":"Kim","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012712570916600_btx708-B9","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The Sequence Alignment\/Map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012712570916600_btx708-B10","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1038\/ng.3761","article-title":"Exploring the genetic architecture of inflammatory bowel disease by whole-genome sequencing identifies association at ADCY7","volume":"49","author":"Luo","year":"2017","journal-title":"Nat. Genet"},{"key":"2023012712570916600_btx708-B11","first-page":"1","article-title":"Principal components analysis of population admixture","volume":"7","author":"Ma","year":"2012","journal-title":"PLoS ONE"},{"key":"2023012712570916600_btx708-B12","doi-asserted-by":"crossref","first-page":"2803","DOI":"10.1093\/bioinformatics\/btq526","article-title":"SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies","volume":"26","author":"Martin","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012712570916600_btx708-B13","doi-asserted-by":"crossref","first-page":"786","DOI":"10.1126\/science.356262","article-title":"Synthetic maps of human gene frequencies in Europeans","volume":"201","author":"Menozzi","year":"1978","journal-title":"Science"},{"key":"2023012712570916600_btx708-B14","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023012712570916600_btx708-B15","doi-asserted-by":"crossref","first-page":"543.","DOI":"10.1186\/1471-2164-15-543","article-title":"Evaluating the accuracy of AIM panels at quantifying genome ancestry","volume":"15","author":"Pardo-Seco","year":"2014","journal-title":"BMC Genomics"},{"key":"2023012712570916600_btx708-B16","doi-asserted-by":"crossref","first-page":"904","DOI":"10.1038\/ng1847","article-title":"Principal components analysis corrects for stratification in genome-wide association studies","volume":"38","author":"Price","year":"2006","journal-title":"Nat. Genet"},{"key":"2023012712570916600_btx708-B17","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1086\/519795","article-title":"Plink: a tool set for whole-genome association and population-based linkage analyses","volume":"81","author":"Purcell","year":"2007","journal-title":"Am. J. Hum. Genet"},{"key":"2023012712570916600_btx708-B18","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng0508-491","article-title":"Principal component analysis of genetic data","volume":"40","author":"Reich","year":"2008","journal-title":"Nat. Genet"},{"key":"2023012712570916600_btx708-B19","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/nature09534","article-title":"A map of human genome variation from population-scale sequencing","volume":"467","author":"The 1000 Genomes Project Consortium","year":"2010","journal-title":"Nature"},{"key":"2023012712570916600_btx708-B20","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1038\/nature14962","article-title":"The UK10K project identifies rare variants in health and disease","volume":"526","author":"The UK10K Consortium","year":"2015","journal-title":"Nature"},{"key":"2023012712570916600_btx708-B21","doi-asserted-by":"crossref","first-page":"e3862","DOI":"10.1371\/journal.pone.0003862","article-title":"Analysis of East Asia genetic substructure using genome-wide SNP arrays","volume":"3","author":"Tian","year":"2008","journal-title":"PLoS ONE"},{"key":"2023012712570916600_btx708-B22","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1038\/ng.2924","article-title":"Ancestry estimation and control of population stratification for sequence-based association studies","volume":"46","author":"Wang","year":"2014","journal-title":"Nat. Genet"},{"key":"2023012712570916600_btx708-B23","doi-asserted-by":"crossref","first-page":"1352","DOI":"10.1038\/ng.3403","article-title":"Height-reducing variants and selection for short stature in Sardinia","volume":"47","author":"Zoledziewska","year":"2015","journal-title":"Nat. Genet"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/7\/1157\/48914479\/bioinformatics_34_7_1157.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/7\/1157\/48914479\/bioinformatics_34_7_1157.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T13:45:23Z","timestamp":1674827123000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/7\/1157\/4665415"}},"subtitle":[],"editor":[{"given":"Oliver","family":"Stegle","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,11,27]]},"references-count":23,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2018,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx708","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2018,4,1]]},"published":{"date-parts":[[2017,11,27]]}}}