{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T17:08:44Z","timestamp":1775063324959,"version":"3.50.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"15","license":[{"start":{"date-parts":[[2017,3,16]],"date-time":"2017-03-16T00:00:00Z","timestamp":1489622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","award":["GM099568"],"award-info":[{"award-number":["GM099568"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R\/Bioconductor package \u2018SeqArray\u2019 for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0\u2009Gb (VCF), 12.3\u2009Gb (BCF, binary VCF), 3.5\u2009Gb (BGT) and 2.6\u2009Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R\/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and Implementation<\/jats:title>\n                    <jats:p>http:\/\/www.bioconductor.org\/packages\/SeqArray<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx145","type":"journal-article","created":{"date-parts":[[2017,3,15]],"date-time":"2017-03-15T00:12:38Z","timestamp":1489536758000},"page":"2251-2257","source":"Crossref","is-referenced-by-count":155,"title":["SeqArray\u2014a storage-efficient high-performance data format for WGS variant calls"],"prefix":"10.1093","volume":"33","author":[{"given":"Xiuwen","family":"Zheng","sequence":"first","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"Stephanie M","family":"Gogarten","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"Michael","family":"Lawrence","sequence":"additional","affiliation":[{"name":"Bioinformatics and Computational Biology, Genentech, Inc, South San Francisco, CA, USA"}]},{"given":"Adrienne","family":"Stilp","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"Matthew P","family":"Conomos","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"Bruce S","family":"Weir","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"Cathy","family":"Laurie","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]},{"given":"David","family":"Levine","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, University of Washington, Seattle, WA, USA"}]}],"member":"286","published-online":{"date-parts":[[2017,3,16]]},"reference":[{"key":"2023063012483456800_btx145-B1","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/nature09534","article-title":"A map of human genome variation from population-scale sequencing","volume":"467","author":"1000 Genomes Project Consortium","year":"2010","journal-title":"Nature"},{"key":"2023063012483456800_btx145-B2","doi-asserted-by":"crossref","first-page":"7.","DOI":"10.1186\/s13742-015-0047-8","article-title":"Second-generation plink: rising to the challenge of larger and richer datasets","volume":"4","author":"Chang","year":"2015","journal-title":"GigaScience"},{"key":"2023063012483456800_btx145-B3","doi-asserted-by":"crossref","first-page":"793","DOI":"10.1056\/NEJMp1500523","article-title":"A new initiative on precision medicine","volume":"372","author":"Collins","year":"2015","journal-title":"N. Engl. J. Med"},{"key":"2023063012483456800_btx145-B4","doi-asserted-by":"crossref","first-page":"276","DOI":"10.1002\/gepi.21896","article-title":"Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness","volume":"39","author":"Conomos","year":"2015","journal-title":"Genet. Epidemiol"},{"key":"2023063012483456800_btx145-B5","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1016\/j.ajhg.2015.11.022","article-title":"Model-free estimation of recent genetic relatedness","volume":"98","author":"Conomos","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023063012483456800_btx145-B6","doi-asserted-by":"crossref","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","article-title":"The variant call format and vcftools","volume":"27","author":"Danecek","year":"2011","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B7","doi-asserted-by":"crossref","first-page":"1266","DOI":"10.1093\/bioinformatics\/btu014","article-title":"Efficient haplotype matching and storage using the positional burrows-wheeler transform (pbwt)","volume":"30","author":"Durbin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v040.i08","article-title":"Rcpp: Seamless R and C\u2009++ integration","volume":"40","author":"Eddelbuettel","year":"2011","journal-title":"J. Stat. Softw"},{"key":"2023063012483456800_btx145-B9","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1016\/j.ajhg.2015.12.022","article-title":"Fast principal-component analysis reveals convergent evolution of ADH1B in europe and east asia","volume":"98","author":"Galinsky","year":"2016","journal-title":"Am. J. Hum. Genet"},{"key":"2023063012483456800_btx145-B10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/gb-2004-5-10-r80","article-title":"Bioconductor: open software development for computational biology and bioinformatics","volume":"5","author":"Gentleman","year":"2004","journal-title":"Genome Biol"},{"key":"2023063012483456800_btx145-B11","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1038\/nrg.2016.49","article-title":"Coming of age: ten years of next-generation sequencing technologies","volume":"17","author":"Goodwin","year":"2016","journal-title":"Nat. Rev. Genet"},{"key":"2023063012483456800_btx145-B12","doi-asserted-by":"crossref","first-page":"e1003118.","DOI":"10.1371\/journal.pcbi.1003118","article-title":"Software for computing and annotating genomic ranges","volume":"9","author":"Lawrence","year":"2013","journal-title":"PLoS Comput. Biol"},{"key":"2023063012483456800_btx145-B13","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1038\/nmeth.3654","article-title":"Efficient genotype compression and analysis of large genetic-variation data sets","volume":"13","author":"Layer","year":"2016","journal-title":"Nat. Methods"},{"key":"2023063012483456800_btx145-B14","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B15","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1093\/bioinformatics\/btv613","article-title":"BGT: efficient and flexible genotype query across many samples","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B16","doi-asserted-by":"crossref","first-page":"2867","DOI":"10.1093\/bioinformatics\/btq559","article-title":"Robust relationship inference in genome-wide association studies","volume":"26","author":"Manichaikul","year":"2010","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B17","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1038\/nrg2626","article-title":"Sequencing technologies \u2013 the next generation","volume":"11","author":"Metzker","year":"2010","journal-title":"Nat. Rev. Genet"},{"key":"2023063012483456800_btx145-B18","doi-asserted-by":"crossref","first-page":"2076","DOI":"10.1093\/bioinformatics\/btu168","article-title":"VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants","volume":"30","author":"Obenchain","year":"2014","journal-title":"Bioinformatics"},{"key":"2023063012483456800_btx145-B19","doi-asserted-by":"crossref","first-page":"349","DOI":"10.14778\/3025111.3025117","article-title":"The tiledb array data storage manager","volume":"10","author":"Papadopoulos","year":"2016","journal-title":"Proc. VLDB Endow"},{"key":"2023063012483456800_btx145-B20","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pgen.0020190","article-title":"Population structure and eigenanalysis","volume":"2","author":"Patterson","year":"2006","journal-title":"PLoS Genet"},{"key":"2023063012483456800_btx145-B21","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1086\/519795","article-title":"PLINK: a tool set for whole-genome association and population-based linkage analyses","volume":"81","author":"Purcell","year":"2007","journal-title":"Am. J. Hum. Genet"},{"key":"2023063012483456800_btx145-B22","author":"R Core Team","year":"2016"},{"key":"2023063012483456800_btx145-B23","doi-asserted-by":"crossref","first-page":"399","DOI":"10.1198\/106186007X178979","article-title":"Simple parallel statistical computing in R","volume":"16","author":"Rossini","year":"2007","journal-title":"J. Comput. Graph. Stat"},{"key":"2023063012483456800_btx145-B24","first-page":"1358","article-title":"Estimating F-statistics for the analysis of population structure","volume":"38","author":"Weir","year":"1984","journal-title":"Evolution"},{"key":"2023063012483456800_btx145-B25","doi-asserted-by":"crossref","first-page":"721","DOI":"10.1146\/annurev.genet.36.050802.093940","article-title":"Estimating F-statistics","volume":"36","author":"Weir","year":"2002","journal-title":"Annu. Rev. Genet"},{"key":"2023063012483456800_btx145-B26","first-page":"e267","article-title":"SNPs and SNVs in forensic science","volume":"5","author":"Weir","year":"2015","journal-title":"Forensic Sci. Int"},{"key":"2023063012483456800_btx145-B27","doi-asserted-by":"crossref","first-page":"1468","DOI":"10.1101\/gr.4398405","article-title":"Measures of human population structure show heterogeneity among genomic regions","volume":"15","author":"Weir","year":"2005","journal-title":"Genome Res"},{"key":"2023063012483456800_btx145-B28","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.ajhg.2010.11.011","article-title":"GCTA: a tool for genome-wide complex trait analysis","volume":"88","author":"Yang","year":"2011","journal-title":"Am. J. Hum. Genet"},{"key":"2023063012483456800_btx145-B29","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1016\/j.tpb.2015.09.004","article-title":"Eigenanalysis of SNP data with an identity by descent interpretation","volume":"107","author":"Zheng","year":"2016","journal-title":"Theor. Popul. Biol"},{"key":"2023063012483456800_btx145-B30","doi-asserted-by":"crossref","first-page":"3326","DOI":"10.1093\/bioinformatics\/bts606","article-title":"A high-performance computing toolset for relatedness and principal component analysis of SNP data","volume":"28","author":"Zheng","year":"2012","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/15\/2251\/50756394\/bioinformatics_33_15_2251.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/15\/2251\/50756394\/bioinformatics_33_15_2251.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,30]],"date-time":"2023-06-30T08:48:59Z","timestamp":1688114939000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/33\/15\/2251\/3072873"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,3,16]]},"references-count":30,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2017,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx145","relation":{"has-review":[{"id-type":"doi","id":"10.3410\/f.727443196.793535014","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2017,8,1]]},"published":{"date-parts":[[2017,3,16]]}}}