{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T17:34:39Z","timestamp":1755797679772,"version":"3.37.3"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2021,1,8]],"date-time":"2021-01-08T00:00:00Z","timestamp":1610064000000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000066","name":"National Institute of Environmental Health Sciences","doi-asserted-by":"publisher","award":["K01 ES028064"],"award-info":[{"award-number":["K01 ES028064"]}],"id":[{"id":"10.13039\/100000066","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1705197"],"award-info":[{"award-number":["#1705197"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000065","name":"National Institute of Neurological Disorders and Stroke","doi-asserted-by":"publisher","award":["R01 NS102371"],"award-info":[{"award-number":["R01 NS102371"]}],"id":[{"id":"10.13039\/100000065","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,4,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next-generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data, which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open-source pipeline that can perform all these steps on NGS data in a manner, which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with the aforementioned functionalities.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30\u00d7 coverage whole-genome sequencing (WGS) data in \u223c90\u00a0min. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for single nucleotide variants and 99.20% for insertion\/deletions across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE &amp; SLURM) high-performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50\u00d7 coverage WGS on Amazon Web Service. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>xGAP is available at https:\/\/github.com\/Adigorla\/xgap.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa1097","type":"journal-article","created":{"date-parts":[[2021,1,5]],"date-time":"2021-01-05T04:21:19Z","timestamp":1609820479000},"page":"9-16","source":"Crossref","is-referenced-by-count":1,"title":["xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery"],"prefix":"10.1093","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0849-7894","authenticated-orcid":false,"given":"Aditya","family":"Gorla","sequence":"first","affiliation":[{"name":"Department of Bioengineering, University of California , Los Angeles, CA 90095, USA"}]},{"given":"Brandon","family":"Jew","sequence":"additional","affiliation":[{"name":"Bioinformatics Interdepartmental Program, University of California , Los Angeles, CA 90095, USA"}]},{"given":"Luke","family":"Zhang","sequence":"additional","affiliation":[{"name":"Undergraduate Neuroscience Interdepartmental Program, University of California , Los Angeles, CA 90095, USA"}]},{"given":"Jae Hoon","family":"Sul","sequence":"additional","affiliation":[{"name":"Department of Psychiatry and Biobehavioral Sciences, University of California , Los Angeles, CA 90095, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,1,8]]},"reference":[{"key":"2023051601060129400_btaa1097-B1","doi-asserted-by":"crossref","first-page":"974","DOI":"10.1101\/gr.114876.110","article-title":"CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing","volume":"21","author":"Abyzov","year":"2011","journal-title":"Genome Res"},{"key":"2023051601060129400_btaa1097-B2","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1016\/j.csbj.2014.11.001","article-title":"A case study for cloud based high throughput analysis of NGS data using the globus genomics system","volume":"13","author":"Bhuvaneshwar","year":"2015","journal-title":"Comput. Struct. Biotechnol. J"},{"key":"2023051601060129400_btaa1097-B3","doi-asserted-by":"crossref","first-page":"9345","DOI":"10.1038\/s41598-019-45835-3","article-title":"Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers","volume":"9","author":"Chen","year":"2019","journal-title":"Sci. Rep"},{"key":"2023051601060129400_btaa1097-B4","doi-asserted-by":"crossref","first-page":"628","DOI":"10.1038\/nrg3046","article-title":"Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data","volume":"12","author":"Cooper","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023051601060129400_btaa1097-B5","doi-asserted-by":"crossref","first-page":"2482","DOI":"10.1093\/bioinformatics\/btv179","article-title":"Halvade: scalable sequence analysis with mapReduce","volume":"31","author":"Decap","year":"2015","journal-title":"Bioinformatics"},{"year":"2017","author":"Deng","key":"2023051601060129400_btaa1097-B6"},{"key":"2023051601060129400_btaa1097-B7","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"Depristo","year":"2011","journal-title":"Nat. Genet"},{"key":"2023051601060129400_btaa1097-B8","doi-asserted-by":"crossref","first-page":"316","DOI":"10.1038\/nbt.3820","article-title":"Nextflow enables reproducible computational workflows","volume":"35","author":"Di Tommaso","year":"2017","journal-title":"Nat. Biotechnol"},{"year":"2017","author":"Emeras","key":"2023051601060129400_btaa1097-B9"},{"key":"2023051601060129400_btaa1097-B10","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1186\/s12859-019-3169-7","article-title":"Recommendations for performance optimizations when using GATK3.8 and GATK4","volume":"20","author":"Heldenbrand","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2023051601060129400_btaa1097-B11","doi-asserted-by":"crossref","first-page":"e0132868","DOI":"10.1371\/journal.pone.0132868","article-title":"ElPrep: high-performance preparation of sequence alignment\/map files for variant calling","volume":"10","author":"Herzeel","year":"2015","journal-title":"PLoS One"},{"key":"2023051601060129400_btaa1097-B12","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/s13059-014-0577-x","article-title":"Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics","volume":"16","author":"Kelly","year":"2015","journal-title":"Genome Biol"},{"key":"2023051601060129400_btaa1097-B13","doi-asserted-by":"crossref","first-page":"2520","DOI":"10.1093\/bioinformatics\/bts480","article-title":"Snakemake-a scalable bioinformatics workflow engine","volume":"28","author":"K\u00f6ster","year":"2012","journal-title":"Bioinformatics"},{"key":"2023051601060129400_btaa1097-B14","doi-asserted-by":"crossref","first-page":"555","DOI":"10.1038\/s41587-019-0054-x","article-title":"Best practices for benchmarking germline small-variant calls in human genomes","volume":"37","author":"Krusche","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2023051601060129400_btaa1097-B15","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1097\/PAT.0000000000000235","article-title":"Whole genome sequencing in clinical and public health microbiology","volume":"47","author":"Kwong","year":"2015","journal-title":"Pathology"},{"key":"2023051601060129400_btaa1097-B16","doi-asserted-by":"crossref","first-page":"R84","DOI":"10.1186\/gb-2014-15-6-r84","article-title":"LUMPY: a probabilistic framework for structural variant discovery","volume":"15","author":"Layer","year":"2014","journal-title":"Genome Biol"},{"key":"2023051601060129400_btaa1097-B17","doi-asserted-by":"crossref","first-page":"1985","DOI":"10.1093\/bioinformatics\/btz216","article-title":"VCPA: genomic variant calling pipeline and data management tool for Alzheimer\u2019s Disease Sequencing Project","volume":"35","author":"Leung","year":"2019","journal-title":"Bioinformatics"},{"key":"2023051601060129400_btaa1097-B18","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023051601060129400_btaa1097-B19","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows-Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023051601060129400_btaa1097-B20","doi-asserted-by":"crossref","first-page":"802","DOI":"10.1016\/j.ajhg.2019.03.002","article-title":"Dynamic scan procedure for detecting rare-variant association regions in whole-genome sequencing studies","volume":"104","author":"Li","year":"2019","journal-title":"Am. J. Hum. Genet"},{"key":"2023051601060129400_btaa1097-B21","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1186\/s13059-019-1649-8","article-title":"Improving the usability and archival stability of bioinformatics software","volume":"20","author":"Mangul","year":"2019","journal-title":"Genome Biol"},{"key":"2023051601060129400_btaa1097-B22","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res"},{"key":"2023051601060129400_btaa1097-B23","doi-asserted-by":"crossref","first-page":"2741","DOI":"10.1093\/bioinformatics\/btv204","article-title":"MetaSV: an accurate and integrative structural-variant caller for next generation sequencing","volume":"31","author":"Mohiyuddin","year":"2015","journal-title":"Bioinformatics"},{"key":"2023051601060129400_btaa1097-B24","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1016\/j.ygeno.2008.07.001","article-title":"Applications of next-generation sequencing technologies in functional genomics","volume":"92","author":"Morozova","year":"2008","journal-title":"Genomics"},{"year":"2015","author":"Mushtaq","key":"2023051601060129400_btaa1097-B25"},{"key":"2023051601060129400_btaa1097-B26","doi-asserted-by":"crossref","first-page":"e0224784","DOI":"10.1371\/journal.pone.0224784","article-title":"SparkGA2: production-quality memory-efficient Apache Spark based genome analysis framework","volume":"14","author":"Mushtaq","year":"2019","journal-title":"PLoS One"},{"key":"2023051601060129400_btaa1097-B27","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1038\/nrg2986","article-title":"Genotype and SNP calling from next-generation sequencing data","volume":"12","author":"Nielsen","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023051601060129400_btaa1097-B28","article-title":"Scaling accurate genetic variant discovery to tens of thousands of samples","author":"Poplin","year":"2017","journal-title":"DOI: 10.1101\/201178."},{"key":"2023051601060129400_btaa1097-B29","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.jpdc.2017.06.009","article-title":"Scalable system scheduling for HPC and big data","volume":"111","author":"Reuther","year":"2018","journal-title":"J. Parallel Distrib. Comput"},{"key":"2023051601060129400_btaa1097-B30","doi-asserted-by":"crossref","first-page":"215","DOI":"10.3389\/fgene.2015.00215","article-title":"Clinical applications of next generation sequencing in cancer: from panels, to exomes, to genomes","volume":"6","author":"Shen","year":"2015","journal-title":"Front. Genet"},{"key":"2023051601060129400_btaa1097-B31","doi-asserted-by":"crossref","first-page":"17851","DOI":"10.1038\/s41598-018-36177-7","article-title":"Comparison of three variant callers for human whole genome sequencing","volume":"8","author":"Supernat","year":"2018","journal-title":"Sci. Rep"},{"key":"2023051601060129400_btaa1097-B32","doi-asserted-by":"crossref","first-page":"2032","DOI":"10.1093\/bioinformatics\/btv098","article-title":"Sambamba: fast processing of NGS alignment formats","volume":"31","author":"Tarasov","year":"2015","journal-title":"Bioinformatics"},{"year":"2003","author":"Yoo","key":"2023051601060129400_btaa1097-B33"},{"key":"2023051601060129400_btaa1097-B34","doi-asserted-by":"crossref","first-page":"561","DOI":"10.1038\/s41587-019-0074-6","article-title":"An open resource for accurately benchmarking small variant and reference calls","volume":"37","author":"Zook","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2023051601060129400_btaa1097-B35","doi-asserted-by":"crossref","first-page":"160025","DOI":"10.1038\/sdata.2016.25","article-title":"Extensive sequencing of seven human genomes to characterize benchmark reference materials","volume":"3","author":"Zook","year":"2016","journal-title":"Sci. Data"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa1097\/36237963\/btaa1097.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/1\/9\/50321763\/btaa1097.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/1\/9\/50321763\/btaa1097.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T01:07:23Z","timestamp":1684199243000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/1\/9\/6069565"}},"subtitle":[],"editor":[{"given":"Pier Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,1,1]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,4,9]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa1097","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2021,1,1]]},"published":{"date-parts":[[2021,1,1]]}}}