{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T21:13:36Z","timestamp":1753737216470,"version":"3.37.3"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2016,9,10]],"date-time":"2016-09-10T00:00:00Z","timestamp":1473465600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2016,9,10]],"date-time":"2016-09-10T00:00:00Z","timestamp":1473465600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01 HG008115"],"award-info":[{"award-number":["R01 HG008115"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000015","name":"U.S. Department of Energy","doi-asserted-by":"publisher","award":["DE-AC05-00OR22725"],"award-info":[{"award-number":["DE-AC05-00OR22725"]}],"id":[{"id":"10.13039\/100000015","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180\u00a0TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6\u00a0TB of data across the platforms.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-016-1211-6","type":"journal-article","created":{"date-parts":[[2016,9,10]],"date-time":"2016-09-10T01:52:53Z","timestamp":1473472373000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["A hybrid computational strategy to address WGS variant analysis in &gt;5000 samples"],"prefix":"10.1186","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9149-295X","authenticated-orcid":false,"given":"Zhuoyi","family":"Huang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Navin","family":"Rustagi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Narayanan","family":"Veeraraghavan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrew","family":"Carroll","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Richard","family":"Gibbs","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eric","family":"Boerwinkle","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Manjunath Gorentla","family":"Venkata","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fuli","family":"Yu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2016,9,10]]},"reference":[{"issue":"7571","key":"1211_CR1","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1038\/nature15393","volume":"526","author":"1000 Genomes Project Consortium","year":"2015","unstructured":"1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68\u201374.","journal-title":"Nature"},{"issue":"7571","key":"1211_CR2","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1038\/nature14962","volume":"526","author":"UK10K Consortium","year":"2015","unstructured":"UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature. 2015;526(7571):82\u201390.","journal-title":"Nature"},{"issue":"1","key":"1211_CR3","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1161\/CIRCGENETICS.108.829747","volume":"2","author":"BM Psaty","year":"2009","unstructured":"Psaty BM, O\u2019Donnell CJ, Gudnason V, Lunetta KL, Folsom AR, Rotter JI, Uitterlinden AG, Harris TB, Witteman JC, Boerwinkle E, CHARGE Consortium. Cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium design of prospective meta-analyses of genome-wide association studies from 5 cohorts. Circ Cardiovasc Genet. 2009;2(1):73\u201380.","journal-title":"Circ Cardiovasc Genet"},{"key":"1211_CR4","unstructured":"CHARGE Consortium. http:\/\/www.chargeconsortium.com\/. Accessed 25 Oct 2015."},{"issue":"7","key":"1211_CR5","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pbio.1002195","volume":"13","author":"ZD Stephens","year":"2015","unstructured":"Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big data: astronomical or genomical? PLoS Biol. 2015;13(7), e1002195.","journal-title":"PLoS Biol"},{"issue":"9","key":"1211_CR6","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1101\/gr.107524.110","volume":"20","author":"A McKenna","year":"2010","unstructured":"McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297\u2013303.","journal-title":"Genome Res"},{"issue":"6","key":"1211_CR7","doi-asserted-by":"publisher","first-page":"918","DOI":"10.1101\/gr.176552.114","volume":"25","author":"G Jun","year":"2015","unstructured":"Jun G, Wing MK, Abecasis GR, Kang HM. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data. Genome Res. 2015;25(6):918\u201325.","journal-title":"Genome Res"},{"issue":"5","key":"1211_CR8","doi-asserted-by":"publisher","first-page":"833","DOI":"10.1101\/gr.146084.112","volume":"23","author":"Y Wang","year":"2013","unstructured":"Wang Y, Lu J, Yu J, Gibbs RA, Yu F. An integrative variant analysis pipeline for accurate genotype\/haplotype inference in population NGS data. Genome Res. 2013;23(5):833\u201342.","journal-title":"Genome Res"},{"key":"1211_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-14-274","volume":"14","author":"X Yu","year":"2013","unstructured":"Yu X, Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinf. 2013;14:1.","journal-title":"BMC Bioinf"},{"key":"1211_CR10","doi-asserted-by":"publisher","first-page":"17875","DOI":"10.1038\/srep17875","volume":"5","author":"S Hwang","year":"2015","unstructured":"Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep. 2015;5:17875.","journal-title":"Sci Rep"},{"issue":"6","key":"1211_CR11","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1038\/nrg2986","volume":"12","author":"R Nielsen","year":"2011","unstructured":"Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12(6):443\u201351.","journal-title":"Nat Rev Genet"},{"issue":"8","key":"1211_CR12","first-page":"1","volume":"13","author":"Q Liu","year":"2012","unstructured":"Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics. 2012;13(8):1.","journal-title":"BMC Genomics"},{"key":"1211_CR13","doi-asserted-by":"publisher","first-page":"940","DOI":"10.1101\/gr.117259.110","volume":"21","author":"Y Li","year":"2011","unstructured":"Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 2011;21:940\u201351.","journal-title":"Genome Res"},{"key":"1211_CR14","first-page":"49","volume-title":"Global Conference on Signal and Information Processing (GlobalSIP)","author":"Z Huang","year":"2013","unstructured":"Huang Z, Yu J, Yu F. Cloud processing of 1000 genomes sequencing data using Amazon Web Service. In: Global Conference on Signal and Information Processing (GlobalSIP). Washington: IEEE; 2013. p. 49\u201352."},{"key":"1211_CR15","unstructured":"Amazon Web Services(AWS)-Cloud Computing Services. https:\/\/aws.amazon.com. Accessed 25 Oct 2015."},{"issue":"11","key":"1211_CR16","doi-asserted-by":"publisher","first-page":"1363","DOI":"10.1093\/bioinformatics\/btp236","volume":"25","author":"MC Schatz","year":"2009","unstructured":"Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25(11):1363\u20139.","journal-title":"Bioinformatics"},{"issue":"11","key":"1211_CR17","doi-asserted-by":"publisher","first-page":"R134","DOI":"10.1186\/gb-2009-10-11-r134","volume":"10","author":"B Langmead","year":"2009","unstructured":"Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009;10(11):R134.","journal-title":"Genome Biol"},{"issue":"12","key":"1211_CR18","first-page":"1","volume":"11","author":"E Afgan","year":"2010","unstructured":"Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinf. 2010;11(12):1.","journal-title":"BMC Bioinf"},{"issue":"6","key":"1211_CR19","first-page":"1","volume":"13","author":"US Evani","year":"2012","unstructured":"Evani US, Challis D, Yu J, Jackson AR, Paithankar S, Bainbridge MN, Jakkamsetti A, Pham P, Coarfa C, Milosavljevic A, Yu F. Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genomics. 2012;13(6):1.","journal-title":"BMC Genomics"},{"issue":"6","key":"1211_CR20","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0129277","volume":"10","author":"SS Shringarpure","year":"2015","unstructured":"Shringarpure SS, Carroll A, Francisco M, Bustamante CD. Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes. PLoS ONE. 2015;10(6), e0129277.","journal-title":"PLoS ONE"},{"key":"1211_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-15-1","volume":"15","author":"JG Reid","year":"2014","unstructured":"Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, Bainbridge M, White S, Salerno W, Buhay C, Yu F. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinf. 2014;15:1.","journal-title":"BMC Bioinf"},{"key":"1211_CR22","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-13-8","volume":"13","author":"D Challis","year":"2012","unstructured":"Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F. An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinf. 2012;13:1.","journal-title":"BMC Bioinf"},{"key":"1211_CR23","unstructured":"The Cloud VS HPC Conundrum. http:\/\/www.nextplatform.com\/2015\/06\/03\/the-hpc-cloud-versus-cluster-cost-conundrum\/. Accessed 25 Oct 2015."},{"issue":"3","key":"1211_CR24","doi-asserted-by":"publisher","first-page":"704","DOI":"10.1016\/j.future.2012.08.014","volume":"29","author":"C De Alfonso","year":"2013","unstructured":"De Alfonso C, Caballer M, Alvarruiz F, Molt\u00f3 G. An economic and energy-aware analysis of the viability of outsourcing cluster computing to a cloud. Futur Gener Comput Syst. 2013;29(3):704\u201312.","journal-title":"Futur Gener Comput Syst"},{"key":"1211_CR25","unstructured":"Oak Ridge Leadership Computing Facilities. https:\/\/www.olcf.ornl.gov\/computing-resources\/. Accessed 25 Oct 2015."},{"key":"1211_CR26","unstructured":"Blue BioU | Center for Research Computing. https:\/\/www.rcsg.rice.edu\/tag\/blue-biou\/. Accessed 25 Oct 2015."},{"key":"1211_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-015-0736-4","volume":"16","author":"KA Standish","year":"2015","unstructured":"Standish KA, Carland TM, Lockwood GK, Pfeiffer W, Tatineni M, Huang CC, Lamberth S, Cherkas Y, Brodmerkel C, Jaeger E, Smith L. Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies. BMC Bioinf. 2015;16:1.","journal-title":"BMC Bioinf"},{"key":"1211_CR28","first-page":"78","volume-title":"Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis","author":"D Fiala","year":"2012","unstructured":"Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. Salt Lake City: IEEE Computer Society Press; 2012. p. 78."},{"issue":"3","key":"1211_CR29","doi-asserted-by":"publisher","first-page":"723","DOI":"10.1093\/molbev\/mst229","volume":"31","author":"E Han","year":"2014","unstructured":"Han E, Sinsheimer JS, Novembre J. Characterizing bias in population genetic inferences from low-coverage sequencing data. Mol Biol Evol. 2014;31(3):723\u201335.","journal-title":"Mol Biol Evol"},{"issue":"3","key":"1211_CR30","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0121644","volume":"10","author":"F Yu","year":"2015","unstructured":"Yu F, Lu J, Liu X, Gazave E, Chang D, Raj S, Hunter-Zinck H, Blekhman R, Arbiza L, Van Hout C, Morrison A. Population Genomic Analysis of 962 Whole Genome Sequences of Humans Reveals Natural Selection in Non-Coding Regions. PLoS ONE. 2015;10(3), e0121644.","journal-title":"PLoS ONE"},{"key":"1211_CR31","unstructured":"EC2 Instance Types-Amazon Web Services (AWS). https:\/\/aws.amazon.com\/ec2\/instance-types\/. Accessed 25 Oct 2015."},{"key":"1211_CR32","doi-asserted-by":"publisher","first-page":"239","DOI":"10.1145\/2493123.2462919","volume-title":"Proceedings of the 22nd international symposium on High-performance parallel and distributed computing","author":"A Marathe","year":"2013","unstructured":"Marathe A, Harris R, Lowenthal DK, De Supinski BR, Rountree B, Schulz M, Yuan X. A comparative study of high-performance computing on the cloud. In: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. New York: ACM; 2013. p. 239\u201350."},{"key":"1211_CR33","unstructured":"Oakridge Leadership compute Facility. https:\/\/www.olcf.ornl.gov\/summit\/. Accessed 25 Oct 2015."},{"key":"1211_CR34","unstructured":"AWS Public Datasets. http:\/\/aws.amazon.com\/datasets\/ Accessed on 25 Oct 2015."},{"issue":"3","key":"1211_CR35","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1016\/1047-2797(91)90005-W","volume":"1","author":"LP Fried","year":"1991","unstructured":"Fried LP, Borhani NO, Enright P, Furberg CD, Gardin JM, Kronmal RA, Kuller LH, Manolio TA, Mittelmark MB, Newman A, O\u2019Leary DH. The cardiovascular health study: design and rationale. Ann Epidemiol. 1991;1(3):263\u201376.","journal-title":"Ann Epidemiol"},{"issue":"3","key":"1211_CR36","doi-asserted-by":"publisher","first-page":"279","DOI":"10.2105\/AJPH.41.3.279","volume":"41","author":"TR Dawber","year":"1951","unstructured":"Dawber TR, Meadors GF, Moore Jr FE. Epidemiological Approaches to Heart Disease: The Framingham Study*. Am J Public Health Nations Health. 1951;41(3):279\u201386.","journal-title":"Am J Public Health Nations Health"},{"issue":"4","key":"1211_CR37","doi-asserted-by":"publisher","first-page":"687","DOI":"10.1093\/oxfordjournals.aje.a115184","volume":"129","author":"A Investigators","year":"1989","unstructured":"Investigators A. The atherosclerosis risk in communit (aric) stui) y: Design and objectwes. Am J Epidemiol. 1989;129(4):687\u2013702.","journal-title":"Am J Epidemiol"},{"key":"1211_CR38","doi-asserted-by":"publisher","first-page":"e68095","DOI":"10.1371\/journal.pone.0068095","volume":"8.7","author":"ML Grove","year":"2013","unstructured":"Grove ML, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE Consortium. PLoS ONE. 2013;8.7:e68095.","journal-title":"PLoS ONE"},{"key":"1211_CR39","unstructured":"DNAnexus. https:\/\/www.dnanexus.com\/. Accessed 25 Oct 2015."},{"key":"1211_CR40","unstructured":"UnifiedGenotyper error: Somehow the requested coordinate is not covered by the read. http:\/\/gatkforums.broadinstitute.org\/discussion\/3141\/unifiedgenotyper-error-somehow-the-requested-coordinate-is-not-covered-by-the-read. Accessed 25 Oct 2015."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-1211-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-016-1211-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-1211-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T17:59:43Z","timestamp":1706810383000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-016-1211-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,9,10]]},"references-count":40,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["1211"],"URL":"https:\/\/doi.org\/10.1186\/s12859-016-1211-6","relation":{},"ISSN":["1471-2105"],"issn-type":[{"type":"electronic","value":"1471-2105"}],"subject":[],"published":{"date-parts":[[2016,9,10]]},"assertion":[{"value":"24 February 2016","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 August 2016","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 September 2016","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"361"}}