{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T07:30:49Z","timestamp":1768289449357,"version":"3.49.0"},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T00:00:00Z","timestamp":1570752000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T00:00:00Z","timestamp":1570752000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science Foundation","award":["CCF-1139158"],"award-info":[{"award-number":["CCF-1139158"]}]},{"DOI":"10.13039\/100006235","name":"Lawrence Berkeley National Laboratory","doi-asserted-by":"crossref","award":["7076018"],"award-info":[{"award-number":["7076018"]}],"id":[{"id":"10.13039\/100006235","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","award":["FA8750-12-2-0331"],"award-info":[{"award-number":["FA8750-12-2-0331"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000051","name":"National Human Genome Research Institute","doi-asserted-by":"publisher","award":["U54HG007990-01"],"award-info":[{"award-number":["U54HG007990-01"]}],"id":[{"id":"10.13039\/100000051","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["HHSN261201400006C"],"award-info":[{"award-number":["HHSN261201400006C"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2019,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n              <jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.<\/jats:p>\n              <\/jats:sec>\n              <jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3\u2009min on a 16-core workstation (35.3\u00d7 speedup vs. XHMM), 12.7\u2009min using 10 executor cores on a Spark cluster (18.8\u00d7 speedup vs. XHMM), and 9.8\u2009min using 32 executor cores on Amazon AWS\u2019 Elastic MapReduce. We performed CNV discovery from the original BAM files in 292\u2009min using 640 executor cores on a Spark cluster.<\/jats:p>\n              <\/jats:sec>\n              <jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>We describe DECA\u2019s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark\u2019s configuration parameters.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-019-3108-7","type":"journal-article","created":{"date-parts":[[2019,10,11]],"date-time":"2019-10-11T07:11:42Z","timestamp":1570777902000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark"],"prefix":"10.1186","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9643-7148","authenticated-orcid":false,"given":"Michael D.","family":"Linderman","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Davin","family":"Chia","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Forrest","family":"Wallace","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Frank A.","family":"Nothaft","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2019,10,11]]},"reference":[{"key":"3108_CR1","doi-asserted-by":"publisher","first-page":"597","DOI":"10.1016\/j.ajhg.2012.08.005","volume":"91","author":"M Fromer","year":"2012","unstructured":"Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet. 2012;91:597\u2013607. \n                    https:\/\/doi.org\/10.1016\/j.ajhg.2012.08.005\n                    \n                  .","journal-title":"Am J Hum Genet"},{"key":"3108_CR2","doi-asserted-by":"publisher","first-page":"1107","DOI":"10.1038\/ng.3638","volume":"48","author":"DM Ruderfer","year":"2016","unstructured":"Ruderfer DM, Hamamsy T, Lek M, Karczewski KJ, Kavanagh D, Samocha KE, et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat Genet. 2016;48:1107\u201311. \n                    https:\/\/doi.org\/10.1038\/ng.3638\n                    \n                  .","journal-title":"Nat Genet"},{"issue":"Suppl 11","key":"3108_CR3","doi-asserted-by":"publisher","first-page":"S1","DOI":"10.1186\/1471-2105-14-S11-S1","volume":"14","author":"M Zhao","year":"2013","unstructured":"Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013;14(Suppl 11):S1. \n                    https:\/\/doi.org\/10.1186\/1471-2105-14-S11-S1\n                    \n                  .","journal-title":"BMC Bioinformatics"},{"key":"3108_CR4","doi-asserted-by":"publisher","first-page":"btv547","DOI":"10.1093\/bioinformatics\/btv547","volume":"32","author":"JS Packer","year":"2015","unstructured":"Packer JS, Maxwell EK, O\u2019Dushlaine C, Lopez AE, Dewey FE, Chernomorsky R, et al. CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics. 2015;32:btv547. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/btv547\n                    \n                  .","journal-title":"Bioinformatics"},{"key":"3108_CR5","doi-asserted-by":"publisher","first-page":"631","DOI":"10.1145\/2723372.2742787","volume-title":"Proceedings of the 2015 ACM SIGMOD international conference on Management of Data","author":"FA Nothaft","year":"2015","unstructured":"Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, et al. Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD international conference on Management of Data. Melbourne: ACM; 2015. p. 631\u201346. \n                    https:\/\/doi.org\/10.1145\/2723372.2742787\n                    \n                  ."},{"key":"3108_CR6","volume-title":"ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing","author":"M Massie","year":"2013","unstructured":"Massie M, Nothaft F, Hartl C, Kozanitis C, Schumacher A, Joseph AD, et al. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. 2013. \n                    http:\/\/www.eecs.berkeley.edu\/Pubs\/TechRpts\/2013\/EECS-2013-207.html\n                    \n                  ."},{"key":"3108_CR7","doi-asserted-by":"publisher","first-page":"2652","DOI":"10.1093\/bioinformatics\/btu343","volume":"30","author":"MS Wiewi\u00f3rka","year":"2014","unstructured":"Wiewi\u00f3rka MS, Messina A, Pacholewska A, Maffioletti S, Gawrysiak P, Okoniewski MJ. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics. 2014;30:2652\u20133. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/btu343\n                    \n                  .","journal-title":"Bioinformatics."},{"key":"3108_CR8","doi-asserted-by":"publisher","first-page":"1052","DOI":"10.1186\/s12864-015-2269-7","volume":"16","author":"AR O\u2019Brien","year":"2015","unstructured":"O\u2019Brien AR, Saunders NFW, Guo Y, Buske FA, Scott RJ, Bauer DC. VariantSpark: population scale clustering of genotype information. BMC Genomics. 2015;16:1052. \n                    https:\/\/doi.org\/10.1186\/s12864-015-2269-7\n                    \n                  .","journal-title":"BMC Genomics"},{"key":"3108_CR9","doi-asserted-by":"publisher","unstructured":"Bahmani A, Sibley AB, Parsian M, Owzar K, Mueller F. SparkScore: Leveraging Apache Spark for Distributed Genomic Inference. In: 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW), vol. 2016: IEEE. p. 435\u201342. \n                    https:\/\/doi.org\/10.1109\/IPDPSW.2016.6\n                    \n                  .","DOI":"10.1109\/IPDPSW.2016.6"},{"key":"3108_CR10","doi-asserted-by":"publisher","unstructured":"Li X, Tan G, Zhang C, Xu L, Zhang Z, Sun N. Accelerating large-scale genomic analysis with Spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): IEEE; 2016. p. 747\u201351. \n                    https:\/\/doi.org\/10.1109\/BIBM.2016.7822614\n                    \n                  .","DOI":"10.1109\/BIBM.2016.7822614"},{"key":"3108_CR11","unstructured":"Hail. \n                    https:\/\/github.com\/hail-is\/hail\n                    \n                  . Accessed 8 Jun 2018."},{"key":"3108_CR12","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1016\/j.ajhg.2017.05.017","volume":"101","author":"D Zhang","year":"2017","unstructured":"Zhang D, Zhao L, Li B, He Z, Wang GT, Liu DJ, et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. Am J Hum Genet. 2017;101:115\u201322. \n                    https:\/\/doi.org\/10.1016\/j.ajhg.2017.05.017\n                    \n                  .","journal-title":"Am J Hum Genet"},{"key":"3108_CR13","doi-asserted-by":"publisher","first-page":"303","DOI":"10.1093\/bioinformatics\/btw614","volume":"33","author":"M Klein","year":"2017","unstructured":"Klein M, Sharma R, Bohrer CH, Avelis CM, Roberts E. Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and spark. Bioinformatics. 2017;33:303\u20135. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/btw614\n                    \n                  .","journal-title":"Bioinformatics."},{"issue":"13 Supplement","key":"3108_CR14","doi-asserted-by":"publisher","first-page":"3580","DOI":"10.1158\/1538-7445.AM2017-3580","volume":"77","author":"M Babadi","year":"2017","unstructured":"Babadi M, Benjamin DI, Lee SK, Smirnov A, Chevalier A, Lichtenstein L, et al. Abstract 3580: GATK CNV: copy-number variation discovery from coverage data. Cancer Res. 2017;77(13 Supplement):3580 LP \u2013 3580. \n                    https:\/\/doi.org\/10.1158\/1538-7445.AM2017-3580\n                    \n                  .","journal-title":"Cancer Res"},{"key":"3108_CR15","doi-asserted-by":"publisher","unstructured":"Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on apache spark. Gigascience. 2018;7. \n                    https:\/\/doi.org\/10.1093\/gigascience\/giy098\n                    \n                  .","DOI":"10.1093\/gigascience\/giy098"},{"key":"3108_CR16","first-page":"2","volume-title":"Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation","author":"M Zaharia","year":"2012","unstructured":"Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation; 2012. p. 2. \n                    http:\/\/dl.acm.org\/citation.cfm?id=2228301\n                    \n                  . Accessed 7 Aug 2017."},{"key":"3108_CR17","doi-asserted-by":"publisher","first-page":"876","DOI":"10.1093\/bioinformatics\/bts054","volume":"28","author":"M Niemenmaa","year":"2012","unstructured":"Niemenmaa M, Kallio A, Schumacher A, Klemel\u00e4 P, Korpelainen E, Heljanko K. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012;28:876\u20137. \n                    https:\/\/doi.org\/10.1093\/bioinformatics\/bts054\n                    \n                  .","journal-title":"Bioinformatics."},{"key":"3108_CR18","first-page":"1","volume":"17","author":"X Meng","year":"2016","unstructured":"Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, et al. MLlib: machine learning in apache spark. J Mach Learn Res. 2016;17:1\u20137 \n                    http:\/\/www.jmlr.org\/papers\/v17\/15-237.html\n                    \n                  . Accessed 7 Aug 2017.","journal-title":"J Mach Learn Res"},{"key":"3108_CR19","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1109\/5.18626","volume":"77","author":"LR Rabiner","year":"1989","unstructured":"Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257\u201386. \n                    https:\/\/doi.org\/10.1109\/5.18626\n                    \n                  .","journal-title":"Proc IEEE"},{"key":"3108_CR20","unstructured":"Fromer M, Purcell SM. XHMM. \n                    https:\/\/atgu.mgh.harvard.edu\/xhmm\/index.shtml\n                    \n                  . Accessed 8 May 2019."},{"key":"3108_CR21","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1038\/nature15393","volume":"526","author":"A Auton","year":"2015","unstructured":"Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015;526:68\u201374. \n                    https:\/\/doi.org\/10.1038\/nature15393\n                    \n                  .","journal-title":"Nature."},{"key":"3108_CR22","doi-asserted-by":"publisher","first-page":"1297","DOI":"10.1101\/gr.107524.110","volume":"20","author":"A McKenna","year":"2010","unstructured":"McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297\u2013303. \n                    https:\/\/doi.org\/10.1101\/gr.107524.110\n                    \n                  .","journal-title":"Genome Res"},{"key":"3108_CR23","doi-asserted-by":"publisher","first-page":"7.23.1","DOI":"10.1002\/0471142905.hg0723s81","volume":"81","author":"M Fromer","year":"2014","unstructured":"Fromer M, Purcell SM. Using XHMM software to detect copy number variation in whole-exome sequencing data. Curr Protoc Hum Genet. 2014;81:7.23.1\u20137.23.21. \n                    https:\/\/doi.org\/10.1002\/0471142905.hg0723s81\n                    \n                  .","journal-title":"Curr Protoc Hum Genet"},{"key":"3108_CR24","unstructured":"Databricks Inc. Databricks. \n                    https:\/\/databricks.com\n                    \n                  . Accessed 8 Jun 2018."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-019-3108-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s12859-019-3108-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-019-3108-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,10,9]],"date-time":"2020-10-09T23:06:18Z","timestamp":1602284778000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-019-3108-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,11]]},"references-count":24,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,12]]}},"alternative-id":["3108"],"URL":"https:\/\/doi.org\/10.1186\/s12859-019-3108-7","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,11]]},"assertion":[{"value":"24 June 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 September 2019","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 October 2019","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"FAN was a consultant for and is now employed by Databricks, Inc.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"493"}}