{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T03:11:10Z","timestamp":1768273870054,"version":"3.49.0"},"reference-count":29,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2016,2,4]],"date-time":"2016-02-04T00:00:00Z","timestamp":1454544000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2016,2,4]],"date-time":"2016-02-04T00:00:00Z","timestamp":1454544000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000098","name":"NIH Clinical Center","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000098","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100006235","name":"LBNL","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100006235","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>In this paper, we show that the response to this challenge is a modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50\u00d7 faster than its existing brute force approach and 8\u00d7 faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data and, as we show, allow users to reduce the number of lines of software code that needs to be developed to query such data by an order of magnitude.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-016-0904-1","type":"journal-article","created":{"date-parts":[[2016,2,4]],"date-time":"2016-02-04T01:01:29Z","timestamp":1454547689000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["GenAp: a distributed SQL interface for genomic data"],"prefix":"10.1186","volume":"17","author":[{"given":"Christos","family":"Kozanitis","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"David A.","family":"Patterson","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2016,2,4]]},"reference":[{"key":"904_CR1","unstructured":"BeatAML Project. http:\/\/www.ohsu.edu\/xd\/health\/services\/cancer\/about-us\/druker\/upload\/beat-aml-flyer-v5.pdf."},{"key":"904_CR2","unstructured":"MMRF CoMMpass Project. https:\/\/research.themmrf.org\/."},{"key":"904_CR3","unstructured":"ICGC Cancer Genome Projects. https:\/\/icgc.org\/."},{"key":"904_CR4","unstructured":"Sequence Read Archive (SRA). http:\/\/www.ncbi.nlm.nih.gov\/sra\/."},{"key":"904_CR5","doi-asserted-by":"publisher","first-page":"093","DOI":"10.1093\/database\/bau093","volume":"2014","author":"C Wilks","year":"2014","unstructured":"Wilks C, Cline MS, Weiler E, Diehkans M, Craft B, Martin C, et al.The cancer genomics hub (cghub): overcoming cancer through the power of torrential data. Database. 2014; 2014:093.","journal-title":"Database"},{"key":"904_CR6","unstructured":"Annai\u2019s Gene Torrent. A High Speed File Transfer Protocol. https:\/\/annaisystems.zendesk.com\/hc\/en-us\/articles\/204184548-What-is-GNOS-."},{"key":"904_CR7","doi-asserted-by":"crossref","unstructured":"Paten B, Diekhans M, Druker BJ, Friend S, Guinney J, Gassner N, et al.The nih bd2k center for big data in translational genomics. J Am Med Inf Assoc. 2015; 047.","DOI":"10.1093\/jamia\/ocv047"},{"key":"904_CR8","unstructured":"Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al.Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association: 2012. p. 2\u20132."},{"key":"904_CR9","unstructured":"Massie M, Nothaft F, Hartl C, Kozanitis C, Schumacher A, Joseph AD, et al.Adam: Genomics formats and processing patterns for cloud scale computing. 2013. EECS Department, University of California, Berkeley, Tech. Rep. UCB\/EECS-2013-207."},{"key":"904_CR10","doi-asserted-by":"crossref","unstructured":"Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, et al.Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM: 2015. p. 631\u201346.","DOI":"10.1145\/2723372.2742787"},{"key":"904_CR11","unstructured":"Apache Avro. https:\/\/avro.apache.org\/."},{"key":"904_CR12","unstructured":"Apache Parquet. https:\/\/parquet.apache.org\/."},{"issue":"1\u20132","key":"904_CR13","doi-asserted-by":"publisher","first-page":"330","DOI":"10.14778\/1920841.1920886","volume":"3","author":"S Melnik","year":"2010","unstructured":"Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, et al.Dremel: interactive analysis of web-scale datasets. Proc VLDB Endowment. 2010; 3(1\u20132):330\u20139.","journal-title":"Proc VLDB Endowment"},{"issue":"16","key":"904_CR14","doi-asserted-by":"publisher","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","volume":"25","author":"H Li","year":"2009","unstructured":"Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.The sequence alignment\/map format and samtools. Bioinforma. 2009; 25(16):2078\u20139.","journal-title":"Bioinforma"},{"issue":"15","key":"904_CR15","doi-asserted-by":"publisher","first-page":"2156","DOI":"10.1093\/bioinformatics\/btr330","volume":"27","author":"P Danecek","year":"2011","unstructured":"Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al.The variant call format and vcftools. Bioinforma. 2011; 27(15):2156\u20138.","journal-title":"Bioinforma"},{"issue":"12","key":"904_CR16","doi-asserted-by":"publisher","first-page":"1691","DOI":"10.1093\/bioinformatics\/btr174","volume":"27","author":"DW Barnett","year":"2011","unstructured":"Barnett DW, Garrison EK, Quinlan AR, Stromberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011; 27(12):1691\u201392.","journal-title":"Bioinformatics"},{"issue":"6","key":"904_CR17","doi-asserted-by":"publisher","first-page":"841","DOI":"10.1093\/bioinformatics\/btq033","volume":"26","author":"AR Quinlan","year":"2010","unstructured":"Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinforma. 2010; 26(6):841\u20132.","journal-title":"Bioinforma"},{"issue":"1","key":"904_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/bioinformatics\/btt250","volume":"30","author":"C Kozanitis","year":"2014","unstructured":"Kozanitis C, Heiberg A, Varghese G, Bafna V. Using genome query language to uncover genetic variation. Bioinforma. 2014; 30(1):1\u20138.","journal-title":"Bioinforma"},{"key":"904_CR19","doi-asserted-by":"crossref","unstructured":"Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Palluzzi F, Muller H, et al. Genometric query language: A novel approach to large-scale genomic data management. Bioinforma. 2015; 048.","DOI":"10.1093\/bioinformatics\/btv048"},{"key":"904_CR20","unstructured":"Nextbio\u2019s Scalable SAAS Platform for Big Data. http:\/\/www.nextbio.com\/b\/corp\/products.nb."},{"key":"904_CR21","unstructured":"mongoDB. https:\/\/www.mongodb.org\/."},{"key":"904_CR22","unstructured":"Apache HBase. http:\/\/hbase.apache.org\/."},{"key":"904_CR23","unstructured":"Apache Cassandra. http:\/\/cassandra.apache.org\/."},{"issue":"1","key":"904_CR24","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1145\/2398356.2398376","volume":"56","author":"V Bafna","year":"2013","unstructured":"Bafna V, Deutsch A, Heiberg A, Kozanitis C, Ohno-Machado L, Varghese G, et al.Abstractions for genomics. Communications of the ACM. 2013; 56(1):83\u201393.","journal-title":"Communications of the ACM"},{"issue":"2","key":"904_CR25","doi-asserted-by":"publisher","first-page":"1626","DOI":"10.14778\/1687553.1687609","volume":"2","author":"A Thusoo","year":"2009","unstructured":"Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, et al.Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment. 2009; 2(2):1626\u20139.","journal-title":"Proc VLDB Endowment"},{"key":"904_CR26","unstructured":"Cloudera Impala. http:\/\/impala.io\/."},{"key":"904_CR27","doi-asserted-by":"crossref","unstructured":"Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, et al.Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM: 2015. p. 1383\u201394.","DOI":"10.1145\/2723372.2742797"},{"key":"904_CR28","doi-asserted-by":"crossref","unstructured":"Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: Sql and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM: 2013. p. 13\u201324.","DOI":"10.21236\/ADA570737"},{"key":"904_CR29","doi-asserted-by":"crossref","unstructured":"Yadwadkar NJ, Ananthanarayanan G, Katz R. Wrangler: Predictable and faster jobs using fewer resources. In: Proceedings of the ACM Symposium on Cloud Computing. ACM: 2014. p. 1\u201314.","DOI":"10.1145\/2670979.2671005"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0904-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-016-0904-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0904-1","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0904-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T18:11:22Z","timestamp":1706811082000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-016-0904-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,2,4]]},"references-count":29,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["904"],"URL":"https:\/\/doi.org\/10.1186\/s12859-016-0904-1","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,2,4]]},"assertion":[{"value":"16 August 2015","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 January 2016","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 February 2016","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"63"}}