{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:20:19Z","timestamp":1772907619441,"version":"3.50.1"},"reference-count":22,"publisher":"Oxford University Press (OUP)","issue":"23","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2013,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this \u2018data deluge\u2019, here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation.<\/jats:p>\n               <jats:p>Results: We built BioPig on the Apache\u2019s Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig\u2019s programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.<\/jats:p>\n               <jats:p>Availability and implementation: BioPig is released as open-source software under the BSD license at https:\/\/sites.google.com\/a\/lbl.gov\/biopig\/<\/jats:p>\n               <jats:p>Contact: \u00a0ZhongWang@lbl.gov<\/jats:p>","DOI":"10.1093\/bioinformatics\/btt528","type":"journal-article","created":{"date-parts":[[2013,9,11]],"date-time":"2013-09-11T09:53:20Z","timestamp":1378893200000},"page":"3014-3019","source":"Crossref","is-referenced-by-count":83,"title":["BioPig: a Hadoop-based analytic toolkit for large-scale sequence data"],"prefix":"10.1093","volume":"29","author":[{"given":"Henrik","family":"Nordberg","sequence":"first","affiliation":[{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"},{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"}]},{"given":"Karan","family":"Bhatia","sequence":"additional","affiliation":[{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"}]},{"given":"Kai","family":"Wang","sequence":"additional","affiliation":[{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"}]},{"given":"Zhong","family":"Wang","sequence":"additional","affiliation":[{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"},{"name":"1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA and 2Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA"}]}],"member":"286","published-online":{"date-parts":[[2013,9,10]]},"reference":[{"key":"2023012810484222400_btt528-B1","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/1327452.1327492","article-title":"MapReduce: simplified data processing on large clusters","volume":"51","author":"Dean","year":"2008","journal-title":"Commun. ACM"},{"key":"2023012810484222400_btt528-B2","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/nature09534","article-title":"A map of human genome variation from population-scale sequencing","volume":"467","author":"1000 Genomes Project Consortium. et al.","year":"2010","journal-title":"Nature"},{"key":"2023012810484222400_btt528-B3","first-page":"147","article-title":"Error detecting and error correcting codes","volume":"29","author":"Hamming","year":"1950","journal-title":"AT&T Tech. J."},{"key":"2023012810484222400_btt528-B4","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1126\/science.1200387","article-title":"Metagenomic discovery of biomass-degrading genes and genomes from cow rumen","volume":"331","author":"Hess","year":"2011","journal-title":"Science"},{"key":"2023012810484222400_btt528-B5","doi-asserted-by":"crossref","first-page":"1542","DOI":"10.1093\/bioinformatics\/bts165","article-title":"Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses","volume":"28","author":"Jourdren","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012810484222400_btt528-B6","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1089\/omi.2011.0101","article-title":"Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins","volume":"15","author":"Kolker","year":"2011","journal-title":"Omics"},{"key":"2023012810484222400_btt528-B7","doi-asserted-by":"crossref","first-page":"R134","DOI":"10.1186\/gb-2009-10-11-r134","article-title":"Searching for SNPs with cloud computing","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol."},{"key":"2023012810484222400_btt528-B8","doi-asserted-by":"crossref","first-page":"R83","DOI":"10.1186\/gb-2010-11-8-r83","article-title":"Cloud-scale RNA-sequencing differential expression analysis with Myrna","volume":"11","author":"Langmead","year":"2010","journal-title":"Genome Biol."},{"key":"2023012810484222400_btt528-B9","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1109\/ICPPW.2009.37","article-title":"Biodoop: bioinformatics on hadoop","volume-title":"Parallel Processing Workshops, 2009. ICPPW'09. International Conference on IEEE","author":"Leo","year":"2009"},{"key":"2023012810484222400_btt528-B10","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and SAMtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012810484222400_btt528-B11","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res."},{"key":"2023012810484222400_btt528-B12","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1038\/nrg2626","article-title":"Sequencing technologies - the next generation","volume":"11","author":"Metzker","year":"2010","journal-title":"Nat. Rev. Genet."},{"key":"2023012810484222400_btt528-B13","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1016\/0076-6879(87)55023-6","article-title":"Specific synthesis of DNA in vitro via a polymerase-catalyzed chain-reaction","volume":"155","author":"Mullis","year":"1987","journal-title":"Method Enzymol."},{"key":"2023012810484222400_btt528-B14","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1186\/1756-0500-4-171","article-title":"CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping","volume":"4","author":"Nguyen","year":"2011","journal-title":"BMC Res. Notes"},{"key":"2023012810484222400_btt528-B15","doi-asserted-by":"crossref","first-page":"876","DOI":"10.1093\/bioinformatics\/bts054","article-title":"Hadoop-BAM: directly manipulating next generation sequencing data in the cloud","volume":"28","author":"Niemenmaa","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012810484222400_btt528-B16","doi-asserted-by":"crossref","first-page":"14793","DOI":"10.1073\/pnas.1005297107","article-title":"Adaptation to herbivory by the Tammar wallaby includes bacterial and glycoside hydrolase profiles different from other herbivores","volume":"107","author":"Pope","year":"2010","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012810484222400_btt528-B17","doi-asserted-by":"crossref","first-page":"1363","DOI":"10.1093\/bioinformatics\/btp236","article-title":"CloudBurst: highly sensitive read mapping with MapReduce","volume":"25","author":"Schatz","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012810484222400_btt528-B18","first-page":"1471","article-title":"A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes","volume":"9","author":"Stefan","year":"2008","journal-title":"BMC Genomics"},{"key":"2023012810484222400_btt528-B19","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-11-S12-S1","article-title":"An overview of the Hadoop\/MapReduce\/HBase framework and its current applications in bioinformatics","volume":"11","author":"Taylor","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012810484222400_btt528-B20","doi-asserted-by":"crossref","first-page":"560","DOI":"10.1038\/nature06269","article-title":"Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite","volume":"450","author":"Warnecke","year":"2007","journal-title":"Nature"},{"key":"2023012810484222400_btt528-B21","first-page":"2","article-title":"Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing","volume-title":"Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation","author":"Zaharia","year":"2012"},{"key":"2023012810484222400_btt528-B22","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino","year":"2008","journal-title":"Genome Res."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/29\/23\/3014\/48894652\/bioinformatics_29_23_3014.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/29\/23\/3014\/48894652\/bioinformatics_29_23_3014.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,28]],"date-time":"2023-01-28T12:49:08Z","timestamp":1674910148000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/29\/23\/3014\/247182"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,9,10]]},"references-count":22,"journal-issue":{"issue":"23","published-print":{"date-parts":[[2013,12,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btt528","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2013,12,1]]},"published":{"date-parts":[[2013,9,10]]}}}