{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T18:20:47Z","timestamp":1772907647096,"version":"3.50.1"},"reference-count":13,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1076,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig\u2019s scalability over many computing nodes and illustrate its use with example scripts.<\/jats:p>\n               <jats:p>Availability and Implementation: Available under the open source MIT license at http:\/\/sourceforge.net\/projects\/seqpig\/<\/jats:p>\n               <jats:p>Contact: \u00a0andre.schumacher@yahoo.com<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btt601","type":"journal-article","created":{"date-parts":[[2013,10,23]],"date-time":"2013-10-23T01:14:23Z","timestamp":1382490863000},"page":"119-120","source":"Crossref","is-referenced-by-count":75,"title":["SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop"],"prefix":"10.1093","volume":"30","author":[{"given":"Andr\u00e9","family":"Schumacher","sequence":"first","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"},{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Luca","family":"Pireddu","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matti","family":"Niemenmaa","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aleksi","family":"Kallio","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eija","family":"Korpelainen","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gianluigi","family":"Zanetti","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Keijo","family":"Heljanko","sequence":"additional","affiliation":[{"name":"1 Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, 2International Computer Science Institute, Berkeley, CA, USA, 3CRS4\u2014Center for Advanced Studies, Research and Development in Sardinia, Italy and 4CSC\u2014IT Center for Science, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2013,10,22]]},"reference":[{"key":"2023012710380934000_btt601-B1","unstructured":"Andrews\n              S\n            \n          \n          Fastqc. a quality control tool for high throughput sequence data\n          2010\n          \n            http:\/\/www.bioinformatics.babraham.ac.uk\/projects\/fastqc (8 November 2013, date last accessed)"},{"key":"2023012710380934000_btt601-B2","doi-asserted-by":"crossref","DOI":"10.14778\/2367502.2367519","article-title":"Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads","volume-title":"Proceedings of the VLDB Endowment","author":"Chen","year":"2012"},{"key":"2023012710380934000_btt601-B3","doi-asserted-by":"crossref","first-page":"R134","DOI":"10.1186\/gb-2009-10-11-r134","article-title":"Searching for SNPs with cloud computing","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol."},{"key":"2023012710380934000_btt601-B4","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1038\/498255a","article-title":"Biology: the big challenges of big data","volume":"498","author":"Marx","year":"2013","journal-title":"Nature"},{"key":"2023012710380934000_btt601-B5","doi-asserted-by":"crossref","first-page":"876","DOI":"10.1093\/bioinformatics\/bts054","article-title":"Hadoop-BAM: directly manipulating next generation sequencing data in the cloud","volume":"28","author":"Niemenmaa","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012710380934000_btt601-B6","doi-asserted-by":"crossref","DOI":"10.1093\/bioinformatics\/btt528","article-title":"BioPig: a Hadoop-based analytic toolkit for large-scale sequence data","author":"Nordberg","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012710380934000_btt601-B7","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.1186\/1471-2105-11-S12-S2","article-title":"SeqWare query engine: storing and searching sequence data in the cloud","volume":"11","author":"O\u2019Connor","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012710380934000_btt601-B8","doi-asserted-by":"crossref","first-page":"2159","DOI":"10.1093\/bioinformatics\/btr325","article-title":"SEAL: a distributed short read mapping and duplicate removal tool","volume":"27","author":"Pireddu","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012710380934000_btt601-B9","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1186\/1471-2164-12-419","article-title":"SAMQA: error classification and validation of high-throughput sequenced read data","volume":"12","author":"Robinson","year":"2011","journal-title":"BMC Genomics"},{"key":"2023012710380934000_btt601-B10","doi-asserted-by":"crossref","first-page":"200","DOI":"10.1186\/1471-2105-13-200","article-title":"Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds","volume":"13","author":"Sch\u00f6nherr","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023012710380934000_btt601-B11","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1186\/gb-2010-11-5-207","article-title":"The case for cloud computing in genome informatics","volume":"11","author":"Stein","year":"2010","journal-title":"Genome Biol."},{"key":"2023012710380934000_btt601-B12","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-11-S12-S1","article-title":"An overview of the Hadoop\/MapReduce\/HBase framework and its current applications in bioinformatics","volume":"11","author":"Taylor","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023012710380934000_btt601-B13","article-title":"Cloudbreak: accurate and scalable genomic structural variation detection in the cloud with MapReduce","author":"Whelan","year":"2013","journal-title":"arXiv:1307.2331"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/1\/119\/48913395\/bioinformatics_30_1_119.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/1\/119\/48913395\/bioinformatics_30_1_119.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T10:42:10Z","timestamp":1674816130000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/30\/1\/119\/237052"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,10,22]]},"references-count":13,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btt601","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2014,1,1]]},"published":{"date-parts":[[2013,10,22]]}}}