{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T16:13:40Z","timestamp":1761581620375,"version":"3.37.3"},"reference-count":38,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2017,11,2]],"date-time":"2017-11-02T00:00:00Z","timestamp":1509580800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,3,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Next Generation Sequencing (NGS) technology enables identification of microbial genomes from massive amount of human microbiomes more rapidly and cheaper than ever before. However, the traditional sequential genome analysis algorithms, tools, and platforms are inefficient for performing large-scale metagenomic studies on ever-growing sample data volumes. Currently, there is an urgent need for scalable analysis pipelines that enable harnessing all the power of parallel computation in computing clusters and in cloud computing environments. We propose ViraPipe, a scalable metagenome analysis pipeline that is able to analyze thousands of human microbiomes in parallel in tolerable time. The pipeline is tuned for analyzing viral metagenomes and the software is applicable for other metagenomic analyses as well. ViraPipe integrates parallel BWA-MEM read aligner, MegaHit De novo assembler, and BLAST and HMMER3 sequence search tools. We show the scalability of ViraPipe by running experiments on mining virus related genomes from NGS datasets in a distributed Spark computing cluster.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>ViraPipe analyses 768 human samples in 210 minutes on a Spark computing cluster comprising 23 nodes and 1288 cores in total. The speedup of ViraPipe executed on 23 nodes was 11x compared to the sequential analysis pipeline executed on a single node. The whole process includes parallel decompression, read interleaving, BWA-MEM read alignment, filtering and normalizing of non-human reads, De novo contigs assembling, and searching of sequences with BLAST and HMMER3 tools.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>https:\/\/github.com\/NGSeq\/ViraPipe<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btx702","type":"journal-article","created":{"date-parts":[[2017,11,1]],"date-time":"2017-11-01T12:10:24Z","timestamp":1509538224000},"page":"928-935","source":"Crossref","is-referenced-by-count":12,"title":["ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads"],"prefix":"10.1093","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8851-4265","authenticated-orcid":false,"given":"Altti Ilari","family":"Maarala","sequence":"first","affiliation":[{"name":"Department of Computer Science, Aalto University, Espoo, Finland"},{"name":"Helsinki Institute for Information Technology HIIT, Espoo, Finland"}]},{"given":"Zurab","family":"Bzhalava","sequence":"additional","affiliation":[{"name":"Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden"}]},{"given":"Joakim","family":"Dillner","sequence":"additional","affiliation":[{"name":"Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden"}]},{"given":"Keijo","family":"Heljanko","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Aalto University, Espoo, Finland"},{"name":"Helsinki Institute for Information Technology HIIT, Espoo, Finland"}]},{"given":"Davit","family":"Bzhalava","sequence":"additional","affiliation":[{"name":"Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden"}]}],"member":"286","published-online":{"date-parts":[[2017,11,2]]},"reference":[{"year":"2015","author":"Abu-Doleh","key":"2023012712464898300_btx702-B1"},{"key":"2023012712464898300_btx702-B2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/j.virol.2015.07.023","article-title":"Does human papillomavirus-negative condylomata exist?","volume":"485","author":"Arroyo M\u00fchr","year":"2015","journal-title":"Virology"},{"key":"2023012712464898300_btx702-B3","doi-asserted-by":"crossref","first-page":"2546","DOI":"10.1002\/ijc.29325","article-title":"Human papillomavirus type 197 is commonly present in skin tumors","volume":"136","author":"Arroyo M\u00fchr","year":"2015","journal-title":"Int. J. Cancer"},{"key":"2023012712464898300_btx702-B4","doi-asserted-by":"crossref","first-page":"e0172308.","DOI":"10.1371\/journal.pone.0172308","article-title":"Viruses in case series of tumors: consistent presence in different cancers in the same subject","volume":"12","author":"Arroyo M\u00fchr","year":"2017","journal-title":"PLoS One"},{"year":"2012","author":"Brown","key":"2023012712464898300_btx702-B5"},{"key":"2023012712464898300_btx702-B6","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.virol.2012.06.022","article-title":"Phylogenetically diverse TT virus viremia among pregnant women","volume":"432","author":"Bzhalava","year":"2012","journal-title":"Virology"},{"key":"2023012712464898300_btx702-B7","doi-asserted-by":"crossref","first-page":"e65953.","DOI":"10.1371\/journal.pone.0065953","article-title":"Unbiased approach for virus detection in skin lesions","volume":"8","author":"Bzhalava","year":"2013","journal-title":"PLoS One"},{"key":"2023012712464898300_btx702-B8","doi-asserted-by":"crossref","first-page":"5807.","DOI":"10.1038\/srep05807","article-title":"Deep sequencing extends the diversity of human papillomaviruses in human skin","volume":"4","author":"Bzhalava","year":"2014","journal-title":"Sci. Rep"},{"key":"2023012712464898300_btx702-B9","doi-asserted-by":"crossref","first-page":"S28.","DOI":"10.1186\/1471-2164-13-S7-S28","article-title":"A de novo next generation genomic sequence assembler based on string graph and mapreduce cloud computing framework","volume":"13","author":"Chang","year":"2012","journal-title":"BMC Genomics"},{"key":"2023012712464898300_btx702-B10","doi-asserted-by":"crossref","first-page":"2482","DOI":"10.1093\/bioinformatics\/btv179","article-title":"Halvade: scalable sequence analysis with mapreduce","volume":"31","author":"Decap","year":"2015","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1371\/journal.pcbi.1002195","article-title":"Accelerated profile hmm searches","volume":"7","author":"Eddy","year":"2011","journal-title":"PLOS Comput. Biol"},{"key":"2023012712464898300_btx702-B12","doi-asserted-by":"crossref","first-page":"e0145490.","DOI":"10.1371\/journal.pone.0145490","article-title":"Parallel and scalable short-read alignment on multi-core clusters using upc\u2009++","volume":"11","author":"Gonzalez-Dom\u00ednguez","year":"2016","journal-title":"PloS One"},{"key":"2023012712464898300_btx702-B13","doi-asserted-by":"crossref","DOI":"10.1038\/nrg.2017.63","article-title":"Human genetic variation and the gut microbiome in disease","author":"Hall","year":"2017","journal-title":"Nat. Rev. Genet"},{"year":"2004","author":"Jeffrey","key":"2023012712464898300_btx702-B14"},{"key":"2023012712464898300_btx702-B15","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1016\/j.ymeth.2016.02.020","article-title":"MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices","volume":"102","author":"Li","year":"2016","journal-title":"Methods"},{"key":"2023012712464898300_btx702-B16","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with burrows-wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B17","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and samtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B18","doi-asserted-by":"crossref","first-page":"2031","DOI":"10.1093\/bioinformatics\/btr319","article-title":"Comparative studies of de novo assembly tools for next-generation sequencing technologies","volume":"27","author":"Lin","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B20","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res"},{"key":"2023012712464898300_btx702-B21","doi-asserted-by":"crossref","first-page":"e121.","DOI":"10.1093\/nar\/gkt263","article-title":"Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions","volume":"41","author":"Mistry","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023012712464898300_btx702-B22","doi-asserted-by":"crossref","first-page":"D595","DOI":"10.1093\/nar\/gkv1195","article-title":"Ebi metagenomics in 2016 \u2013 an expanding and evolving resource for the analysis and archiving of metagenomic data","volume":"44","author":"Mitchell","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023012712464898300_btx702-B23","doi-asserted-by":"crossref","first-page":"876.","DOI":"10.1093\/bioinformatics\/bts054","article-title":"Hadoop-bam: directly manipulating next generation sequencing data in the cloud","volume":"28","author":"Niemenmaa","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B24","first-page":"168","article-title":"Microbial induction of immunity, inflammation, and cancer","volume":"1","author":"O\u2019keefe","year":"2011","journal-title":"Front. Physiol"},{"key":"2023012712464898300_btx702-B25","doi-asserted-by":"crossref","first-page":"2159","DOI":"10.1093\/bioinformatics\/btr325","article-title":"Seal: a distributed short read mapping and duplicate removal tool","volume":"27","author":"Pireddu","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B26","doi-asserted-by":"crossref","first-page":"1508","DOI":"10.1093\/bioinformatics\/btu071","article-title":"Supercomputing for the parallelization of whole genome analysis","volume":"30","author":"Puckelwartz","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B27","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1146\/annurev-virology-031413-085550","article-title":"Viruses and the microbiota","volume":"1","author":"Robinson","year":"2014","journal-title":"Annu. Rev. Virol"},{"key":"2023012712464898300_btx702-B28","doi-asserted-by":"crossref","first-page":"e00408","DOI":"10.1128\/mBio.00408-12","article-title":"Exploring the parallel development of microbial systems in neonates with cystic fibrosis","volume":"3","author":"Rogers","year":"2012","journal-title":"MBio"},{"key":"2023012712464898300_btx702-B29","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1093\/bioinformatics\/btt601","article-title":"Seqpig: simple and scalable scripting for large sequencing data sets in hadoop","volume":"30","author":"Schumacher","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B30","doi-asserted-by":"crossref","first-page":"e105067.","DOI":"10.1371\/journal.pone.0105067","article-title":"Profile hidden Markov models for the detection of viruses within metagenomic sequence data","volume":"9","author":"Skewes-Cox","year":"2014","journal-title":"PLoS One"},{"key":"2023012712464898300_btx702-B31","doi-asserted-by":"crossref","first-page":"25235.","DOI":"10.1038\/srep25235","article-title":"Detection of DNA viruses in prostate cancer","volume":"6","author":"Smelov","year":"2016","journal-title":"Sci. Rep"},{"key":"2023012712464898300_btx702-B32","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pbio.1002195","article-title":"Big data: astronomical or genomical?","volume":"13","author":"Stephens","year":"2015","journal-title":"PloS Biol"},{"key":"2023012712464898300_btx702-B33","doi-asserted-by":"crossref","first-page":"3.","DOI":"10.1186\/2042-5783-2-3","article-title":"Metagenomics \u2013 a guide from sampling to data analysis","volume":"2","author":"Thomas","year":"2012","journal-title":"Microb. Inform. Exp"},{"key":"2023012712464898300_btx702-B34","doi-asserted-by":"crossref","first-page":"1863","DOI":"10.1093\/bioinformatics\/btg244","article-title":"Soap-HT-BLAST: high throughput BLAST based on Web services","volume":"19","author":"Wang","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012712464898300_btx702-B35","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1016\/j.trsl.2012.03.006","article-title":"Emerging view of the human virome","volume":"160","author":"Wylie","year":"2012","journal-title":"Transl. Res"},{"key":"2023012712464898300_btx702-B36","doi-asserted-by":"crossref","first-page":"e27735.","DOI":"10.1371\/journal.pone.0027735","article-title":"Sequence analysis of the human virome in febrile and afebrile children","volume":"7","author":"Wylie","year":"2012","journal-title":"PLoS One"},{"year":"2010","author":"Zaharia","key":"2023012712464898300_btx702-B37"},{"year":"2012","author":"Zaharia","key":"2023012712464898300_btx702-B19"},{"key":"2023012712464898300_btx702-B38","doi-asserted-by":"crossref","first-page":"1090.","DOI":"10.1093\/bioinformatics\/btw750","article-title":"Metaspark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes","volume":"33","author":"Zhou","year":"2017","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/6\/928\/48913596\/bioinformatics_34_6_928.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/6\/928\/48913596\/bioinformatics_34_6_928.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,27]],"date-time":"2023-08-27T21:08:35Z","timestamp":1693170515000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/6\/928\/4587582"}},"subtitle":[],"editor":[{"given":"Bonnie","family":"Berger","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2017,11,2]]},"references-count":38,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2018,3,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btx702","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2018,3,15]]},"published":{"date-parts":[[2017,11,2]]}}}