{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:50Z","timestamp":1772138090521,"version":"3.50.1"},"reference-count":39,"publisher":"Oxford University Press (OUP)","issue":"5","funder":[{"DOI":"10.13039\/100006206","name":"Biological and Environmental Research","doi-asserted-by":"publisher","award":["DE-AC02-05CH11231"],"award-info":[{"award-number":["DE-AC02-05CH11231"]}],"id":[{"id":"10.13039\/100006206","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100\u20131000\u00a0GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development\/deployment cycles for similar large-scale sequence data analysis problems.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/bitbucket.org\/berkeleylab\/jgi-sparc<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty733","type":"journal-article","created":{"date-parts":[[2018,8,22]],"date-time":"2018-08-22T15:19:20Z","timestamp":1534951160000},"page":"760-768","source":"Crossref","is-referenced-by-count":17,"title":["SpaRC: scalable sequence clustering using Apache Spark"],"prefix":"10.1093","volume":"35","author":[{"given":"Lizhen","family":"Shi","sequence":"first","affiliation":[{"name":"Department of Computer Science, School of Computer Science, Florida State University, Tallahassee, FL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiandong","family":"Meng","sequence":"additional","affiliation":[{"name":"US Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA"},{"name":"Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Elizabeth","family":"Tseng","sequence":"additional","affiliation":[{"name":"Pacific Biosciences Inc, Menlo Park, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Mascagni","sequence":"additional","affiliation":[{"name":"Department of Computer Science, School of Computer Science, Florida State University, Tallahassee, FL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6307-0458","authenticated-orcid":false,"given":"Zhong","family":"Wang","sequence":"additional","affiliation":[{"name":"US Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA"},{"name":"Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA"},{"name":"School of Natural Sciences, University of California at Merced, Merced, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2018,8,23]]},"reference":[{"key":"2023013107255375200_bty733-B1","first-page":"1013","author":"Abu-Doleh","year":"2015"},{"key":"2023013107255375200_bty733-B2","doi-asserted-by":"crossref","first-page":"1498","DOI":"10.1101\/gr.123638.111","article-title":"Accurate and comprehensive sequencing of personal genomes","volume":"21","author":"Ajay","year":"2011","journal-title":"Genome Res"},{"key":"2023013107255375200_bty733-B3","doi-asserted-by":"crossref","first-page":"59.","DOI":"10.1186\/s12859-017-1466-6","article-title":"A framework for space-efficient read clustering in metagenomic samples","volume":"18","author":"Alanko","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2023013107255375200_bty733-B4","first-page":"1383","author":"Armbrust","year":"2015"},{"key":"2023013107255375200_bty733-B5","first-page":"435","author":"Bahmani","year":"2016"},{"key":"2023013107255375200_bty733-B6","doi-asserted-by":"crossref","first-page":"1053.","DOI":"10.1038\/nbt.3329","article-title":"Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning","volume":"33","author":"Cleary","year":"2015","journal-title":"Nature Biotechnol"},{"key":"2023013107255375200_bty733-B7","first-page":"2","author":"Dave","year":"2016"},{"key":"2023013107255375200_bty733-B8","doi-asserted-by":"crossref","first-page":"318.","DOI":"10.1186\/s12859-017-1723-8","article-title":"Sparkblast: scalable blast processing using in-memory operations","volume":"18","author":"de Castro","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2023013107255375200_bty733-B9","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"Kmc 2: fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B10","first-page":"1","author":"Georganas","year":"2015"},{"key":"2023013107255375200_bty733-B11","doi-asserted-by":"crossref","first-page":"e0132628.","DOI":"10.1371\/journal.pone.0132628","article-title":"Widespread polycistronic transcripts in fungi revealed by single-molecule mrna sequencing","volume":"10","author":"Gordon","year":"2015","journal-title":"PLoS One"},{"key":"2023013107255375200_bty733-B12","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1089\/cmb.2014.0251","article-title":"Dime: a novel framework for de novo metagenomic sequence assembly","volume":"22","author":"Guo","year":"2015","journal-title":"J. Comput. Biol"},{"key":"2023013107255375200_bty733-B13","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1126\/science.1200387","article-title":"Metagenomic discovery of biomass-degrading genes and genomes from cow rumen","volume":"331","author":"Hess","year":"2011","journal-title":"Science"},{"key":"2023013107255375200_bty733-B14","doi-asserted-by":"crossref","first-page":"4904","DOI":"10.1073\/pnas.1402564111","article-title":"Tackling soil diversity with the assembly of large, complex metagenomes","volume":"111","author":"Howe","year":"2014","journal-title":"Proc. Natl. Acad. Sci.USA"},{"key":"2023013107255375200_bty733-B15","doi-asserted-by":"crossref","first-page":"4399","DOI":"10.1128\/AEM.67.10.4399-4406.2001","article-title":"Counting the uncountable: statistical approaches to estimating microbial diversity","volume":"67","author":"Hughes","year":"2001","journal-title":"Appl. Environ. Microbiol"},{"key":"2023013107255375200_bty733-B16","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1093\/bioinformatics\/btw614","article-title":"Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using hadoop and spark","volume":"33","author":"Klein","year":"2017","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B17","doi-asserted-by":"crossref","first-page":"1674","DOI":"10.1093\/bioinformatics\/btv033","article-title":"Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph","volume":"31","author":"Li","year":"2015","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B18","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B19","first-page":"135","author":"Malewicz","year":"2010"},{"key":"2023013107255375200_bty733-B20","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B21","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1038\/nrg3068","article-title":"Next-generation transcriptome assembly","volume":"12","author":"Martin","year":"2011","journal-title":"Nat. Rev. Genet"},{"key":"2023013107255375200_bty733-B22","doi-asserted-by":"crossref","first-page":"4519","DOI":"10.1038\/srep04519","article-title":"A near complete snapshot of the zea mays seedling transcriptome revealed from ultra-deep sequencing","volume":"4","author":"Martin","year":"2014","journal-title":"Sci. Rep"},{"key":"2023013107255375200_bty733-B23","author":"Massie","year":"2013"},{"key":"2023013107255375200_bty733-B24","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1016\/j.ygeno.2010.03.001","article-title":"Assembly algorithms for next-generation sequencing data","volume":"95","author":"Miller","year":"2010","journal-title":"Genomics"},{"key":"2023013107255375200_bty733-B25","doi-asserted-by":"crossref","first-page":"824","DOI":"10.1101\/gr.213959.116","article-title":"metaSPAdes: a new versatile metagenomic assembler","volume":"27","author":"Nurk","year":"2017","journal-title":"Genome Res"},{"key":"2023013107255375200_bty733-B26","first-page":"30","author":"Nystrom","year":"2015"},{"key":"2023013107255375200_bty733-B27","doi-asserted-by":"crossref","first-page":"036106.","DOI":"10.1103\/PhysRevE.76.036106","article-title":"Near linear time algorithm to detect community structures in large-scale networks","volume":"76","author":"Raghavan","year":"2007","journal-title":"Phys. Rev. E"},{"key":"2023013107255375200_bty733-B28","first-page":"549","author":"Rasheed","year":"2013"},{"key":"2023013107255375200_bty733-B29","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"Dsk: k-mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B30","doi-asserted-by":"crossref","first-page":"1063","DOI":"10.1038\/nmeth.4458","article-title":"Critical assessment of metagenome interpretation\u2014a benchmark of metagenomics software","volume":"14","author":"Sczyrba","year":"2017","journal-title":"Nat. Methods"},{"key":"2023013107255375200_bty733-B31","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1016\/j.parco.2016.10.002","article-title":"A case study of tuning mapreduce for efficient bioinformatics in the cloud","volume":"61","author":"Shi","year":"2017","journal-title":"Parallel Comput"},{"key":"2023013107255375200_bty733-B32","doi-asserted-by":"crossref","first-page":"1517","DOI":"10.1101\/gr.168245.113","article-title":"Methane yield phenotypes linked to differential gene expression in the sheep rumen microbiome","volume":"24","author":"Shi","year":"2014","journal-title":"Genome Res"},{"key":"2023013107255375200_bty733-B33","doi-asserted-by":"crossref","first-page":"160081.","DOI":"10.1038\/sdata.2016.81","article-title":"Next generation sequencing data of a defined microbial mock community","volume":"3","author":"Singer","year":"2016","journal-title":"Sci. Data"},{"key":"2023013107255375200_bty733-B34","doi-asserted-by":"crossref","first-page":"1261359.","DOI":"10.1126\/science.1261359","article-title":"Structure and function of the global ocean microbiome","volume":"348","author":"Sunagawa","year":"2015","journal-title":"Science"},{"key":"2023013107255375200_bty733-B35","doi-asserted-by":"crossref","first-page":"805","DOI":"10.1038\/nrg1709","article-title":"Metagenomics: dna sequencing of environmental samples","volume":"6","author":"Tringe","year":"2005","journal-title":"Nat. Rev. Genet"},{"key":"2023013107255375200_bty733-B36","doi-asserted-by":"crossref","first-page":"i356","DOI":"10.1093\/bioinformatics\/bts397","article-title":"Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample","volume":"28","author":"Wang","year":"2012","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B37","first-page":"2","author":"Xin","year":"2013"},{"key":"2023013107255375200_bty733-B38","doi-asserted-by":"crossref","first-page":"438","DOI":"10.1093\/bioinformatics\/btw645","article-title":"Cloudphylo: a fast and scalable tool for phylogeny reconstruction","volume":"33","author":"Xu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023013107255375200_bty733-B39","first-page":"2","author":"Zaharia","year":"2012"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/5\/760\/48966093\/bioinformatics_35_5_760.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/5\/760\/48966093\/bioinformatics_35_5_760.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,6]],"date-time":"2025-07-06T09:40:01Z","timestamp":1751794801000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/5\/760\/5078476"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2018,8,23]]},"references-count":39,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2019,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty733","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/246496","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2019,3,1]]},"published":{"date-parts":[[2018,8,23]]}}}