{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,30]],"date-time":"2025-10-30T07:10:07Z","timestamp":1761808207529,"version":"3.37.3"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2018,7,13]],"date-time":"2018-07-13T00:00:00Z","timestamp":1531440000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Science Foundation of China","doi-asserted-by":"publisher","award":["1R01AI125982","1R01DE024523","# 11471313"],"award-info":[{"award-number":["1R01AI125982","1R01DE024523","# 11471313"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,2,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (\u223c2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>Open-source software for the proposed method is freely available at https:\/\/www.acsu.buffalo.edu\/~yijunsun\/lab\/SLAD.html.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty617","type":"journal-article","created":{"date-parts":[[2018,7,13]],"date-time":"2018-07-13T11:24:02Z","timestamp":1531481042000},"page":"380-388","source":"Crossref","is-referenced-by-count":8,"title":["A parallel computational framework for ultra-large-scale sequence clustering analysis"],"prefix":"10.1093","volume":"35","author":[{"given":"Wei","family":"Zheng","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, The State University of New York, NY, Buffalo, NY, USA"}]},{"given":"Qi","family":"Mao","sequence":"additional","affiliation":[{"name":"Department of Microbiology and Immunology, The State University of New York, NY, Buffalo, NY, USA"}]},{"given":"Robert J","family":"Genco","sequence":"additional","affiliation":[{"name":"Department of Oral Biology, The State University of New York, NY, Buffalo, NY, USA"}]},{"given":"Jean","family":"Wactawski-Wende","sequence":"additional","affiliation":[{"name":"Department of Epidemiology and Environmental Health, The State University of New York, NY, Buffalo, NY, USA"}]},{"given":"Michael","family":"Buck","sequence":"additional","affiliation":[{"name":"Department of Biochemistry, University at Buffalo, The State University of New York, NY, Buffalo, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8797-4243","authenticated-orcid":false,"given":"Yunpeng","family":"Cai","sequence":"additional","affiliation":[{"name":"Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China"}]},{"given":"Yijun","family":"Sun","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The State University of New York, NY, Buffalo, NY, USA"},{"name":"Department of Microbiology and Immunology, The State University of New York, NY, Buffalo, NY, USA"}]}],"member":"286","published-online":{"date-parts":[[2018,7,13]]},"reference":[{"key":"2023013107235710600_bty617-B1","first-page":"1068","volume-title":"Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms","author":"Balcan","year":"2009"},{"key":"2023013107235710600_bty617-B2","doi-asserted-by":"crossref","first-page":"e95.","DOI":"10.1093\/nar\/gkr349","article-title":"ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time","volume":"39","author":"Cai","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2023013107235710600_bty617-B3","doi-asserted-by":"crossref","first-page":"e1005518.","DOI":"10.1371\/journal.pcbi.1005518","article-title":"ESPRIT-Forest: parallel clustering of massive amplicon sequence data in subquadratic time","volume":"13","author":"Cai","year":"2017","journal-title":"PLoS Comput. Biol"},{"key":"2023013107235710600_bty617-B4","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1038\/nmeth.f.303","article-title":"QIIME allows analysis of high-throughput community sequencing data","volume":"7","author":"Caporaso","year":"2010","journal-title":"Nat. Methods"},{"key":"2023013107235710600_bty617-B5","doi-asserted-by":"crossref","first-page":"baq013.","DOI":"10.1093\/database\/baq013","article-title":"The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information","volume":"2010","author":"Chen","year":"2010","journal-title":"Database"},{"key":"2023013107235710600_bty617-B6","doi-asserted-by":"crossref","first-page":"347","DOI":"10.1016\/j.mimet.2013.07.004","article-title":"MSClust: a multi-seeds based clustering algorithm for microbiome profiling using 16S rRNA sequence","volume":"94","author":"Chen","year":"2013","journal-title":"J. Microbiol. Methods"},{"key":"2023013107235710600_bty617-B7","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1016\/j.mimet.2013.08.011","article-title":"High throughput sequencing methods and analysis for microbiome research","volume":"95","author":"Di Bella","year":"2013","journal-title":"J. Microbiol. Methods"},{"key":"2023013107235710600_bty617-B8","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2023013107235710600_bty617-B9","doi-asserted-by":"crossref","first-page":"1440","DOI":"10.1126\/science.342.6165.1440-b","article-title":"Your microbes, your health","volume":"342","author":"Editorial","year":"2013","journal-title":"Science"},{"key":"2023013107235710600_bty617-B10","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1186\/s12915-014-0069-1","article-title":"The Earth Microbiome project: successes and aspirations","volume":"12","author":"Gilbert","year":"2014","journal-title":"BMC Biol"},{"key":"2023013107235710600_bty617-B11","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1016\/j.watres.2014.05.008","article-title":"Replicating the microbial community and water quality performance of full-scale slow sand filters in laboratory-scale filters","volume":"61","author":"Haig","year":"2014","journal-title":"Water Res"},{"key":"2023013107235710600_bty617-B12","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1093\/bioinformatics\/btw722","article-title":"DACE: a scalable DP-means algorithm for clustering extremely large sequence data","volume":"33","author":"Jiang","year":"2017","journal-title":"Bioinformatics"},{"key":"2023013107235710600_bty617-B13","first-page":"887","volume-title":"Proc. 29th International Conference on Machine Learning","author":"Krishnamurthy","year":"2012"},{"key":"2023013107235710600_bty617-B14","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023013107235710600_bty617-B15","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1093\/bioinformatics\/btt657","article-title":"HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences","volume":"30","author":"Matias Rodrigues","year":"2014","journal-title":"Bioinformatics"},{"key":"2023013107235710600_bty617-B16","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1093\/bib\/bbs054","article-title":"Classification of metagenomic sequences: methods and challenges","volume":"13","author":"Mande","year":"2012","journal-title":"Brief. Bioinform"},{"key":"2023013107235710600_bty617-B17","doi-asserted-by":"crossref","first-page":"610.","DOI":"10.1038\/ismej.2011.139","article-title":"An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea","volume":"6","author":"McDonald","year":"2012","journal-title":"ISME J"},{"key":"2023013107235710600_bty617-B18","first-page":"1235","article-title":"MLlib: machine learning in Apache Spark","volume":"17","author":"Meng","year":"2016","journal-title":"J. Mach. Learn. Res"},{"key":"2023013107235710600_bty617-B19","doi-asserted-by":"crossref","first-page":"e545.","DOI":"10.7717\/peerj.545","article-title":"Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences","volume":"2","author":"Rideout","year":"2014","journal-title":"PeerJ"},{"key":"2023013107235710600_bty617-B20","doi-asserted-by":"crossref","first-page":"1501","DOI":"10.1128\/AEM.71.3.1501-1506.2005","article-title":"Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness","volume":"71","author":"Schloss","year":"2005","journal-title":"Appl. Environ. Microbiol"},{"key":"2023013107235710600_bty617-B21","doi-asserted-by":"crossref","first-page":"3219","DOI":"10.1128\/AEM.02810-10","article-title":"Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis","volume":"77","author":"Schloss","year":"2011","journal-title":"Appl. Environ. Microbiol"},{"key":"2023013107235710600_bty617-B22","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1109\/MSP.2007.914237","article-title":"Locality-sensitive hashing for finding nearest neighbors","volume":"25","author":"Slaney","year":"2008","journal-title":"IEEE Signal Process. Mag"},{"key":"2023013107235710600_bty617-B23","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1093\/bib\/bbr009","article-title":"A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis","volume":"13","author":"Sun","year":"2012","journal-title":"Brief. Bioinf"},{"key":"2023013107235710600_bty617-B24","doi-asserted-by":"crossref","first-page":"e76.","DOI":"10.1093\/nar\/gkp285","article-title":"ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences","volume":"37","author":"Sun","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2023013107235710600_bty617-B25","doi-asserted-by":"crossref","first-page":"e205.","DOI":"10.1093\/nar\/gkq872","article-title":"Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data","volume":"38","author":"Sun","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2023013107235710600_bty617-B26","first-page":"203","article-title":"Active clustering of biological sequences","volume":"13","author":"Voevodski","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2023013107235710600_bty617-B27","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1007\/s11222-007-9033-z","article-title":"A tutorial on spectral clustering","volume":"17","author":"Von Luxburg","year":"2007","journal-title":"Stat. Comput"},{"key":"2023013107235710600_bty617-B28","first-page":"153","volume-title":"Proc. 2010 IEEE International Conference on Bioinfomatics and Biomedicine","author":"Ye","year":"2011"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/3\/380\/48964796\/bioinformatics_35_3_380.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/3\/380\/48964796\/bioinformatics_35_3_380.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,3]],"date-time":"2023-09-03T18:46:52Z","timestamp":1693766812000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/3\/380\/5053310"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,7,13]]},"references-count":28,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2019,2,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty617","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2019,2,1]]},"published":{"date-parts":[[2018,7,13]]}}}