{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T03:55:25Z","timestamp":1769831725577,"version":"3.49.0"},"reference-count":45,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2016,6,2]],"date-time":"2016-06-02T00:00:00Z","timestamp":1464825600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"name":"NSF","award":["DMS-1518001"],"award-info":[{"award-number":["DMS-1518001"]}]},{"name":"OCE","award":["1136818"],"award-info":[{"award-number":["1136818"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,3,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The software is available at https:\/\/github.com\/younglululu\/COCACOLA.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btw290","type":"journal-article","created":{"date-parts":[[2016,6,3]],"date-time":"2016-06-03T01:12:51Z","timestamp":1464916371000},"page":"791-798","source":"Crossref","is-referenced-by-count":132,"title":["COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge"],"prefix":"10.1093","volume":"33","author":[{"given":"Yang Young","family":"Lu","sequence":"first","affiliation":[{"name":"Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA"}]},{"given":"Ting","family":"Chen","sequence":"additional","affiliation":[{"name":"Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA"},{"name":"Center for Synthetic and Systems Biology, TNLIST, Beijing, China"}]},{"given":"Jed A","family":"Fuhrman","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA, USA"}]},{"given":"Fengzhu","family":"Sun","sequence":"additional","affiliation":[{"name":"Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA"},{"name":"Center for Computational Systems Biology, Fudan University, Shanghai, China"}]}],"member":"286","published-online":{"date-parts":[[2016,6,2]]},"reference":[{"key":"2023020204511568800_btw290-B1","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1038\/nbt.2579","article-title":"Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes","volume":"31","author":"Albertsen","year":"2013","journal-title":"Nat. Biotechnol"},{"key":"2023020204511568800_btw290-B2","doi-asserted-by":"crossref","first-page":"1144","DOI":"10.1038\/nmeth.3103","article-title":"Binning metagenomic contigs by coverage and composition","volume":"11","author":"Alneberg","year":"2014","journal-title":"Nat. Methods"},{"key":"2023020204511568800_btw290-B3","doi-asserted-by":"crossref","first-page":"e1002373.","DOI":"10.1371\/journal.pcbi.1002373","article-title":"Joint analysis of multiple metagenomic samples","volume":"8","author":"Baran","year":"2012","journal-title":"PLoS Comput. Biol"},{"key":"2023020204511568800_btw290-B4","doi-asserted-by":"crossref","DOI":"10.1201\/9781584889977","volume-title":"Constrained Clustering: Advances in Algorithms, Theory, and Applications","author":"Basu","year":"2008"},{"key":"2023020204511568800_btw290-B5","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1038\/nmeth.1358","article-title":"Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models","volume":"6","author":"Brady","year":"2009","journal-title":"Nat. Methods"},{"key":"2023020204511568800_btw290-B6","author":"Cai","year":"2010"},{"key":"2023020204511568800_btw290-B7","doi-asserted-by":"crossref","first-page":"1548","DOI":"10.1109\/TPAMI.2010.231","article-title":"Graph regularized nonnegative matrix factorization for data representation","volume":"33","author":"Cai","year":"2011","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"2023020204511568800_btw290-B8","doi-asserted-by":"crossref","first-page":"e1003292.","DOI":"10.1371\/journal.pcbi.1003292","article-title":"Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution","volume":"9","author":"Carr","year":"2013","journal-title":"PLoS Comput. Biol"},{"key":"2023020204511568800_btw290-B9","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1007\/978-3-540-78839-3_3","article-title":"Compostbin: a DNA composition-based algorithm for binning environmental shotgun reads","author":"Chatterji","year":"2008","journal-title":"Res. Comput. Mol. Biol"},{"key":"2023020204511568800_btw290-B10","volume-title":"Spectral Graph Theory","author":"Chung","year":"1997"},{"key":"2023020204511568800_btw290-B11","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1038\/nature11234","article-title":"Structure, function and diversity of the healthy human microbiome","volume":"486","author":"Consortium","year":"2012","journal-title":"Nature"},{"key":"2023020204511568800_btw290-B12","first-page":"27","article-title":"Variational Bayesian model selection for mixture distributions","author":"Corduneanu","year":"2001","journal-title":"Artificial intelligence and Statistics 2001"},{"key":"2023020204511568800_btw290-B13","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1109\/TPAMI.1979.4766909","article-title":"A cluster separation measure","volume":"1","author":"Davies","year":"1979","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"2023020204511568800_btw290-B14","doi-asserted-by":"crossref","first-page":"377","DOI":"10.1101\/gr.5969107","article-title":"MEGAN analysis of metagenomic data","volume":"17","author":"Huson","year":"2007","journal-title":"Genome Res"},{"key":"2023020204511568800_btw290-B15","author":"Ijaz","year":"2013"},{"key":"2023020204511568800_btw290-B16","doi-asserted-by":"crossref","first-page":"e603.","DOI":"10.7717\/peerj.603","article-title":"GroopM: an automated tool for the recovery of population genomes from related metagenomes","volume":"2","author":"Imelfort","year":"2014","journal-title":"PeerJ"},{"key":"2023020204511568800_btw290-B17","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1007\/978-3-642-40837-3_9","article-title":"A clustering approach to constrained binary matrix factorization","author":"Jiang","year":"2014","journal-title":"Data Mining and Knowledge Discovery for Big Data"},{"key":"2023020204511568800_btw290-B18","doi-asserted-by":"crossref","first-page":"e1165.","DOI":"10.7717\/peerj.1165","article-title":"MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities","volume":"3","author":"Kang","year":"2015","journal-title":"PeerJ"},{"key":"2023020204511568800_btw290-B19","doi-asserted-by":"crossref","first-page":"544.","DOI":"10.1186\/1471-2105-11-544","article-title":"Clustering metagenomic sequences with interpolated Markov models","volume":"11","author":"Kelley","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023020204511568800_btw290-B20","author":"Kim","year":"2008"},{"key":"2023020204511568800_btw290-B21","author":"Langville","year":"2006"},{"key":"2023020204511568800_btw290-B22","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1038\/44565","article-title":"Learning the parts of objects by non-negative matrix factorization","volume":"401","author":"Lee","year":"1999","journal-title":"Nature"},{"key":"2023020204511568800_btw290-B23","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/TCBB.2013.137","article-title":"A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting","volume":"11","author":"Liao","year":"2014","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform"},{"key":"2023020204511568800_btw290-B24","doi-asserted-by":"crossref","first-page":"982","DOI":"10.1109\/TSMCB.2012.2220543","article-title":"Understanding and enhancement of internal clustering validation measures","volume":"43","author":"Liu","year":"2013","journal-title":"IEEE Trans. Cybern"},{"key":"2023020204511568800_btw290-B25","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1093\/bib\/bbs054","article-title":"Classification of metagenomic sequences: methods and challenges","volume":"13","author":"Mande","year":"2012","journal-title":"Brief. Bioinform"},{"key":"2023020204511568800_btw290-B26","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1038\/nmeth976","article-title":"Accurate phylogenetic classification of variable-length DNA fragments","volume":"4","author":"McHardy","year":"2007","journal-title":"Nat. Methods"},{"key":"2023020204511568800_btw290-B27","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1093\/bioinformatics\/btq608","article-title":"SPHINXan algorithm for taxonomic binning of metagenomic sequences","volume":"27","author":"Mohammed","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020204511568800_btw290-B28","doi-asserted-by":"crossref","first-page":"822","DOI":"10.1038\/nbt.2939","article-title":"Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes","volume":"32","author":"Nielsen","year":"2014","journal-title":"Nat. Biotechnol"},{"key":"2023020204511568800_btw290-B29","first-page":"2326","article-title":"Analysis of a data matrix and a graph: metagenomic data and the phylogenetic tree","author":"Purdom","year":"2011","journal-title":"Ann. Appl. Stat"},{"key":"2023020204511568800_btw290-B30","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nature08821","article-title":"A human gut microbial gene catalogue established by metagenomic sequencing","volume":"464","author":"Qin","year":"2010","journal-title":"Nature"},{"key":"2023020204511568800_btw290-B31","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1146\/annurev.genet.38.072902.091216","article-title":"Metagenomics: genomic analysis of microbial communities","volume":"38","author":"Riesenfeld","year":"2004","journal-title":"Annu. Rev. Genet"},{"key":"2023020204511568800_btw290-B32","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1093\/bioinformatics\/btq619","article-title":"NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads","volume":"27","author":"Rosen","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020204511568800_btw290-B33","author":"Salvador","year":"2004"},{"key":"2023020204511568800_btw290-B34","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1101\/gr.142315.112","article-title":"Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization","volume":"23","author":"Sharon","year":"2013","journal-title":"Genome Res"},{"key":"2023020204511568800_btw290-B35","doi-asserted-by":"crossref","first-page":"619","DOI":"10.1109\/TCBB.2011.111","article-title":"The impact of normalization and phylogenetic information on estimating the distance for metagenomes","volume":"9","author":"Su","year":"2012","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform"},{"key":"2023020204511568800_btw290-B36","author":"Tang","year":"2005"},{"key":"2023020204511568800_btw290-B37","doi-asserted-by":"crossref","first-page":"ii59","DOI":"10.1093\/bioinformatics\/bti1110","article-title":"Fast protein classification with multiple networks","volume":"21","author":"Tsuda","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020204511568800_btw290-B38","doi-asserted-by":"crossref","first-page":"1467","DOI":"10.1089\/cmb.2010.0056","article-title":"Alignment-free sequence comparison (ii): theoretical power of comparison statistics","volume":"17","author":"Wan","year":"2010","journal-title":"J. Comput. Biol"},{"key":"2023020204511568800_btw290-B39","doi-asserted-by":"crossref","DOI":"10.1038\/nmeth.3583","article-title":"Comparing the performance of biomedical clustering methods","author":"Wiwie","year":"2015","journal-title":"Nat. Methods"},{"key":"2023020204511568800_btw290-B40","doi-asserted-by":"crossref","first-page":"R46.","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"},{"key":"2023020204511568800_btw290-B41","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1089\/cmb.2010.0245","article-title":"A novel abundance-based algorithm for binning metagenomic sequences using l-tuples","volume":"18","author":"Wu","year":"2011","journal-title":"J. Comput. Biol"},{"key":"2023020204511568800_btw290-B42","doi-asserted-by":"crossref","first-page":"605","DOI":"10.1093\/bioinformatics\/btv638","article-title":"MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets","volume":"32","author":"Wu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020204511568800_btw290-B43","first-page":"S5.","article-title":"Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers","volume":"11","author":"Yang","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023020204511568800_btw290-B44","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1142\/S0219720009004151","article-title":"An ORFome assembly approach to metagenomics sequences analysis","volume":"7","author":"Ye","year":"2009","journal-title":"J. Bioinform. Comput. Biol"},{"key":"2023020204511568800_btw290-B45","author":"Zhao","year":"2007"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/6\/791\/49038237\/bioinformatics_33_6_791.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/6\/791\/49038237\/bioinformatics_33_6_791.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,8,18]],"date-time":"2023-08-18T18:50:09Z","timestamp":1692384609000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/33\/6\/791\/2525584"}},"subtitle":[],"editor":[{"given":"Cenk","family":"Sahinalp","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2016,6,2]]},"references-count":45,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2017,3,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btw290","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2017,3,15]]},"published":{"date-parts":[[2016,6,2]]}}}