{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T02:50:24Z","timestamp":1778986224593,"version":"3.51.4"},"reference-count":20,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2016,8,19]],"date-time":"2016-08-19T00:00:00Z","timestamp":1471564800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2016,8,19]],"date-time":"2016-08-19T00:00:00Z","timestamp":1471564800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100006489","name":"Commissariat \u00e0 l'\u00c9nergie Atomique et aux \u00c9nergies Alternatives","doi-asserted-by":"crossref","award":["Programme \"Technologies pour la Sant\u00e9\" \/ Projet Meta-Target"],"award-info":[{"award-number":["Programme \"Technologies pour la Sant\u00e9\" \/ Projet Meta-Target"]}],"id":[{"id":"10.13039\/501100006489","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100006489","name":"Commissariat \u00e0 l'\u00c9nergie Atomique et aux \u00c9nergies Alternatives","doi-asserted-by":"publisher","award":["Programme \"Technologies pour la Sant\u00e9\" \/ Projet Meta-Target"],"award-info":[{"award-number":["Programme \"Technologies pour la Sant\u00e9\" \/ Projet Meta-Target"]}],"id":[{"id":"10.13039\/501100006489","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and\/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new \u201cassembly-free\u201d binning protocol.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the \u201cde novo\u201d identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the <jats:italic>E. coli O104:H4<\/jats:italic> genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>We present a set of sequence clustering (\u201cbinning\u201d) modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the \u201cassembly-free\u201d analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the \u201calignment-free\u201d d<jats:sub>2<\/jats:sub>S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for <jats:italic>de novo<\/jats:italic> \u201cpre-assembly\u201d tasks by segregating sequences into biologically meaningful partitions.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-016-1186-3","type":"journal-article","created":{"date-parts":[[2016,8,19]],"date-time":"2016-08-19T12:30:36Z","timestamp":1471609836000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes"],"prefix":"10.1186","volume":"17","author":[{"given":"Anestis","family":"Gkanogiannis","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"St\u00e9phane","family":"Gazut","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marcel","family":"Salanoubat","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sawsan","family":"Kanj","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Thomas","family":"Br\u00fcls","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2016,8,19]]},"reference":[{"key":"1186_CR1","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1038\/nature11450","volume":"490","author":"J Qin","year":"2012","unstructured":"Qin J, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490:55\u201360.","journal-title":"Nature"},{"key":"1186_CR2","doi-asserted-by":"publisher","first-page":"R122","DOI":"10.1186\/gb-2012-13-12-r122","volume":"13","author":"S Boisvert","year":"2012","unstructured":"Boisvert S, et al. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13:R122.","journal-title":"Genome Biol"},{"key":"1186_CR3","doi-asserted-by":"publisher","first-page":"4904","DOI":"10.1073\/pnas.1402564111","volume":"111","author":"AC Howe","year":"2014","unstructured":"Howe AC, et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A. 2014;111:4904\u20139.","journal-title":"Proc Natl Acad Sci U S A"},{"key":"1186_CR4","doi-asserted-by":"publisher","first-page":"174","DOI":"10.1016\/j.copbio.2016.04.011","volume":"39","author":"D Turaev","year":"2016","unstructured":"Turaev D, Rattei T. High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. Curr Opin Biotechnol. 2016;39:174\u201381.","journal-title":"Curr Opin Biotechnol"},{"key":"1186_CR5","doi-asserted-by":"publisher","first-page":"1144","DOI":"10.1038\/nmeth.3103","volume":"11","author":"J Alneberg","year":"2014","unstructured":"Alneberg J, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11:1144\u20136.","journal-title":"Nat Methods"},{"key":"1186_CR6","doi-asserted-by":"publisher","first-page":"646","DOI":"10.1093\/bib\/bbs031","volume":"13","author":"J Dr\u00f6ge","year":"2012","unstructured":"Dr\u00f6ge J, McHardy AC. Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief Bioinform. 2012;13:646\u201355.","journal-title":"Brief Bioinform"},{"key":"1186_CR7","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1186\/2049-2618-2-26","volume":"2","author":"YW Wu","year":"2014","unstructured":"Wu YW, et al. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2:26.","journal-title":"Microbiome"},{"key":"1186_CR8","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1089\/cmb.2010.0245","volume":"18","author":"YW Wu","year":"2011","unstructured":"Wu YW, Ye Y. A novel abundance-based algorithm for binning metagenomics sequences using l-tuples. J Comput Biol. 2011;18:523\u201334.","journal-title":"J Comput Biol"},{"key":"1186_CR9","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1038\/nature08821","volume":"464","author":"J Qin","year":"2010","unstructured":"Qin J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59\u201365.","journal-title":"Nature"},{"key":"1186_CR10","unstructured":"Holtgrewe, M. (2010) Mason \u2013 a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut f\u00fcr Mathematik und Informatik, Freie Universit\u00e4t Berlin."},{"key":"1186_CR11","unstructured":"Rosenberg, A. and Hirschberg, J. (2007) V-Measure: A conditional entropy based external cluster evaluation measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410\u2013420."},{"key":"1186_CR12","doi-asserted-by":"publisher","first-page":"343","DOI":"10.1093\/bib\/bbt067","volume":"15","author":"K Song","year":"2014","unstructured":"Song K, et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform. 2014;15:343\u201353.","journal-title":"Brief Bioinform"},{"key":"1186_CR13","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1093\/bioinformatics\/btm563","volume":"24","author":"P Langfelder","year":"2008","unstructured":"Langfelder P, et al. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24:719\u201320.","journal-title":"Bioinformatics"},{"key":"1186_CR14","doi-asserted-by":"publisher","first-page":"i356","DOI":"10.1093\/bioinformatics\/bts397","volume":"28","author":"Y Wang","year":"2012","unstructured":"Wang Y, et al. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012;28:i356\u201362.","journal-title":"Bioinformatics"},{"key":"1186_CR15","doi-asserted-by":"publisher","first-page":"822","DOI":"10.1038\/nbt.2939","volume":"32","author":"HB Nielsen","year":"2014","unstructured":"Nielsen HB, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32:822\u20138.","journal-title":"Nat Biotechnol"},{"key":"1186_CR16","doi-asserted-by":"publisher","first-page":"1053","DOI":"10.1038\/nbt.3329","volume":"33","author":"B Cleary","year":"2015","unstructured":"Cleary B, et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat Biotechnol. 2015;33:1053\u201360.","journal-title":"Nat Biotechnol"},{"key":"1186_CR17","doi-asserted-by":"publisher","first-page":"R46","DOI":"10.1186\/gb-2014-15-3-r46","volume":"15","author":"DE Wood","year":"2014","unstructured":"Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46.","journal-title":"Genome Biol"},{"key":"1186_CR18","doi-asserted-by":"publisher","first-page":"436","DOI":"10.1038\/ismej.2007.48","volume":"5","author":"IA Davidova","year":"2007","unstructured":"Davidova IA, et al. Anaerobic phenanthrene mineralization by a carboxylating sulfate-reducing bacterial enrichment. ISME J. 2007;5:436\u201342.","journal-title":"ISME J"},{"key":"1186_CR19","doi-asserted-by":"publisher","first-page":"1043","DOI":"10.1101\/gr.186072.114","volume":"25","author":"DH Parks","year":"2015","unstructured":"Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043\u201355.","journal-title":"Genome Res"},{"key":"1186_CR20","doi-asserted-by":"publisher","first-page":"1513","DOI":"10.1073\/pnas.1017351108","volume":"108","author":"S Gnerre","year":"2011","unstructured":"Gnerre S, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A. 2011;108:1513\u20138.","journal-title":"Proc Natl Acad Sci U S A"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-1186-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-016-1186-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-1186-3","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-1186-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T18:14:42Z","timestamp":1706811282000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-016-1186-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,8,19]]},"references-count":20,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["1186"],"URL":"https:\/\/doi.org\/10.1186\/s12859-016-1186-3","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,8,19]]},"assertion":[{"value":"19 December 2015","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 August 2016","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 August 2016","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"311"}}