{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,26]],"date-time":"2025-10-26T21:17:45Z","timestamp":1761513465619},"reference-count":31,"publisher":"Oxford University Press (OUP)","issue":"17","license":[{"start":{"date-parts":[[2017,1,27]],"date-time":"2017-01-27T00:00:00Z","timestamp":1485475200000},"content-version":"vor","delay-in-days":984,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014,9,1]]},"abstract":"<jats:p>Motivation: Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation.<\/jats:p>\n               <jats:p>Results: In this article, we develop a content-based exploration and retrieval method for whole-metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and high accuracy in discriminating between different body sites even though the method is unsupervised.<\/jats:p>\n               <jats:p>Availability and implementation: A software implementation of the DSM framework is available at https:\/\/github.com\/HIITMetagenomics\/dsm-framework.<\/jats:p>\n               <jats:p>Contact: \u00a0sohan.seth@hiit.fi or antti.honkela@hiit.fi<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary Data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btu340","type":"journal-article","created":{"date-parts":[[2014,5,21]],"date-time":"2014-05-21T02:15:06Z","timestamp":1400638506000},"page":"2471-2479","source":"Crossref","is-referenced-by-count":26,"title":["Exploration and retrieval of whole-metagenome sequencing samples"],"prefix":"10.1093","volume":"30","author":[{"given":"Sohan","family":"Seth","sequence":"first","affiliation":[{"name":"1 \u00a01Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, 2Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and 3Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland"}]},{"given":"Niko","family":"V\u00e4lim\u00e4ki","sequence":"additional","affiliation":[{"name":"1 \u00a01Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, 2Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and 3Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland"}]},{"given":"Samuel","family":"Kaski","sequence":"additional","affiliation":[{"name":"1 \u00a01Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, 2Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and 3Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland"}]},{"given":"Antti","family":"Honkela","sequence":"additional","affiliation":[{"name":"1 \u00a01Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, 2Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and 3Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland"}]}],"member":"286","published-online":{"date-parts":[[2014,5,19]]},"reference":[{"key":"2023012711525008800_btu340-B1","doi-asserted-by":"crossref","first-page":"e1002373","DOI":"10.1371\/journal.pcbi.1002373","article-title":"Joint analysis of multiple metagenomic samples","volume":"8","author":"Baran","year":"2012","journal-title":"PLoS Comput. Biol."},{"key":"2023012711525008800_btu340-B2","doi-asserted-by":"crossref","first-page":"i145","DOI":"10.1093\/bioinformatics\/btp215","article-title":"Probabilistic retrieval and visualization of biologically relevant microarray experiments","volume":"25","author":"Caldas","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B3","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1093\/bioinformatics\/btr634","article-title":"Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma","volume":"28","author":"Caldas","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B4","doi-asserted-by":"crossref","first-page":"3316","DOI":"10.1093\/bioinformatics\/bts599","article-title":"Real time metagenomics: using k-mers to annotate metagenomes","volume":"28","author":"Edwards","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B5","doi-asserted-by":"crossref","first-page":"594","DOI":"10.1073\/pnas.1116053109","article-title":"Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease","volume":"109","author":"Greenblum","year":"2012","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012711525008800_btu340-B6","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1038\/nature11234","article-title":"Structure, function and diversity of the healthy human microbiome","volume":"486","author":"Human Microbiome Project Consortium","year":"2012","journal-title":"Nature"},{"key":"2023012711525008800_btu340-B7","doi-asserted-by":"crossref","first-page":"730","DOI":"10.1186\/1471-2164-13-730","article-title":"Comparison of metagenomic samples using sequence signatures","volume":"13","author":"Jiang","year":"2012","journal-title":"BMC Genomics"},{"key":"2023012711525008800_btu340-B8","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-642-02441-2_17","article-title":"Permuted longest common prefix array","volume-title":"Proceedings of Combinatorial Pattern Matching","author":"K\u00e4rkk\u00e4inen","year":"2009"},{"key":"2023012711525008800_btu340-B9","doi-asserted-by":"crossref","DOI":"10.1145\/1982185.1982389","article-title":"Entropy based feature selection for text categorization","volume-title":"Proceedings of the 2011 ACM Symposium on Applied Computing - SAC 11","author":"Largeron","year":"2011"},{"key":"2023012711525008800_btu340-B10","doi-asserted-by":"crossref","first-page":"e32118","DOI":"10.1371\/journal.pone.0032118","article-title":"Analyses of the microbial diversity across the human microbiome","volume":"7","author":"Li","year":"2012","journal-title":"PLoS One"},{"key":"2023012711525008800_btu340-B11","doi-asserted-by":"crossref","first-page":"3242","DOI":"10.1093\/bioinformatics\/btr547","article-title":"Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data","volume":"27","author":"Liu","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B12","doi-asserted-by":"crossref","first-page":"S10","DOI":"10.1186\/1471-2105-13-S19-S10","article-title":"Compareads: comparing huge metagenomic experiments","volume":"13","author":"Maillet","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023012711525008800_btu340-B13","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Marais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B14","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-540-78646-7_38","article-title":"Computing information retrieval performance measures efficiently in the presence of tied scores","volume-title":"Proceedings of the IR research, 30th European conference on Advances in information retrieval","author":"McSherry","year":"2008"},{"key":"2023012711525008800_btu340-B15","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1186\/1471-2105-9-386","article-title":"The metagenomics RAST server a public resource for the automatic phylogenetic and functional analysis of metagenomes","volume":"9","author":"Meyer","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023012711525008800_btu340-B16","doi-asserted-by":"crossref","first-page":"6643","DOI":"10.1093\/nar\/gkp698","article-title":"FIGfams: yet another set of protein families","volume":"37","author":"Meyer","year":"2009","journal-title":"Nucleic Acids Res."},{"key":"2023012711525008800_btu340-B17","doi-asserted-by":"crossref","first-page":"1849","DOI":"10.1093\/bioinformatics\/btp341","article-title":"Visual and statistical comparison of metagenomes","volume":"25","author":"Mitra","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B18","doi-asserted-by":"crossref","first-page":"715","DOI":"10.1093\/bioinformatics\/btq041","article-title":"Identifying biologically relevant differences between metagenomic communities","volume":"26","author":"Parks","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B19","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nature08821","article-title":"A human gut microbial gene catalogue established by metagenomic sequencing","volume":"464","author":"Qin","year":"2010","journal-title":"Nature"},{"key":"2023012711525008800_btu340-B20","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1038\/nature11450","article-title":"A metagenome-wide association study of gut microbiota in type 2 diabetes","volume":"490","author":"Qin","year":"2012","journal-title":"Nature"},{"key":"2023012711525008800_btu340-B21","doi-asserted-by":"crossref","first-page":"e3373","DOI":"10.1371\/journal.pone.0003373","article-title":"MetaSim: a sequencing simulator for genomics and metagenomics","volume":"3","author":"Richter","year":"2008","journal-title":"PLoS One"},{"key":"2023012711525008800_btu340-B22","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"DSK: k-mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B23","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1038\/nature11711","article-title":"Genomic variation landscape of the human gut microbiome","volume":"493","author":"Schloissnig","year":"2013","journal-title":"Nature"},{"key":"2023012711525008800_btu340-B24","doi-asserted-by":"crossref","first-page":"R60","DOI":"10.1186\/gb-2011-12-6-r60","article-title":"Metagenomic biomarker discovery and explanation","volume":"12","author":"Segata","year":"2011","journal-title":"Genome Biol."},{"key":"2023012711525008800_btu340-B25","doi-asserted-by":"crossref","first-page":"811","DOI":"10.1038\/nmeth.2066","article-title":"Metagenomic microbial community profiling using unique clade-specific marker genes","volume":"9","author":"Segata","year":"2012","journal-title":"Nat. Methods"},{"key":"2023012711525008800_btu340-B26","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1145\/1321440.1321528","article-title":"A comparison of statistical significance tests for information retrieval evaluation","volume-title":"Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. CIKM\u201907","author":"Smucker","year":"2007"},{"key":"2023012711525008800_btu340-B27","doi-asserted-by":"crossref","first-page":"2493","DOI":"10.1093\/bioinformatics\/bts470","article-title":"Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data","volume":"28","author":"Su","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012711525008800_btu340-B28","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1038\/nature02340","article-title":"Community structure and metabolism through reconstruction of microbial genomes from the environment","volume":"428","author":"Tyson","year":"2004","journal-title":"Nature"},{"key":"2023012711525008800_btu340-B29","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-642-33122-0_35","article-title":"Distributed string mining for high-throughput sequencing data","volume-title":"12th Workshop on Algorithms in Bioinformatics (WABI)","author":"V\u00e4lim\u00e4ki","year":"2012"},{"key":"2023012711525008800_btu340-B30","doi-asserted-by":"crossref","first-page":"e1000352","DOI":"10.1371\/journal.pcbi.1000352","article-title":"Statistical methods for detecting differentially abundant features in clinical metagenomic samples","volume":"5","author":"White","year":"2009","journal-title":"PLoS Comput. Biol."},{"key":"2023012711525008800_btu340-B31","first-page":"412","article-title":"A comparative study on feature selection in text categorization","volume-title":"Proceedings of the Fourteenth International Conference on Machine Learning (ICML\u201997)","author":"Yang","year":"1997"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/17\/2471\/48927030\/bioinformatics_30_17_2471.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/17\/2471\/48927030\/bioinformatics_30_17_2471.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T12:15:23Z","timestamp":1674821723000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/30\/17\/2471\/2748241"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,5,19]]},"references-count":31,"journal-issue":{"issue":"17","published-print":{"date-parts":[[2014,9,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btu340","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published":{"date-parts":[[2014,5,19]]}}}