{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T04:10:09Z","timestamp":1777349409841,"version":"3.51.4"},"reference-count":53,"publisher":"Public Library of Science (PLoS)","issue":"5","license":[{"start":{"date-parts":[[2025,5,7]],"date-time":"2025-05-07T00:00:00Z","timestamp":1746576000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100014895","name":"Open Philanthropy Project","doi-asserted-by":"publisher","award":["NA"],"award-info":[{"award-number":["NA"]}],"id":[{"id":"10.13039\/100014895","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000156","name":"Division of Emerging Frontiers","doi-asserted-by":"publisher","award":["2025457"],"award-info":[{"award-number":["2025457"]}],"id":[{"id":"10.13039\/100000156","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>We use open source human gut microbiome data to learn a microbial \u201clanguage\u201d model by adapting techniques from Natural Language Processing (NLP). Our microbial \u201clanguage\u201d model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1011353","type":"journal-article","created":{"date-parts":[[2025,5,7]],"date-time":"2025-05-07T16:07:41Z","timestamp":1746634061000},"page":"e1011353","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":14,"title":["Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data"],"prefix":"10.1371","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-6014-9643","authenticated-orcid":true,"given":"Quintin","family":"Pope","sequence":"first","affiliation":[]},{"given":"Rohan","family":"Varma","sequence":"additional","affiliation":[]},{"given":"Christine","family":"Tataru","sequence":"additional","affiliation":[]},{"given":"Maude M","family":"David","sequence":"additional","affiliation":[]},{"given":"Xiaoli","family":"Fern","sequence":"additional","affiliation":[]}],"member":"340","published-online":{"date-parts":[[2025,5,7]]},"reference":[{"issue":"2","key":"pcbi.1011353.ref001","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1016\/j.chom.2013.07.007","article-title":"Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment","volume":"14","author":"AD Kostic","year":"2013","journal-title":"Cell Host Microbe"},{"key":"pcbi.1011353.ref002","doi-asserted-by":"crossref","first-page":"186","DOI":"10.1016\/j.bbi.2015.03.016","article-title":"Altered fecal microbiota composition in patients with major depressive disorder","volume":"48","author":"H Jiang","year":"2015","journal-title":"Brain Behav Immun"},{"issue":"6","key":"pcbi.1011353.ref003","doi-asserted-by":"crossref","first-page":"786","DOI":"10.1038\/mp.2016.44","article-title":"Gut microbiome remodeling induces depressive-like behaviors through a pathway mediated by the host\u2019s metabolism","volume":"21","author":"P Zheng","year":"2016","journal-title":"Mol Psychiatry"},{"issue":"34","key":"pcbi.1011353.ref004","doi-asserted-by":"crossref","first-page":"13780","DOI":"10.1073\/pnas.0706625104","article-title":"Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases","volume":"104","author":"DN Frank","year":"2007","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"3","key":"pcbi.1011353.ref005","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1016\/j.chom.2014.02.005","article-title":"The treatment-naive microbiome in new-onset Crohn\u2019s disease","volume":"15","author":"D Gevers","year":"2014","journal-title":"Cell Host Microbe"},{"issue":"416","key":"pcbi.1011353.ref006","doi-asserted-by":"crossref","first-page":"eaah6888","DOI":"10.1126\/scitranslmed.aah6888","article-title":"A role for bacterial urease in gut dysbiosis and Crohn\u2019s disease","volume":"9","author":"J Ni","year":"2017","journal-title":"Sci Transl Med"},{"issue":"4","key":"pcbi.1011353.ref007","doi-asserted-by":"crossref","first-page":"392","DOI":"10.1038\/nm.4517","article-title":"Current understanding of the human microbiome","volume":"24","author":"JA Gilbert","year":"2018","journal-title":"Nat Med"},{"issue":"3","key":"pcbi.1011353.ref008","doi-asserted-by":"crossref","DOI":"10.1128\/mSystems.00031-18","article-title":"American gut: an open platform for citizen science microbiome research","volume":"3","author":"D McDonald","year":"2018","journal-title":"mSystems"},{"issue":"4","key":"pcbi.1011353.ref009","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0061217","article-title":"phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data","volume":"8","author":"PJ McMurdie","year":"2013","journal-title":"PLoS One"},{"issue":"12","key":"pcbi.1011353.ref010","doi-asserted-by":"crossref","first-page":"e2006842","DOI":"10.1371\/journal.pbio.2006842","article-title":"Gut microbiota diversity across ethnicities in the United States","volume":"16","author":"AW Brooks","year":"2018","journal-title":"PLoS Biol"},{"issue":"6","key":"pcbi.1011353.ref011","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1038\/s41564-018-0156-0","article-title":"Methods for phylogenetic analysis of microbiome data","volume":"3","author":"AD Washburne","year":"2018","journal-title":"Nat Microbiol"},{"issue":"2","key":"pcbi.1011353.ref012","doi-asserted-by":"crossref","first-page":"e1006721","DOI":"10.1371\/journal.pcbi.1006721","article-title":"16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses","volume":"15","author":"S Woloszynek","year":"2019","journal-title":"PLoS Comput Biol"},{"issue":"5","key":"pcbi.1011353.ref013","doi-asserted-by":"crossref","first-page":"e1007859","DOI":"10.1371\/journal.pcbi.1007859","article-title":"Decoding the language of microbiomes using word-embedding techniques, and applications in inflammatory bowel disease","volume":"16","author":"CA Tataru","year":"2020","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1011353.ref014","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532\u201343. Available from: http:\/\/www.aclweb.org\/anthology\/D14-1162","DOI":"10.3115\/v1\/D14-1162"},{"issue":"1","key":"pcbi.1011353.ref015","doi-asserted-by":"crossref","DOI":"10.1016\/j.cmi.2015.09.004","article-title":"Composition of human faecal microbiota in resistance to Campylobacter infection","volume":"22","author":"C Kampmann","year":"2016","journal-title":"Clin Microbiol Infect"},{"key":"pcbi.1011353.ref016","article-title":"Attention is all you need.","author":"A Vaswani","year":"2017"},{"issue":"10","key":"pcbi.1011353.ref017","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"A Elnaggar","year":"2022","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"1","key":"pcbi.1011353.ref018","doi-asserted-by":"crossref","first-page":"723","DOI":"10.1186\/s12859-019-3220-8","article-title":"Modeling aspects of the language of life through transfer-learning protein sequences","volume":"20","author":"M Heinzinger","year":"2019","journal-title":"BMC Bioinformatics"},{"issue":"8","key":"pcbi.1011353.ref019","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1093\/bioinformatics\/btac020","article-title":"ProteinBERT: a universal deep-learning model of protein sequence and function","volume":"38","author":"N Brandes","year":"2022","journal-title":"Bioinformatics"},{"issue":"15","key":"pcbi.1011353.ref020","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"A Rives","year":"2021","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"15","key":"pcbi.1011353.ref021","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome","volume":"37","author":"Y Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"pcbi.1011353.ref022","doi-asserted-by":"crossref","unstructured":"Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics; 2018.","DOI":"10.18653\/v1\/N18-2074"},{"key":"pcbi.1011353.ref023","doi-asserted-by":"crossref","unstructured":"Huang Z, Liang D, Xu P, Xiang B. Improve transformer models with better relative position embeddings. In: Findings of the Association for Computational Linguistics: EMNLP 2020; 2020. p. 3327\u201335.","DOI":"10.18653\/v1\/2020.findings-emnlp.298"},{"issue":"3","key":"pcbi.1011353.ref024","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/MSP.2021.3134634","article-title":"Self-supervised representation learning: introduction, advances, and challenges","volume":"39","author":"L Ericsson","year":"2022","journal-title":"IEEE Signal Process Mag"},{"key":"pcbi.1011353.ref025","article-title":"ELECTRA: pre-training text encoders as discriminators rather than generators.","author":"K Clark","year":"2020"},{"key":"pcbi.1011353.ref026","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding.","author":"J Devlin","year":"2019"},{"issue":"56","key":"pcbi.1011353.ref027","first-page":"1929","article-title":"Dropout: a simple way to prevent neural networks from overfitting","volume":"15","author":"N Srivastava","year":"2014","journal-title":"J Mach Learn Res"},{"key":"pcbi.1011353.ref028","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-10590-1_53","article-title":"Visualizing and understanding convolutional networks.","volume-title":"Computer Vision \u2013 ECCV 2014","author":"MD Zeiler","year":"2014"},{"key":"pcbi.1011353.ref029","doi-asserted-by":"crossref","first-page":"17004","DOI":"10.1038\/nmicrobiol.2017.4","article-title":"Dynamics of the human gut microbiome in inflammatory bowel disease","volume":"2","author":"J Halfvarson","year":"2017","journal-title":"Nat Microbiol"},{"issue":"7758","key":"pcbi.1011353.ref030","doi-asserted-by":"crossref","first-page":"655","DOI":"10.1038\/s41586-019-1237-9","article-title":"Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases","volume":"569","author":"J Lloyd-Price","year":"2019","journal-title":"Nature"},{"key":"pcbi.1011353.ref031","article-title":"Data and code from: learning a deep language model for microbiomes: the power of large scale unlabeled microbiome data","author":"Q Pope","year":"2024"},{"issue":"1","key":"pcbi.1011353.ref032","doi-asserted-by":"crossref","first-page":"6026","DOI":"10.1038\/s41598-020-63159-5","article-title":"DeepMicro: deep representation learning for disease prediction based on microbiome data","volume":"10","author":"M Oh","year":"2020","journal-title":"Sci Rep"},{"issue":"1","key":"pcbi.1011353.ref033","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1002\/rsa.10073","article-title":"An elementary proof of a theorem of Johnson and Lindenstrauss","volume":"22","author":"S Dasgupta","year":"2002","journal-title":"Random Struct Algorithms"},{"key":"pcbi.1011353.ref034","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"F Pedregosa","year":"2011","journal-title":"J Mach Learn Res"},{"issue":"2","key":"pcbi.1011353.ref035","doi-asserted-by":"crossref","first-page":"233","DOI":"10.1002\/aic.690370209","article-title":"Nonlinear principal component analysis using autoassociative neural networks","volume":"37","author":"MA Kramer","year":"1991","journal-title":"AIChE J"},{"key":"pcbi.1011353.ref036","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1016\/j.patcog.2018.05.019","article-title":"Discriminatively boosted image clustering with fully convolutional auto-encoders","volume":"83","author":"F Li","year":"2018","journal-title":"Pattern Recognit"},{"key":"pcbi.1011353.ref037","doi-asserted-by":"crossref","first-page":"1261889","DOI":"10.3389\/fmicb.2023.1261889","article-title":"Machine learning approaches in microbiome research: challenges and best practices","volume":"14","author":"G Papoutsoglou","year":"2023","journal-title":"Front Microbiol"},{"issue":"1","key":"pcbi.1011353.ref038","doi-asserted-by":"crossref","first-page":"6818","DOI":"10.1038\/s41467-022-34405-3","article-title":"Faecal microbiome-based machine learning for multi-class disease diagnosis","volume":"13","author":"Q Su","year":"2022","journal-title":"Nat Commun"},{"key":"pcbi.1011353.ref039","doi-asserted-by":"crossref","DOI":"10.1101\/2024.09.16.613342","article-title":"Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear baselines","author":"C Ahlmann-Eltze","year":"2024"},{"issue":"86","key":"pcbi.1011353.ref040","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"L van der Maaten","year":"2008","journal-title":"J Mach Learn Res"},{"issue":"7","key":"pcbi.1011353.ref041","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1038\/nmeth.3869","article-title":"DADA2: high-resolution sample inference from illumina amplicon data","volume":"13","author":"BJ Callahan","year":"2016","journal-title":"Nat Methods"},{"issue":"1","key":"pcbi.1011353.ref042","doi-asserted-by":"crossref","first-page":"5416","DOI":"10.1038\/s41467-019-13056-x","article-title":"The art of using t-SNE for single-cell transcriptomics","volume":"10","author":"D Kobak","year":"2019","journal-title":"Nat Commun"},{"issue":"1","key":"pcbi.1011353.ref043","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1093\/nar\/28.1.27","article-title":"KEGG: Kyoto encyclopedia of genes and genomes","volume":"28","author":"M Kanehisa","year":"2000","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"pcbi.1011353.ref044","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1186\/s12864-019-6427-1","article-title":"Piphillin predicts metagenomic composition and dynamics from DADA2-corrected 16S rDNA sequences","volume":"21","author":"NR Narayan","year":"2020","journal-title":"BMC Genomics"},{"key":"pcbi.1011353.ref045","unstructured":"Tenenbaum D, Maintainer B. KEGGREST: client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package. In: KEGGREST: client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package; 2018."},{"key":"pcbi.1011353.ref046","first-page":"83","article-title":"Sulla determinazione empirica di una legge di distribuzione.","volume":"4","author":"L Ka","year":"1933","journal-title":"G Ist Ital Attuari"},{"issue":"2","key":"pcbi.1011353.ref047","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1214\/aoms\/1177730256","article-title":"Table for estimating the goodness of fit of empirical distributions","volume":"19","author":"N Smirnov","year":"1948","journal-title":"Ann Math Statist"},{"key":"pcbi.1011353.ref048","doi-asserted-by":"crossref","first-page":"177","DOI":"10.1080\/00949658608810963","article-title":"An omnibus test for the two-sample problem using the empirical characteristic function","volume":"26","author":"TW Epps","year":"1986","journal-title":"J Statist Comput Simulat"},{"issue":"3","key":"pcbi.1011353.ref049","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1038\/s41592-019-0686-2","article-title":"SciPy 1.0: fundamental algorithms for scientific computing in Python","volume":"17","author":"P Virtanen","year":"2020","journal-title":"Nat Methods"},{"issue":"3","key":"pcbi.1011353.ref050","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1037\/0033-2909.114.3.494","article-title":"Dominance statistics: ordinal analyses to answer ordinal questions.","volume":"114","author":"N Cliff","year":"1993","journal-title":"Psychol Bullet"},{"key":"pcbi.1011353.ref051","article-title":"GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison","volume":"50","author":"D Dai","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"pcbi.1011353.ref052","unstructured":"Hernandez D, Kaplan J, Henighan T, McCandlish S. Scaling laws for transfer; 2021."},{"issue":"7","key":"pcbi.1011353.ref053","doi-asserted-by":"crossref","first-page":"7628","DOI":"10.1609\/aaai.v36i7.20729","article-title":"Frozen pretrained transformers as universal computation engines","volume":"36","author":"K Lu","year":"2022","journal-title":"AAAI"}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1011353","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,7]],"date-time":"2025-05-07T16:07:56Z","timestamp":1746634076000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1011353"}},"subtitle":[],"editor":[{"given":"Stacey D.","family":"Finley","sequence":"first","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2025,5,7]]},"references-count":53,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5,7]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1011353","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.07.17.549267","asserted-by":"object"}]},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,7]]}}}