{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T06:08:59Z","timestamp":1778220539192,"version":"3.51.4"},"reference-count":71,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005370","name":"Gates Cambridge Trust","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100005370","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004460","name":"Rotary Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100004460","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>The functional annotation of uncharacterized microbial enzymes from metagenomic data remains a significant challenge, limiting our understanding of microbial metabolic dynamics. Traditional annotation methods often rely on sequence homology, which can fail to identify remote homologs or enzymes with structural rather than sequence conservation. To address this gap, we developed CAZyLingua, the first annotation tool to use protein language models (pLMs) for the accurate classification of carbohydrate-active enzyme (CAZyme) families and subfamilies.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>CAZyLingua demonstrated high performance, maintaining precision and recall comparable to state-of-the-art hidden Markov model-based methods while outperforming purely sequence-based approaches. When applied to a metagenomic gene catalog from mother\/infant pairs, CAZyLingua identified over 27,000 putative CAZymes missed by other tools, including horizontally-transferred enzymes implicated in infant microbiome development. In datasets from patients with Crohn\u2019s disease and IgG4-related disease, CAZyLinuga uncovered disease-associated CAZymes, highlighting an expansion of carbohydrate esterases (CEs) in IgG4-related disease. A CE17 enzyme predicted to be overabundant in Crohn\u2019s disease was functionally validated, confirming its catalytic activity on acetylated manno-oligosaccharides.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusions<\/jats:title>\n                    <jats:p>CAZyLingua is a powerful tool that effectively augments existing functional annotation pipelines for CAZymes. By leveraging the deep contextual information captured by pLMs, our method can uncover novel CAZyme diversity and reveal enzymatic functions relevant to health and disease, contributing to a further understanding of biological processes related to host health and nutrition.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s12859-025-06286-y","type":"journal-article","created":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T15:29:25Z","timestamp":1764170965000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Protein language models uncover carbohydrate-active enzyme function in metagenomics"],"prefix":"10.1186","volume":"26","author":[{"given":"Kumar","family":"Thurimella","sequence":"first","affiliation":[]},{"given":"Ahmed M. T.","family":"Mohamed","sequence":"additional","affiliation":[]},{"given":"Chenhao","family":"Li","sequence":"additional","affiliation":[]},{"given":"Tommi","family":"Vatanen","sequence":"additional","affiliation":[]},{"given":"Daniel B.","family":"Graham","sequence":"additional","affiliation":[]},{"given":"R\u00f3is\u00edn M.","family":"Owens","sequence":"additional","affiliation":[]},{"given":"Sabina Leanti","family":"La Rosa","sequence":"additional","affiliation":[]},{"given":"Damian R.","family":"Plichta","sequence":"additional","affiliation":[]},{"given":"Sergio","family":"Bacallado","sequence":"additional","affiliation":[]},{"given":"Ramnik J.","family":"Xavier","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,11,26]]},"reference":[{"issue":"7285","key":"6286_CR1","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1038\/nature08821","volume":"464","author":"J Qin","year":"2010","unstructured":"Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59\u201365.","journal-title":"Nature"},{"issue":"1","key":"6286_CR2","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1038\/s41587-020-0603-3","volume":"39","author":"A Almeida","year":"2021","unstructured":"Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105\u201314.","journal-title":"Nat Biotechnol"},{"issue":"3","key":"6286_CR3","doi-asserted-by":"publisher","first-page":"649","DOI":"10.1016\/j.cell.2019.01.001","volume":"176","author":"E Pasolli","year":"2019","unstructured":"Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell. 2019;176(3):649-662.e20.","journal-title":"Cell"},{"issue":"7459","key":"6286_CR4","doi-asserted-by":"publisher","first-page":"431","DOI":"10.1038\/nature12352","volume":"499","author":"C Rinke","year":"2013","unstructured":"Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431\u20137.","journal-title":"Nature"},{"issue":"52","key":"6286_CR5","doi-asserted-by":"publisher","first-page":"15898","DOI":"10.1073\/pnas.1508380112","volume":"112","author":"N Perdig\u00e3o","year":"2015","unstructured":"Perdig\u00e3o N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, et al. Unexpected features of the dark proteome. Proc Natl Acad Sci USA. 2015;112(52):15898\u2013903.","journal-title":"Proc Natl Acad Sci USA"},{"issue":"7915","key":"6286_CR6","doi-asserted-by":"publisher","first-page":"754","DOI":"10.1038\/s41586-022-04648-7","volume":"606","author":"Y Zhang","year":"2022","unstructured":"Zhang Y, Bhosle A, Bae S, McIver LJ, Pishchany G, Accorsi EK, et al. Discovery of bioactive microbial gene products in inflammatory bowel disease. Nature. 2022;606(7915):754\u201360.","journal-title":"Nature"},{"issue":"15","key":"6286_CR7","doi-asserted-by":"publisher","first-page":"508","DOI":"10.1186\/1471-2164-15-508","volume":"21","author":"B Ma","year":"2014","unstructured":"Ma B, Charkowski AO, Glasner JD, Perna NT. Identification of host-microbe interaction factors in the genomes of soft rot-associated pathogens Dickeya dadantii 3937 and Pectobacterium carotovorum WPP14 with supervised machine learning. BMC Genom. 2014;21(15):508.","journal-title":"BMC Genom"},{"key":"6286_CR8","doi-asserted-by":"publisher","DOI":"10.1128\/mSystems.00183-17","author":"CA Lozupone","year":"2018","unstructured":"Lozupone CA. Unraveling interactions between the microbiome and the host immune system to decipher mechanisms of disease. mSystems. 2018. https:\/\/doi.org\/10.1128\/mSystems.00183-17.","journal-title":"mSystems"},{"issue":"1","key":"6286_CR9","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1186\/s40168-020-00875-0","volume":"8","author":"G Berg","year":"2020","unstructured":"Berg G, Rybakova D, Fischer D, Cernava T, Verg\u00e8s MCC, Charles T, et al. Microbiome definition re-visited: old concepts and new challenges. Microbiome. 2020;8(1):103.","journal-title":"Microbiome"},{"issue":"6639","key":"6286_CR10","doi-asserted-by":"publisher","first-page":"1358","DOI":"10.1126\/science.adf2465","volume":"379","author":"T Yu","year":"2023","unstructured":"Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358\u201363.","journal-title":"Science"},{"issue":"7949","key":"6286_CR11","doi-asserted-by":"publisher","first-page":"774","DOI":"10.1038\/s41586-023-05696-3","volume":"614","author":"AHW Yeh","year":"2023","unstructured":"Yeh AHW, Norn C, Kipnis Y, Tischer D, Pellock SJ, Evans D, et al. De novo design of luciferases using deep learning. Nature. 2023;614(7949):774\u201380.","journal-title":"Nature"},{"key":"6286_CR12","doi-asserted-by":"publisher","first-page":"89802","DOI":"10.1109\/ACCESS.2020.2992468","volume":"8","author":"Z Tao","year":"2020","unstructured":"Tao Z, Dong B, Teng Z, Zhao Y. The classification of enzymes by deep learning. IEEE Access. 2020;8:89802\u201311.","journal-title":"IEEE Access"},{"issue":"5","key":"6286_CR13","doi-asserted-by":"publisher","first-page":"760","DOI":"10.1093\/bioinformatics\/btx680","volume":"34","author":"Y Li","year":"2018","unstructured":"Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, et al. DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760\u20139.","journal-title":"Bioinformatics"},{"issue":"6","key":"6286_CR14","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0028742","volume":"7","author":"BL Cantarel","year":"2012","unstructured":"Cantarel BL, Lombard V, Henrissat B. Complex carbohydrate utilization by the healthy human microbiome. PLoS ONE. 2012;7(6):e28742.","journal-title":"PLoS ONE"},{"key":"6286_CR15","doi-asserted-by":"publisher","first-page":"D490","DOI":"10.1093\/nar\/gkt1178","volume":"42","author":"V Lombard","year":"2014","unstructured":"Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42:D490\u20135.","journal-title":"Nucleic Acids Res"},{"key":"6286_CR16","doi-asserted-by":"publisher","first-page":"D233","DOI":"10.1093\/nar\/gkn663","volume":"37","author":"BL Cantarel","year":"2009","unstructured":"Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res. 2009;37:D233\u20138.","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"6286_CR17","doi-asserted-by":"publisher","first-page":"859","DOI":"10.1016\/j.cell.2016.01.024","volume":"164","author":"MR Charbonneau","year":"2016","unstructured":"Charbonneau MR, O\u2019Donnell D, Blanton LV, Totten SM, Davis JCC, Barratt MJ, et al. Sialylated milk oligosaccharides promote microbiota-dependent growth in models of infant undernutrition. Cell. 2016;164(5):859\u201371.","journal-title":"Cell"},{"issue":"9","key":"6286_CR18","doi-asserted-by":"publisher","first-page":"542","DOI":"10.1038\/s41579-022-00712-1","volume":"20","author":"JF Wardman","year":"2022","unstructured":"Wardman JF, Bains RK, Rahfeld P, Withers SG. Carbohydrate-active enzymes (CAZymes) in the gut microbiome. Nat Rev Microbiol. 2022;20(9):542\u201356.","journal-title":"Nat Rev Microbiol"},{"issue":"6","key":"6286_CR19","doi-asserted-by":"publisher","first-page":"e0220122","DOI":"10.1128\/mbio.02201-22","volume":"13","author":"AM Porras","year":"2022","unstructured":"Porras AM, Zhou H, Shi Q, Xiao X, JRI Live Cell Bank, Longman R, et al. Inflammatory bowel disease-associated gut commensals degrade components of the extracellular Matrix. MBio. 2022;13(6):e0220122.","journal-title":"MBio"},{"issue":"1","key":"6286_CR20","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1186\/s13073-021-00853-7","volume":"13","author":"DR Plichta","year":"2021","unstructured":"Plichta DR, Somani J, Pichaud M, Wallace ZS, Fernandes AD, Perugino CA, et al. Congruent microbiome signatures in fibrosis-prone autoimmune diseases: IgG4-related disease and systemic sclerosis. Genome Med. 2021;13(1):35.","journal-title":"Genome Med"},{"issue":"7","key":"6286_CR21","doi-asserted-by":"publisher","first-page":"951","DOI":"10.1093\/bioinformatics\/bti125","volume":"21","author":"J S\u00f6ding","year":"2004","unstructured":"S\u00f6ding J. Protein homology detection by HMM\u2013HMM comparison. Bioinformatics. 2004;21(7):951\u201360.","journal-title":"Bioinformatics"},{"key":"6286_CR22","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1093\/nar\/gkr367","volume":"39","author":"RD Finn","year":"2011","unstructured":"Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:29\u201337.","journal-title":"Nucleic Acids Res"},{"issue":"3","key":"6286_CR23","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"SF Altschul","year":"1990","unstructured":"Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403\u201310.","journal-title":"J Mol Biol"},{"issue":"17","key":"6286_CR24","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"SF Altschul","year":"1997","unstructured":"Altschul SF, Madden TL, Sch\u00e4ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389\u2013402.","journal-title":"Nucleic Acids Res"},{"issue":"W1","key":"6286_CR25","doi-asserted-by":"publisher","first-page":"W95","DOI":"10.1093\/nar\/gky418","volume":"46","author":"H Zhang","year":"2018","unstructured":"Zhang H, Yohe T, Huang L, Entwistle S, Wu P, Yang Z, et al. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018;46(W1):W95-101.","journal-title":"Nucleic Acids Res"},{"issue":"7706","key":"6286_CR26","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1038\/s41586-018-0124-0","volume":"557","author":"MN Price","year":"2018","unstructured":"Price MN, Wetmore KM, Waters RJ, Callaghan M, Ray J, Liu H, et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature. 2018;557(7706):503\u20139.","journal-title":"Nature"},{"key":"6286_CR27","doi-asserted-by":"publisher","DOI":"10.1038\/s41587-021-01179-w","author":"ML Bileschi","year":"2022","unstructured":"Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, et al. Using deep learning to annotate the protein universe. Nat Biotechnol. 2022. https:\/\/doi.org\/10.1038\/s41587-021-01179-w.","journal-title":"Nat Biotechnol"},{"issue":"1","key":"6286_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-021-23303-9","volume":"12","author":"V Gligorijevi\u0107","year":"2021","unstructured":"Gligorijevi\u0107 V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun. 2021;12(1):1\u201314.","journal-title":"Nat Commun"},{"issue":"11","key":"6286_CR29","doi-asserted-by":"publisher","first-page":"1617","DOI":"10.1038\/s41587-022-01432-w","volume":"40","author":"R Chowdhury","year":"2022","unstructured":"Chowdhury R, Bouatta N, Biswas S, Floristean C, Kharkar A, Roy K, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022;40(11):1617\u201323.","journal-title":"Nat Biotechnol"},{"key":"6286_CR30","doi-asserted-by":"publisher","first-page":"1099","DOI":"10.1038\/s41587-022-01618-2","volume":"41","author":"A Madani","year":"2023","unstructured":"Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023;41:1099\u2013106.","journal-title":"Nat Biotechnol"},{"issue":"7873","key":"6286_CR31","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","volume":"596","author":"J Jumper","year":"2021","unstructured":"Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583\u20139.","journal-title":"Nature"},{"issue":"1","key":"6286_CR32","doi-asserted-by":"publisher","first-page":"2351","DOI":"10.1038\/s41467-023-37896-w","volume":"14","author":"J Koehler Leman","year":"2023","unstructured":"Koehler Leman J, Szczerbiak P, Renfrew PD, Gligorijevic V, Berenberg D, Vatanen T, et al. Sequence-structure-function relationships in the microbial protein universe. Nat Commun. 2023;14(1):2351.","journal-title":"Nat Commun"},{"key":"6286_CR33","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2016239118","author":"A Rives","year":"2021","unstructured":"Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021. https:\/\/doi.org\/10.1073\/pnas.2016239118.","journal-title":"Proc Natl Acad Sci USA"},{"key":"6286_CR34","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btad579","author":"K Kaminski","year":"2023","unstructured":"Kaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics. 2023. https:\/\/doi.org\/10.1093\/bioinformatics\/btad579.","journal-title":"Bioinformatics"},{"issue":"2","key":"6286_CR35","doi-asserted-by":"publisher","DOI":"10.1093\/nargab\/lqac043","volume":"4","author":"M Heinzinger","year":"2022","unstructured":"Heinzinger M, Littmann M, Sillitoe I, Bordin N, Orengo C, Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform. 2022;4(2):lqac043.","journal-title":"NAR Genom Bioinform"},{"issue":"6637","key":"6286_CR36","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1126\/science.ade2574","volume":"379","author":"Z Lin","year":"2023","unstructured":"Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123\u201330.","journal-title":"Science"},{"key":"6286_CR37","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3095381","author":"A Elnaggar","year":"2021","unstructured":"Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell. 2021. https:\/\/doi.org\/10.1109\/TPAMI.2021.3095381.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"6","key":"6286_CR38","doi-asserted-by":"publisher","first-page":"654","DOI":"10.1016\/j.cels.2021.05.017","volume":"12","author":"T Bepler","year":"2021","unstructured":"Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Syst. 2021;12(6):654-669.e3.","journal-title":"Cell Syst"},{"issue":"26","key":"6286_CR39","doi-asserted-by":"publisher","first-page":"4921","DOI":"10.1016\/j.cell.2022.11.023","volume":"185","author":"T Vatanen","year":"2022","unstructured":"Vatanen T, Jabbar KS, Ruohtula T, Honkanen J, Avila-Pacheco J, Siljander H, et al. Mobile genetic elements from the maternal microbiome shape infant gut microbial assembly and metabolism. Cell. 2022;185(26):4921-4936.e15.","journal-title":"Cell"},{"issue":"9","key":"6286_CR40","doi-asserted-by":"publisher","DOI":"10.1016\/j.xcrm.2021.100393","volume":"2","author":"YC Lou","year":"2021","unstructured":"Lou YC, Olm MR, Diamond S, Crits-Christoph A, Firek BA, Baker R, et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep Med. 2021;2(9):100393.","journal-title":"Cell Rep Med"},{"issue":"23","key":"6286_CR41","doi-asserted-by":"publisher","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","volume":"28","author":"L Fu","year":"2012","unstructured":"Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150\u20132.","journal-title":"Bioinformatics"},{"issue":"5","key":"6286_CR42","doi-asserted-by":"publisher","DOI":"10.1002\/cpz1.113","volume":"1","author":"C Dallago","year":"2021","unstructured":"Dallago C, Sch\u00fctze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, et al. Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols. 2021;1(5):e113.","journal-title":"Current Protocols"},{"issue":"4","key":"6286_CR43","doi-asserted-by":"publisher","first-page":"366","DOI":"10.1038\/s41592-021-01101-x","volume":"18","author":"B Buchfink","year":"2021","unstructured":"Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366\u20138.","journal-title":"Nat Methods"},{"key":"6286_CR44","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"6286_CR45","unstructured":"lightning: Deep learning framework to train, deploy, and ship AI products Lightning fast [Internet]. Github; [cited 2023 Aug 26]. Available from: https:\/\/github.com\/Lightning-AI\/lightning"},{"key":"6286_CR46","unstructured":"Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [Internet]. arXiv [cs.LG]. 2019. Available from: http:\/\/arxiv.org\/abs\/1912.01703"},{"key":"6286_CR47","unstructured":"Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: A Research Platform for Distributed Model Selection and Training [Internet]. arXiv [cs.LG]. 2018. Available from: http:\/\/arxiv.org\/abs\/1807.05118"},{"key":"6286_CR48","unstructured":"Li L, Jamieson K, Rostamizadeh A, Gonina K, Hardt M, Recht B, et al. Massively Parallel Hyperparameter Tuning [Internet]. 2018 [cited 2023 Aug 27]. Available from: https:\/\/openreview.net\/pdf?id=S1Y7OOlRZ"},{"key":"6286_CR49","unstructured":"Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems [Internet]. arXiv [cs.DC]. 2016. Available from: http:\/\/arxiv.org\/abs\/1603.04467"},{"issue":"W1","key":"6286_CR50","doi-asserted-by":"publisher","first-page":"W110","DOI":"10.1093\/nar\/gkaa375","volume":"48","author":"K Barrett","year":"2020","unstructured":"Barrett K, Hunt CJ, Lange L, Meyer AS. Conserved unique peptide patterns (CUPP) online platform: peptide-based functional annotation of carbohydrate active enzymes. Nucleic Acids Res. 2020;48(W1):W110\u20135.","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"6286_CR51","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1038\/s41564-018-0306-4","volume":"4","author":"EA Franzosa","year":"2019","unstructured":"Franzosa EA, Sirota-Madi A, Avila-Pacheco J, Fornelos N, Haiser HJ, Reinker S, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4(2):293\u2013305.","journal-title":"Nat Microbiol"},{"issue":"1","key":"6286_CR52","doi-asserted-by":"publisher","first-page":"10","DOI":"10.14806\/ej.17.1.200","volume":"17","author":"M Martin","year":"2011","unstructured":"Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011;17(1):10\u20132.","journal-title":"EMBnet J"},{"issue":"10","key":"6286_CR53","doi-asserted-by":"publisher","first-page":"1674","DOI":"10.1093\/bioinformatics\/btv033","volume":"31","author":"D Li","year":"2015","unstructured":"Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674\u20136.","journal-title":"Bioinformatics"},{"issue":"11","key":"6286_CR54","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1186\/1471-2105-11-119","volume":"8","author":"D Hyatt","year":"2010","unstructured":"Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 2010;8(11):119.","journal-title":"BMC Bioinform"},{"issue":"11","key":"6286_CR55","doi-asserted-by":"publisher","first-page":"1026","DOI":"10.1038\/nbt.3988","volume":"35","author":"M Steinegger","year":"2017","unstructured":"Steinegger M, S\u00f6ding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026\u20138.","journal-title":"Nat Biotechnol"},{"issue":"6","key":"6286_CR56","doi-asserted-by":"publisher","first-page":"679","DOI":"10.1038\/s41592-022-01488-1","volume":"19","author":"M Mirdita","year":"2022","unstructured":"Mirdita M, Sch\u00fctze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19(6):679\u201382.","journal-title":"Nat Methods"},{"key":"6286_CR57","doi-asserted-by":"publisher","DOI":"10.1038\/s41587-023-01773-0","author":"M van Kempen","year":"2023","unstructured":"van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023. https:\/\/doi.org\/10.1038\/s41587-023-01773-0.","journal-title":"Nat Biotechnol"},{"key":"6286_CR58","doi-asserted-by":"crossref","unstructured":"Hunter (2007) Matplotlib: A 2D Graphics Environment. 9:90\u20135.","DOI":"10.1109\/MCSE.2007.55"},{"issue":"W1","key":"6286_CR59","doi-asserted-by":"publisher","first-page":"W210","DOI":"10.1093\/nar\/gkac387","volume":"50","author":"L Holm","year":"2022","unstructured":"Holm L. Dali server: structural unification of protein families. Nucleic Acids Res. 2022;50(W1):W210\u20135.","journal-title":"Nucleic Acids Res"},{"issue":"9","key":"6286_CR60","doi-asserted-by":"publisher","first-page":"1109","DOI":"10.1038\/s41592-022-01585-1","volume":"19","author":"C Zhang","year":"2022","unstructured":"Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods. 2022;19(9):1109\u201315.","journal-title":"Nat Methods"},{"issue":"3","key":"6286_CR61","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1038\/s41592-019-0686-2","volume":"17","author":"P Virtanen","year":"2020","unstructured":"Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261\u201372.","journal-title":"Nat Methods"},{"key":"6286_CR62","unstructured":"Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference [Internet]. SciPy; 2010 [cited 2023 Aug 18]. Available from: https:\/\/pdfs.semanticscholar.org\/3a27\/6417e5350e29cb6bf04ea5a4785601d5a215.pdf"},{"issue":"1","key":"6286_CR63","doi-asserted-by":"publisher","first-page":"905","DOI":"10.1038\/s41467-019-08812-y","volume":"10","author":"SL La Rosa","year":"2019","unstructured":"La Rosa SL, Leth ML, Michalak L, Hansen ME, Pudlo NA, Glowacki R, et al. The human gut Firmicute Roseburia intestinalis is a primary degrader of dietary \u03b2-mannans. Nat Commun. 2019;10(1):905.","journal-title":"Nat Commun"},{"issue":"3","key":"6286_CR64","doi-asserted-by":"publisher","DOI":"10.1128\/mBio.03628-20","volume":"12","author":"LJ Lindstad","year":"2021","unstructured":"Lindstad LJ, Lo G, Leivers S, Lu Z, Michalak L, Pereira GV, et al. Human gut faecalibacterium prausnitzii deploys a highly efficient conserved system to cross-feed on \u03b2-mannan-derived oligosaccharides. MBio. 2021;12(3):e0362820.","journal-title":"MBio"},{"issue":"7","key":"6286_CR65","doi-asserted-by":"publisher","first-page":"1023","DOI":"10.1038\/s41587-021-01156-3","volume":"40","author":"F Teufel","year":"2022","unstructured":"Teufel F, Almagro Armenteros JJ, Johansen AR, G\u00edslason MH, Pihl SI, Tsirigos KD, et al. Signalp 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40(7):1023\u20135.","journal-title":"Nat Biotechnol"},{"issue":"11","key":"6286_CR66","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1009442","volume":"17","author":"H Mallick","year":"2021","unstructured":"Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol. 2021;17(11):e1009442.","journal-title":"PLoS Comput Biol"},{"issue":"10","key":"6286_CR67","first-page":"826","volume":"32","author":"AC Anderson","year":"2022","unstructured":"Anderson AC, Stangherlin S, Pimentel KN, Weadge JT, Clarke AJ. The SGNH hydrolase family: a template for carbohydrate diversity. Glycobiology. 2022;32(10):826\u201348.","journal-title":"Glycobiology"},{"issue":"12","key":"6286_CR68","doi-asserted-by":"publisher","first-page":"1183","DOI":"10.1016\/S0969-2126(01)00684-0","volume":"9","author":"JA Prates","year":"2001","unstructured":"Prates JA, Tarbouriech N, Charnock SJ, Fontes CM, Ferreira LM, Davies GJ. The structure of the feruloyl esterase module of xylanase 10B from Clostridium thermocellum provides insights into substrate recognition. Structure. 2001;9(12):1183\u201390.","journal-title":"Structure"},{"issue":"2","key":"6286_CR69","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1016\/j.chom.2023.12.014","volume":"32","author":"E Buzun","year":"2024","unstructured":"Buzun E, Hsu CY, Sejane K, Oles RE, Vasquez Ayala A, Loomis LR, et al. A bacterial sialidase mediates early-life colonization by a pioneering gut commensal. Cell Host Microbe. 2024;32(2):181-190.e9.","journal-title":"Cell Host Microbe"},{"issue":"3","key":"6286_CR70","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1016\/j.chom.2014.02.005","volume":"15","author":"D Gevers","year":"2014","unstructured":"Gevers D, Kugathasan S, Denson LA, V\u00e1zquez-Baeza Y, Van Treuren W, Ren B, et al. The treatment-naive microbiome in new-onset Crohn\u2019s disease. Cell Host Microbe. 2014;15(3):382\u201392.","journal-title":"Cell Host Microbe"},{"issue":"7758","key":"6286_CR71","doi-asserted-by":"publisher","first-page":"655","DOI":"10.1038\/s41586-019-1237-9","volume":"569","author":"J Lloyd-Price","year":"2019","unstructured":"Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655\u201362.","journal-title":"Nature"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06286-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-025-06286-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-025-06286-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T17:05:34Z","timestamp":1764176734000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-025-06286-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,26]]},"references-count":71,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["6286"],"URL":"https:\/\/doi.org\/10.1186\/s12859-025-06286-y","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,26]]},"assertion":[{"value":"9 September 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 September 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"R.J.X. is a co-founder of Jnana Therapeutics and Convergence Bio, scientific advisory board member at Nestl\u00e9, Magnet BioMedicine, and Arena BioWorks, and board director at MoonLake Immunotherapeutics. D.R.P. is an employee of Novozymes A\/S, Denmark. These organizations had no roles in this study.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"285"}}