{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:51:54Z","timestamp":1759333914579,"version":"build-2065373602"},"reference-count":35,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T00:00:00Z","timestamp":1759190400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004587","name":"Instituto de Salud Carlos III","doi-asserted-by":"publisher","award":["PI24\/00222"],"award-info":[{"award-number":["PI24\/00222"]}],"id":[{"id":"10.13039\/501100004587","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008425","name":"Conseller\u00eda de Cultura, Educaci\u00f3n e Ordenaci\u00f3n Universitaria, Xunta de Galicia","doi-asserted-by":"publisher","award":["ED431G-2019\/04 GRC2021\/48 GPC2020\/27 ED481A-2021 IN606B-2023\/005"],"award-info":[{"award-number":["ED431G-2019\/04 GRC2021\/48 GPC2020\/27 ED481A-2021 IN606B-2023\/005"]}],"id":[{"id":"10.13039\/501100008425","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Bioinform."],"abstract":"<jats:p>One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at <jats:ext-link>https:\/\/gitlab.citius.usc.es\/lara.vazquez\/epheclass<\/jats:ext-link>.<\/jats:p>","DOI":"10.3389\/fbinf.2025.1514880","type":"journal-article","created":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T05:27:48Z","timestamp":1759210068000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences"],"prefix":"10.3389","volume":"5","author":[{"given":"Lara","family":"V\u00e1zquez-Gonz\u00e1lez","sequence":"first","affiliation":[]},{"given":"Carlos","family":"Pe\u00f1a-Reyes","sequence":"additional","affiliation":[]},{"given":"Alba","family":"Regueira-Iglesias","sequence":"additional","affiliation":[]},{"given":"Carlos","family":"Balsa-Castro","sequence":"additional","affiliation":[]},{"given":"Inmaculada","family":"Tom\u00e1s","sequence":"additional","affiliation":[]},{"given":"Mar\u00eda J.","family":"Carreira","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,9,30]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1111\/j.2517-6161.1982.tb01195.x","article-title":"The statistical analysis of compositional data","volume":"44","author":"Aitchison","year":"1982","journal-title":"J. R. Stat. Soc. Ser. B Methodol."},{"key":"B2","doi-asserted-by":"publisher","first-page":"i32","DOI":"10.1093\/bioinformatics\/bty296","article-title":"MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples","volume":"34","author":"Asgari","year":"2018","journal-title":"Bioinformatics"},{"key":"B3","doi-asserted-by":"publisher","first-page":"2639","DOI":"10.1038\/ismej.2017.119","article-title":"Exact sequence variants should replace operational taxonomic units in marker-gene data analysis","volume":"11","author":"Callahan","year":"2017","journal-title":"ISME J."},{"key":"B4","doi-asserted-by":"publisher","first-page":"331","DOI":"10.1177\/00220345211035775","article-title":"SMDI: an index for measuring subgingival microbial dysbiosis","volume":"101","author":"Chen","year":"2021","journal-title":"J. Dent. Res."},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1802.04967","article-title":"Deslib: a dynamic ensemble selection library in python","author":"Cruz","year":"","journal-title":"arXiv"},{"key":"B6","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1016\/j.inffus.2017.09.010","article-title":"Dynamic classifier selection: recent advances and perspectives","volume":"41","author":"Cruz","year":"","journal-title":"Inf. Fusion"},{"key":"B7","doi-asserted-by":"publisher","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"B8","doi-asserted-by":"publisher","first-page":"3476","DOI":"10.1093\/bioinformatics\/btv401","article-title":"Error filtering, pair assembly and error correction for next-generation sequencing reads","volume":"31","author":"Edgar","year":"2015","journal-title":"Bioinformatics"},{"key":"B9","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1016\/j.chom.2014.02.005","article-title":"The treatment-naive microbiome in new-onset Crohn\u2019s disease","volume":"15","author":"Gevers","year":"2014","journal-title":"Cell Host Microbe"},{"key":"B10","doi-asserted-by":"publisher","first-page":"2224","DOI":"10.3389\/fmicb.2017.02224","article-title":"Microbiome datasets are compositional: and this is not optional","volume":"8","author":"Gloor","year":"2017","journal-title":"Front. Microbiol."},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2205.09906","article-title":"Data augmentation for compositional data: advancing predictive models of the microbiome","author":"Gordon-Rodriguez","year":"2022","journal-title":"arXiv"},{"key":"B12","doi-asserted-by":"publisher","first-page":"1155","DOI":"10.1111\/biom.13481","article-title":"Utilizing stability criteria in choosing feature selection methods yields reproducible results in microbiome data","volume":"78","author":"Jiang","year":"2021","journal-title":"Biometrics"},{"key":"B13","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4614-6849-3","volume-title":"Applied predictive modeling","author":"Kuhn","year":"2013"},{"key":"B14","doi-asserted-by":"publisher","first-page":"216","DOI":"10.3389\/fcimb.2019.00216","article-title":"Identification of salivary microbiota and its association with host inflammatory mediators in periodontitis","volume":"9","author":"Lundmark","year":"2019","journal-title":"Front. Cell. Infect. Microbiol."},{"key":"B15","doi-asserted-by":"publisher","first-page":"W1","DOI":"10.7326\/m14-0698","article-title":"Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): explanation and elaboration","volume":"162","author":"Moons","year":"2015","journal-title":"Ann. Intern. Med."},{"key":"B16","doi-asserted-by":"publisher","first-page":"669","DOI":"10.3390\/life12050669","article-title":"Megad: deep learning for rapid and accurate disease status prediction of metagenomic samples","volume":"12","author":"Mreyoud","year":"2022","journal-title":"Life"},{"key":"B17","doi-asserted-by":"publisher","first-page":"1549","DOI":"10.3390\/jcm9051549","article-title":"Identification of potential oral microbial biomarkers for the diagnosis of periodontitis","volume":"9","author":"Na","year":"2020","journal-title":"J. Clin. Med."},{"key":"B18","doi-asserted-by":"publisher","first-page":"4995","DOI":"10.1007\/s00784-022-04468-z","article-title":"Identification of the specific microbial community compositions in saliva associated with periodontitis during pregnancy","volume":"26","author":"Narita","year":"2022","journal-title":"Clin. Oral Investig."},{"key":"B19","doi-asserted-by":"publisher","first-page":"1165295","DOI":"10.3389\/fcimb.2023.1165295","article-title":"The effect of low-abundance otu filtering methods on the reliability and variability of microbial composition assessed by 16s rrna amplicon sequencing","volume":"13","author":"Nikodemova","year":"2023","journal-title":"Front. Cell. Infect. Microbiol."},{"key":"B20","doi-asserted-by":"publisher","first-page":"6026","DOI":"10.1038\/s41598-020-63159-5","article-title":"Deepmicro: deep representation learning for disease prediction based on microbiome data","volume":"10","author":"Oh","year":"2020","journal-title":"Sci. Rep."},{"key":"B21","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"B22","doi-asserted-by":"publisher","first-page":"giz107","DOI":"10.1093\/gigascience\/giz107","article-title":"A field guide for the compositional analysis of any-omics data","volume":"8","author":"Quinn","year":"2019","journal-title":"Gigascience"},{"key":"B23","doi-asserted-by":"publisher","first-page":"99","DOI":"10.1016\/j.micres.2010.02.003","article-title":"Microbial phylogeny and diversity: small subunit ribosomal RNA sequence analysis and beyond","volume":"166","author":"Rajendhran","year":"2011","journal-title":"Microbiol. Res."},{"key":"B24","doi-asserted-by":"publisher","first-page":"929","DOI":"10.1038\/s41598-020-79875-x","article-title":"Relationship between dental and periodontal health status and the salivary microbiome: bacterial diversity, co-occurrence networks and predictive models","volume":"11","author":"Relvas","year":"2021","journal-title":"Sci. Rep."},{"key":"B25","doi-asserted-by":"publisher","first-page":"7537","DOI":"10.1128\/AEM.01541-09","article-title":"Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities","volume":"75","author":"Schloss","year":"2009","journal-title":"Appl. Environ. Microbiol."},{"key":"B26","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-77244-8","volume-title":"Clinical prediction models","author":"Steyerberg","year":"2009"},{"key":"B27","doi-asserted-by":"publisher","first-page":"126","DOI":"10.1186\/s12859-023-05251-x","article-title":"Automatic disease prediction from human gut metagenomic data using boosting GraphSAGE","volume":"24","author":"Syama","year":"2023","journal-title":"BMC Bioinforma."},{"key":"B28","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1186\/s12911-019-1004-8","article-title":"Comparing different supervised machine learning algorithms for disease prediction","volume":"19","author":"Uddin","year":"2019","journal-title":"BMC Med. Inf. Decis. Mak."},{"key":"B29","doi-asserted-by":"publisher","first-page":"2835","DOI":"10.3390\/diagnostics13172835","article-title":"Crohn\u2019s disease prediction using sequence based machine learning analysis of human microbiome","volume":"13","author":"Unal","year":"2023","journal-title":"Diagnostics"},{"key":"B30","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1007\/978-3-031-36616-1_44","article-title":"An ensemble-based phenotype classifier to diagnose crohn\u2019s disease from 16s rRNA gene sequences","volume":"14062","author":"V\u00e1zquez-Gonz\u00e1lez","year":"2023","journal-title":"Proc. IbPRIA 2023. Lect. Notes Comput. Sci."},{"key":"B31","doi-asserted-by":"publisher","first-page":"835","DOI":"10.1093\/biomet\/83.4.835","article-title":"A distribution-free procedure for comparing receiver operating characteristic curves for a paired experiment","volume":"83","author":"Venkatraman","year":"1996","journal-title":"Biometrika"},{"key":"B32","doi-asserted-by":"publisher","first-page":"1134","DOI":"10.1111\/j.0006-341x.2000.01134.x","article-title":"A permutation test to compare receiver operating characteristic curves","volume":"56","author":"Venkatraman","year":"2000","journal-title":"Biometrics"},{"key":"B33","doi-asserted-by":"publisher","first-page":"343ra81","DOI":"10.1126\/scitranslmed.aad0917","article-title":"Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability","volume":"8","author":"Yassour","year":"2016","journal-title":"Sci. Transl. Med."},{"key":"B34","first-page":"1","article-title":"Popular deep learning algorithms for disease prediction","volume-title":"A review","author":"Yu","year":"2022"},{"key":"B35","doi-asserted-by":"publisher","first-page":"e1009345","DOI":"10.1371\/journal.pcbi.1009345","article-title":"Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network","volume":"17","author":"Zhao","year":"2021","journal-title":"PLOS Comput. Biol."}],"container-title":["Frontiers in Bioinformatics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1514880\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T05:27:50Z","timestamp":1759210070000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fbinf.2025.1514880\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,30]]},"references-count":35,"alternative-id":["10.3389\/fbinf.2025.1514880"],"URL":"https:\/\/doi.org\/10.3389\/fbinf.2025.1514880","relation":{},"ISSN":["2673-7647"],"issn-type":[{"type":"electronic","value":"2673-7647"}],"subject":[],"published":{"date-parts":[[2025,9,30]]},"article-number":"1514880"}}