{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T04:23:18Z","timestamp":1772252598122,"version":"3.50.1"},"reference-count":42,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2016,12,13]],"date-time":"2016-12-13T00:00:00Z","timestamp":1481587200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01GM100387"],"award-info":[{"award-number":["R01GM100387"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. The predictive modeling of such microbiota count data for the classification of human infection from parasitic worms, such as helminths, can help in the detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse, containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously, using different methods of feature reduction. To our knowledge, integrative methods, such as transfer learning, have not yet been explored in the microbiome domain as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster, grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis by using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling by using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under the receiver operating characteristic (ROC) Curve (AUC) and Balanced Accuracy (Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.<\/jats:p>","DOI":"10.3390\/data1030019","type":"journal-article","created":{"date-parts":[[2016,12,13]],"date-time":"2016-12-13T10:15:52Z","timestamp":1481624152000},"page":"19","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations"],"prefix":"10.3390","volume":"1","author":[{"given":"Mahbaneh","family":"Eshaghzadeh Torbati","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Pittsburgh, 6135 Sennott Square, 210 S Bouquet St, Pittsburgh, PA 15260-9161, USA"}]},{"given":"Makedonka","family":"Mitreva","sequence":"additional","affiliation":[{"name":"Department of Medicine, Washington University School of Medicine, 660 S Euclid Ave, St. Louis, MO 63110, USA"}]},{"given":"Vanathi","family":"Gopalakrishnan","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite 500, Pittsburgh, PA 15206-3701, USA"}]}],"member":"1968","published-online":{"date-parts":[[2016,12,13]]},"reference":[{"key":"ref_1","unstructured":"World Health Organization (2004). Estimated Incidence, Prevalence and TB Mortality, WHO. Available online: http:\/\/www. who. int\/mediacentre\/factsheets\/fs104\/en."},{"key":"ref_2","first-page":"ftv020","article-title":"Fine-scale analysis of 16S rRNA sequences reveals a high level of taxonomic diversity among vaginal Atopobium spp.","volume":"73","author":"Krishnan","year":"2015","journal-title":"Pathog. Dis."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1691","DOI":"10.1111\/jam.13111","article-title":"Study of duodenal bacterial communities by 16s rrna gene analysis in adults with active celiac disease versus non-celiac disease controls","volume":"120","author":"Nistal","year":"2016","journal-title":"J. Appl. Microbiol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1141","DOI":"10.1007\/s00285-012-0586-x","article-title":"Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens\u2019 theorem","volume":"67","author":"Wendl","year":"2013","journal-title":"J. Math. Biol."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Jumpstart Consortium Human Microbiome Project Data Generation Working Group (2012). Evaluation of 16S rDNA-based community profiling for human microbiome research. PLoS ONE, 7.","DOI":"10.1371\/journal.pone.0039315"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.1574-6941.2003.tb01040.x","article-title":"Using ecological diversity measures with bacterial communities","volume":"43","author":"Hill","year":"2003","journal-title":"FEMS Microbiol. Ecol."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1607","DOI":"10.1093\/bioinformatics\/btu855","article-title":"Selection of models for the analysis of risk-factor trees: Leveraging biological knowledge to mine large sets of risk factors with application to microbiome data","volume":"31","author":"Zhang","year":"2015","journal-title":"Bioinformatics"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"White, J.R. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol., 5.","DOI":"10.1371\/journal.pcbi.1000352"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/gb-2011-12-6-r60","article-title":"Metagenomic biomarker discovery and explanation","volume":"12","author":"Segata","year":"2011","journal-title":"Genome Biol."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Holmes, I., Harris, K., and Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE, 7.","DOI":"10.1371\/journal.pone.0030126"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"La Rosa, P.S., Brooks, J.P., Deych, E., Boone, E.L., Edwards, D.J., Wang, Q., Sodergren, E., Weinstock, G., and Shannon, W.D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE, 7.","DOI":"10.1371\/journal.pone.0052078"},{"key":"ref_12","first-page":"32","article-title":"A new method for nonparametric multivariate analysis of variance","volume":"26","author":"Anderson","year":"2001","journal-title":"Austral Ecol."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2106","DOI":"10.1093\/bioinformatics\/bts342","article-title":"Associating microbiome composition with environmental covariates using generalized UniFrac distances","volume":"28","author":"Chen","year":"2012","journal-title":"Bioinformatics"},{"key":"ref_14","first-page":"209","article-title":"The detection of disease clustering and a generalized regression approach","volume":"27","author":"Mantel","year":"1976","journal-title":"Cancer Res."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"8228","DOI":"10.1128\/AEM.71.12.8228-8235.2005","article-title":"UniFrac: A new phylogenetic method for comparing microbial communities","volume":"71","author":"Lozupone","year":"2005","journal-title":"Appl. Environ. Microbiol."},{"key":"ref_16","unstructured":"Tobias, R.D. (1995, January 2). An introduction to partial least squares regression. Proceedings of the Twentieth Annual SAS Users Group International Conference, Orlando, FL, USA."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1002\/cem.785","article-title":"Partial least squares for discrimination","volume":"17","author":"Barker","year":"2003","journal-title":"J. Chemom."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1093\/bioinformatics\/18.1.39","article-title":"Tumor classification by partial least squares using microarray gene expression data","volume":"18","author":"Nguyen","year":"2002","journal-title":"Bioinformatics"},{"key":"ref_19","first-page":"1544","article-title":"A sparse PLS for variable selection when integrating omics data","volume":"7","author":"Rossouw","year":"2008","journal-title":"Stat. Appl. Genet. Mol. Biol."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"L\u00ea Cao, K.A., Martin, P.G., Robert-Grani\u00e9, C., and Besse, P. (2009). Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinform., 10.","DOI":"10.1186\/1471-2105-10-34"},{"key":"ref_21","first-page":"1","article-title":"Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with high-fat diet","volume":"8","author":"Mahana","year":"2011","journal-title":"Genome Med."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"L\u00ea Cao, K.A., Boitard, S., and Besse, P. (2011). Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinform., 12.","DOI":"10.1186\/1471-2105-12-253"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"L\u00ea Cao, K.A., Costello, M.E., Lakis, V.A., Bartolo, F., Chua, X.Y., Brazeilles, R., and Rondeau, P. (2016). mixMC: A multivariate statistical framework to gain insight into Microbial Communities. bioRxiv, 044206. doi:http:\/\/dx.doi.org\/10.1101\/044206.","DOI":"10.1101\/044206"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sun, Y., Cai, Y., Mai, V., Farmerie, W., Yu, F., Li, J., and Goodison, S. (2011). Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data. Nucleic Acids Res.","DOI":"10.1093\/nar\/gkq872"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1111\/j.1467-9868.2011.00771.x","article-title":"Regression shrinkage and selection via the lasso: A retrospective","volume":"73","author":"Tibshirani","year":"2011","journal-title":"J. R. Stat. Soc. Ser. B (Stat. Methodol.)"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1002\/widm.8","article-title":"Classification and regression trees","volume":"1","author":"Loh","year":"2011","journal-title":"Wiley Interdiscip. Rev. Data Min. Know. Dis."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Ogoe, H.A., Visweswaran, S., Lu, X., and Gopalakrishnan, V. (2015). Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data. BMC Bioinform., 16.","DOI":"10.1186\/s12859-015-0643-8"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40168-015-0102-9","article-title":"The effect of dietary resistant starch type 2 on the microbiota and markers of gut inflammation in rural Malawi children","volume":"3","author":"Ordiz","year":"2015","journal-title":"Microbiome"},{"key":"ref_29","unstructured":"Dietterich, T., Bishop, C., Heckerman, D., Jordan, M., and Kearns, M. (2010). Introduction to Machine Learning, The MIT Press."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"633","DOI":"10.1093\/nar\/gkt1244","article-title":"Ribosomal Database Project: Data and tools for high throughput rRNA analysis","volume":"42","author":"Cole","year":"2013","journal-title":"Nucleic Acids Res."},{"key":"ref_31","unstructured":"Bellman, R.E. (1957). Dynamic Programming, Princeton University Press."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"10312","DOI":"10.1038\/srep10312","article-title":"Application of high-dimensional feature selection: Evaluation for genomic prediction in man","volume":"5","author":"Bermingham","year":"2012","journal-title":"Sci. Rep."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"2323","DOI":"10.1126\/science.290.5500.2323","article-title":"Nonlinear dimensionality reduction by locally linear embedding","volume":"290","author":"Roweis","year":"2000","journal-title":"Science"},{"key":"ref_34","first-page":"41","article-title":"An empirical study of the naive Bayes classifier","volume":"3","author":"Rish","year":"2001","journal-title":"IJCAI"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1023\/A:1009715923555","article-title":"A tutorial on support vector machines for pattern recognition","volume":"2","author":"Burges","year":"1998","journal-title":"Data Min. Know. Dis."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"332","DOI":"10.7763\/IJCTE.2011.V3.328","article-title":"Behaviour analysis of multilayer perceptrons with multiple hidden neurons and hidden layers","volume":"3","author":"Panchal","year":"2011","journal-title":"Int. J. Comput. Theory Eng."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/BF00116251","article-title":"Induction of decision trees","volume":"1","author":"Quinlan","year":"1986","journal-title":"Mach. Learn."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1093\/biomet\/70.1.163","article-title":"Information gain and a general measure of correlation","volume":"70","author":"Kent","year":"1983","journal-title":"Biometrika"},{"key":"ref_40","unstructured":"Pompili, M., and Chavez, S. (1995). Artificial Intelligence: A Modern Approach, Prentice Hall."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA data mining software: An update","volume":"11","author":"Hall","year":"2009","journal-title":"ACM SIGKDD Explor."},{"key":"ref_42","unstructured":"Zhang, Q. Implemented Code for SMARTscan, 2015. Available online: https:\/\/dsgweb.wustl.edu\/qunyuan\/software\/smartscan\/."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/1\/3\/19\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T19:28:27Z","timestamp":1760210907000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/1\/3\/19"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,12,13]]},"references-count":42,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["data1030019"],"URL":"https:\/\/doi.org\/10.3390\/data1030019","relation":{"has-preprint":[{"id-type":"doi","id":"10.20944\/preprints201612.0031.v1","asserted-by":"object"}]},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,12,13]]}}}