{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,23]],"date-time":"2026-04-23T19:47:27Z","timestamp":1776973647428,"version":"3.51.4"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2025,6,26]],"date-time":"2025-06-26T00:00:00Z","timestamp":1750896000000},"content-version":"vor","delay-in-days":1,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1933925"],"award-info":[{"award-number":["#1933925"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1937255"],"award-info":[{"award-number":["#1937255"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1937232"],"award-info":[{"award-number":["#1937232"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["#1933925"],"award-info":[{"award-number":["#1933925"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>High-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Addressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>All source code is available at https:\/\/github.com\/DLii-Research\/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https:\/\/github.com\/DLii-Research\/q2-deepdna.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf370","type":"journal-article","created":{"date-parts":[[2025,6,26]],"date-time":"2025-06-26T00:43:13Z","timestamp":1750898593000},"source":"Crossref","is-referenced-by-count":1,"title":["SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing"],"prefix":"10.1093","volume":"41","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-5851-7964","authenticated-orcid":false,"suffix":"II","given":"David W","family":"Ludwig","sequence":"first","affiliation":[{"name":"Department of Computer Science, Middle Tennessee State University , Murfreesboro, TN 37132,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christopher","family":"Guptil","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Computer Science, Miami University , Oxford, OH 45056,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7271-5520","authenticated-orcid":false,"given":"Nicholas R","family":"Alexander","sequence":"additional","affiliation":[{"name":"Department of Biology, Middle Tennessee State University , Murfreesboro, TN 37132,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1893-1888","authenticated-orcid":false,"given":"Kateryna","family":"Zhalnina","sequence":"additional","affiliation":[{"name":"Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1025-9310","authenticated-orcid":false,"given":"Edi M -L","family":"Wipf","sequence":"additional","affiliation":[{"name":"Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6652-012X","authenticated-orcid":false,"given":"Albina","family":"Khasanova","sequence":"additional","affiliation":[{"name":"Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory , Berkeley, CA 94720,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1653-0009","authenticated-orcid":false,"given":"Nicholas A","family":"Barber","sequence":"additional","affiliation":[{"name":"Department of Biology, San Diego State University , San Diego, CA 92182,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7734-5179","authenticated-orcid":false,"given":"Wesley","family":"Swingley","sequence":"additional","affiliation":[{"name":"Department of Biological Sciences, Northern Illinois University , DeKalb, IL 60115,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3119-8809","authenticated-orcid":false,"given":"Donald M","family":"Walker","sequence":"additional","affiliation":[{"name":"Department of Biology, Middle Tennessee State University , Murfreesboro, TN 37132,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4619-6083","authenticated-orcid":false,"given":"Joshua L","family":"Phillips","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Middle Tennessee State University , Murfreesboro, TN 37132,","place":["United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,6,25]]},"reference":[{"key":"2025071019563120200_btaf370-B1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1128\/MMBR.00019-15","article-title":"Taxonomy, physiology, and natural products of actinobacteria","volume":"80","author":"Barka","year":"2016","journal-title":"Microbiol Mol Biol Rev"},{"key":"2025071019563120200_btaf370-B2","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1186\/s40168-018-0470-z","article-title":"Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2\u2019s q2-feature-classifier plugin","volume":"6","author":"Bokulich","year":"2018","journal-title":"Microbiome"},{"key":"2025071019563120200_btaf370-B3","author":"Cornman","year":"2024"},{"key":"2025071019563120200_btaf370-B4","first-page":"4171","author":"Devlin"},{"key":"2025071019563120200_btaf370-B5","author":"Dosovitskiy"},{"key":"2025071019563120200_btaf370-B6","doi-asserted-by":"crossref","first-page":"3111","DOI":"10.1038\/s41396-021-01027-4","article-title":"Open challenges for microbial network construction and analysis","volume":"15","author":"Faust","year":"2021","journal-title":"ISME J"},{"key":"2025071019563120200_btaf370-B7","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1186\/s40168-019-0633-6","article-title":"CAMISIM: simulating metagenomes and microbial communities","volume":"7","author":"Fritz","year":"2019","journal-title":"Microbiome"},{"key":"2025071019563120200_btaf370-B8","author":"Hao"},{"key":"2025071019563120200_btaf370-B9","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"2025071019563120200_btaf370-B10","doi-asserted-by":"crossref","first-page":"105095","DOI":"10.1016\/j.biosystems.2023.105095","article-title":"SetQuence & SetOmic: deep set transformers for whole genome and exome tumour analysis","volume":"235","author":"Jurenaite","year":"2024","journal-title":"Biosystems"},{"key":"2025071019563120200_btaf370-B11","doi-asserted-by":"crossref","first-page":"4643","DOI":"10.1038\/s41467-019-12669-6","article-title":"Species abundance information improves sequence taxonomy classification accuracy","volume":"10","author":"Kaehler","year":"2019","journal-title":"Nat Commun"},{"key":"2025071019563120200_btaf370-B12","doi-asserted-by":"crossref","first-page":"5112","DOI":"10.1128\/AEM.01043-13","article-title":"Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq illumina sequencing platform","volume":"79","author":"Kozich","year":"2013","journal-title":"Appl Environ Microbiol"},{"key":"2025071019563120200_btaf370-B13","first-page":"3744","author":"Lee"},{"key":"2025071019563120200_btaf370-B14","doi-asserted-by":"crossref","first-page":"836","DOI":"10.1038\/s41467-022-28448-9","article-title":"Rhizosphere bacteriome structure and functions","volume":"13","author":"Ling","year":"2022","journal-title":"Nat Commun"},{"key":"2025071019563120200_btaf370-B15","doi-asserted-by":"crossref","first-page":"2687","DOI":"10.1016\/j.csbj.2021.05.001","article-title":"Network analysis methods for studying microbial communities: a mini review","volume":"19","author":"Matchado","year":"2021","journal-title":"Comput Struct Biotechnol J"},{"key":"2025071019563120200_btaf370-B16","doi-asserted-by":"crossref","first-page":"634","DOI":"10.1111\/1574-6976.12028","article-title":"The rhizosphere microbiome: significance of plant beneficial, plant pathogenic, and human pathogenic microorganisms","volume":"37","author":"Mendes","year":"2013","journal-title":"FEMS Microbiol Rev"},{"key":"2025071019563120200_btaf370-B17","doi-asserted-by":"publisher","author":"Mock","DOI":"10.1073\/pnas.2122636119"},{"key":"2025071019563120200_btaf370-B18","doi-asserted-by":"crossref","first-page":"165","DOI":"10.1186\/s13059-018-1554-6","article-title":"RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification","volume":"19","author":"Nasko","year":"2018","journal-title":"Genome Biol"},{"key":"2025071019563120200_btaf370-B19","author":"Nguyen"},{"key":"2025071019563120200_btaf370-B20","doi-asserted-by":"crossref","first-page":"7188","DOI":"10.1093\/nar\/gkm864","article-title":"SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB","volume":"35","author":"Pruesse","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2025071019563120200_btaf370-B21","doi-asserted-by":"crossref","first-page":"D590","DOI":"10.1093\/nar\/gks1219","article-title":"The SILVA ribosomal RNA gene database project: improved data processing and web-based tools","volume":"41","author":"Quast","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2025071019563120200_btaf370-B22","first-page":"1","volume-title":"2023 IEEE Signal Processing in Medicine and Biology Symposium (SPMB)","author":"Refahi","year":"2023"},{"key":"2025071019563120200_btaf370-B23","author":"Refahi","year":"2024"},{"key":"2025071019563120200_btaf370-B24","doi-asserted-by":"crossref","first-page":"e1009581","DOI":"10.1371\/journal.pcbi.1009581","article-title":"RESCRIPt: reproducible sequence taxonomy reference database management","volume":"17","author":"Robeson","year":"2021","journal-title":"PLoS Comput Biol"},{"key":"2025071019563120200_btaf370-B25","author":"Ruffolo"},{"key":"2025071019563120200_btaf370-B26","doi-asserted-by":"publisher","first-page":"e00746","DOI":"10.1128\/mbio.00746-15","article-title":"Successional trajectories of rhizosphere bacterial communities over consecutive seasons","volume":"6","author":"Shi","year":"2015","journal-title":"mBio"},{"key":"2025071019563120200_btaf370-B27","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2025071019563120200_btaf370-B28","doi-asserted-by":"crossref","first-page":"3017","DOI":"10.1093\/nar\/gkad055","article-title":"DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis","volume":"51","author":"Wang","year":"2023","journal-title":"Nucleic Acids Res"},{"key":"2025071019563120200_btaf370-B29","doi-asserted-by":"crossref","first-page":"470","DOI":"10.1038\/s41564-018-0129-3","article-title":"Dynamic root exudate chemistry and microbial substrate preferences drive patterns in rhizosphere microbial community assembly","volume":"3","author":"Zhalnina","year":"2018","journal-title":"Nat Microbiol"},{"key":"2025071019563120200_btaf370-B30","author":"Zhalnina","year":"2022"},{"key":"2025071019563120200_btaf370-B31","author":"Zhou"},{"key":"2025071019563120200_btaf370-B32","author":"Zhou","year":"2024"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf370\/63582081\/btaf370.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf370\/63582081\/btaf370.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/41\/7\/btaf370\/63582081\/btaf370.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T23:56:39Z","timestamp":1752191799000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf370\/8173948"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2025,6,25]]},"references-count":32,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf370","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,7]]},"published":{"date-parts":[[2025,6,25]]},"article-number":"btaf370"}}