{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T06:37:55Z","timestamp":1774334275249,"version":"3.50.1"},"reference-count":38,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2024,5,22]],"date-time":"2024-05-22T00:00:00Z","timestamp":1716336000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Institutes of Health, Department of Health and Human Services","award":["U19AI110818"],"award-info":[{"award-number":["U19AI110818"]}]},{"DOI":"10.13039\/100013114","name":"Broad Institute","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100013114","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,6,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models\u2014adopted from the natural language processing field\u2014have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>https:\/\/github.com\/AbeelLab\/safpred.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae328","type":"journal-article","created":{"date-parts":[[2024,5,22]],"date-time":"2024-05-22T16:30:53Z","timestamp":1716395453000},"source":"Crossref","is-referenced-by-count":6,"title":["SAFPred: synteny-aware gene function prediction for bacteria using protein embeddings"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8584-4736","authenticated-orcid":false,"given":"Aysun","family":"Urhan","sequence":"first","affiliation":[{"name":"Delft Bioinformatics Lab, Delft University of Technology Van Mourik , Delft XE 2628, The Netherlands"},{"name":"Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard , Cambridge, MA 02142, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1447-2701","authenticated-orcid":false,"given":"Bianca-Maria","family":"Cosma","sequence":"additional","affiliation":[{"name":"Delft Bioinformatics Lab, Delft University of Technology Van Mourik , Delft XE 2628, The Netherlands"}]},{"given":"Ashlee M","family":"Earl","sequence":"additional","affiliation":[{"name":"Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard , Cambridge, MA 02142, United States"}]},{"given":"Abigail L","family":"Manson","sequence":"additional","affiliation":[{"name":"Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard , Cambridge, MA 02142, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7205-7431","authenticated-orcid":false,"given":"Thomas","family":"Abeel","sequence":"additional","affiliation":[{"name":"Delft Bioinformatics Lab, Delft University of Technology Van Mourik , Delft XE 2628, The Netherlands"},{"name":"Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard , Cambridge, MA 02142, United States"}]}],"member":"286","published-online":{"date-parts":[[2024,5,22]]},"reference":[{"key":"2024060402530657700_btae328-B1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J Mol Biol"},{"key":"2024060402530657700_btae328-B2","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat Genet"},{"key":"2024060402530657700_btae328-B3","doi-asserted-by":"crossref","first-page":"e113","DOI":"10.1002\/cpz1.113","article-title":"Learned embeddings from deep learning to visualize and predict protein sets","volume":"1","author":"Dallago","year":"2021","journal-title":"Curr Protoc"},{"key":"2024060402530657700_btae328-B4","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s00239-002-2317-1","article-title":"Analysis of the cellular functions of escherichia coli operons and their conservation in bacillus subtilis","volume":"55","author":"de Daruvar","year":"2002","journal-title":"J Mol Evol"},{"key":"2024060402530657700_btae328-B5","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pcbi.1002195","article-title":"Accelerated profile hmm searches","volume":"7","author":"Eddy","year":"2011","journal-title":"PLoS Comput Biol"},{"key":"2024060402530657700_btae328-B6","doi-asserted-by":"publisher","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: Toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2022","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2024060402530657700_btae328-B7","doi-asserted-by":"crossref","first-page":"723","DOI":"10.1186\/s12859-019-3220-8","article-title":"Modeling aspects of the language of life through transfer-learning protein sequences","volume":"20","author":"Heinzinger","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2024060402530657700_btae328-B8","doi-asserted-by":"crossref","first-page":"lqac043","DOI":"10.1093\/nargab\/lqac043","article-title":"Contrastive learning on protein embeddings enlightens midnight zone","volume":"4","author":"Heinzinger","year":"2022","journal-title":"NAR Genom Bioinform"},{"key":"2024060402530657700_btae328-B9","doi-asserted-by":"crossref","first-page":"2606","DOI":"10.1038\/s41467-022-30070-8","article-title":"Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter","volume":"13","author":"Hoarfrost","year":"2022","journal-title":"Nat Commun"},{"key":"2024060402530657700_btae328-B10","doi-asserted-by":"crossref","first-page":"D309","DOI":"10.1093\/nar\/gky1085","article-title":"eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses","volume":"47","author":"Huerta-Cepas","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B11","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2024060402530657700_btae328-B12","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1093\/bioinformatics\/btz595","article-title":"DeepGOPlus: improved protein function prediction from sequence","volume":"36","author":"Kulmanov","year":"2019","journal-title":"Bioinformatics"},{"key":"2024060402530657700_btae328-B13","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1016\/j.cell.2017.04.027","article-title":"Tracing the enterococci from paleozoic origins to the hospital","volume":"169","author":"Lebreton","year":"2017","journal-title":"Cell"},{"key":"2024060402530657700_btae328-B14","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"CD-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2024060402530657700_btae328-B15","first-page":"119","article-title":"Gene function prediction with gene interaction networks: a context graph kernel approach","volume":"14","author":"Li","year":"2009","journal-title":"IEEE Trans Inf Technol Biomed"},{"key":"2024060402530657700_btae328-B16","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2024060402530657700_btae328-B17","doi-asserted-by":"crossref","first-page":"1160","DOI":"10.1038\/s41598-020-80786-0","article-title":"Embeddings from deep learning transfer go annotations beyond homology","volume":"11","author":"Littmann","year":"2021","journal-title":"Sci Rep"},{"key":"2024060402530657700_btae328-B18","doi-asserted-by":"crossref","first-page":"i304","DOI":"10.1093\/bioinformatics\/bty262","article-title":"HFSP: high speed homology-driven function annotation of proteins","volume":"34","author":"Mahlich","year":"2018","journal-title":"Bioinformatics"},{"key":"2024060402530657700_btae328-B19","doi-asserted-by":"crossref","first-page":"10162","DOI":"10.1093\/nar\/gkad757","article-title":"Learning from the unknown: exploring the range of bacterial functionality","volume":"51","author":"Mahlich","year":"2023","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B20","doi-asserted-by":"crossref","first-page":"e0242723","DOI":"10.1371\/journal.pone.0242723","article-title":"A thorough analysis of the contribution of experimental, derived and sequence-based predicted protein-protein interactions for functional annotation of proteins","volume":"15","author":"Makrodimitris","year":"2020","journal-title":"PLoS One"},{"key":"2024060402530657700_btae328-B21","doi-asserted-by":"crossref","first-page":"D412","DOI":"10.1093\/nar\/gkaa913","article-title":"Pfam: the protein families database in 2021","volume":"49","author":"Mistry","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B22","doi-asserted-by":"crossref","first-page":"D213","DOI":"10.1093\/nar\/gku1243","article-title":"The interpro protein families database: the classification resource after 15 years","volume":"43","author":"Mitchell","year":"2015","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B23","doi-asserted-by":"crossref","first-page":"D552","DOI":"10.1093\/nar\/gkq1090","article-title":"ODB: a database for operon organizations, 2011 update","volume":"39","author":"Okuda","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B24","doi-asserted-by":"crossref","first-page":"D785","DOI":"10.1093\/nar\/gkab776","article-title":"GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy","volume":"50","author":"Parks","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B25","doi-asserted-by":"crossref","first-page":"D418","DOI":"10.1093\/nar\/gkac993","article-title":"InterPro in 2022","volume":"51","author":"Paysan-Lafosse","year":"2023","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B26","first-page":"3.1.1","article-title":"An introduction to sequence similarity (\u201chomology\u201d) searching","author":"Pearson","year":"2013","journal-title":"Curr Protoc Bioinformatics"},{"key":"2024060402530657700_btae328-B27","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1038\/nmeth.2340","article-title":"A large-scale evaluation of computational protein function prediction","volume":"10","author":"Radivojac","year":"2013","journal-title":"Nat Methods"},{"key":"2024060402530657700_btae328-B28","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024060402530657700_btae328-B29","doi-asserted-by":"crossref","first-page":"5857","DOI":"10.1073\/pnas.95.11.5857","article-title":"SMART, a simple modular architecture research tool: identification of signaling domains","volume":"95","author":"Schultz","year":"1998","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024060402530657700_btae328-B30","doi-asserted-by":"publisher","first-page":"e2310852121","DOI":"10.1073\/pnas.2310852121","article-title":"Global diversity of enterococci and description of 18 previously unknown species","volume":"121","author":"Schwartzman","year":"2024","journal-title":"Proc Natl Acad Sci U S A"},{"key":"2024060402530657700_btae328-B31","doi-asserted-by":"crossref","first-page":"2068","DOI":"10.1093\/bioinformatics\/btu153","article-title":"Prokka: rapid prokaryotic genome annotation","volume":"30","author":"Seemann","year":"2014","journal-title":"Bioinformatics"},{"key":"2024060402530657700_btae328-B32","doi-asserted-by":"crossref","first-page":"D506","DOI":"10.1093\/nar\/gky1049","article-title":"UniProt: a worldwide hub of protein knowledge","volume":"47","author":"The UniProt Consortium","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B33","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1038\/s41587-023-01773-0","article-title":"Fast and accurate protein structure search with foldseek","volume":"42","author":"Van Kempen","year":"2024","journal-title":"Nat Biotechnol"},{"key":"2024060402530657700_btae328-B34","doi-asserted-by":"crossref","first-page":"1157","DOI":"10.1016\/j.cell.2022.02.002","article-title":"Emerging enterococcus pore-forming toxins with MHC\/HLA-I as receptors","volume":"185","author":"Xiong","year":"2022","journal-title":"Cell"},{"key":"2024060402530657700_btae328-B35","doi-asserted-by":"crossref","first-page":"W469","DOI":"10.1093\/nar\/gkab398","article-title":"NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information","volume":"49","author":"Yao","year":"2021","journal-title":"Nucleic Acids Res"},{"key":"2024060402530657700_btae328-B36","doi-asserted-by":"crossref","first-page":"2465","DOI":"10.1093\/bioinformatics\/bty130","article-title":"GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank","volume":"34","author":"You","year":"2018","journal-title":"Bioinformatics"},{"key":"2024060402530657700_btae328-B37","doi-asserted-by":"crossref","first-page":"169","DOI":"10.1016\/j.chom.2017.12.018","article-title":"Identification of a botulinum neurotoxin-like toxin in a commensal strain of enterococcus faecium","volume":"23","author":"Zhang","year":"2018","journal-title":"Cell Host Microbe"},{"key":"2024060402530657700_btae328-B38","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1186\/s13059-019-1835-8","article-title":"The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens","volume":"20","author":"Zhou","year":"2019","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae328\/57826417\/btae328.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/6\/btae328\/58079633\/btae328.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/6\/btae328\/58079633\/btae328.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,4]],"date-time":"2024-06-04T03:28:10Z","timestamp":1717471690000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae328\/7679689"}},"subtitle":[],"editor":[{"given":"Lenore","family":"Cowen","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,5,22]]},"references-count":38,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,6,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae328","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,6,1]]},"published":{"date-parts":[[2024,5,22]]},"article-number":"btae328"}}