{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T08:37:40Z","timestamp":1776760660433,"version":"3.51.2"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2021,3,23]],"date-time":"2021-03-23T00:00:00Z","timestamp":1616457600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000057","name":"National Institute of General Medical Sciences","doi-asserted-by":"publisher","award":["R35GM124952"],"award-info":[{"award-number":["R35GM124952"]}],"id":[{"id":"10.13039\/100000057","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,9,29]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence\u2013Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs\/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence\u2013function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The data, source codes and models are available at https:\/\/github.com\/Shen-Lab\/TALE.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab198","type":"journal-article","created":{"date-parts":[[2021,3,22]],"date-time":"2021-03-22T16:14:27Z","timestamp":1616429667000},"page":"2825-2833","source":"Crossref","is-referenced-by-count":137,"title":["TALE: Transformer-based protein function Annotation with joint sequence\u2013Label Embedding"],"prefix":"10.1093","volume":"37","author":[{"given":"Yue","family":"Cao","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Texas A&M University , College Station, TX 77843, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1703-7796","authenticated-orcid":false,"given":"Yang","family":"Shen","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Texas A&M University , College Station, TX 77843, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,3,23]]},"reference":[{"key":"2023061310574606500_btab198-B1","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet"},{"key":"2023061310574606500_btab198-B2","article-title":"Initializing neural networks for hierarchical multi-label text classification","author":"Baker","year":"2017","journal-title":"Assoc. Comput. Ling"},{"key":"2023061310574606500_btab198-B3","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nmeth.3176","article-title":"Fast and sensitive protein alignment using diamond","volume":"12","author":"Buchfink","year":"2015","journal-title":"Nat. Methods"},{"key":"2023061310574606500_btab198-B4","doi-asserted-by":"crossref","first-page":"i53","DOI":"10.1093\/bioinformatics\/btt228","article-title":"Information-theoretic evaluation of predicted ontological annotations","volume":"29","author":"Clark","year":"2013","journal-title":"Bioinformatics"},{"key":"2023061310574606500_btab198-B5","first-page":"248","author":"Deng","year":"2009"},{"key":"2023061310574606500_btab198-B6","author":"Duong","year":"2020"},{"key":"2023061310574606500_btab198-B7","author":"Elnaggar","year":"2020"},{"key":"2023061310574606500_btab198-B8","doi-asserted-by":"crossref","first-page":"e0198216","DOI":"10.1371\/journal.pone.0198216","article-title":"Predicting human protein function with multi-task deep neural networks","volume":"13","author":"Fa","year":"2018","journal-title":"PLoS One"},{"key":"2023061310574606500_btab198-B9","doi-asserted-by":"crossref","first-page":"giaa081","DOI":"10.1093\/gigascience\/giaa081","article-title":"Graph2GO: a multi-modal attributed network embedding method for inferring protein functions","volume":"9","author":"Fan","year":"2020","journal-title":"GigaScience"},{"key":"2023061310574606500_btab198-B10","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1007\/978-1-4939-3743-1_10","article-title":"Community-wide evaluation of computational function prediction","volume":"1446","author":"Friedberg","year":"2017","journal-title":"Methods Mol. Biol"},{"key":"2023061310574606500_btab198-B11","doi-asserted-by":"crossref","first-page":"3873","DOI":"10.1093\/bioinformatics\/bty440","article-title":"deepNF: deep network fusion for protein function prediction","volume":"34","author":"Gligorijevi\u0107","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061310574606500_btab198-B12","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1186\/s13059-016-1037-6","article-title":"An expanded evaluation of protein function prediction methods shows an improvement in accuracy","volume":"17","author":"Jiang","year":"2016","journal-title":"Genome Biol"},{"key":"2023061310574606500_btab198-B13","doi-asserted-by":"crossref","first-page":"272","DOI":"10.1186\/1471-2105-6-272","article-title":"Automated methods of predicting the function of biological sequences using go and blast","volume":"6","author":"Jones","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023061310574606500_btab198-B14","first-page":"60","author":"Kahanda","year":"2017"},{"key":"2023061310574606500_btab198-B15","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1093\/bioinformatics\/btz595","article-title":"DeepGOPlus: improved protein function prediction from sequence","volume":"36","author":"Kulmanov","year":"2020","journal-title":"Bioinformatics"},{"key":"2023061310574606500_btab198-B16","doi-asserted-by":"crossref","first-page":"660","DOI":"10.1093\/bioinformatics\/btx624","article-title":"DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier","volume":"34","author":"Kulmanov","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061310574606500_btab198-B17","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J. Mol. Biol"},{"key":"2023061310574606500_btab198-B18","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1038\/nmeth.2340","article-title":"A large-scale evaluation of computational protein function prediction","volume":"10","author":"Radivojac","year":"2013","journal-title":"Nat. Methods"},{"key":"2023061310574606500_btab198-B19","first-page":"1","article-title":"DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks","volume":"9","author":"Rifaioglu","year":"2019","journal-title":"Sci. Rep"},{"key":"2023061310574606500_btab198-B20","first-page":"622803","author":"Rives","year":"2020"},{"key":"2023061310574606500_btab198-B21","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1006\/jmbi.1997.1602","article-title":"Structural basis for molecular recognition between nuclear transport factor 2 (NTF2) and the GDP-bound form of the Ras-family GTPase Ran","volume":"277","author":"Stewart","year":"1998","journal-title":"J. Mol. Biol"},{"key":"2023061310574606500_btab198-B22","first-page":"45, D362\u2013D368","article-title":"The STRING database in 2017: quality-controlled protein\u2013protein association networks, made broadly accessible","author":"Szklarczyk","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023061310574606500_btab198-B23","doi-asserted-by":"crossref","first-page":"D506","DOI":"10.1093\/nar\/gky1049","article-title":"UniProt: a worldwide hub of protein knowledge","volume":"47","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023061310574606500_btab198-B24","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"2023061310574606500_btab198-B25","doi-asserted-by":"crossref","first-page":"1260","DOI":"10.1126\/science.abb2507","article-title":"Cryo-EM structure of the 2019-ncov spike in the prefusion conformation","volume":"367","author":"Wrapp","year":"2020","journal-title":"Science"},{"key":"2023061310574606500_btab198-B26","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1038\/nmeth.3213","article-title":"The I-TASSER suite: protein structure and function prediction","volume":"12","author":"Yang","year":"2015","journal-title":"Nat. Methods"},{"key":"2023061310574606500_btab198-B27","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1016\/j.ymeth.2018.05.026","article-title":"DeepText2Go: improving large-scale protein function prediction with deep semantic text representation","volume":"145","author":"You","year":"2018","journal-title":"Methods"},{"key":"2023061310574606500_btab198-B28","doi-asserted-by":"crossref","first-page":"2465","DOI":"10.1093\/bioinformatics\/bty130","article-title":"GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank","volume":"34","author":"You","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061310574606500_btab198-B29","doi-asserted-by":"crossref","first-page":"W379","DOI":"10.1093\/nar\/gkz388","article-title":"NetGO: improving large-scale protein function prediction with massive network information","volume":"47","author":"You","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023061310574606500_btab198-B30","doi-asserted-by":"crossref","first-page":"1900019","DOI":"10.1002\/pmic.201900019","article-title":"DeepFunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions","volume":"19","author":"Zhang","year":"2019","journal-title":"Proteomics"},{"key":"2023061310574606500_btab198-B31","first-page":"1836","author":"Zhou","year":"2019"},{"key":"2023061310574606500_btab198-B32","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-019-1835-8","article-title":"The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens","volume":"20","author":"Zhou","year":"2019","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab198\/39510874\/btab198.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/18\/2825\/50579661\/btab198.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/18\/2825\/50579661\/btab198.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T07:00:07Z","timestamp":1686639607000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/18\/2825\/6182677"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,3,23]]},"references-count":32,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2021,9,29]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab198","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.09.27.315937","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,9,15]]},"published":{"date-parts":[[2021,3,23]]}}}