{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T18:53:57Z","timestamp":1767380037318,"version":"3.48.0"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2025,12,7]],"date-time":"2025-12-07T00:00:00Z","timestamp":1765065600000},"content-version":"vor","delay-in-days":1,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62302311"],"award-info":[{"award-number":["62302311"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62302316"],"award-info":[{"award-number":["62302316"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62406199"],"award-info":[{"award-number":["62406199"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62471310"],"award-info":[{"award-number":["62471310"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,1,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Protein language models (PLMs) have emerged as pivotal tools for protein representation, enabling significant advances in structure-function prediction and computational biology. However, current PLMs predominantly rely on fine-grained amino acid sequences as input, treating individual residues as tokens. While this approach facilitates semantic learning at the residue level, it struggles to capture molecular-level semantics, particularly for large proteins, where sequence truncation and inefficient local pattern extraction hinder holistic understanding. The spatial structure of a protein determines its function. Despite the critical role of protein function analysis, coarse-grained protein language frameworks that bridge sequence and structural semantics remain underdeveloped.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>To fill this gap, we introduce a novel structure-aware coarse-grained protein language that discretizes proteins into local structural patterns derived from their secondary structures. By constructing a vocabulary of these patterns as \u201cwords,\u201d we represent proteins as compact, structure-aware \u201csentences\u201d significantly shorter than raw amino acid sequences. We benchmark the proposed coarse-grained language against three state-of-the-art fine-grained protein languages and a classical language modeling method in natural language processing, using two architectures: a lightweight Doc2Vec model and a Transformer-based BERT model, and evaluating performance across diverse downstream tasks, including function prediction, enzyme classification, and interaction identification. The proposed method achieves stable performance across three tasks, especially for long proteins. These results demonstrate that the proposed coarse-grained protein language preserves critical structural and functional semantics and improves molecular-level analysis, offering a promising direction for decoding higher-order biological insights.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The data and source code of the proposed method are available at GitHub (https:\/\/github.com\/bug-0x3f\/coarse-grained-protein-language) and Zenodo (DOI: 10.5281\/zenodo.17674298).<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaf654","type":"journal-article","created":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T12:51:43Z","timestamp":1764679903000},"source":"Crossref","is-referenced-by-count":0,"title":["Molecular-level protein semantic learning via structure-aware coarse-grained language modeling"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0484-0966","authenticated-orcid":false,"given":"Jun","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]},{"name":"National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xueer","family":"Weng","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]},{"name":"National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tiantian","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]},{"name":"National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0888-6575","authenticated-orcid":false,"given":"Yumeng","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen Technology University , Shenzhen, Guangdong 518118,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8479-6904","authenticated-orcid":false,"given":"Zexuan","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]},{"name":"National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University , Shenzhen, Guangdong 518060,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2025,12,6]]},"reference":[{"key":"2026010213513209900_btaf654-B1","doi-asserted-by":"crossref","first-page":"493","DOI":"10.1038\/s41586-024-07487-w","article-title":"Accurate structure prediction of biomolecular interactions with alphafold 3","volume":"630","author":"Abramson","year":"2024","journal-title":"Nature"},{"key":"2026010213513209900_btaf654-B2","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B3","doi-asserted-by":"crossref","first-page":"4143","DOI":"10.1002\/anie.201708408","article-title":"Directed evolution: bringing new chemistry to life","volume":"57","author":"Arnold","year":"2018","journal-title":"Angew Chem Int Ed Engl"},{"key":"2026010213513209900_btaf654-B4","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat Genet"},{"year":"2013","author":"Bengio","key":"2026010213513209900_btaf654-B5","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1308.3432"},{"key":"2026010213513209900_btaf654-B6","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The protein data bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B7","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1093\/bioinformatics\/btac020","article-title":"ProteinBERT: a universal deep-learning model of protein sequence and function","volume":"38","author":"Brandes","year":"2022","journal-title":"Bioinformatics"},{"key":"2026010213513209900_btaf654-B8","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pcbi.1002195","article-title":"Accelerated profile hmm searches","volume":"7","author":"Eddy","year":"2011","journal-title":"PLoS Comput Biol"},{"key":"2026010213513209900_btaf654-B9","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"Prottrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2022","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2026010213513209900_btaf654-B10","doi-asserted-by":"crossref","first-page":"D279","DOI":"10.1093\/nar\/gkv1344","article-title":"The pfam protein families database: towards a more sustainable future","volume":"44","author":"Finn","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B11","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"CD-HIT: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2026010213513209900_btaf654-B12","doi-asserted-by":"crossref","first-page":"3168","DOI":"10.1038\/s41467-021-23303-9","article-title":"Structure-based protein function prediction using graph convolutional networks","volume":"12","author":"Gligorijevi\u0107","year":"2021","journal-title":"Nat Commun"},{"key":"2026010213513209900_btaf654-B13","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1158\/2159-8290.CD-21-1059","article-title":"Hallmarks of cancer: new dimensions","volume":"12","author":"Hanahan","year":"2022","journal-title":"Cancer Discov"},{"key":"2026010213513209900_btaf654-B14","doi-asserted-by":"crossref","first-page":"lqae150","DOI":"10.1093\/nargab\/lqae150","article-title":"Bilingual language model for protein sequence and structure","volume":"6","author":"Heinzinger","year":"2024","journal-title":"NAR Genom Bioinform"},{"key":"2026010213513209900_btaf654-B15","first-page":"137673","volume-title":"Advances in Neural Information Processing Systems","author":"Hu","year":"2024"},{"key":"2026010213513209900_btaf654-B16","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2026010213513209900_btaf654-B17","doi-asserted-by":"crossref","first-page":"2577","DOI":"10.1002\/bip.360221211","article-title":"Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features","volume":"22","author":"Kabsch","year":"1983","journal-title":"Biopolymers"},{"first-page":"1","year":"2020","author":"Lan","key":"2026010213513209900_btaf654-B18"},{"key":"2026010213513209900_btaf654-B19","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/IJCNN48605.2020.9207145","author":"\u0141a\u0144cucki","year":"2020"},{"first-page":"1188","year":"2014","author":"Le","key":"2026010213513209900_btaf654-B20"},{"key":"2026010213513209900_btaf654-B21","first-page":"35700","volume-title":"Advances in Neural Information Processing Systems","author":"Li","year":"2024"},{"key":"2026010213513209900_btaf654-B22","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2026010213513209900_btaf654-B23","first-page":"29287","volume-title":"Advances in Neural Information Processing Systems","author":"Meier","year":"2021"},{"key":"2026010213513209900_btaf654-B24","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1093\/nar\/gkg062","article-title":"The cath database: an extended protein family resource for structural and functional genomics","volume":"31","author":"Orengo","year":"2003","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B25","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1038\/nmeth.1818","article-title":"HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment","volume":"9","author":"Remmert","year":"2012","journal-title":"Nat Methods"},{"key":"2026010213513209900_btaf654-B26","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci USA"},{"key":"2026010213513209900_btaf654-B27","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1038\/nrd.2016.230","article-title":"A comprehensive map of molecular drug targets","volume":"16","author":"Santos","year":"2017","journal-title":"Nat Rev Drug Discov"},{"key":"2026010213513209900_btaf654-B28","doi-asserted-by":"crossref","first-page":"D5","DOI":"10.1093\/nar\/gkn741","article-title":"Database resources of the national center for biotechnology information","volume":"37","author":"Sayers","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B29","first-page":"572","article-title":"Small molecules, big targets: drug discovery faces the protein-protein interaction challenge","volume":"11","author":"Scott","year":"2012","journal-title":"Nat Rev Drug Discov"},{"key":"2026010213513209900_btaf654-B30","doi-asserted-by":"crossref","first-page":"706","DOI":"10.1038\/s41586-019-1923-7","article-title":"Improved protein structure prediction using potentials from deep learning","volume":"577","author":"Senior","year":"2020","journal-title":"Nature"},{"first-page":"1715","year":"2016","author":"Sennrich","key":"2026010213513209900_btaf654-B31"},{"first-page":"6987","year":"2023","author":"Su","key":"2026010213513209900_btaf654-B32"},{"key":"2026010213513209900_btaf654-B33","doi-asserted-by":"crossref","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","article-title":"Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches","volume":"31","author":"Suzek","year":"2015","journal-title":"Bioinformatics"},{"key":"2026010213513209900_btaf654-B34","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1186\/s13321-024-00884-3","article-title":"PETA: evaluating the impact of protein transfer learning with Sub-word tokenization on downstream applications","volume":"16","author":"Tan","year":"2024","journal-title":"J Cheminform"},{"key":"2026010213513209900_btaf654-B35","first-page":"114147","volume-title":"Advances in Neural Information Processing Systems","author":"Tao","year":"2024"},{"key":"2026010213513209900_btaf654-B36","doi-asserted-by":"crossref","first-page":"D523","DOI":"10.1093\/nar\/gkac1052","article-title":"Uniprot: the universal protein knowledgebase in 2023","volume":"51","author":"Uniprot Consortium","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2026010213513209900_btaf654-B37","first-page":"6309","volume-title":"Advances in Neural Information Processing Systems","author":"van den Oord","year":"2017"},{"key":"2026010213513209900_btaf654-B38","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1038\/s41587-023-01773-0","article-title":"Fast and accurate protein structure search with foldseek","volume":"42","author":"Van Kempen","year":"2024","journal-title":"Nat Biotechnol"},{"key":"2026010213513209900_btaf654-B39","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1002\/prot.26626","article-title":"iNucRes-ASSH: identifying nucleic acid-binding residues in proteins by using self-attention-based structure-sequence hybrid neural network","volume":"92","author":"Zhang","year":"2024","journal-title":"Prot Struct Funct Bioinfo"},{"year":"2023","author":"Zhang","key":"2026010213513209900_btaf654-B40"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaf654\/65790122\/btaf654.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf654\/65790122\/btaf654.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/1\/btaf654\/65790122\/btaf654.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T18:51:41Z","timestamp":1767379901000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btaf654\/8373459"}},"subtitle":[],"editor":[{"given":"Lenore","family":"Cowen","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2025,12,6]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaf654","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2026,1]]},"published":{"date-parts":[[2025,12,6]]},"article-number":"btaf654"}}