{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T16:08:39Z","timestamp":1777651719449,"version":"3.51.4"},"reference-count":44,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2022,9,22]],"date-time":"2022-09-22T00:00:00Z","timestamp":1663804800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,11,19]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.<\/jats:p>","DOI":"10.1093\/bib\/bbac401","type":"journal-article","created":{"date-parts":[[2022,9,22]],"date-time":"2022-09-22T16:33:05Z","timestamp":1663864385000},"source":"Crossref","is-referenced-by-count":9,"title":["SPRoBERTa: protein embedding learning with local fragment modeling"],"prefix":"10.1093","volume":"23","author":[{"given":"Lijun","family":"Wu","sequence":"first","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengcan","family":"Yin","sequence":"additional","affiliation":[{"name":"National Key Laboratory for Novel Software Technology, Nanjing University , 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jinhua","family":"Zhu","sequence":"additional","affiliation":[{"name":"CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China , No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhen","family":"Wu","sequence":"additional","affiliation":[{"name":"National Key Laboratory for Novel Software Technology, Nanjing University , 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liang","family":"He","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yingce","family":"Xia","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shufang","family":"Xie","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tao","family":"Qin","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tie-Yan","family":"Liu","sequence":"additional","affiliation":[{"name":"Microsoft Research Asia, No. 5 Dan Ling Street , Haidian District, 100080, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2022,9,22]]},"reference":[{"key":"2022112111112184000_ref1","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022112111112184000_ref2","article-title":"Roberta: A robustly optimized bert pretraining approach","volume-title":"ArXiv","author":"Liu","year":"2019"},{"key":"2022112111112184000_ref3","first-page":"5998","volume-title":"Advances in neural information processing systems","author":"Vaswani","year":"2017"},{"issue":"1","key":"2022112111112184000_ref4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in dna sequences using a bloom filter","volume":"12","author":"Melsted","year":"2011","journal-title":"BMC bioinformatics"},{"key":"2022112111112184000_ref5","doi-asserted-by":"crossref","first-page":"66","DOI":"10.18653\/v1\/D18-2012","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Kudo","year":"2018"},{"issue":"12","key":"2022112111112184000_ref6","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","article-title":"Unified rational protein engineering with sequence-based deep representation learning","volume":"16","author":"Alley","year":"2019","journal-title":"Nat Methods"},{"key":"2022112111112184000_ref7","article-title":"Learning protein sequence embeddings using information from structure","volume-title":"International Conference on Learning Representations","author":"Bepler","year":"2018"},{"key":"2022112111112184000_ref8","first-page":"2227","volume-title":"Proceedings of NAACL-HLT","author":"Peters","year":"2018"},{"key":"2022112111112184000_ref9","doi-asserted-by":"crossref","first-page":"328","DOI":"10.18653\/v1\/P18-1031","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Howard","year":"2018"},{"issue":"8","key":"2022112111112184000_ref10","doi-asserted-by":"crossref","first-page":"2401","DOI":"10.1093\/bioinformatics\/btaa003","article-title":"Udsmprot: universal deep sequence models for protein classification","volume":"36","author":"Strodthoff","year":"2020","journal-title":"Bioinformatics"},{"issue":"1","key":"2022112111112184000_ref11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-019-3220-8","article-title":"Modeling aspects of the language of life through transfer-learning protein sequences","volume":"20","author":"Heinzinger","year":"2019","journal-title":"BMC bioinformatics"},{"key":"2022112111112184000_ref12","first-page":"9689","article-title":"Evaluating protein transfer learning with tape","volume":"32","author":"Rao","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"15","key":"2022112111112184000_ref13","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci"},{"key":"2022112111112184000_ref14","volume-title":"International Conference on Learning Representations","author":"Rao","year":"2020"},{"key":"2022112111112184000_ref15","article-title":"Language models enable zero-shot prediction of the effects of mutations on protein function","volume":"34","author":"Meier","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2022112111112184000_ref16","doi-asserted-by":"crossref","first-page":"123912","DOI":"10.1109\/ACCESS.2021.3110269","article-title":"Pre-training of deep bidirectional protein sequence representations with structural information","volume":"9","author":"Min","year":"2021","journal-title":"IEEE Access"},{"key":"2022112111112184000_ref17","doi-asserted-by":"crossref","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"Prottrans: towards cracking the language of life\u2019s code through self-supervised deep learning and high performance computing","volume-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence","author":"Elnaggar","year":"2021"},{"key":"2022112111112184000_ref18","article-title":"Rethinking attention with performers","volume-title":"International Conference on Learning Representations","author":"Choromanski","year":"2020"},{"key":"2022112111112184000_ref19","doi-asserted-by":"crossref","DOI":"10.1101\/2021.02.12.430858","article-title":"MSA transformer","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Rao","year":"2021"},{"key":"2022112111112184000_ref20","doi-asserted-by":"crossref","first-page":"1715","DOI":"10.18653\/v1\/P16-1162","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sennrich","year":"2016"},{"key":"2022112111112184000_ref21","doi-asserted-by":"crossref","first-page":"66","DOI":"10.18653\/v1\/P18-1007","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Kudo","year":"2018"},{"key":"2022112111112184000_ref22","doi-asserted-by":"crossref","first-page":"5149","DOI":"10.1109\/ICASSP.2012.6289079","volume-title":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Schuster","year":"2012"},{"key":"2022112111112184000_ref23","first-page":"1","volume-title":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","author":"Nambiar","year":"2020"},{"key":"2022112111112184000_ref24","article-title":"Pre-training protein language models with label-agnostic binding pairs enhances performance in downstream tasks","volume-title":"Annual Conference on Neural Information Processing Systems","author":"Filipavicius","year":"2020"},{"key":"2022112111112184000_ref25","first-page":"770","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"He","year":"2016"},{"issue":"D1","key":"2022112111112184000_ref26","doi-asserted-by":"crossref","first-page":"D427","DOI":"10.1093\/nar\/gky995","article-title":"The pfam protein families database in 2019","volume":"47","author":"El-Gebali","year":"2019","journal-title":"Nucleic Acids Res"},{"issue":"10","key":"2022112111112184000_ref27","doi-asserted-by":"crossref","first-page":"1282","DOI":"10.1093\/bioinformatics\/btm098","article-title":"Uniref: comprehensive and non-redundant uniprot reference clusters","volume":"23","author":"Suzek","year":"2007","journal-title":"Bioinformatics"},{"issue":"6","key":"2022112111112184000_ref28","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1002\/prot.25674","article-title":"Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning","volume":"87","author":"Klausen","year":"2019","journal-title":"Proteins: Structure, Function, and Bioinformatics"},{"issue":"4","key":"2022112111112184000_ref29","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1002\/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO;2-4","article-title":"Evaluation and improvement of multiple sequence methods for protein secondary structure prediction","volume":"34","author":"Cuff","year":"1999","journal-title":"Proteins: Structure, Function, and Bioinformatics"},{"issue":"1","key":"2022112111112184000_ref30","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1005324","article-title":"Accurate de novo prediction of protein contact map by ultra-deep learning model","volume":"13","author":"Wang","year":"2017","journal-title":"PLoS Comput Biol"},{"issue":"8","key":"2022112111112184000_ref31","doi-asserted-by":"crossref","first-page":"1295","DOI":"10.1093\/bioinformatics\/btx780","article-title":"Deepsf: deep convolutional neural network for mapping protein sequences to folds","volume":"34","author":"Hou","year":"2018","journal-title":"Bioinformatics"},{"issue":"1","key":"2022112111112184000_ref32","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-021-23303-9","article-title":"Structure-based protein function prediction using graph convolutional networks","volume":"12","author":"Gligorijevi\u0107","year":"2021","journal-title":"Nat Commun"},{"key":"2022112111112184000_ref33","article-title":"Adam: A method for stochastic optimization","volume-title":"ICLR","author":"Kingma","year":"2015"},{"key":"2022112111112184000_ref34","doi-asserted-by":"crossref","DOI":"10.1101\/2020.09.04.283929","article-title":"Self-supervised contrastive learning of protein representations by mutual information maximization","author":"Lu","year":"2020"},{"key":"2022112111112184000_ref35","article-title":"Profile prediction: An alignment-based pre-training task for protein sequence models","volume-title":"ArXiv","author":"Sturmfels","year":"2020"},{"issue":"3","key":"2022112111112184000_ref36","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1016\/j.sbi.2006.04.004","article-title":"Multiple sequence alignment","volume":"16","author":"Edgar","year":"2006","journal-title":"Curr Opin Struct Biol"},{"issue":"8","key":"2022112111112184000_ref37","doi-asserted-by":"crossref","first-page":"2102","DOI":"10.1093\/bioinformatics\/btac020","article-title":"Proteinbert: A universal deep-learning model of protein sequence and function","volume":"38","author":"Brandes","year":"2022","journal-title":"Bioinformatics"},{"key":"2022112111112184000_ref38","volume-title":"International Conference on Learning Representations","author":"Zhang","year":"2021"},{"issue":"1","key":"2022112111112184000_ref39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-018-04964-5","article-title":"Clustering huge protein sequence sets in linear time","volume":"9","author":"Steinegger","year":"2018","journal-title":"Nat Commun"},{"issue":"1","key":"2022112111112184000_ref40","first-page":"1","article-title":"Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction","volume":"12","author":"Wang","year":"2022","journal-title":"Sci Rep"},{"key":"2022112111112184000_ref41","article-title":"Learning from protein structure with geometric vector perceptrons","volume-title":"International Conference on Learning Representations","author":"Jing","year":"2021"},{"key":"2022112111112184000_ref42","article-title":"Protein representation learning by geometric structure pretraining","volume-title":"ArXiv","author":"Zhang","year":"2022"},{"issue":"3","key":"2022112111112184000_ref43","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0018093","article-title":"A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives","volume":"6","author":"Thompson","year":"2011","journal-title":"PloS one"},{"issue":"7873","key":"2022112111112184000_ref44","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/6\/bbac401\/47144233\/bbac401.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/23\/6\/bbac401\/47144233\/bbac401.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,21]],"date-time":"2022-11-21T11:15:13Z","timestamp":1669029313000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbac401\/6711410"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,22]]},"references-count":44,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2022,11,19]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbac401","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,11]]},"published":{"date-parts":[[2022,9,22]]},"article-number":"bbac401"}}