{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T11:27:29Z","timestamp":1763810849485,"version":"3.37.3"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"14","license":[{"start":{"date-parts":[[2022,6,2]],"date-time":"2022-06-02T00:00:00Z","timestamp":1654128000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"NSFC-Youth","award":["61902335"],"award-info":[{"award-number":["61902335"]}]},{"name":"Key Area R&D Program of Guangdong Province","award":["2018B030338001"],"award-info":[{"award-number":["2018B030338001"]}]},{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"publisher","award":["2018YFB1800800"],"award-info":[{"award-number":["2018YFB1800800"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Shenzhen Outstanding Talents Training Fund"},{"name":"Guangdong Research Project","award":["2017ZT07X152"],"award-info":[{"award-number":["2017ZT07X152"]}]},{"name":"Guangdong Regional Joint Fund-Key Projects","award":["2019B1515120039"],"award-info":[{"award-number":["2019B1515120039"]}]},{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"publisher","award":["61931024&81922046"],"award-info":[{"award-number":["61931024&81922046"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Zelixir Biotechnology Company Fund"},{"name":"High-Performance Computing Portal"},{"name":"Information Technology Services Office"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,7,11]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Protein secondary structure prediction (PSSP) is one of the fundamental and challenging problems in the field of computational biology. Accurate PSSP relies on sufficient homologous protein sequences to build the multiple sequence alignment (MSA). Unfortunately, many proteins lack homologous sequences, which results in the low quality of MSA and poor performance. In this article, we propose the novel dynamic scoring matrix (DSM)-Distil to tackle this issue, which takes advantage of the pretrained BERT and exploits the knowledge distillation on the newly designed DSM features. Specifically, we propose the DSM to replace the widely used profile and PSSM (position-specific scoring matrix) features. DSM could automatically dig for the suitable feature for each residue, based on the original profile. Namely, DSM-Distil not only could adapt to the low homologous proteins but also is compatible with high homologous ones. Thanks to the dynamic property, DSM could adapt to the input data much better and achieve higher performance. Moreover, to compensate for low-quality MSA, we propose to generate the pseudo-DSM from a pretrained BERT model and aggregate it with the original DSM by adaptive residue-wise fusion, which helps to build richer and more complete input features. In addition, we propose to supervise the learning of low-quality DSM features using high-quality ones. To achieve this, a novel teacher\u2013student model is designed to distill the knowledge from proteins with high homologous sequences to that of low ones. Combining all the proposed methods, our model achieves the new state-of-the-art performance for low homologous proteins.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Compared with the previous state-of-the-art method \u2018Bagging\u2019, DSM-Distil achieves an improvement about 5% and 7.3% improvement for proteins with MSA count \u226430 and extremely low homologous cases, respectively. We also compare DSM-Distil with Alphafold2 which is a state-of-the-art framework for protein structure prediction. DSM-Distil outperforms Alphafold2 by 4.1% on extremely low-quality MSA on 8-state secondary structure prediction. Moreover, we release a large-scale up-to-date test dataset BC40 for low-quality MSA structure prediction evaluation.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>BC40 dataset: https:\/\/drive.google.com\/drive\/folders\/15vwRoOjAkhhwfjDk6-YoKGf4JzZXIMC. HardCase dataset: https:\/\/drive.google.com\/drive\/folders\/1BvduOr2b7cObUHy6GuEWk-aUkKJgzTUv. Code: https:\/\/github.com\/qinwang-ai\/DSM-Distil.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac351","type":"journal-article","created":{"date-parts":[[2022,6,2]],"date-time":"2022-06-02T13:34:16Z","timestamp":1654176856000},"page":"3574-3581","source":"Crossref","is-referenced-by-count":10,"title":["Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5442-4210","authenticated-orcid":false,"given":"Qin","family":"Wang","sequence":"first","affiliation":[{"name":"The Chinese University of Hong Kong (Shenzhen) , Shenzhen 51800, China"},{"name":"The Future Network of Intelligence Institute , Shenzhen 51800, China"},{"name":"Shenzhen Research Institute of Big Data , Shenzhen 51800, China"}]},{"given":"Jun","family":"Wei","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong (Shenzhen) , Shenzhen 51800, China"},{"name":"The Future Network of Intelligence Institute , Shenzhen 51800, China"},{"name":"Shenzhen Research Institute of Big Data , Shenzhen 51800, China"}]},{"given":"Yuzhe","family":"Zhou","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong (Shenzhen) , Shenzhen 51800, China"},{"name":"The Future Network of Intelligence Institute , Shenzhen 51800, China"},{"name":"Shenzhen Research Institute of Big Data , Shenzhen 51800, China"}]},{"given":"Mingzhi","family":"Lin","sequence":"additional","affiliation":[{"name":"Zelixir Biotech , Shanghai 200030, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4517-7216","authenticated-orcid":false,"given":"Ruobing","family":"Ren","sequence":"additional","affiliation":[{"name":"Shanghai Key Laboratory of Metabolic Remodeling and Health, Institute of Metabolism and Integrative Biology, Fudan University , Shanghai 200000, China"}]},{"given":"Sheng","family":"Wang","sequence":"additional","affiliation":[{"name":"Zelixir Biotech , Shanghai 200030, China"}]},{"given":"Shuguang","family":"Cui","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong (Shenzhen) , Shenzhen 51800, China"},{"name":"The Future Network of Intelligence Institute , Shenzhen 51800, China"},{"name":"Shenzhen Research Institute of Big Data , Shenzhen 51800, China"}]},{"given":"Zhen","family":"Li","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong (Shenzhen) , Shenzhen 51800, China"},{"name":"The Future Network of Intelligence Institute , Shenzhen 51800, China"},{"name":"Shenzhen Research Institute of Big Data , Shenzhen 51800, China"}]}],"member":"286","published-online":{"date-parts":[[2022,6,2]]},"reference":[{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","article-title":"Unified rational protein engineering with sequence-based deep representation learning","volume":"16","author":"Alley","year":"2019","journal-title":"Nat. Methods"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped blast and psi-blast: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2023041405371330100_","article-title":"Learning protein sequence embeddings using information from structure","volume-title":"International Conference on Learning Representations","author":"Bepler","year":"2018"},{"key":"2023041405371330100_","first-page":"535","volume-title":"Philadelphia, PA, USA,","author":"Bucilu\u01ce","year":"2006"},{"first-page":"742","year":"2017","author":"Chen","key":"2023041405371330100_"},{"key":"2023041405371330100_","first-page":"755","article-title":"Profile hidden Markov models","volume":"14","author":"Eddy","year":"1998","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023041405371330100_","first-page":"88","volume-title":"International Conference on Research in Computational Molecular Biology, Padua, Italy","author":"Guo","year":"2020"},{"year":"2019","author":"Heinzinger","key":"2023041405371330100_"},{"key":"2023041405371330100_","first-page":"9","article-title":"Distilling the Knowledge in a Neural Network","volume-title":"Statistics","author":"Hinton","year":"2015"},{"year":"2015","author":"Huang","key":"2023041405371330100_"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2023041405371330100_","first-page":"2577","article-title":"Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features","volume":"22","author":"Kabsch","year":"1983","journal-title":"Biopolym. Original Res. Biomol"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1002\/prot.24347","article-title":"Assessment of the assessment: evaluation of the model quality estimates in casp10","volume":"82","author":"Kryshtafovych","year":"2014","journal-title":"Proteins"},{"year":"2016","author":"Li","key":"2023041405371330100_"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"797","DOI":"10.1038\/nchembio.251","article-title":"Computer-aided design of functional protein interactions","volume":"5","author":"Mandell","year":"2009","journal-title":"Nat. Chem. Biol"},{"key":"2023041405371330100_","first-page":"5191","article-title":"Improved knowledge distillation via teacher assistant","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Mirzadeh","year":"2020"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"1800","DOI":"10.1126\/science.1095920","article-title":"Protein kinase inhibitors: insights into drug design from structure","volume":"303","author":"Noble","year":"2004","journal-title":"Science"},{"key":"2023041405371330100_","first-page":"114135","article-title":"Detecting formal thought disorder by deep contextualized word representations","volume-title":"Psychiatry Res","author":"Sarzynska-Wawer","year":"2021"},{"key":"2023041405371330100_","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"first-page":"9686","year":"2019","author":"Rao","key":"2023041405371330100_"},{"key":"2023041405371330100_","first-page":"8844","volume-title":"International Conference on Machine Learning","author":"Rao","year":"2021"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume-title":"Proceedings of the National Academy of Sciences","author":"Rives","year":"2021"},{"year":"2018","author":"Schmitt","key":"2023041405371330100_"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"1026","DOI":"10.1038\/nbt.3988","article-title":"Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets","volume":"35","author":"Steinegger","year":"2017","journal-title":"Nat. Biotechnol"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"926","DOI":"10.1093\/bioinformatics\/btu739","article-title":"Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches","volume":"31","author":"Suzek","year":"2015","journal-title":"Bioinformatics"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"1589","DOI":"10.1093\/bioinformatics\/btg224","article-title":"Pisces: a protein sequence culling server","volume":"19","author":"Wang","year":"2003","journal-title":"Bioinformatics"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"18962","DOI":"10.1038\/srep18962","article-title":"Protein secondary structure prediction using deep convolutional neural fields","volume":"6","author":"Wang","year":"2016","journal-title":"Sci. Rep"},{"first-page":"5754","year":"2019","author":"Yang","key":"2023041405371330100_"},{"key":"2023041405371330100_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-15-S8-S3","article-title":"Template-based c8-scorpion: a protein 8-state secondary structure prediction method using structural information and context-based features","volume":"15","author":"Yaseen","year":"2014","journal-title":"BMC Bioinformatics"},{"key":"2023041405371330100_","first-page":"4133","article-title":"A gift from knowledge distillation: Fast optimization, network minimization and transfer learning","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA","author":"Yim","year":"2017"},{"first-page":"1974","year":"2017","author":"Yu","key":"2023041405371330100_"},{"key":"2023041405371330100_","first-page":"745","article-title":"Deep supervised and convolutional generative stochastic network for protein secondary structure prediction","volume-title":"International conference on machine learning","author":"Zhou","year":"2014"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac351\/43966102\/btac351.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/14\/3574\/49884387\/btac351.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/14\/3574\/49884387\/btac351.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,22]],"date-time":"2023-11-22T15:38:34Z","timestamp":1700667514000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/14\/3574\/6598795"}},"subtitle":[],"editor":[{"given":"Lenore","family":"Cowen","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,6,2]]},"references-count":33,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2022,7,11]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac351","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2022,7,15]]},"published":{"date-parts":[[2022,6,2]]}}}