{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:33:34Z","timestamp":1772138014332,"version":"3.50.1"},"reference-count":26,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T00:00:00Z","timestamp":1672185600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["K99HG011490"],"award-info":[{"award-number":["K99HG011490"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01GM120609"],"award-info":[{"award-number":["R01GM120609"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Columbia University Precision Medicine Joint Pilot Grants Program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,1,19]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Accurate variant pathogenicity predictions are important in genetic studies of human diseases. Inframe insertion and deletion variants (indels) alter protein sequence and length, but not as deleterious as frameshift indels. Inframe indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing prediction methods largely use manually encoded features including conservation, protein structure and function, and allele frequency to infer variant pathogenicity. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SHort Inframe iNsertion and dEletion (SHINE). SHINE uses pretrained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.<\/jats:p>","DOI":"10.1093\/bib\/bbac584","type":"journal-article","created":{"date-parts":[[2022,12,3]],"date-time":"2022-12-03T02:02:54Z","timestamp":1670032974000},"source":"Crossref","is-referenced-by-count":9,"title":["SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants"],"prefix":"10.1093","volume":"24","author":[{"given":"Xiao","family":"Fan","sequence":"first","affiliation":[{"name":"Department of Pediatrics, Columbia University , New York, NY , USA"},{"name":"Department of Systems Biology, Columbia University , New York, NY , USA"}]},{"given":"Hongbing","family":"Pan","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, Columbia University , New York, NY , USA"}]},{"given":"Alan","family":"Tian","sequence":"additional","affiliation":[{"name":"Lynbrook High School , San Jose, CA , USA"}]},{"given":"Wendy K","family":"Chung","sequence":"additional","affiliation":[{"name":"Department of Pediatrics, Columbia University , New York, NY , USA"},{"name":"Department of Medicine, Columbia University , New York, NY , USA"}]},{"given":"Yufeng","family":"Shen","sequence":"additional","affiliation":[{"name":"Department of Systems Biology, Columbia University , New York, NY , USA"},{"name":"Department of Biomedical Informatics, Columbia University , New York, NY , USA"},{"name":"JP Sulzberger Columbia Genome Center, Columbia University , New York, NY , USA"}]}],"member":"286","published-online":{"date-parts":[[2022,12,28]]},"reference":[{"key":"2023011917101972400_ref1","doi-asserted-by":"crossref","first-page":"628","DOI":"10.1038\/s41586-021-04103-z","article-title":"Exome sequencing and analysis of 454,787 UK biobank participants","volume":"599","author":"Backman","year":"2021","journal-title":"Nature"},{"key":"2023011917101972400_ref2","doi-asserted-by":"crossref","DOI":"10.1101\/2022.06.10.22276179","article-title":"Saturation genome editing of DDX3X clarifies pathogenicity of germline and somatic variation","author":"Radford","year":"2022"},{"key":"2023011917101972400_ref3","doi-asserted-by":"crossref","first-page":"1182","DOI":"10.1101\/gr.4565806","article-title":"An initial map of insertion and deletion (INDEL) variation in the human genome","volume":"16","author":"Mills","year":"2006","journal-title":"Genome Res"},{"key":"2023011917101972400_ref4","doi-asserted-by":"crossref","first-page":"2563","DOI":"10.1093\/gbe\/evz180","article-title":"In-frame indel mutations in the genome of the blind Mexican Cavefish","volume":"11","author":"Berning","year":"2019","journal-title":"Astyanax mexicanus, Genome Biol Evol"},{"key":"2023011917101972400_ref5","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1186\/s13023-016-0505-0","article-title":"The role of small in-frame insertions\/deletions in inherited eye disorders and how structural modelling can help estimate their pathogenicity","volume":"11","author":"Sergouniotis","year":"2016","journal-title":"Orphanet J Rare Dis"},{"key":"2023011917101972400_ref6","author":"ClinVar","year":"2021"},{"key":"2023011917101972400_ref7","doi-asserted-by":"crossref","first-page":"D1062","DOI":"10.1093\/nar\/gkx1153","article-title":"ClinVar: improving access to variant interpretations and supporting evidence","volume":"46","author":"Landrum","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023011917101972400_ref8","article-title":"SIFT Indel: predictions for the functional effects of amino acid insertions\/deletions in proteins","volume":"8","author":"Hu","year":"2013","journal-title":"PLoS One"},{"key":"2023011917101972400_ref9","doi-asserted-by":"crossref","first-page":"R23","DOI":"10.1186\/gb-2013-14-3-r23","article-title":"DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels","volume":"14","author":"Zhao","year":"2013","journal-title":"Genome Biol"},{"key":"2023011917101972400_ref10","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1186\/1471-2105-15-111","article-title":"A comprehensive study of small non-frameshift insertions\/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i)","volume":"15","author":"Bermejo-Das-Neves","year":"2014","journal-title":"BMC Bioinf"},{"key":"2023011917101972400_ref11","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1007\/s00438-014-0922-5","article-title":"Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties","volume":"290","author":"Zhang","year":"2015","journal-title":"Mol Genet Genomics"},{"key":"2023011917101972400_ref12","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1002\/humu.22911","article-title":"Assessing the pathogenicity of insertion and deletion variants with the variant effect scoring tool (VEST-Indel)","volume":"37","author":"Douville","year":"2016","journal-title":"Hum Mutat"},{"key":"2023011917101972400_ref13","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/s13059-016-1141-7","article-title":"GAVIN: gene-aware variant INterpretation for medical sequencing","volume":"18","author":"Velde","year":"2017","journal-title":"Genome Biol"},{"key":"2023011917101972400_ref14","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1007112","article-title":"Pathogenicity and functional impact of non-frameshifting insertion\/deletion variation in the human genome","volume":"15","author":"Pagel","year":"2019","journal-title":"PLoS Comput Biol"},{"key":"2023011917101972400_ref15","doi-asserted-by":"crossref","first-page":"D886","DOI":"10.1093\/nar\/gky1016","article-title":"CADD: predicting the deleteriousness of variants throughout the human genome","volume":"47","author":"Rentzsch","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023011917101972400_ref16","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1186\/s13073-020-00775-w","article-title":"CAPICE: a computational method for consequence-agnostic pathogenicity interpretation of clinical exome variations","volume":"12","author":"Li","year":"2020","journal-title":"Genome Med"},{"key":"2023011917101972400_ref17","doi-asserted-by":"crossref","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci"},{"key":"2023011917101972400_ref18","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with AlphaFold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2023011917101972400_ref19","doi-asserted-by":"crossref","DOI":"10.1101\/2021.07.09.450648","article-title":"Language models enable zero-shot prediction of the effects of mutations on protein function","volume-title":"NeurIPS","author":"Meier","year":"2021"},{"key":"2023011917101972400_ref20","doi-asserted-by":"crossref","DOI":"10.1038\/s41588-022-01148-2","article-title":"Integrating de novo and inherited variants in 42,607 autism cases identifies mutations in new moderate-risk genes","volume-title":"Nat Genet","author":"Zhou"},{"key":"2023011917101972400_ref21","doi-asserted-by":"crossref","first-page":"757","DOI":"10.1038\/s41586-020-2832-5","article-title":"Evidence for 28 genetic disorders discovered by combining healthcare and research data","volume":"586","author":"Kaplanis","year":"2020","journal-title":"Nature"},{"key":"2023011917101972400_ref22","doi-asserted-by":"crossref","first-page":"488","DOI":"10.1016\/j.neuron.2018.01.015","article-title":"SPARK: a US cohort of 50,000 families to accelerate autism research","volume":"97","author":"pfeliciano@simonsfoundation.org SCEa, Consortium S","year":"2018","journal-title":"Neuron"},{"key":"2023011917101972400_ref23","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1038\/s41525-019-0093-8","article-title":"Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes","volume":"4","author":"Feliciano","year":"2019","journal-title":"NPJ Genom Med"},{"key":"2023011917101972400_ref24","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1158\/2159-8290.CD-17-0321","article-title":"Accelerating discovery of functional mutant alleles in cancer","volume":"8","author":"Chang","year":"2018","journal-title":"Cancer Discov"},{"key":"2023011917101972400_ref25","first-page":"8844","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Rao","year":"2021"},{"key":"2023011917101972400_ref26","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Mach Learn Res"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/24\/1\/bbac584\/48781923\/bbac584.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/24\/1\/bbac584\/48781923\/bbac584.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,19]],"date-time":"2023-01-19T12:25:53Z","timestamp":1674131153000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbac584\/6961792"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,28]]},"references-count":26,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1,19]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbac584","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.08.30.505840","asserted-by":"object"}]},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,1]]},"published":{"date-parts":[[2022,12,28]]},"article-number":"bbac584"}}