{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T07:57:55Z","timestamp":1773388675894,"version":"3.50.1"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2025,2,23]],"date-time":"2025-02-23T00:00:00Z","timestamp":1740268800000},"content-version":"vor","delay-in-days":93,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,22]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans, and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation, the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions. Implementations of analyses are available at https:\/\/github.com\/santule\/pLMEvo.<\/jats:p>","DOI":"10.1093\/bib\/bbaf047","type":"journal-article","created":{"date-parts":[[2025,2,20]],"date-time":"2025-02-20T23:20:19Z","timestamp":1740093619000},"source":"Crossref","is-referenced-by-count":10,"title":["Do protein language models learn phylogeny?"],"prefix":"10.1093","volume":"26","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5630-5792","authenticated-orcid":false,"given":"Sanjana","family":"Tule","sequence":"first","affiliation":[{"name":"School of Chemistry and Molecular Biosciences, The University of Queensland , Brisbane, QLD 4072 ,","place":["Australia"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0487-2629","authenticated-orcid":false,"given":"Gabriel","family":"Foley","sequence":"additional","affiliation":[{"name":"School of Chemistry and Molecular Biosciences, The University of Queensland , Brisbane, QLD 4072 ,","place":["Australia"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3548-268X","authenticated-orcid":false,"given":"Mikael","family":"Bod\u00e9n","sequence":"additional","affiliation":[{"name":"School of Chemistry and Molecular Biosciences, The University of Queensland , Brisbane, QLD 4072 ,","place":["Australia"]}]}],"member":"286","published-online":{"date-parts":[[2025,2,23]]},"reference":[{"key":"2025022310134613100_ref1","doi-asserted-by":"publisher","first-page":"1530","DOI":"10.1093\/molbev\/msaa015","article-title":"IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era","volume":"37","author":"Minh","year":"2020","journal-title":"Mol Biol Evol"},{"key":"2025022310134613100_ref2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0009490","article-title":"FastTree 2 \u2013 approximately maximum-likelihood trees for large alignments","volume":"5","author":"Dehal","year":"2010","journal-title":"PloS One"},{"key":"2025022310134613100_ref3","doi-asserted-by":"publisher","first-page":"1307","DOI":"10.1093\/molbev\/msn067","article-title":"An improved general amino acid replacement matrix","volume":"25","author":"Le","year":"2008","journal-title":"Mol Biol Evol"},{"key":"2025022310134613100_ref4","doi-asserted-by":"publisher","first-page":"S4","DOI":"10.1186\/1471-2105-16-S5-S4","article-title":"Improving multiple sequence alignment by using better guide trees","volume":"16","author":"Zhan","year":"2015","journal-title":"BMC Bioinform"},{"key":"2025022310134613100_ref5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pone.0056143","article-title":"Iteratively refined guide trees help improving alignment and phylogenetic inference in the mushroom family bolbitiaceae","volume":"8","author":"T\u00f3th","year":"2013","journal-title":"PloS One"},{"key":"2025022310134613100_ref6","doi-asserted-by":"publisher","first-page":"e77","DOI":"10.1371\/journal.pcbi.0020077","article-title":"Functional classification using phylogenomic inference","volume":"2","author":"Brown","year":"2006","journal-title":"PLoS Comput Biol"},{"key":"2025022310134613100_ref7","doi-asserted-by":"publisher","first-page":"233","DOI":"10.1080\/07388550802512633","article-title":"Protein function predictions based on the phylogenetic profile method","volume":"28","author":"Jiang","year":"2008","journal-title":"Crit Rev Biotechnol"},{"key":"2025022310134613100_ref8","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1093\/bib\/bbaa337","article-title":"FireProtASR: a web server for fully automated ancestral sequence reconstruction","volume":"22","author":"Musil","year":"2020","journal-title":"Brief Bioinform"},{"key":"2025022310134613100_ref9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1371\/journal.pcbi.1010633","article-title":"Engineering indel and substitution variants of diverse and ancient enzymes using graphical representation of ancestral sequence predictions","volume":"18","author":"Foley","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"2025022310134613100_ref10","doi-asserted-by":"publisher","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci"},{"key":"2025022310134613100_ref11","doi-asserted-by":"publisher","first-page":"09","DOI":"10.1093\/bioinformatics\/btad579","article-title":"pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models","volume":"39","author":"Kaminski","year":"2023","journal-title":"Bioinformatics"},{"key":"2025022310134613100_ref12","doi-asserted-by":"publisher","first-page":"284","DOI":"10.1126\/science.abd7331","article-title":"Learning the language of viral evolution and escape","volume":"371","author":"Hie","year":"2021","journal-title":"Science"},{"key":"2025022310134613100_ref13","doi-asserted-by":"publisher","first-page":"274","DOI":"10.1016\/j.cels.2022.01.003","article-title":"Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins","volume":"13","author":"Hie","year":"2022","journal-title":"Cell Syst"},{"key":"2025022310134613100_ref14","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1186\/s12859-024-05699-5","article-title":"Protein embedding based alignment","volume":"25","author":"Giovanni Iovino","year":"2024","journal-title":"BMC Bioinform"},{"key":"2025022310134613100_ref15","doi-asserted-by":"crossref","first-page":"btad786","DOI":"10.1093\/bioinformatics\/btad786","article-title":"Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone","volume":"40","author":"Pantolini","year":"2024","journal-title":"Bioinformatics"},{"key":"2025022310134613100_ref16","doi-asserted-by":"crossref","DOI":"10.1101\/2021.07.09.450648","article-title":"Language models enable zero-shot prediction of the effects of mutations on protein function","volume-title":"Advances in Neural Information Processing Systems","author":"Meier","year":"2021"},{"key":"2025022310134613100_ref17","doi-asserted-by":"publisher","first-page":"809","DOI":"10.1038\/s43588-021-00168-y","article-title":"Cluster learning-assisted directed evolution","volume":"1","author":"Qiu","year":"2021","journal-title":"Nat Comput Sci"},{"key":"2025022310134613100_ref18","doi-asserted-by":"publisher","first-page":"1026","DOI":"10.1016\/j.cels.2021.07.008","article-title":"Informed training set design enables efficient machine learning-assisted directed protein evolution","volume":"12","author":"Wittmann","year":"2021","journal-title":"Cell Syst"},{"key":"2025022310134613100_ref19","doi-asserted-by":"publisher","first-page":"6298","DOI":"10.1038\/s41467-022-34032-y","article-title":"Protein language models trained on multiple sequence alignments learn phylogenetic relationships","volume":"13","author":"Lupo","year":"2022","journal-title":"Nat Commun"},{"key":"2025022310134613100_ref20","article-title":"Msa transformer","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Roa","year":"2021"},{"key":"2025022310134613100_ref21","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2025022310134613100_ref22","doi-asserted-by":"publisher","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2022","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2025022310134613100_ref23","doi-asserted-by":"publisher","DOI":"10.1101\/2024.07.01.600583","article-title":"Simulating 500 million years of evolution with a language model","author":"Hayes","journal-title":"bioRxiv"},{"key":"2025022310134613100_ref24","doi-asserted-by":"publisher","DOI":"10.1093\/nargab\/lqae150","article-title":"ProstT5: bilingual language model for protein sequence and structure","volume-title":"NAR Genomics and Bioinformatics","author":"Heinzinger","year":"2024"},{"key":"2025022310134613100_ref25","doi-asserted-by":"publisher","first-page":"D412","DOI":"10.1093\/nar\/gkaa913","article-title":"Pfam: the protein families database in 2021","volume":"49","author":"Mistry","year":"2020","journal-title":"Nucleic Acids Res"},{"key":"2025022310134613100_ref26","doi-asserted-by":"publisher","DOI":"10.3389\/neuro.06.004.2008","article-title":"Representational similarity analysis - connecting the branches of systems neuroscience","volume":"2","author":"Kriegeskorte","year":"2008","journal-title":"Front Syst Neurosci"},{"key":"2025022310134613100_ref27","doi-asserted-by":"publisher","volume-title":"Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP","author":"Abnar","DOI":"10.18653\/v1\/W19-4820"},{"key":"2025022310134613100_ref28","doi-asserted-by":"publisher","first-page":"1635","DOI":"10.1093\/molbev\/msw046","article-title":"ETE 3: reconstruction, analysis, and visualization of phylogenomic data","volume":"33","author":"Huerta-Cepas","year":"2016","journal-title":"Mol Biol Evol"},{"key":"2025022310134613100_ref29","doi-asserted-by":"publisher","article-title":"The geometry of hidden representations of protein language models","author":"Valeriani","DOI":"10.1101\/2022.10.24.513504"},{"key":"2025022310134613100_ref30","doi-asserted-by":"publisher","first-page":"469","DOI":"10.1007\/BF02589501","article-title":"The significance probability of the smirnov two-sample test","volume":"3","author":"Hodges","year":"1958","journal-title":"Ark Mat"},{"key":"2025022310134613100_ref31","doi-asserted-by":"publisher","first-page":"6309","DOI":"10.1609\/aaai.v33i01.33016309","article-title":"What is one grain of sand in the desert? Analyzing individual neurons in deep nlp models","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Dalvi","year":"2018"},{"key":"2025022310134613100_ref32","article-title":"Toy models of superposition","author":"Elhage"},{"key":"2025022310134613100_ref33","doi-asserted-by":"publisher","first-page":"499","DOI":"10.1002\/prot.22458","article-title":"Structure is three to ten times more conserved than sequence\u2013a study of structural response in protein cores","volume":"77","author":"Illerg\u00e5rd","year":"2009","journal-title":"Proteins"},{"key":"2025022310134613100_ref34","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2025022310134613100_ref35","doi-asserted-by":"publisher","article-title":"Protein language models are biased by unequal sequence sampling across the tree of life","author":"Ding","DOI":"10.1101\/2024.03.07.584001"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbaf047\/62054908\/bbaf047.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/1\/bbaf047\/62054908\/bbaf047.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,23]],"date-time":"2025-02-23T05:14:02Z","timestamp":1740287642000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbaf047\/8030578"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,22]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,11,22]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbaf047","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.09.23.614642","asserted-by":"object"}]},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,1]]},"published":{"date-parts":[[2024,11,22]]},"article-number":"bbaf047"}}