{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T18:17:32Z","timestamp":1779387452808,"version":"3.53.1"},"reference-count":46,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T00:00:00Z","timestamp":1724976000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,9,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>We make our inference code, 3mer pre-trained model weights and datasets available.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae529","type":"journal-article","created":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T10:46:24Z","timestamp":1725014784000},"source":"Crossref","is-referenced-by-count":15,"title":["Are genomic language models all you need? Exploring genomic language models on protein downstream tasks"],"prefix":"10.1093","volume":"40","author":[{"given":"Sam","family":"Boshar","sequence":"first","affiliation":[{"name":"InstaDeep , Cambridge, MA 02142, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Evan","family":"Trop","sequence":"additional","affiliation":[{"name":"InstaDeep , Cambridge, MA 02142, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6084-6775","authenticated-orcid":false,"given":"Bernardo P","family":"de Almeida","sequence":"additional","affiliation":[{"name":"InstaDeep , Paris 75010, France"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Liviu","family":"Copoiu","sequence":"additional","affiliation":[{"name":"InstaDeep , London W2 1AY, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Thomas","family":"Pierrot","sequence":"additional","affiliation":[{"name":"InstaDeep , Cambridge, MA 02142, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2024,8,30]]},"reference":[{"key":"2024091405020394700_btae529-B1","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1002\/prot.25423","article-title":"Assessment of hard target modeling in casp12 reveals an emerging role of alignment-based contact prediction methods","volume":"86","author":"Abriata","year":"2018","journal-title":"Proteins"},{"key":"2024091405020394700_btae529-B2","doi-asserted-by":"crossref","first-page":"1196","DOI":"10.1038\/s41592-021-01252-x","article-title":"Effective gene expression prediction from sequence by integrating long-range interactions","volume":"18","author":"Avsec","year":"2021","journal-title":"Nat Methods"},{"key":"2024091405020394700_btae529-B3","author":"Benegas","year":"2023"},{"key":"2024091405020394700_btae529-B4","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1093\/nar\/28.1.235","article-title":"The protein data bank","volume":"28","author":"Berman","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B5","author":"Brown","year":"2020"},{"key":"2024091405020394700_btae529-B6","doi-asserted-by":"crossref","first-page":"W402","DOI":"10.1093\/nar\/gkz297","article-title":"The PSIPRED protein analysis workbench: 20 years on","volume":"47","author":"Buchan","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B7","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1002\/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO;2-4","article-title":"Evaluation and improvement of multiple sequence methods for protein secondary structure prediction","volume":"34","author":"Cuff","year":"1999","journal-title":"Proteins"},{"key":"2024091405020394700_btae529-B8","author":"Dalla-Torre","year":"2023"},{"key":"2024091405020394700_btae529-B9","author":"Dallago","year":"2021"},{"key":"2024091405020394700_btae529-B10","author":"de Almeida","year":"2024"},{"key":"2024091405020394700_btae529-B11","author":"Devlin","year":"2018"},{"key":"2024091405020394700_btae529-B12","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2021","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2024091405020394700_btae529-B13","doi-asserted-by":"crossref","first-page":"1581","DOI":"10.1093\/molbev\/msu081","article-title":"A comprehensive, high-resolution map of a gene\u2019s fitness landscape","volume":"31","author":"Firnberg","year":"2014","journal-title":"Mol Biol Evol"},{"key":"2024091405020394700_btae529-B14","author":"Hallee","year":"2023"},{"key":"2024091405020394700_btae529-B15","doi-asserted-by":"crossref","first-page":"W510","DOI":"10.1093\/nar\/gkac439","article-title":"Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning","volume":"50","author":"H\u00f8ie","year":"2022","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B16","doi-asserted-by":"crossref","first-page":"495","DOI":"10.1038\/s41592-020-0801-4","article-title":"Meltome atlas\u2014thermal proteome stability across the tree of life","volume":"17","author":"Jarzab","year":"2020","journal-title":"Nat Methods"},{"key":"2024091405020394700_btae529-B17","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"2024091405020394700_btae529-B18","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2024091405020394700_btae529-B19","doi-asserted-by":"crossref","first-page":"D29","DOI":"10.1093\/nar\/gki098","article-title":"The EMBL nucleotide sequence database","volume":"33","author":"Kanz","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B20","author":"Kingma","year":"2015"},{"key":"2024091405020394700_btae529-B21","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1002\/prot.25674","article-title":"Netsurfp-2.0: improved prediction of protein structural features by integrated deep learning","volume":"87","author":"Klausen","year":"2019","journal-title":"Proteins Struct Funct Bioinf"},{"key":"2024091405020394700_btae529-B22","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1038\/nature09792","article-title":"Initial impact of the sequencing of the human genome","volume":"470","author":"Lander","year":"2011","journal-title":"Nature"},{"key":"2024091405020394700_btae529-B23","first-page":"1027","author":"Li","year":"2024"},{"key":"2024091405020394700_btae529-B25","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"2024091405020394700_btae529-B26","first-page":"1950","article-title":"Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning","volume":"35","author":"Liu","year":"2022","journal-title":"Adv Neural Inf Process Syst"},{"key":"2024091405020394700_btae529-B27","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1186\/s12964-020-00642-6","article-title":"A code within the genetic code: codon usage regulates co-translational protein folding","volume":"18","author":"Liu","year":"2020","journal-title":"Cell Commun Signal"},{"key":"2024091405020394700_btae529-B28","doi-asserted-by":"crossref","first-page":"3744","DOI":"10.1093\/bioinformatics\/btab491","article-title":"Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework","volume":"37","author":"Moffat","year":"2021","journal-title":"Bioinformatics"},{"key":"2024091405020394700_btae529-B29","author":"Nguyen","year":"2023"},{"key":"2024091405020394700_btae529-B30","author":"Nguyen","year":"2024"},{"key":"2024091405020394700_btae529-B31","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1038\/s42256-024-00791-0","article-title":"Codon language embeddings provide strong signals for use in protein engineering","volume":"6","author":"Outeiral","year":"2024","journal-title":"Nat Mach Intell"},{"key":"2024091405020394700_btae529-B32","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1007\/s11033-021-06749-4","article-title":"Codon usage bias","volume":"49","author":"Parvathy","year":"2022","journal-title":"Mol Biol Rep"},{"key":"2024091405020394700_btae529-B33","doi-asserted-by":"crossref","DOI":"10.1126\/science.aay2784","article-title":"Parallel molecular mechanisms for enzyme temperature adaptation","volume":"371","author":"Pinney","year":"2021","journal-title":"Science"},{"key":"2024091405020394700_btae529-B34","author":"Press","year":"2022"},{"key":"2024091405020394700_btae529-B35","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J Mach Learn Res"},{"key":"2024091405020394700_btae529-B36","first-page":"9689","author":"Rao","year":"2019"},{"key":"2024091405020394700_btae529-B37","doi-asserted-by":"crossref","first-page":"e2016239118","DOI":"10.1073\/pnas.2016239118","article-title":"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences","volume":"118","author":"Rives","year":"2021","journal-title":"Proc Natl Acad Sci USA"},{"key":"2024091405020394700_btae529-B38","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1126\/science.aan0693","article-title":"Global analysis of protein folding using massively parallel design, synthesis, and testing","volume":"357","author":"Rocklin","year":"2017","journal-title":"Science"},{"key":"2024091405020394700_btae529-B39","doi-asserted-by":"crossref","first-page":"397","DOI":"10.1038\/nature17995","article-title":"Local fitness landscape of the green fluorescent protein","volume":"533","author":"Sarkisyan","year":"2016","journal-title":"Nature"},{"key":"2024091405020394700_btae529-B40","doi-asserted-by":"crossref","first-page":"6719","DOI":"10.1093\/nar\/gkq495","article-title":"Synonymous codon usage influences the local protein structure observed","volume":"38","author":"Saunders","year":"2010","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B41","doi-asserted-by":"crossref","first-page":"1203","DOI":"10.1098\/rstb.2009.0305","article-title":"Forces that influence the evolution of codon bias","volume":"365","author":"Sharp","year":"2010","journal-title":"Philos Trans R Soc Lond B Biol Sci"},{"key":"2024091405020394700_btae529-B42","author":"Steck","year":"2024"},{"key":"2024091405020394700_btae529-B43","author":"Su","year":"2021"},{"key":"2024091405020394700_btae529-B44","doi-asserted-by":"crossref","first-page":"D523","DOI":"10.1093\/nar\/gkac1052","article-title":"Uniprot: the universal protein knowledgebase in 2023","volume":"51","author":"Uniprot Consortium","year":"2023","journal-title":"Nucleic Acids Res"},{"key":"2024091405020394700_btae529-B45","author":"Xu","year":"2022"},{"key":"2024091405020394700_btae529-B46","first-page":"482","article-title":"Sixty-five years of the long march in protein secondary structure prediction: the final stretch?","volume":"19","author":"Yang","year":"2018","journal-title":"Brief Bioinf"},{"key":"2024091405020394700_btae529-B47","author":"Zhou","year":"2023"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae529\/58972224\/btae529.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/9\/btae529\/59117017\/btae529.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/9\/btae529\/59117017\/btae529.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,14]],"date-time":"2024-09-14T06:14:47Z","timestamp":1726294487000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae529\/7745814"}},"subtitle":[],"editor":[{"given":"Janet","family":"Kelso","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"editor"}]}],"short-title":[],"issued":{"date-parts":[[2024,8,30]]},"references-count":46,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae529","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.05.20.594989","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,9]]},"published":{"date-parts":[[2024,8,30]]},"article-number":"btae529"}}