{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T01:53:39Z","timestamp":1780624419043,"version":"3.54.1"},"reference-count":29,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T00:00:00Z","timestamp":1721692800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Nat Mach Intell"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Deep-learning models that learn a sense of language on DNA have achieved a high level of performance on genome biological tasks. Genome sequences follow rules similar to natural language but are distinct in the absence of a concept of words. We established byte-pair encoding on the human genome and trained a foundation language model called GROVER (Genome Rules Obtained Via Extracted Representations) with the vocabulary selected via a custom task, next-<jats:italic>k<\/jats:italic>-mer prediction. The defined dictionary of tokens in the human genome carries best the information content for GROVER. Analysing learned representations, we observed that trained token embeddings primarily encode information related to frequency, sequence content and length. Some tokens are primarily localized in repeats, whereas the majority widely distribute over the genome. GROVER also learns context and lexical ambiguity. Average trained embeddings of genomic regions relate to functional genomics annotation and thus indicate learning of these structures purely from the contextual relationships of tokens. This highlights the extent of information content encoded by the sequence that can be grasped by GROVER. On fine-tuning tasks addressing genome biology with questions of genome element identification and protein\u2013DNA binding, GROVER exceeds other models\u2019 performance. GROVER learns sequence context, a sense for structure and language rules. Extracting this knowledge can be used to compose a grammar book for the code of life.<\/jats:p>","DOI":"10.1038\/s42256-024-00872-0","type":"journal-article","created":{"date-parts":[[2024,7,23]],"date-time":"2024-07-23T10:10:03Z","timestamp":1721729403000},"page":"911-923","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":86,"title":["DNA language model GROVER learns sequence context in the human genome"],"prefix":"10.1038","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4345-1074","authenticated-orcid":false,"given":"Melissa","family":"Sanabria","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jonas","family":"Hirsch","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pierre M.","family":"Joubert","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3056-4360","authenticated-orcid":false,"given":"Anna R.","family":"Poetsch","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,7,23]]},"reference":[{"key":"872_CR1","doi-asserted-by":"publisher","first-page":"860","DOI":"10.1038\/35057062","volume":"409","author":"ES Lander","year":"2001","unstructured":"Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860\u2013921 (2001).","journal-title":"Nature"},{"key":"872_CR2","doi-asserted-by":"publisher","first-page":"1227","DOI":"10.1038\/1921227a0","volume":"192","author":"FH Crick","year":"1961","unstructured":"Crick, F. H., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General nature of the genetic code for proteins. Nature 192, 1227\u20131232 (1961).","journal-title":"Nature"},{"key":"872_CR3","unstructured":"Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (IEEE, 2017); https:\/\/proceedings.neurips.cc\/paper\/7181-attention-is-all"},{"key":"872_CR4","unstructured":"Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (eds Larochelle, H. et al.) 1877\u20131901 (IEEE, 2020)."},{"key":"872_CR5","doi-asserted-by":"publisher","first-page":"1196","DOI":"10.1038\/s41592-021-01252-x","volume":"18","author":"\u017d Avsec","year":"2021","unstructured":"Avsec, \u017d. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196\u20131203 (2021).","journal-title":"Nat. Methods"},{"key":"872_CR6","doi-asserted-by":"publisher","first-page":"e81","DOI":"10.1093\/nar\/gkac326","volume":"50","author":"M Yang","year":"2022","unstructured":"Yang, M. et al. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 50, e81 (2022).","journal-title":"Nucleic Acids Res."},{"key":"872_CR7","doi-asserted-by":"publisher","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","volume":"37","author":"Y Ji","year":"2021","unstructured":"Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112\u20132120 (2021).","journal-title":"Bioinformatics"},{"key":"872_CR8","unstructured":"Dalla-Torre, H. et al. The Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Preprint at bioRxiv https:\/\/www.biorxiv.org\/content\/10.1101\/2023.01.11.523679.abstract (2023)."},{"key":"872_CR9","doi-asserted-by":"publisher","unstructured":"Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint at https:\/\/doi.org\/10.48550\/arXiv.1810.04805 (2018).","DOI":"10.48550\/arXiv.1810.04805"},{"key":"872_CR10","doi-asserted-by":"crossref","unstructured":"Sanabria, M., Hirsch, J. & Poetsch, A. R. Distinguishing word identity and sequence context in DNA language models. Preprint at bioRxiv https:\/\/www.biorxiv.org\/content\/10.1101\/2023.07.11.548593 (2023).","DOI":"10.1101\/2023.07.11.548593"},{"key":"872_CR11","unstructured":"Mo, S. et al. Multi-modal self-supervised pre-training for large-scale genome data. Poster at NeurIPS 2021 AI for Science Workshop. OpenReview.net https:\/\/openreview.net\/forum?id=fdV-GZ4LPfn (2021)."},{"key":"872_CR12","unstructured":"Nguyen, E. et al. Hyenadna: long-range genomic sequence modeling at single nucleotide resolution. Preprint at https:\/\/arxiv.org\/pdf\/2306.15794 (2023)."},{"key":"872_CR13","unstructured":"Zhou, Z. et al. Dnabert-2: efficient foundation model and benchmark for multi-species genome. Preprint at https:\/\/arxiv.org\/pdf\/2306.15006 (2023)."},{"key":"872_CR14","doi-asserted-by":"publisher","first-page":"337","DOI":"10.1109\/TIT.1977.1055714","volume":"23","author":"J Ziv","year":"1977","unstructured":"Ziv, J. & Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337\u2013343 (1977).","journal-title":"IEEE Trans. Inf. Theory"},{"key":"872_CR15","doi-asserted-by":"publisher","first-page":"151","DOI":"10.1007\/BF00278187","volume":"78","author":"DN Cooper","year":"1988","unstructured":"Cooper, D. N. & Youssoufian, H. The CpG dinucleotide and human genetic disease. Hum. Genet. 78, 151\u2013155 (1988).","journal-title":"Hum. Genet."},{"key":"872_CR16","doi-asserted-by":"publisher","first-page":"445","DOI":"10.1016\/S0021-9258(18)65663-7","volume":"208","author":"RL Sinsheimer","year":"1954","unstructured":"Sinsheimer, R. L. The action of pancreatic desoxyribonuclease. I. Isolation of mono- and dinucleotides. J. Biol. Chem. 208, 445\u2013459 (1954).","journal-title":"J. Biol. Chem."},{"key":"872_CR17","doi-asserted-by":"publisher","first-page":"226","DOI":"10.1126\/science.187.4173.226","volume":"187","author":"R Holliday","year":"1975","unstructured":"Holliday, R. & Pugh, J. E. DNA modification mechanisms and gene activity during development. Science 187, 226\u2013232 (1975).","journal-title":"Science"},{"key":"872_CR18","doi-asserted-by":"publisher","first-page":"S8","DOI":"10.1016\/j.ctrv.2011.04.010","volume":"37","author":"AR Poetsch","year":"2011","unstructured":"Poetsch, A. R. & Plass, C. Transcriptional regulation by DNA methylation. Cancer Treat. Rev. 37, S8\u2013S12 (2011).","journal-title":"Cancer Treat. Rev."},{"key":"872_CR19","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1016\/S0168-9525(97)01181-5","volume":"13","author":"JA Yoder","year":"1997","unstructured":"Yoder, J. A., Walsh, C. P. & Bestor, T. H. Cytosine methylation and the ecology of intragenomic parasites. Trends Genet. 13, 335\u2013340 (1997).","journal-title":"Trends Genet."},{"key":"872_CR20","unstructured":"Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https:\/\/arxiv.org\/pdf\/1301.3781.pdf (2013)."},{"key":"872_CR21","doi-asserted-by":"crossref","unstructured":"Ethayarajh, K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. Preprint at https:\/\/arxiv.org\/pdf\/1909.00512 (2019).","DOI":"10.18653\/v1\/D19-1006"},{"key":"872_CR22","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1016\/j.molcel.2019.02.036","volume":"74","author":"T Sultana","year":"2019","unstructured":"Sultana, T. et al. The landscape of L1 retrotransposons in the human genome is shaped by pre-insertion sequence biases and post-insertion selection. Mol. Cell 74, 555\u2013570.e7 (2019).","journal-title":"Mol. Cell"},{"key":"872_CR23","doi-asserted-by":"crossref","unstructured":"The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57\u201374 (2012).","DOI":"10.1038\/nature11247"},{"key":"872_CR24","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. Preprint at https:\/\/arxiv.org\/pdf\/1508.07909.pdf (2015).","DOI":"10.18653\/v1\/P16-1162"},{"key":"872_CR25","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1186\/s13072-020-00343-x","volume":"13","author":"LS Pongor","year":"2020","unstructured":"Pongor, L. S. et al. BAMscale: quantification of next-generation sequencing peaks and generation of scaled coverage tracks. Epigenetics Chromatin 13, 21 (2020).","journal-title":"Epigenetics Chromatin"},{"key":"872_CR26","doi-asserted-by":"publisher","unstructured":"Sanabria, M., Hirsch, J. & Poetsch, A. R. GROVER pretrained DNA language model of the human genome. Zenodo https:\/\/doi.org\/10.5281\/zenodo.8373117 (2023).","DOI":"10.5281\/zenodo.8373117"},{"key":"872_CR27","doi-asserted-by":"publisher","unstructured":"Sanabria, M., Hirsch, J. & Poetsch, A. R. GROVER tokenized Human Genome hg19 data set. Zenodo https:\/\/doi.org\/10.5281\/zenodo.8373053 (2023).","DOI":"10.5281\/zenodo.8373053"},{"key":"872_CR28","doi-asserted-by":"publisher","unstructured":"Sanabria, M., Hirsch, J., Joubert, P. & Poetsch, A. R. The human genome\u2019s vocabulary as proposed by the DNA language model GROVER - the code to the paper. Zenodo https:\/\/doi.org\/10.5281\/zenodo.8373202 (2023).","DOI":"10.5281\/zenodo.8373202"},{"key":"872_CR29","doi-asserted-by":"publisher","unstructured":"Sanabria, M., Hirsch, J. & Poetsch, A. R. GROVER DNA language model tutorial. Zenodo https:\/\/doi.org\/10.5281\/zenodo.8373158 (2023).","DOI":"10.5281\/zenodo.8373158"}],"container-title":["Nature Machine Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00872-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00872-0","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00872-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,29]],"date-time":"2024-08-29T18:11:12Z","timestamp":1724955072000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00872-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,23]]},"references-count":29,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2024,8]]}},"alternative-id":["872"],"URL":"https:\/\/doi.org\/10.1038\/s42256-024-00872-0","relation":{},"ISSN":["2522-5839"],"issn-type":[{"value":"2522-5839","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,23]]},"assertion":[{"value":"31 August 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}