{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T10:37:24Z","timestamp":1776335844252,"version":"3.51.2"},"reference-count":57,"publisher":"Oxford University Press (OUP)","issue":"15","license":[{"start":{"date-parts":[[2021,2,4]],"date-time":"2021-02-04T00:00:00Z","timestamp":1612396800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01LM011297"],"award-info":[{"award-number":["R01LM011297"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The source code, pretrained and finetuned model for DNABERT are available at GitHub (https:\/\/github.com\/jerryji1993\/DNABERT).<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab083","type":"journal-article","created":{"date-parts":[[2021,2,3]],"date-time":"2021-02-03T03:49:59Z","timestamp":1612324199000},"page":"2112-2120","source":"Crossref","is-referenced-by-count":1054,"title":["DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome"],"prefix":"10.1093","volume":"37","author":[{"given":"Yanrong","family":"Ji","sequence":"first","affiliation":[{"name":"Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine , Chicago, IL 60611, USA"}]},{"given":"Zhihan","family":"Zhou","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Northwestern University , Evanston, IL 60208, USA"}]},{"given":"Han","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Northwestern University , Evanston, IL 60208, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7053-1064","authenticated-orcid":false,"given":"Ramana V","family":"Davuluri","sequence":"additional","affiliation":[{"name":"Department of Biomedical Informatics, Stony Brook University , Stony Brook, NY 11794, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,2,4]]},"reference":[{"key":"2024041009302593200_btab083-B1","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1038\/nbt.3300","article-title":"Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning","volume":"33","author":"Alipanahi","year":"2015","journal-title":"Nat. Biotechnol"},{"key":"2024041009302593200_btab083-B2","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1038\/s41576-019-0173-8","article-title":"Determinants of enhancer and promoter activities of regulatory elements","volume":"21","author":"Andersson","year":"2020","journal-title":"Nat. Rev. Genet"},{"key":"2024041009302593200_btab083-B3","doi-asserted-by":"crossref","first-page":"1659","DOI":"10.1038\/nprot.2017.055","article-title":"Mapping genome-wide transcription-factor binding sites using DAP-seq","volume":"12","author":"Bartlett","year":"2017","journal-title":"Nat. Protoc"},{"key":"2024041009302593200_btab083-B4","doi-asserted-by":"crossref","first-page":"1798","DOI":"10.1109\/TPAMI.2013.50","article-title":"Representation learning: a review and new perspectives","volume":"35","author":"Bengio","year":"2013","journal-title":"IEEE Trans. Pattern Anal"},{"key":"2024041009302593200_btab083-B5","doi-asserted-by":"crossref","first-page":"2561","DOI":"10.1093\/nar\/12.5.2561","article-title":"Genome structure described by formal languages","volume":"12","author":"Brendel","year":"1984","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B6","doi-asserted-by":"crossref","first-page":"1213","DOI":"10.1038\/nmeth.2688","article-title":"Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position","volume":"10","author":"Buenrostro","year":"2013","journal-title":"Nat. Methods"},{"key":"2024041009302593200_btab083-B7","doi-asserted-by":"crossref","first-page":"D1005","DOI":"10.1093\/nar\/gky1120","article-title":"The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019","volume":"47","author":"Buniello","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B8","author":"Cho","year":"2014"},{"key":"2024041009302593200_btab083-B9","author":"Clauwaert","year":"2020"},{"key":"2024041009302593200_btab083-B10","doi-asserted-by":"crossref","first-page":"445","DOI":"10.1016\/S0092-8674(03)00348-9","article-title":"The multiple sulfatase deficiency gene encodes an essential and limiting factor for the activity of sulfatases","volume":"113","author":"Cosma","year":"2003","journal-title":"Cell"},{"key":"2024041009302593200_btab083-B11","first-page":"412","article-title":"Application of FirstEF to find promoters and first exons in the human genome","volume":"29","author":"Davuluri","year":"2003","journal-title":"Curr.Protoc.Bioinf"},{"key":"2024041009302593200_btab083-B12","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1016\/j.tig.2008.01.008","article-title":"The functional consequences of alternative promoter use in mammalian genomes","volume":"24","author":"Davuluri","year":"2008","journal-title":"Trends Genet"},{"key":"2024041009302593200_btab083-B13","author":"Devlin","year":"2018"},{"key":"2024041009302593200_btab083-B14","doi-asserted-by":"crossref","first-page":"D157","DOI":"10.1093\/nar\/gks1233","article-title":"EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era","volume":"41","author":"Dreos","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B15","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1038\/nature11247","article-title":"An integrated encyclopedia of DNA elements in the human genome","volume":"489","author":"Dunham","year":"2012","journal-title":"Nature"},{"key":"2024041009302593200_btab083-B16","doi-asserted-by":"crossref","first-page":"829","DOI":"10.1038\/nrg3813","article-title":"A census of human RNA-binding proteins","volume":"15","author":"Gerstberger","year":"2014","journal-title":"Nat. Rev. Genet"},{"key":"2024041009302593200_btab083-B17","doi-asserted-by":"crossref","first-page":"8","DOI":"10.3410\/B4-8","article-title":"The context of gene expression regulation","volume":"4","author":"Gibcus","year":"2012","journal-title":"F1000 Biol. Rep"},{"key":"2024041009302593200_btab083-B18","doi-asserted-by":"crossref","first-page":"R24","DOI":"10.1186\/gb-2007-8-2-r24","article-title":"Quantifying similarity between motifs","volume":"8","author":"Gupta","year":"2007","journal-title":"Genome Biol"},{"key":"2024041009302593200_btab083-B19","first-page":"178","author":"Hassanzadeh","year":"2016"},{"key":"2024041009302593200_btab083-B20","doi-asserted-by":"crossref","first-page":"737","DOI":"10.1016\/S0092-8240(87)90018-8","article-title":"Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors","volume":"49","author":"Head","year":"1987","journal-title":"Bull. Math. Biol"},{"key":"2024041009302593200_btab083-B21","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput"},{"key":"2024041009302593200_btab083-B22","doi-asserted-by":"crossref","first-page":"e71","DOI":"10.1136\/jmg.2006.045377","article-title":"MYO7A mutation screening in Usher syndrome type I patients from diverse origins","volume":"44","author":"Jaijo","year":"2006","journal-title":"J. Med. Genet"},{"key":"2024041009302593200_btab083-B23","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1111\/j.1749-6632.1999.tb08916.x","article-title":"The linguistics of DNA: words, sentences, grammar, phonetics, and semantics","volume":"870","author":"Ji","year":"1999","journal-title":"Ann. N. Y. Acad. Sci. Paper Ed"},{"key":"2024041009302593200_btab083-B24","doi-asserted-by":"crossref","first-page":"134","DOI":"10.1038\/s41598-019-56894-x","article-title":"In silico analysis of alternative splicing on drug\u2013target gene interactions","volume":"10","author":"Ji","year":"2020","journal-title":"Sci. Rep"},{"key":"2024041009302593200_btab083-B25","doi-asserted-by":"crossref","first-page":"990","DOI":"10.1101\/gr.200535.115","article-title":"Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks","volume":"26","author":"Kelley","year":"2016","journal-title":"Genome Res"},{"key":"2024041009302593200_btab083-B26","doi-asserted-by":"crossref","first-page":"e72","DOI":"10.1093\/nar\/gky237","article-title":"A novel method for improved accuracy of transcription factor binding site prediction","volume":"46","author":"Khamis","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B27","doi-asserted-by":"crossref","first-page":"6069","DOI":"10.1093\/nar\/gkr028","article-title":"Crosstalk between c-Jun and TAp73alpha\/beta contributes to the apoptosis-survival balance","volume":"39","author":"Koeppel","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B28","doi-asserted-by":"crossref","first-page":"D980","DOI":"10.1093\/nar\/gkt1113","article-title":"ClinVar: public archive of relationships among sequence variation and human phenotype","volume":"42","author":"Landrum","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B29","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"2024041009302593200_btab083-B30","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2024041009302593200_btab083-B31","doi-asserted-by":"crossref","first-page":"i185","DOI":"10.1093\/bioinformatics\/btu273","article-title":"GRASP: analysis of genotype\u2013phenotype results from 1390 genome-wide association studies and corresponding open access database","volume":"30","author":"Leslie","year":"2014","journal-title":"Bioinformatics"},{"key":"2024041009302593200_btab083-B32","doi-asserted-by":"crossref","first-page":"e14830","DOI":"10.2196\/14830","article-title":"Fine-tuning bidirectional encoder representations from transformers (BERT)-based models on large-scale electronic health record notes: an empirical study","volume":"7","author":"Li","year":"2019","journal-title":"JMIR Med. Inform"},{"key":"2024041009302593200_btab083-B33","doi-asserted-by":"crossref","first-page":"2729","DOI":"10.1093\/bioinformatics\/btw288","article-title":"Predicting regulatory variants with composite statistic","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2024041009302593200_btab083-B34","first-page":"5631","article-title":"Interaction of polymorphisms in xerodermapigmentosum group C with cigarette smoking and pancreatic cancer risk","volume":"16","author":"Liang","year":"2018","journal-title":"OncolLett"},{"key":"2024041009302593200_btab083-B35","author":"Liu","year":"2019"},{"key":"2024041009302593200_btab083-B36","doi-asserted-by":"crossref","first-page":"3169","DOI":"10.1103\/PhysRevLett.73.3169","article-title":"Linguistic features of noncoding DNA sequences","volume":"73","author":"Mantegna","year":"1994","journal-title":"Phys. Rev. Lett"},{"key":"2024041009302593200_btab083-B37","author":"Min","year":"2019"},{"key":"2024041009302593200_btab083-B38","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1186\/gb-2012-13-8-418","article-title":"An encyclopedia of mouse DNA elements (Mouse ENCODE)","volume":"13","author":"Mouse","year":"2012","journal-title":"Genome Biol"},{"key":"2024041009302593200_btab083-B39","doi-asserted-by":"crossref","first-page":"1161","DOI":"10.1073\/pnas.53.5.1161","article-title":"RNA codewords and protein synthesis, VII. On the general nature of the RNA code","volume":"53","author":"Nirenberg","year":"1965","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2024041009302593200_btab083-B40","doi-asserted-by":"crossref","first-page":"286","DOI":"10.3389\/fgene.2019.00286","article-title":"DeePromoter: robust promoter predictor using deep learning","volume":"10","author":"Oubounyt","year":"2019","journal-title":"Front. Genet"},{"key":"2024041009302593200_btab083-B41","doi-asserted-by":"crossref","first-page":"e107","DOI":"10.1093\/nar\/gkw226","article-title":"DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences","volume":"44","author":"Quang","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B42","first-page":"579","article-title":"The linguistics of DNA","volume":"80","author":"Searls","year":"1992","journal-title":"Am. Sci"},{"key":"2024041009302593200_btab083-B43","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1038\/nature01255","article-title":"The language of genes","volume":"420","author":"Searls","year":"2002","journal-title":"Nature"},{"key":"2024041009302593200_btab083-B44","first-page":"1","article-title":"Recurrent neural network for predicting transcription factor binding sites","volume":"8","author":"Shen","year":"2018","journal-title":"Sci. Rep. UK"},{"key":"2024041009302593200_btab083-B45","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1093\/nar\/29.1.308","article-title":"dbSNP: the NCBI database of genetic variation","volume":"29","author":"Sherry","year":"2001","journal-title":"Nucleic Acids Res"},{"key":"2024041009302593200_btab083-B46","doi-asserted-by":"crossref","first-page":"S10","DOI":"10.1186\/gb-2006-7-s1-s10","article-title":"Automatic annotation of eukaryotic genes, pseudogenes and promoters","volume":"7","author":"Solovyev","year":"2006","journal-title":"Genome Biol"},{"key":"2024041009302593200_btab083-B47","doi-asserted-by":"crossref","first-page":"2730","DOI":"10.1093\/bioinformatics\/bty1068","article-title":"Promoter analysis and prediction in the human genome using sequence-based deep learning models","volume":"35","author":"Umarov","year":"2019","journal-title":"Bioinformatics"},{"key":"2024041009302593200_btab083-B48","first-page":"6000","author":"Vaswani","year":"2017"},{"key":"2024041009302593200_btab083-B49","doi-asserted-by":"crossref","first-page":"1206","DOI":"10.1158\/1541-7786.MCR-16-0459","article-title":"The landscape of isoform switches in human cancers","volume":"15","author":"Vitting-Seerup","year":"2017","journal-title":"Mol. Cancer Res"},{"key":"2024041009302593200_btab083-B50","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1186\/s12859-019-3306-3","article-title":"SpliceFinder: ab initio prediction of splice sites using convolutional neural network","volume":"20","author":"Wang","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2024041009302593200_btab083-B51","doi-asserted-by":"crossref","first-page":"802","DOI":"10.1261\/rna.876308","article-title":"Splicing regulation: from a parts list of regulatory elements to an integrated splicing code","volume":"14","author":"Wang","year":"2008","journal-title":"RNA"},{"key":"2024041009302593200_btab083-B52","doi-asserted-by":"crossref","first-page":"520","DOI":"10.1038\/nature01262","article-title":"Initial sequencing and comparative analysis of the mouse genome","volume":"420","author":"Waterston","year":"2002","journal-title":"Nature"},{"key":"2024041009302593200_btab083-B53","first-page":"pp. 5754","author":"Yang","year":"2019"},{"key":"2024041009302593200_btab083-B54","doi-asserted-by":"crossref","first-page":"15632","DOI":"10.1073\/pnas.242597299","article-title":"Gene expression profiling of isogenic cells with different TP53 gene dosage reveals numerous genes that are affected by TP53 dosage and identifies CSPG2 as a direct target of p53","volume":"99","author":"Yoon","year":"2002","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2024041009302593200_btab083-B55","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1007\/s13042-019-00990-x","article-title":"DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding","volume":"11","author":"Zhang","year":"2020","journal-title":"Int. J. Mach. Learn. Cyb"},{"key":"2024041009302593200_btab083-B56","doi-asserted-by":"crossref","first-page":"931","DOI":"10.1038\/nmeth.3547","article-title":"Predicting effects of noncoding variants with deep learning-based sequence model","volume":"12","author":"Zhou","year":"2015","journal-title":"Nat. Methods"},{"key":"2024041009302593200_btab083-B57","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1038\/s41588-018-0295-5","article-title":"A primer on deep learning in genomics","volume":"51","author":"Zou","year":"2019","journal-title":"Nat. Genet"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab083\/36253031\/btab083.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/15\/2112\/57195892\/btab083.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/15\/2112\/57195892\/btab083.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,10]],"date-time":"2024-04-10T05:38:52Z","timestamp":1712727532000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/15\/2112\/6128680"}},"subtitle":[],"editor":[{"given":"Janet","family":"Kelso","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,2,4]]},"references-count":57,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2021,8,9]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab083","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.09.17.301879","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,8,1]]},"published":{"date-parts":[[2021,2,4]]}}}