{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T14:08:07Z","timestamp":1768831687576,"version":"3.49.0"},"reference-count":16,"publisher":"Oxford University Press (OUP)","issue":"10","license":[{"start":{"date-parts":[[2023,10,10]],"date-time":"2023-10-10T00:00:00Z","timestamp":1696896000000},"content-version":"vor","delay-in-days":9,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Grant-in-Aid for Scientific Research","award":["JSPS KAKENHI"],"award-info":[{"award-number":["JSPS KAKENHI"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,10,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The source code and associated data can be accessed at https:\/\/github.com\/yaozhong\/bert_investigation.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad617","type":"journal-article","created":{"date-parts":[[2023,10,10]],"date-time":"2023-10-10T15:31:46Z","timestamp":1696951906000},"source":"Crossref","is-referenced-by-count":9,"title":["Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5598-2521","authenticated-orcid":false,"given":"Yao-zhong","family":"Zhang","sequence":"first","affiliation":[{"name":"Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo , Minato-ku, Tokyo 108-8639, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4921-0936","authenticated-orcid":false,"given":"Zeheng","family":"Bai","sequence":"additional","affiliation":[{"name":"Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo , Minato-ku, Tokyo 108-8639, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2989-308X","authenticated-orcid":false,"given":"Seiya","family":"Imoto","sequence":"additional","affiliation":[{"name":"Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo , Minato-ku, Tokyo 108-8639, Japan"}]}],"member":"286","published-online":{"date-parts":[[2023,10,10]]},"reference":[{"key":"2023102913435425800_btad617-B1","author":"Devlin"},{"key":"2023102913435425800_btad617-B2","author":"Dosovitskiy"},{"key":"2023102913435425800_btad617-B55802988","doi-asserted-by":"crossref","first-page":"D51","DOI":"10.1093\/nar\/gkw1069","article-title":"The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms","volume":"45","author":"Dreos","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2023102913435425800_btad617-B6410094","doi-asserted-by":"crossref","first-page":"7112","DOI":"10.1109\/TPAMI.2021.3095381","article-title":"ProtTrans: Toward understanding the language of life through self-supervised learning","volume":"44","author":"Elnaggar","year":"2022","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2023102913435425800_btad617-B3","first-page":"15","author":"Hinton","year":"2002"},{"key":"2023102913435425800_btad617-B4","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"2023102913435425800_btad617-B5","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1038\/s41586-021-03819-2","article-title":"Highly accurate protein structure prediction with alphafold","volume":"596","author":"Jumper","year":"2021","journal-title":"Nature"},{"key":"2023102913435425800_btad617-B6","first-page":"1","article-title":"Context dependency of nucleotide probabilities and variants in human DNA","volume":"23","author":"Liang","year":"2022","journal-title":"BMC Genomics"},{"key":"2023102913435425800_btad617-B7","author":"Ng","year":"2017"},{"key":"2023102913435425800_btad617-B8","doi-asserted-by":"crossref","first-page":"125","DOI":"10.1038\/s41576-022-00532-2","article-title":"Obtaining genetics insights from deep learning via explainable artificial intelligence","volume":"24","author":"Novakovsky","year":"2023","journal-title":"Nat Rev Genet"},{"key":"2023102913435425800_btad617-B9","doi-asserted-by":"crossref","first-page":"286","DOI":"10.3389\/fgene.2019.00286","article-title":"DeePromoter: robust promoter predictor using deep learning","volume":"10","author":"Oubounyt","year":"2019","journal-title":"Front Genet"},{"key":"2023102913435425800_btad617-B10","first-page":"2825","article-title":"Scikit-learn: machine learning in python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Mach Learn Res"},{"key":"2023102913435425800_btad617-B11","author":"Rao"},{"key":"2023102913435425800_btad617-B12","first-page":"30","author":"Vaswani","year":"2017"},{"key":"2023102913435425800_btad617-B13","doi-asserted-by":"crossref","first-page":"2642","DOI":"10.1093\/bioinformatics\/bty178","article-title":"Learned protein embeddings for machine learning","volume":"34","author":"Yang","year":"2018","journal-title":"Bioinformatics"},{"key":"2023102913435425800_btad617-B14","doi-asserted-by":"crossref","first-page":"i121","DOI":"10.1093\/bioinformatics\/btw255","article-title":"Convolutional neural network architectures for predicting DNA\u2013protein binding","volume":"32","author":"Zeng","year":"2016","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad617\/51990559\/btad617.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/10\/btad617\/52673534\/btad617.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/10\/btad617\/52673534\/btad617.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,29]],"date-time":"2023-10-29T14:10:07Z","timestamp":1698588607000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad617\/7303863"}},"subtitle":[],"editor":[{"given":"Valentina","family":"Boeva","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,10,1]]},"references-count":16,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2023,10,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad617","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,10,1]]},"published":{"date-parts":[[2023,10,1]]},"article-number":"btad617"}}