{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T15:10:57Z","timestamp":1774278657287,"version":"3.50.1"},"reference-count":44,"publisher":"Oxford University Press (OUP)","issue":"4","license":[{"start":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T00:00:00Z","timestamp":1740441600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"funder":[{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","award":["R01LM014306"],"award-info":[{"award-number":["R01LM014306"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objectives<\/jats:title>\n                  <jats:p>The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of natural language processing (NLP) techniques, particularly large language models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>This review highlights the growing role of NLP, particularly LLMs, in genomic sequencing data analysis. While these models improve data processing and regulatory annotation prediction, challenges remain in accessibility and interpretability. Further research is needed to refine their application in genomics.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaf029","type":"journal-article","created":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T17:17:22Z","timestamp":1740503842000},"page":"761-772","source":"Crossref","is-referenced-by-count":6,"title":["Deciphering genomic codes using advanced natural language processing techniques: a scoping review"],"prefix":"10.1093","volume":"32","author":[{"given":"Shuyan","family":"Cheng","sequence":"first","affiliation":[{"name":"Department of Population Health Sciences, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]},{"given":"Yishu","family":"Wei","sequence":"additional","affiliation":[{"name":"Department of Population Health Sciences, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]},{"given":"Yiliang","family":"Zhou","sequence":"additional","affiliation":[{"name":"Department of Population Health Sciences, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]},{"given":"Zihan","family":"Xu","sequence":"additional","affiliation":[{"name":"Department of Population Health Sciences, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1776-5427","authenticated-orcid":false,"given":"Drew N","family":"Wright","sequence":"additional","affiliation":[{"name":"Samuel J. Wood Library & C.V. Starr Biomedical Information Center, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]},{"given":"Jinze","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Public Health, Virginia Commonwealth University , Richmond, VA 23219,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9309-8331","authenticated-orcid":false,"given":"Yifan","family":"Peng","sequence":"additional","affiliation":[{"name":"Department of Population Health Sciences, Weill Cornell Medicine , New York, NY 10065,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,2,25]]},"reference":[{"key":"2025041716421567000_ocaf029-B1","doi-asserted-by":"publisher","first-page":"3198","DOI":"10.1016\/j.csbj.2021.05.039","article-title":"Representation learning applications in biological sequence analysis","volume":"19","author":"Iuchi","year":"2021","journal-title":"Comput Struct Biotechnol J"},{"issue":"12","key":"2025041716421567000_ocaf029-B2","doi-asserted-by":"publisher","first-page":"1868","DOI":"10.1038\/s41592-023-02105-5","article-title":"Large models for genomics","volume":"20","author":"Tang","year":"2023","journal-title":"Nat Methods"},{"key":"2025041716421567000_ocaf029-B3","author":"ScienceDirect Topics. Human Genome. Accessed","year":"2024"},{"key":"2025041716421567000_ocaf029-B4","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1038\/nrg.2016.49","article-title":"Coming of age: ten years of next-generation sequencing technologies","volume":"17","author":"Goodwin","year":"2016","journal-title":"Nat Rev Genet"},{"key":"2025041716421567000_ocaf029-B5","doi-asserted-by":"publisher","author":"Dotan","DOI":"10.1093\/bioinformatics\/btae196"},{"key":"2025041716421567000_ocaf029-B6","doi-asserted-by":"publisher","author":"Jiang","DOI":"10.1002\/wcms.1725"},{"key":"2025041716421567000_ocaf029-B7","doi-asserted-by":"publisher","author":"Consens","year":"2023","DOI":"10.48550\/arXiv.2311.07621"},{"issue":"7","key":"2025041716421567000_ocaf029-B8","doi-asserted-by":"publisher","first-page":"1033","DOI":"10.3390\/biology12071033","article-title":"Transformer architecture and attention mechanisms in genome data analysis: a comprehensive review","volume":"12","author":"Choi","year":"2023","journal-title":"Biology"},{"key":"2025041716421567000_ocaf029-B9","author":"Covidence systematic review software, Veritas Health Innovation"},{"key":"2025041716421567000_ocaf029-B10","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1109\/TCBB.2020.3035021","article-title":"Novel transformer networks for improved sequence labeling in genomics","volume":"19","author":"Clauwaert","year":"2022","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2025041716421567000_ocaf029-B11","doi-asserted-by":"publisher","author":"Hossain","year":"2022","DOI":"10.1109\/IBSSC56953.2022.10037492"},{"key":"2025041716421567000_ocaf029-B12","doi-asserted-by":"publisher","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"2025041716421567000_ocaf029-B13","doi-asserted-by":"publisher","first-page":"bbab005","DOI":"10.1093\/bib\/bbab005","article-title":"A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information","volume":"22","author":"Le","year":"2021","journal-title":"Brief Bioinform"},{"key":"2025041716421567000_ocaf029-B14","doi-asserted-by":"publisher","first-page":"107732","DOI":"10.1016\/j.compbiolchem.2022.107732","article-title":"BERT-promoter: an improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection","volume":"99","author":"Le","year":"2022","journal-title":"Comput Biol Chem"},{"key":"2025041716421567000_ocaf029-B15","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s12539-022-00537-9","article-title":"Improving language model of human genome for DNA\u2013protein binding prediction based on task-specific pre-training","volume":"15","author":"Luo","year":"2023","journal-title":"Interdiscip Sci"},{"key":"2025041716421567000_ocaf029-B16","doi-asserted-by":"publisher","author":"Rajkumar","year":"2022","DOI":"10.1145\/3535508.3545551"},{"key":"2025041716421567000_ocaf029-B17","doi-asserted-by":"publisher","author":"Roy","year":"2023","DOI":"10.3233\/FAIA230492"},{"key":"2025041716421567000_ocaf029-B18","doi-asserted-by":"publisher","first-page":"e16600","DOI":"10.7717\/peerj.16600","article-title":"BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT","volume":"11","author":"Wang","year":"2023","journal-title":"PeerJ"},{"key":"2025041716421567000_ocaf029-B19","doi-asserted-by":"publisher","first-page":"e1010779","DOI":"10.1371\/journal.pcbi.1010779","article-title":"Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework","volume":"18","author":"Wang","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"2025041716421567000_ocaf029-B20","doi-asserted-by":"publisher","first-page":"663","DOI":"10.1007\/978-3-031-13829-4_57","author":"Zhang","year":"2022"},{"key":"2025041716421567000_ocaf029-B21","doi-asserted-by":"publisher","first-page":"568","DOI":"10.3390\/genes13040568","article-title":"SemanticCAP: chromatin accessibility prediction enhanced by features learning from a language model","volume":"13","author":"Zhang","year":"2022","journal-title":"Genes (Basel)"},{"key":"2025041716421567000_ocaf029-B22","doi-asserted-by":"publisher","author":"An","year":"2022","DOI":"10.1145\/3535508.3545512"},{"key":"2025041716421567000_ocaf029-B23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ICCIT60459.2023.10441209","author":"Hossain","year":"2023"},{"key":"2025041716421567000_ocaf029-B24","doi-asserted-by":"publisher","first-page":"107035","DOI":"10.1016\/j.cmpb.2022.107035","article-title":"Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers","volume":"225","author":"Pipoli","year":"2022","journal-title":"Comput Methods Programs Biomed"},{"key":"2025041716421567000_ocaf029-B25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1093\/gigascience\/giad054","article-title":"MuLan-Methyl\u2014multiple transformer-based language models for accurate DNA methylation prediction","volume":"12","author":"Zeng","year":"2022","journal-title":"Gigascience"},{"key":"2025041716421567000_ocaf029-B26","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1186\/s12859-024-05649-1","article-title":"MSCAN: multi-scale self- and cross-attention network for RNA methylation site prediction","volume":"25","author":"Wang","year":"2024","journal-title":"BMC Bioinformatics"},{"key":"2025041716421567000_ocaf029-B27","doi-asserted-by":"publisher","first-page":"272","DOI":"10.3934\/mbe.2024013","article-title":"MTTLm6A: a multi-task transfer learning approach for base-resolution mRNA m6A site prediction based on an improved transformer","volume":"21","author":"Wang","year":"2024","journal-title":"Math Biosci Eng"},{"key":"2025041716421567000_ocaf029-B28","doi-asserted-by":"publisher","first-page":"5384","DOI":"10.1021\/acs.jcim.3c00952","article-title":"BCMCMI: a fusion model for predicting circRNA-miRNA interactions combining semantic and meta-path","volume":"63","author":"Wei","year":"2023","journal-title":"J Chem Inf Model"},{"key":"2025041716421567000_ocaf029-B29","doi-asserted-by":"publisher","first-page":"107421","DOI":"10.1016\/j.compbiomed.2023.107421","article-title":"An efficient circRNA-miRNA interaction prediction model by combining biological text mining and wavelet diffusion-based sparse network structure embedding","volume":"165","author":"Wang","year":"2023","journal-title":"Comput Biol Med"},{"key":"2025041716421567000_ocaf029-B30","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1016\/j.ymeth.2023.107733","article-title":"miTDS: uncovering miRNA-mRNA interactions with deep learning for functional target prediction","volume":"223","author":"Zhang","year":"2024","journal-title":"Methods"},{"key":"2025041716421567000_ocaf029-B31","doi-asserted-by":"publisher","first-page":"129","DOI":"10.1016\/j.csbj.2023.11.011","article-title":"TMSC-m7G: a transformer architecture based on multi-sense-scaled embedding features and convolutional neural network to identify RNA N7-methylguanosine sites","volume":"23","author":"Zhang","year":"2024","journal-title":"Comput Struct Biotechnol J"},{"key":"2025041716421567000_ocaf029-B32","doi-asserted-by":"publisher","first-page":"287","DOI":"10.15302\/J-QB-022-0323","article-title":"Transformer-based DNA methylation detection on ionic signals from Oxford Nanopore sequencing data","volume":"11","author":"Wang","year":"2023","journal-title":"Quant Biol"},{"key":"2025041716421567000_ocaf029-B33","doi-asserted-by":"publisher","first-page":"1535","DOI":"10.1109\/BIBM55023.2023.10237492","author":"Jhee","year":"2023"},{"key":"2025041716421567000_ocaf029-B34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/CIBCB52886.2022.9863058","author":"Jurenaite","year":"2022"},{"key":"2025041716421567000_ocaf029-B35","doi-asserted-by":"publisher","first-page":"846638","DOI":"10.3389\/fnins.2022.846638","article-title":"Alzheimer\u2019s disease classification through imaging genetic data with IGnet","volume":"16","author":"Wang","year":"2022","journal-title":"Front Neurosci"},{"key":"2025041716421567000_ocaf029-B36","doi-asserted-by":"publisher","first-page":"150","DOI":"10.1016\/j.synbio.2019.09.001","article-title":"The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer","volume":"4","author":"Huang","year":"2019","journal-title":"Synth Syst Biotechnol"},{"key":"2025041716421567000_ocaf029-B37","doi-asserted-by":"publisher","first-page":"4171","DOI":"10.18653\/v1\/N19","author":"Devlin","year":"2019"},{"key":"2025041716421567000_ocaf029-B38","author":"Face"},{"key":"2025041716421567000_ocaf029-B39","first-page":"282","author":"Lafferty","year":"2001"},{"key":"2025041716421567000_ocaf029-B40","doi-asserted-by":"publisher","first-page":"1470","DOI":"10.1038\/s41592-024-02201-0","article-title":"scGPT: toward building a foundation model for single-cell multi-omics using generative AI","volume":"21","author":"Cui","year":"2024","journal-title":"Nat Methods"},{"key":"2025041716421567000_ocaf029-B41","doi-asserted-by":"publisher","author":"Benegas","DOI":"10.1016\/j.tig.2024.11.013"},{"key":"2025041716421567000_ocaf029-B42","doi-asserted-by":"publisher","first-page":"860","DOI":"10.1093\/cid\/ciad633","article-title":"Black box warning: large language models and the future of infectious diseases consultation","volume":"78","author":"Schwartz","year":"2024","journal-title":"Clin Infect Dis"},{"key":"2025041716421567000_ocaf029-B43","doi-asserted-by":"crossref","first-page":"btae196","DOI":"10.1093\/bioinformatics\/btae196","article-title":"Effect of tokenization on transformers for biological sequences","volume":"40","author":"Dotan","year":"2024","journal-title":"Bioinformatics"},{"key":"2025041716421567000_ocaf029-B44","doi-asserted-by":"publisher","first-page":"3231","DOI":"10.1038\/ncomms4231","article-title":"Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types","volume":"5","author":"Yang","year":"2014","journal-title":"Nat Commun"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/4\/761\/62167460\/ocaf029.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/4\/761\/62167460\/ocaf029.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,17]],"date-time":"2025-04-17T20:42:23Z","timestamp":1744922543000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/32\/4\/761\/8042189"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,25]]},"references-count":44,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,2,25]]},"published-print":{"date-parts":[[2025,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaf029","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,4]]},"published":{"date-parts":[[2025,2,25]]}}}