{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T23:27:29Z","timestamp":1780356449748,"version":"3.54.1"},"reference-count":24,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T00:00:00Z","timestamp":1747872000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"University of Technology and Applied Sciences","award":["IRFP-IBRI-24-15"],"award-info":[{"award-number":["IRFP-IBRI-24-15"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>The classification of genomic sequences is a crucial area of research in the field of virology. This is due to the increasing number of outbreaks we have faced in recent times. We have a vast repository of genomic sequences from various species, including humans, animals, plants, bacteria, and viruses, which tend to mutate and form new variants or strains. In the realm of machine learning, several models are employed for genome sequence classification. Among these are traditional algorithms such as Random Forest (RF), K-nearest neighbors (KNNs), Decision Tree (DT), and Naive Bayes (NB), each offering unique advantages in handling genetic data. Additionally, deep learning models like Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Bi-Directional LSTM networks are utilized for their robust capabilities in capturing complex patterns and dependencies within genomic sequences. In this study, we explored the application of Natural Language Processing (NLP) techniques to classify the genomic sequences. The focus of our research involves utilizing advanced large language models (LLMs) such as DNABERT, DNAGPT, and GENA LM, which are fine-tuned explicitly on the language of DNA. In this research, after a detailed analysis, we found that DNAGPT achieved an accuracy of 96%, which exceeds the performance of state-of-the-art machine learning and deep learning models.<\/jats:p>","DOI":"10.3390\/a18060302","type":"journal-article","created":{"date-parts":[[2025,5,22]],"date-time":"2025-05-22T10:24:45Z","timestamp":1747909485000},"page":"302","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Enhanced Viral Genome Classification Using Large Language Models"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5768-6072","authenticated-orcid":false,"given":"Hemalatha","family":"Gunasekaran","sequence":"first","affiliation":[{"name":"College of Computing and Information Sciences, University of Technology and Applied Sciences, Ibri 516, Oman"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Nesaian Reginal","family":"Wilfred Blessing","sequence":"additional","affiliation":[{"name":"College of Computing and Information Sciences, University of Technology and Applied Sciences, Ibri 516, Oman"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Umar","family":"Sathic","sequence":"additional","affiliation":[{"name":"College of Computing and Information Sciences, University of Technology and Applied Sciences, Ibri 516, Oman"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4864-9485","authenticated-orcid":false,"given":"Mohammad Shahid","family":"Husain","sequence":"additional","affiliation":[{"name":"College of Computing and Information Sciences, University of Technology and Applied Sciences, Ibri 516, Oman"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"100147","DOI":"10.1016\/j.slast.2024.100147","article-title":"Assessment and classification of COVID-19 DNA sequence using pairwise features concatenation from multi-transformer and deep features with machine learning models","volume":"29","author":"Qayyum","year":"2024","journal-title":"J. Assoc. Lab. Autom."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"11220","DOI":"10.1073\/pnas.2005335117","article-title":"Evidence from internet search data shows information-seeking responses to news of local COVID-19 cases","volume":"117","author":"Bento","year":"2020","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"344","DOI":"10.1093\/bioinformatics\/btab672","article-title":"Tiara: Deep learning-based classification system for eukaryotic sequences","volume":"38","author":"Karlicki","year":"2022","journal-title":"Bioinformatics"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Klapproth, C., Sen, R., Stadler, P.F., Findei\u00df, S., and Fallmann, J. (2021). Common Features in lncRNA Annotation and Classification: A Survey. Non-Coding RNA, 7.","DOI":"10.3390\/ncrna7040077"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Yang, A., Zhang, W., Wang, J., Yang, K., Han, Y., and Zhang, L. (2020). Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA. Front. Bioeng. Biotechnol., 8.","DOI":"10.3389\/fbioe.2020.01032"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1186\/s40537-023-00804-6","article-title":"Optimizing classification efficiency with machine learning techniques for pattern matching","volume":"10","author":"Hamed","year":"2023","journal-title":"J. Big Data"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Rahman, A., Zaman, S., and Das, D. (2024, January 8\u20139). Cracking the Genetic Codes: Exploring DNA Sequence Classification with Machine Learning Algorithms and Voting Ensemble Strategies. Proceedings of the 2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS), Dhaka, Bangladesh.","DOI":"10.1109\/iCACCESS61735.2024.10499483"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"280","DOI":"10.4236\/jbise.2016.95021","article-title":"DNA Sequence Classification by Convolutional Neural Network","volume":"9","author":"Nguyen","year":"2016","journal-title":"J. Biomed. Sci. Eng."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Gomes, J.C., Masood, A.I., Silva, L.H.d.S., Ferreira, J.R.B.d.C., J\u00fanior, A.A.F., Rocha, A.L.d.S., de Oliveira, L.C.P., da Silva, N.R.C., Fernandes, B.J.T., and dos Santos, W.P. (2021). Covid-19 diagnosis by combining RT-PCR and pseudo-convolutional machines to characterize virus sequences. Sci. Rep., 11.","DOI":"10.1038\/s41598-021-90766-7"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1835056","DOI":"10.1155\/2021\/1835056","article-title":"Analysis of DNA Sequence Classification Using CNN and Hybrid Models","volume":"2021","author":"Gunasekaran","year":"2021","journal-title":"Comput. Math. Methods Med."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Choi, S.R., and Lee, M. (2023). Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. Biology, 12.","DOI":"10.3390\/biology12071033"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1093\/bioinformatics\/btab083","article-title":"DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome","volume":"37","author":"Ji","year":"2021","journal-title":"Bioinformatics"},{"key":"ref_13","unstructured":"Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., and Liu, H. (2023). DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"e91","DOI":"10.1093\/nar\/gkae783","article-title":"DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors","volume":"52","author":"Kabir","year":"2024","journal-title":"Nucleic Acids Res."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"gkae1310","DOI":"10.1093\/nar\/gkae1310","article-title":"GENA-LM: A family of open-source foundational DNA language models for long sequences","volume":"53","author":"Fishman","year":"2025","journal-title":"Nucleic Acids Res."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zhang, D., Zhang, W., Zhao, Y., Zhang, J., He, B., Qin, C., and Yao, J. (2023). DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. arXiv.","DOI":"10.1101\/2023.07.11.548628"},{"key":"ref_17","first-page":"43177","article-title":"HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution","volume":"36","author":"Nguyen","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"3205","DOI":"10.1021\/acssynbio.3c00154","article-title":"Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions","volume":"12","author":"He","year":"2023","journal-title":"ACS Synth. Biol."},{"key":"ref_19","unstructured":"Wang, Z., Wang, Z., Jiang, J., Chen, P., Shi, X., and Li, Y. (2025). Large Language Models in Bioinformatics: A Survey. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"3498","DOI":"10.1016\/j.csbj.2024.09.031","article-title":"Large language models and their applications in bioinformatics","volume":"23","author":"Sarumi","year":"2024","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: Synthetic Minority Over-sampling Technique","volume":"16","author":"Chawla","year":"2002","journal-title":"J. Artif. Intell. Res."},{"key":"ref_22","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_23","unstructured":"Zhang, X., Beinke, B., Kindhi, B.A., and Wiering, M. (2020). Comparing Machine Learning Algorithms with or without Feature Extraction for DNA Classification. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"323","DOI":"10.1016\/j.aej.2024.03.066","article-title":"U-Net for genomic sequencing: A novel approach to DNA sequence classification","volume":"96","author":"Mohammed","year":"2024","journal-title":"Alex. Eng. J."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/6\/302\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:38:39Z","timestamp":1760031519000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/6\/302"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,22]]},"references-count":24,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,6]]}},"alternative-id":["a18060302"],"URL":"https:\/\/doi.org\/10.3390\/a18060302","relation":{},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,22]]}}}