{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T00:34:57Z","timestamp":1770338097472,"version":"3.49.0"},"reference-count":20,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T00:00:00Z","timestamp":1756252800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T00:00:00Z","timestamp":1756252800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100022402","name":"Instituto Polit\u00e9cnico de Lisboa","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100022402","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int. J. Inf. Secur."],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The increasing sophistication of cybersecurity threats demands innovative solutions for network defense and security education. This study evaluates the potential of Large Language Models (LLMs) in automating cybersecurity tasks, specifically focusing on three critical areas: Honeypot generation, Malware Detection, and Capture The Flag (CTF) exercise creation. We introduce two novel evaluation frameworks: the Cybersecurity Language Understanding (CSLU) benchmark, which assesses model knowledge through domain-specific multiple-choice questions, and an automated evaluation system that measures models\u2019 ability to generate functional security artifacts. Using these frameworks, we evaluated seven state-of-the-art LLMs, including GPT-4, Gemini Pro, and Claude 3 Opus. Results demonstrate that current LLMs exhibit strong capabilities in Malware analysis, with four models achieving perfect scores. However, performance varied significantly in CTF exercise generation, indicating areas for improvement. GPT-4, Gemini Pro, and Claude 3 Opus consistently outperformed other models across all tasks. Performance patterns suggest that model size correlates with task effectiveness, though architecture-specific optimizations also play a significant role. Our findings indicate that LLMs can effectively automate certain cybersecurity tasks, particularly in Malware Detection and analysis. However, their capabilities vary across different security domains, suggesting the need for specialized training or domain-specific adaptations for optimal performance.<\/jats:p>","DOI":"10.1007\/s10207-025-01112-1","type":"journal-article","created":{"date-parts":[[2025,8,27]],"date-time":"2025-08-27T11:43:49Z","timestamp":1756295029000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Evaluation of the maturity of LLMs in the cybersecurity domain"],"prefix":"10.1007","volume":"24","author":[{"given":"Tiago","family":"Concei\u00e7\u00e3o","sequence":"first","affiliation":[]},{"given":"Nuno","family":"Cruz","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,27]]},"reference":[{"key":"1112_CR1","doi-asserted-by":"crossref","unstructured":"Bhambri, S., Chauhan, P., Araujo, F., Doup\u00e9, A., Kambhampati, S.: Using deception in markov game to understand adversarial behaviors through a capture-the-flag environment, (2022)","DOI":"10.1007\/978-3-031-26369-9_5"},{"key":"1112_CR2","doi-asserted-by":"crossref","unstructured":"Bhusal, D., Alam, M.\u00a0T., Nguyen, L., Mahara, A., Lightcap, Z., Frazier, R., Fieblinger, R., Torales, G.\u00a0L., Blakely, B.\u00a0A., Rastogi, N.: Secure: Benchmarking large language models for cybersecurity advisory, (2024). arxiv:2405.20441","DOI":"10.1109\/ACSAC63791.2024.00019"},{"key":"1112_CR3","unstructured":"Chiang, W.-L., Zheng, L., Sheng, Y. Angelopoulos, A.\u00a0N., Li, T., et\u00a0al.: Chatbot arena: An open platform for evaluating llms by human preference, (2024). arxiv:2403.04132"},{"key":"1112_CR4","doi-asserted-by":"publisher","unstructured":"Fadziso, T., Thaduri, U.\u00a0R., Dekkati, S., Ballamudi, V.\u00a0K.\u00a0R., Desamsetti, H.: Evolution of the Cyber Security Threat: An Overview of the Scale of Cyber Threat. 9 (2023). https:\/\/doi.org\/10.6084\/m9.figshare.24189921.v1. https:\/\/figshare.com\/articles\/journal_contribution\/_b_Evolution_of_the_Cyber_Security_Threat_An_Overview_of_the_Scale_of_Cyber_Threat_b_\/24189921","DOI":"10.6084\/m9.figshare.24189921.v1"},{"key":"1112_CR5","unstructured":"Fang, C., Miao, N., Srivastav, S., Liu, J., Zhang, R., Fang, R., Asmita, A., Tsang, R., Nazari, N., Wang, H., Homayoun, H.: Large language models for code analysis: Do llms really do their job?, (2023)"},{"key":"1112_CR6","unstructured":"Gao, Z., Wang, H., Zhou, Y., Zhu, W., Zhang, C.: How far have we gone in vulnerability detection using large language models, (2023)"},{"key":"1112_CR7","doi-asserted-by":"publisher","first-page":"901","DOI":"10.1007\/s11416-024-00536-y","volume":"20","author":"A Guerra-Manzanares","year":"2024","unstructured":"Guerra-Manzanares, A., Bahsi, H.: Experts still needed: boosting long-term android malware detection with active learning. J. Comput. Virol. Hack. Tech. 20, 901\u2013918 (2024). https:\/\/doi.org\/10.1007\/s11416-024-00536-y","journal-title":"J. Comput. Virol. Hack. Tech."},{"key":"1112_CR8","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A ., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding, (2021). arxiv:2009.03300"},{"key":"1112_CR9","unstructured":"Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S., He, S., Huang, Z., Yang, D., Potts, C., Manning, C.\u00a0D., Zou, J.\u00a0Y.: Mapping the increasing use of LLMs in scientific papers, (2024)"},{"key":"1112_CR10","doi-asserted-by":"crossref","unstructured":"Liu, J., An, H., Li, J., Liang, H.: Detecting exploit primitives automatically for heap vulnerabilities on binary programs, (2022)","DOI":"10.1145\/3573428.3573550"},{"key":"1112_CR11","unstructured":"Liu, Z., Shi, J., Buford, J.: Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity. 02 (2024)"},{"key":"1112_CR12","doi-asserted-by":"crossref","unstructured":"McKee, F., Noever, D.:Chatbots in a honeypot world, (2023)","DOI":"10.5121\/ijci.2023.120207"},{"key":"1112_CR13","unstructured":"Shao, M., Chen, B., Jancheska, S., Dolan-Gavitt, B., Garg, S., Karri, R., Shafique, M.:. An empirical evaluation of llms for solving offensive security challenges, (2024)"},{"key":"1112_CR14","doi-asserted-by":"crossref","unstructured":"Sladi\u0107,M., Valeros, V., Catania, C., Garcia, S.: LLM in the shell: Generative honeypots, (2024). arxiv:2309.00155","DOI":"10.1109\/EuroSPW61312.2024.00054"},{"key":"1112_CR15","doi-asserted-by":"publisher","unstructured":"Stewart\u00a0Kirubakaran, S., Ebenezer, V., Santhiya, P., Manojkumar, G., Sophia, S., Snowlin\u00a0Preethi, J.\u00a0A.\u00a0S.:. An effective study on different levels of honeypot with applications and design of real time honeypot. In 2023 2nd International Conference on Edge Computing and Applications (ICECAA), pages 1209\u20131212, 2023. https:\/\/doi.org\/10.1109\/ICECAA58104.2023.10212345","DOI":"10.1109\/ICECAA58104.2023.10212345"},{"key":"1112_CR16","unstructured":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models, (2023)"},{"key":"1112_CR17","unstructured":"Yang, A.\u00a0Z.\u00a0H., Tian, H., Ye, H., Martins, R., Goues, C.\u00a0L.: Security vulnerability detection with multitask self-instructed fine-tuning of large language models, (2024)"},{"key":"1112_CR18","doi-asserted-by":"publisher","unstructured":"Zhang, J., Bu, H., Wen, H., Liu, Y., Fei, H., Xi, R., Li, L., Yang, Y., Zhu, H., Meng, D.: When llms meet cybersecurity: a systematic literature review. Cybersecurity 8(1), 55, ISSN 2523-3246 (2025). https:\/\/doi.org\/10.1186\/s42400-025-00361-w","DOI":"10.1186\/s42400-025-00361-w"},{"key":"1112_CR19","unstructured":"Zhang, Y., Song, W., Ji, Z., Danfeng, Yao, Meng, N.: How well does llm generate security tests?, (2023)"},{"key":"1112_CR20","unstructured":"Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., et\u00a0al.: Judging llm-as-a-judge with mt-bench and chatbot arena. In A.\u00a0Oh, T.\u00a0Naumann, A.\u00a0Globerson, K.\u00a0Saenko, M.\u00a0Hardt, and S.\u00a0Levine, editors, Advances in Neural Information Processing Systems, volume\u00a036, pages 46595\u201346623. Curran Associates, Inc., (2023)"}],"container-title":["International Journal of Information Security"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10207-025-01112-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10207-025-01112-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10207-025-01112-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T11:39:27Z","timestamp":1760614767000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10207-025-01112-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,27]]},"references-count":20,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["1112"],"URL":"https:\/\/doi.org\/10.1007\/s10207-025-01112-1","relation":{},"ISSN":["1615-5262","1615-5270"],"issn-type":[{"value":"1615-5262","type":"print"},{"value":"1615-5270","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,27]]},"assertion":[{"value":"10 August 2025","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 August 2025","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interest as defined by Springer, or other interests that might be perceived to influence the results and\/or discussion reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interest"}},{"value":"The authors declare no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"197"}}