{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T12:07:11Z","timestamp":1779278831342,"version":"3.51.4"},"reference-count":215,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T00:00:00Z","timestamp":1779235200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T00:00:00Z","timestamp":1779235200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001230","name":"Macquarie University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001230","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2026,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Large Language Models (LLM) have demonstrated remarkable capabilities across various applications, but their deployment raises critical safety concerns as potential misuse poses significant societal risks. This survey reviews the end-to-end security and safety pipeline of LLMs, focusing on the interaction between users and model responses. We categorize the system into five key components: attacks, defenses, safety alignment, metrics and guarding mechanisms. Attacks involve crafting adversarial inputs to exploit model vulnerabilities. Defenses act as countermeasures, aiming to detect and prevent such inputs before processing. Safety alignment ensures that, even when attacks reach the model, its responses remain consistent with ethical and policy-aligned behavior. Guarding mechanisms operate post-response to flag, filter, or block unsafe outputs. Each of these stages is subject to rigorous evaluation metrics to assess their effectiveness, robustness, and limitations. As LLMs evolve toward more general-purpose intelligence, these safety considerations become increasingly critical for the development of robust and trustworthy AI systems. Finally, we highlight open challenges and future research directions to advance the security and alignment of LLMs.<\/jats:p>","DOI":"10.1007\/s10994-026-07060-8","type":"journal-article","created":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T11:24:28Z","timestamp":1779276268000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Survey on LLM Safety: Attacks, Defenses, Alignment, Metrics, and Guardrails"],"prefix":"10.1007","volume":"115","author":[{"given":"Pratik","family":"Jalan","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vadivel","family":"Abishethvarman","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bhavik","family":"Chandna","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Usman","family":"Naseem","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,5,20]]},"reference":[{"key":"7060_CR1","doi-asserted-by":"crossref","unstructured":"Abdelkader, H., Abdelrazek, M., Barnett, S., Schneider, J.-G., Rani, P., & Vasa, R. (2024). Ml-on-rails: Safeguarding machine learning models in software systems\u2013A case study. In: Proceedings of the IEEE\/ACM 3rd international conference on ai engineering-software engineering for AI, pp. 178\u2013183","DOI":"10.1145\/3644815.3644958"},{"key":"7060_CR2","doi-asserted-by":"crossref","unstructured":"Abiri, G. (2024). Public constitutional AI. arXiv Article 2406.16696.","DOI":"10.2139\/ssrn.4874670"},{"key":"7060_CR3","unstructured":"Abishethvarman, V., Chandna, B., Jalan, P., & Naseem, U. (2025). XGUARD: A graded benchmark for evaluating safety failures of large language models on extremist content. arXiv Article 2506.00973."},{"key":"7060_CR4","doi-asserted-by":"crossref","unstructured":"Abishethvarman, V., Sabrina, F., & Kwan, P. (2025). Knowledge integrity in large language models: A state-of-the-art review. Information, 16(12), Article 1076.","DOI":"10.3390\/info16121076"},{"key":"7060_CR5","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et\u00a0al. (2023). Gpt-4 technical report. arXiv:2303.08774"},{"key":"7060_CR6","doi-asserted-by":"crossref","unstructured":"Aghakhani, H., Dai, W., Manoel, A., Fernandes, X., Kharkar, A., Kruegel, C., Vigna, G., Evans, D., Zorn, B., & Sim, R. (2024). TrojanPuzzle: Covertly poisoning code-suggestion models. arXiv Article 2301.02344.","DOI":"10.1109\/SP54263.2024.00140"},{"key":"7060_CR7","doi-asserted-by":"publisher","DOI":"10.51219\/jaimld\/syed-arham-akheel\/536","author":"SA Akheel","year":"2025","unstructured":"Akheel, S. A. (2025). Guardrails for large language models: A review of techniques and challenges. Journal of Artificial Intelligence, Machine Learning and Data Science. https:\/\/doi.org\/10.51219\/jaimld\/syed-arham-akheel\/536","journal-title":"Journal of Artificial Intelligence, Machine Learning and Data Science"},{"key":"7060_CR8","unstructured":"Alami, R., Almansoori, A. K., Alzubaidi, A., Seddik, M. E. A., Farooq, M., & Hacid, H. (2024). Alignment with preference optimization is all you need for LLM safety. arXiv Article 2409.07772."},{"key":"7060_CR9","unstructured":"Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \u00c9., Hesslow, D., Launay, J., Malartic, Q., et\u00a0al. (2023). The falcon series of open language models. arXiv:2311.16867"},{"key":"7060_CR10","unstructured":"Alon, G., & Kamfonas, M. (2023). Detecting language model attacks with perplexity. arXiv Article 2308.14132."},{"key":"7060_CR11","unstructured":"Alvisi, L., Tardelli, S., & Tesconi, M. (2025). Mapping the Italian Telegram ecosystem: Communities, toxicity, and hate speech. arXiv Article 2504.19594."},{"key":"7060_CR215","unstructured":"Anthropic, S. (2024). Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 sonnet. URL semanticscholar. org\/CorpusID, 273639283, 24 https:\/\/www.semanticscholar.org\/paper\/Claude-3.5-Sonnet-Model-Card-Addendum\/fed9cc193a14b84131812372d8d5857f8f304c52"},{"key":"7060_CR12","unstructured":"Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M., Hase, P., Lubana, E.S., Jenner, E., Casper, S., Sourbut, O., et\u00a0al. (2024). Foundational challenges in assuring alignment and safety of large language models. arXiv:2404.09932"},{"key":"7060_CR13","unstructured":"Avinash, K., Pareek, N., & Hada, R. (2025). Protect: Towards robust guardrailing stack for trustworthy enterprise LLM systems. arXiv: 2510.13351"},{"key":"7060_CR14","unstructured":"Ayyamperumal, S. G., & Ge, L. (2024). Current state of LLM risks and AI guardrails. arXiv: 2406.12934"},{"key":"7060_CR15","unstructured":"Ayyamperumal, S. G., & Ge, L. (2024). Current state of llm risks and ai guardrails. arXiv:2406.12934"},{"key":"7060_CR16","unstructured":"Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et\u00a0al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862"},{"key":"7060_CR17","unstructured":"Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et\u00a0al. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv:2212.08073"},{"key":"7060_CR18","unstructured":"Banerjee, A., Maity, A., Kamboj, P., & Gupta, S. K. (2024). Cps-llm: large language model based safe usage plan generator for human-in-the-loop human-in-the-plant cyber-physical system. arXiv:2405.11458"},{"key":"7060_CR19","doi-asserted-by":"publisher","unstructured":"Basani, A. R., & Zhang, X. (2024). Gasp: Efficient black-box generation of adversarial suffixes for jailbreaking llms. https:\/\/doi.org\/10.48550\/arxiv.2411.14133","DOI":"10.48550\/arxiv.2411.14133"},{"key":"7060_CR20","doi-asserted-by":"crossref","unstructured":"Bassani, E., & Sanchez, I. (2024). GuardBench: A large-scale benchmark for guardrail models. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 conference on empirical methods in natural language processing, pp. 18393\u201318409. Association for Computational Linguistics, Miami, Florida, USA. https:\/\/doi.org\/10.18653\/v1\/2024.emnlp-main.1022. https:\/\/aclanthology.org\/2024.emnlp-main.1022\/","DOI":"10.18653\/v1\/2024.emnlp-main.1022"},{"key":"7060_CR21","doi-asserted-by":"publisher","unstructured":"Benjamin, V., Braca, E., Carter, I., Kanchwala, H., Khojasteh, N., Landow, C., Luo, Y., Ma, C., Magarelli, A., Mirin, R., Moyer, A., Simpson, K., Skawinski, A., & Heverin, T. (2024). Systematically analyzing prompt injection vulnerabilities in diverse llm architectures. https:\/\/doi.org\/10.48550\/arxiv.2410.23308","DOI":"10.48550\/arxiv.2410.23308"},{"key":"7060_CR22","unstructured":"B\u00f8e, V. S. (2023). Llm self-critique for task-oriented text generation."},{"key":"7060_CR23","unstructured":"Brown, H., Lin, L., Kawaguchi, K., & Shieh, M. (2024). Self-evaluation as a defense against adversarial attacks on LLMs. arXiv: 2407.03234"},{"key":"7060_CR24","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR25","doi-asserted-by":"crossref","unstructured":"Cao, C., Zhu, H., Ji, J., Sun, Q., Zhu, Z., Wu, Y., Dai, J., Yang, Y., Han, S., Guo, Y. (2025). Safelawbench: Towards safe alignment of large language models. arXiv:2506.06636","DOI":"10.18653\/v1\/2025.findings-acl.721"},{"key":"7060_CR26","unstructured":"Cao, N. D., Aziz, W., & Titov, I. (2021). Editing factual knowledge in language models. arXiv: 2104.08164"},{"key":"7060_CR27","unstructured":"Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I., Madry, A., & Kurakin, A. (2019). On evaluating adversarial robustness. arXiv: 1902.06705"},{"key":"7060_CR28","unstructured":"Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. arXiv: 2012.07805"},{"key":"7060_CR29","doi-asserted-by":"crossref","unstructured":"Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., & Wong, E. (2024). Jailbreaking black box large language models in twenty queries. arXiv: 2310.08419","DOI":"10.1109\/SaTML64287.2025.00010"},{"key":"7060_CR30","doi-asserted-by":"publisher","unstructured":"Chen, T., Wang, D., Liang, X., Risius, M., Demartini, G., & Yin, H. (2024). Hate speech detection with generalizable target-aware fairness. In: Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. KDD \u201924, pp. 365\u2013375. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3637528.3671821 .","DOI":"10.1145\/3637528.3671821"},{"key":"7060_CR31","doi-asserted-by":"publisher","unstructured":"Chen, Y., Li, Z., You, S., Chen, Z., Chang, J., Zhang, Y., Dai, W., Guo, Q., & Xiao, Y. (2025). Attributive reasoning for hallucination diagnosis of large language models, pp. 23660\u201323668 https:\/\/doi.org\/10.1609\/aaai.v39i22.34536","DOI":"10.1609\/aaai.v39i22.34536"},{"key":"7060_CR32","unstructured":"Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., & Gonzalez, J. E. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See,2 (3), 6."},{"issue":"240","key":"7060_CR33","first-page":"1","volume":"24","author":"A Chowdhery","year":"2023","unstructured":"Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., & Gehrmann, S. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 (240), 1\u2013113.","journal-title":"Journal of Machine Learning Research"},{"key":"7060_CR35","unstructured":"Chua, G., Chan, S. Y., & Khoo, S. (2025). A flexible large language models guardrail development methodology applied to off-topic prompt detection. arXiv Article 2411.12946."},{"key":"7060_CR36","unstructured":"Chua, J., Li, Y., Yang, S., Wang, C., & Yao, L. (2024). Ai safety in generative ai large language models: A survey. arXiv Article 2407.18369."},{"key":"7060_CR34","doi-asserted-by":"crossref","unstructured":"Chu, J., Liu, Y., Yang, Z., Shen, X., Backes, M., & Zhang, Y. (2025). JailbreakRadar: Comprehensive assessment of jailbreak attacks against LLMs. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 21538\u201321566. Association for Computational Linguistics, Vienna, Austria . https:\/\/doi.org\/10.18653\/v1\/2025.acl-long.1045 . https:\/\/aclanthology.org\/2025.acl-long.1045\/","DOI":"10.18653\/v1\/2025.acl-long.1045"},{"key":"7060_CR37","unstructured":"Claude 3.5 sonnet model card addendum. https:\/\/api.semanticscholar.org\/CorpusID:270667923"},{"key":"7060_CR38","doi-asserted-by":"crossref","unstructured":"Crisan, A., & Fiore-Gartland, B. (2021). Fits and starts: Enterprise use of AutoML and the role of humans in the loop. arXiv Article 2101.04296.","DOI":"10.1145\/3411764.3445775"},{"key":"7060_CR39","unstructured":"Dada, M. Y. (2024). On the biases, privacy implications and guardrail viability of large language models. Master\u2019s thesis, Queen\u2019s University (Canada)"},{"key":"7060_CR40","doi-asserted-by":"publisher","unstructured":"Daquigan, J., Marbella, G., Dioses, R., Co, J. D., Centeno, C., & Mata, K. (2025). Enhancement of profanity filtering and hate speech detection algorithm applied in minecraft chats. Technoarete Transactions on Advances in Computer Applications, DOIurlhttps:\/\/doi.org\/10.36647\/TTACA\/04.02.A001","DOI":"10.36647\/TTACA\/04.02.A001"},{"key":"7060_CR41","doi-asserted-by":"publisher","first-page":"11","DOI":"10.25172\/smustlr.27.1.3","volume":"27","author":"AG Dawson","year":"2024","unstructured":"Dawson, A. G. (2024). Algorithmic adjudication and constitutional ai-the promise of a better ai decision making future? SMU Science and Technology Law Review, 27, 11.","journal-title":"SMU Science and Technology Law Review"},{"key":"7060_CR42","unstructured":"Debenedetti, E., Zhang, J., Balunovic, M., Beurer-Kellner, L., Fischer, M., & Tram\u00e8r, F. (2024). Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in neural information processing systems, vol. 37, pp. 82895\u201382920. Curran Associates, Inc., ??? (2024). https:\/\/doi.org\/10.52202\/079017-2636 . https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2024\/file\/97091a5177d8dc64b1da8bf3e1f6fb54-Paper-Datasets_and_Benchmarks_Track.pdf"},{"key":"7060_CR43","unstructured":"Deng, Y., Yang, Y., Zhang, J., Wang, W., & Li, B. (2025a). DuoGuard: A two-player RL-driven framework for multilingual LLM guardrails. arXiv Article 2502.05163."},{"key":"7060_CR44","unstructured":"Deng, Y., Yang, Y., Zhang, J., Wang, W., & Li, B. (2025b). DuoGuard: A two-player rl-driven framework for multilingual llm guardrails. arXiv Article 2502.05163."},{"key":"7060_CR45","doi-asserted-by":"publisher","unstructured":"Donato, J. (2025). Benchmarking llm robustness against prompt-based adversarial attacks. In: 2025 20th European dependable computing conference companion proceedings (EDCC-C), pp. 60\u201363 . https:\/\/doi.org\/10.1109\/EDCC-C66476.2025.00031","DOI":"10.1109\/EDCC-C66476.2025.00031"},{"key":"7060_CR46","unstructured":"Dong, Y., Mu, R., Jin, G., Qi, Y., Hu, J., Zhao, X., Meng, J., Ruan, W., & Huang, X. (2024a). Building guardrails for large language models. arXiv Article 2402.01822."},{"key":"7060_CR47","doi-asserted-by":"crossref","unstructured":"Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., & Huang, X. (2024b). Safeguarding large language models: A survey. arXiv Article 2406.02622.","DOI":"10.1007\/s10462-025-11389-2"},{"key":"7060_CR48","doi-asserted-by":"crossref","unstructured":"Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., et\u00a0al. (2024c). Safeguarding large language models: A survey. arXiv:2406.02622","DOI":"10.1007\/s10462-025-11389-2"},{"key":"7060_CR49","unstructured":"Dong, Z., Zhou, Z., Yang, C., Shao, J., & Qiao, Y. (2024d). Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv Article 2402.09283."},{"key":"7060_CR50","unstructured":"Downey-Webb, T., Jogunola, O., & Ajao, O. (2025). Safeguarding efficacy in large language models: Evaluating resistance to human-written and algorithmic adversarial prompts. arXiv Article 2510.15973."},{"key":"7060_CR51","unstructured":"Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et\u00a0al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv:2209.07858"},{"key":"7060_CR52","doi-asserted-by":"crossref","unstructured":"Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020a). Realtixicityprompts: Evaluating neural toxic degeneration in language models. arXiv Article 2009.11462.","DOI":"10.18653\/v1\/2020.findings-emnlp.301"},{"key":"7060_CR53","doi-asserted-by":"publisher","unstructured":"Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2020 (pp. 3356\u20133369). Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2020.findings-emnlp.301","DOI":"10.18653\/v1\/2020.findings-emnlp.301"},{"key":"7060_CR54","doi-asserted-by":"crossref","unstructured":"Ghosal, S. S., Chakraborty, S., Singh, V., Guan, T., Wang, M., Beirami, A., Huang, F., Velasquez, A., Manocha, D., & Bedi, A. S. (2024). Immune: Improving safety against jailbreaks in multi-modal LLMs via inference-time alignment. arXiv Article 2411.18688.","DOI":"10.1109\/CVPR52734.2025.02331"},{"key":"7060_CR55","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i22.34568","author":"Y Gong","year":"2025","unstructured":"Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., Duan, S., & Wang, X. (2025). FigStep: Jailbreaking large vision-language models via typographic visual prompts. Proceedings of the AAAI Conference on Artificial Intelligence. https:\/\/doi.org\/10.1609\/aaai.v39i22.34568","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"7060_CR56","doi-asserted-by":"crossref","unstructured":"Goyal, S., Hira, M., Mishra, S., Goyal, S., Goel, A., Dadu, N., DB, K., Mehta, S., & Madaan, N. (2024). Llmguard: guarding against unsafe llm behavior. In: Proceedings of the AAAI conference on artificial intelligence, vol. 38, pp. 23790\u201323792","DOI":"10.1609\/aaai.v38i21.30566"},{"key":"7060_CR57","unstructured":"Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et\u00a0al. (2024). The llama 3 herd of models. arXiv:2407.21783"},{"key":"7060_CR58","doi-asserted-by":"publisher","unstructured":"Gumaan, E. (2025). Theoretical foundations and mitigation of hallucination in large language models. https:\/\/doi.org\/10.48550\/arxiv.2507.22915","DOI":"10.48550\/arxiv.2507.22915"},{"key":"7060_CR59","unstructured":"Hackmann, S. (2024). HM3: Heterogeneous multi-class model merging. arXiv Article 2409.19173."},{"key":"7060_CR60","doi-asserted-by":"crossref","unstructured":"Han, X., & Tsvetkov, Y. (2020). Fortifying toxic speech detectors against veiled toxicity. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 7732\u20137739. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2020.emnlp-main.622 . https:\/\/aclanthology.org\/2020.emnlp-main.622\/","DOI":"10.18653\/v1\/2020.emnlp-main.622"},{"key":"7060_CR61","doi-asserted-by":"crossref","unstructured":"Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv Article 2203.09509.","DOI":"10.18653\/v1\/2022.acl-long.234"},{"key":"7060_CR62","unstructured":"Hasan, M. M., Rahman, Z., Mostafiz, R., & Hossain, M. A. (2025). Sentra-guard: A multilingual human-AI framework for real-time defense against adversarial LLM jailbreaks. arXiv Article 2510.22628."},{"key":"7060_CR63","doi-asserted-by":"crossref","unstructured":"Havrilla, A., Zhuravinskyi, M., Phung, D., Tiwari, A., Tow, J., Biderman, S., Anthony, Q., & Castricato, L. (2023). trlx: A framework for large scale reinforcement learning from human feedback. In: Proceedings of the 2023 conference on empirical methods in natural language processing, pp. 8578\u20138595","DOI":"10.18653\/v1\/2023.emnlp-main.530"},{"key":"7060_CR64","unstructured":"He, F., Zhu, T., Ye, D., Liu, B., Zhou, W., & Yu, P. S. (2024). The emerged security and privacy of LLM agent: A survey with case studies. arXiv Article 2407.19354."},{"key":"7060_CR65","unstructured":"Henneking, C.-L., & Beger, C. (2025). Unlocking transparent alignment through enhanced inverse constitutional AI for principle extraction. arXiv Article 2501.17112."},{"key":"7060_CR67","doi-asserted-by":"crossref","unstructured":"Huang, S., Siddarth, D., Lovitt, L., Liao, T. I., Durmus, E., Tamkin, A., Ganguli, D.: Collective constitutional ai: Aligning a language model with public input. In: Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, pp. 1395\u20131417 (2024)","DOI":"10.1145\/3630106.3658979"},{"key":"7060_CR68","doi-asserted-by":"crossref","unstructured":"Huang, Y., Gupta, S., Zhong, Z., Li, K., & Chen, D. (2023). Privacy implications of retrieval-based language models.","DOI":"10.18653\/v1\/2023.emnlp-main.921"},{"key":"7060_CR69","unstructured":"Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et\u00a0al. (2024). Gpt-4o system card. arXiv:2410.21276"},{"key":"7060_CR66","doi-asserted-by":"crossref","unstructured":"Hu, X., Chen, P.-Y., & Ho, T.-Y. (2024). Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. arXiv Article 2403.00867.","DOI":"10.52202\/079017-4011"},{"key":"7060_CR70","doi-asserted-by":"crossref","unstructured":"Irtiza, S., Akbar, K.A., Yasmeen, A., Khan, L., Daescu, O., & Thuraisingham, B. (2024). Llm-sentry: A model-agnostic human-in-the-loop framework for securing large language models. In: 2024 IEEE 6th international conference on trust, privacy and security in intelligent systems, and applications (TPS-ISA), pp. 245\u2013254. IEEE","DOI":"10.1109\/TPS-ISA62245.2024.00036"},{"key":"7060_CR71","unstructured":"Irving, G., Christiano, P., & Amodei, D. (2018). AI safety via debate. arXiv Article 1805.00899."},{"key":"7060_CR72","unstructured":"Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P.-y, Goldblum, M., Saha, A., Geiping, J., & Goldstein, T. (2023). Baseline defenses for adversarial attacks against aligned language models. arXiv Article 2309.00614."},{"key":"7060_CR74","unstructured":"Jiang, D., Liu, Y., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., & Xiong, H. (2023). From CLIP to DINO: Visual encoders shout in multi-modal large language models. arXiv Article 2310.08825."},{"key":"7060_CR75","unstructured":"Jiao, R., Xie, S., Yue, J., Sato, T., Wang, L., Wang, Y., Chen, Q. A., & Zhu, Q. (2025). Can we trust embodied agents? Exploring backdoor attacks against embodied LLM-based decision-making systems. arXiv Article 2405.20774."},{"key":"7060_CR73","doi-asserted-by":"publisher","unstructured":"Jia, Y., Shao, Z., Liu, Y., Jia, J., Song, D., & Gong, N. (2025). A critical evaluation of defenses against prompt injection attacks. arXiv. https:\/\/doi.org\/10.48550\/arxiv.2505.18333","DOI":"10.48550\/arxiv.2505.18333"},{"key":"7060_CR76","doi-asserted-by":"crossref","unstructured":"Kamath, U., Keenan, K., Somers, G., & Sorenson, S. (2024). Tuning for llm alignment. In: Large language models: A deep dive: bridging theory and practice, pp. 177\u2013218. Springer","DOI":"10.1007\/978-3-031-65647-7_5"},{"key":"7060_CR77","doi-asserted-by":"publisher","unstructured":"Kikkisetti, D., Mustafa, R., Melillo, W., Corizzo, R., Boukouvalas, Z., Gill, J., & Japkowicz, N. (2024). Coded term discovery for online hate speech detection. In: 2024 IEEE 11th international conference on data science and advanced analytics (DSAA), pp. 1\u201310 . https:\/\/doi.org\/10.1109\/DSAA61799.2024.10722816","DOI":"10.1109\/DSAA61799.2024.10722816"},{"key":"7060_CR78","unstructured":"Koide, T., Nakano, H., & Chiba, D. (2026). Clouding the mirror: Stealthy prompt injection attacks targeting LLM-based phishing detection. arXiv: 2602.05484"},{"key":"7060_CR79","doi-asserted-by":"publisher","unstructured":"Koo, H., Kim, M., & Kim, J. (2025). Align to misalign: Automatic llm jailbreak with meta-optimized llm judges. https:\/\/doi.org\/10.48550\/arxiv.2511.01375","DOI":"10.48550\/arxiv.2511.01375"},{"issue":"3","key":"7060_CR80","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1007\/s13735-024-00334-8","volume":"13","author":"P Kumar","year":"2024","unstructured":"Kumar, P. (2024). Adversarial attacks and defenses for large language models (LLMs): Methods, frameworks & challenges. International Journal of Multimedia Information Retrieval, 13(3), 26.","journal-title":"International Journal of Multimedia Information Retrieval"},{"key":"7060_CR81","unstructured":"Kundu, S., Bai, Y., Kadavath, S., Askell, A., Callahan, A., Chen, A., Goldie, A., Balwit, A., Mirhoseini, A., & McLean, B., et\u00a0al. (2023). Specific versus general principles for constitutional ai. arXiv:2310.13798"},{"key":"7060_CR89","first-page":"13872","volume":"37","author":"J Liang","year":"2024","unstructured":"Liang, J., Cai, Z., Zhu, J., Huang, H., Zong, K., An, B., Alharthi, M., He, J., Zhang, L., & Li, H. (2024). Alignment at pre-training! Towards native alignment for Arabic LLMs. Advances in Neural Information Processing Systems, 37, 13872\u201313896.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR90","unstructured":"Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et\u00a0al. (2022). Holistic evaluation of language models."},{"key":"7060_CR82","unstructured":"Li, H., Chen, Y., Zeng, J., Peng, H., Jing, H., Hu, W., Yang, X., Zeng, Z., Han, S., & Song, Y. (2025b). GSPR: Aligning LLM safeguards as generalizable safety policy reasoners. arXiv: 2509.24418"},{"key":"7060_CR83","unstructured":"Li, J., Li, R., & Liu, Q. (2023). Beyond static datasets: A deep interaction approach to llm evaluation. arXiv:2309.04369"},{"key":"7060_CR84","first-page":"124292","volume":"37","author":"J Li","year":"2024","unstructured":"Li, J., Zeng, S., Wai, H.-T., Li, C., Garcia, A., & Hong, M. (2024a). Getting more juice out of the sft data: Reward learning from human demonstration improves sft for llm alignment. Advances in Neural Information Processing Systems, 37, 124292\u2013124318.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR85","doi-asserted-by":"crossref","unstructured":"Li, L., Song, D., Li, X., Zeng, J., Ma, R., & Qiu, X. (2021). Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv: 2108.13888","DOI":"10.18653\/v1\/2021.emnlp-main.241"},{"key":"7060_CR91","doi-asserted-by":"crossref","unstructured":"Lin, S., Hilton, J., Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv:2109.07958","DOI":"10.18653\/v1\/2022.acl-long.229"},{"key":"7060_CR92","doi-asserted-by":"crossref","unstructured":"Lin, S., Hilton, J., & Evans, O. (2022). Truthfulqa: Measuring how models mimic human falsehoods. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long Papers), pp. 3214\u20133252","DOI":"10.18653\/v1\/2022.acl-long.229"},{"key":"7060_CR93","unstructured":"Lin, S., Li, R., Wang, X., Lin, C., Xing, W., & Han, M. (2024). Figure it out: Analyzing-based jailbreak attack on large language models. arXiv:2407.16205"},{"key":"7060_CR94","unstructured":"Lin, S., Yang, H., Li, R., Wang, X., Lin, C., Xing, W., & Han, M. (2025). LLMs can be dangerous reasoners: Analyzing-based jailbreak attack on large language models. arXiv: 2407.16205"},{"key":"7060_CR95","unstructured":"Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et\u00a0al. (2024a). Deepseek-v3 technical report. arXiv:2412.19437"},{"key":"7060_CR96","doi-asserted-by":"crossref","unstructured":"Liu, A., Sheng, & Q., Hu, X. (2024c). Preventing and detecting misinformation generated by large language models. In: Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pp. 3001\u20133004","DOI":"10.1145\/3626772.3661377"},{"key":"7060_CR97","doi-asserted-by":"publisher","unstructured":"Liu, C., Wu, H., Yang, X., Zhang, K., Wu, C., Zhang, W., Yu, N.H., Zhang, T., Guo, Q. & Zhang, J. (2025). Exploiting vulnerabilities in speech translation systems through targeted adversarial attacks. https:\/\/doi.org\/10.48550\/arxiv.2503.00957","DOI":"10.48550\/arxiv.2503.00957"},{"key":"7060_CR98","unstructured":"Liu, F., Xu, Z., & Liu, H. (2024d). Adversarial tuning: Defending against jailbreak attacks for llms. arXiv Article 2406.06622."},{"key":"7060_CR99","unstructured":"Liu, Y., Chen, L., Wang, J., Mei, Q., & Xie, X. (2023a). Meta semantic template for evaluation of large language models. arXiv Article 2310.01448."},{"key":"7060_CR100","doi-asserted-by":"publisher","unstructured":"Liu, Y., Jia, Y., Geng, R., Jia, J., & Gong, N. (2023b). Formalizing and benchmarking prompt injection attacks and defenses. https:\/\/doi.org\/10.48550\/arxiv.2310.12815","DOI":"10.48550\/arxiv.2310.12815"},{"key":"7060_CR101","unstructured":"Liu, Y., Liu, G., Zhang, R., Niyato, D., Xiong, Z., Kim, D. I., Huang, K., & Du, H. (2024b). Hallucination-aware optimization for large language model-empowered communications. arXiv Article 2412.06007."},{"key":"7060_CR102","unstructured":"Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M. F., & Li, H. (2023c). Trustworthy llms: A survey and guideline for evaluating large language models\u2019 alignment. arXiv Article 2308.05374."},{"key":"7060_CR86","unstructured":"Li, X., Zhou, Z., Zhu, J., Yao, J., Liu, T., & Han, B. (2024b) DeepInception: Hypnotize large language model to be jailbreaker. arXiv: 2311.03191"},{"key":"7060_CR87","unstructured":"Li, Y., Ahn, S., Jiang, H., Abdi, A.H., Yang, Y., & Qiu, L. (2025a). SecurityLingua: Efficient defense of LLM jailbreak attacks via security-aware prompt compression. arXiv: 2506.12707"},{"key":"7060_CR88","doi-asserted-by":"publisher","unstructured":"Li, Y., Xiong, Y., Zhong, J., Zhang, J., Zhou, J., & Zou, L. (2025c). Exploiting prefix-tree in structured output interfaces for enhancing jailbreak attacking. https:\/\/doi.org\/10.48550\/arxiv.2502.13527","DOI":"10.48550\/arxiv.2502.13527"},{"key":"7060_CR103","unstructured":"Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H.W., Tay, Y., Zhou, D., Le, Q.V., Zoph, B., & Wei, J. (2023). The flan collection: Designing data and methods for effective instruction tuning. In: International conference on machine learning, pp. 22631\u201322648 . PMLR"},{"key":"7060_CR104","doi-asserted-by":"crossref","unstructured":"Ma, X., Gao, Y., Wang, Y., Wang, R., Wang, X., Sun, Y., Ding, Y., Xu, H., Chen, Y., Zhao, Y., et\u00a0al. (2025). Safety at scale: A comprehensive survey of large model safety. arXiv:2502.05206","DOI":"10.1561\/3300000051"},{"key":"7060_CR105","doi-asserted-by":"publisher","unstructured":"Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2023). Tree of attacks: Jailbreaking black-box llms automatically. https:\/\/doi.org\/10.48550\/arxiv.2312.02119","DOI":"10.48550\/arxiv.2312.02119"},{"key":"7060_CR106","unstructured":"Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2023). Locating and editing factual associations in GPT. arXiv Article 2202.05262."},{"key":"7060_CR108","doi-asserted-by":"crossref","unstructured":"Mostafazadeh\u00a0Davani, A., Omrani, A., Kennedy, B., Atari, M., Ren, X., & Dehghani, M. (2021). Improving counterfactual generation for fair hate speech detection. In: Mostafazadeh\u00a0Davani, A., Kiela, D., Lambert, M., Vidgen, B., Prabhakaran, V., Waseem, Z. (eds.) Proceedings of the 5th workshop on online abuse and harms (WOAH 2021), pp. 92\u2013101. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2021.woah-1.10 . https:\/\/aclanthology.org\/2021.woah-1.10\/","DOI":"10.18653\/v1\/2021.woah-1.10"},{"key":"7060_CR107","doi-asserted-by":"publisher","unstructured":"Mo, T., Lam, J. C. K., Li, V. O. K., & Cheung, L. Y. L. (2025). DECT: Harnessing LLM-assisted fine-grained linguistic knowledge and label-switched and label-preserved data generation for diagnosis of Alzheimer\u2019s disease. Proceedings of the AAAI Conference on Artificial Intelligence. https:\/\/doi.org\/10.1609\/aaai.v39i23.34671","DOI":"10.1609\/aaai.v39i23.34671"},{"key":"7060_CR109","doi-asserted-by":"crossref","unstructured":"Mu, H., He, H., Zhou, Y., Feng, Y., Xu, Y., Qin, L., Shi, X., Liu, Z., Han, X., Shi, Q., Zhu, Q., & Che, W. (2025). Stealthy jailbreak attacks on large language models via benign data mirroring. arXiv Article 2410.21083.","DOI":"10.18653\/v1\/2025.naacl-long.88"},{"key":"7060_CR110","unstructured":"Nadeau, D., Kroutikov, M., McNeil, K., & Baribeau, S. (2024). Benchmarking llama2, Mistral, Gemma and GPT for factuality, toxicity, bias and propensity for hallucinations. arXiv Article 2404.09785."},{"key":"7060_CR111","unstructured":"Nagireddy, M., Padhi, I., Ghosh, S., & Sattigeri, P. (2024). When in doubt, cascade: Towards building efficient and capable guardrails. arXiv Article 2407.06323."},{"key":"7060_CR112","doi-asserted-by":"crossref","unstructured":"Nagireddy, M., Padhi, I., Ghosh, S., & Sattigeri, P. (2025). When in doubt, cascade: Towards building efficient and capable guardrails. Proceedings of the AAAI\/ACM Conference on AI, Ethics, and Society,8, 1812\u20131821.","DOI":"10.1609\/aies.v8i2.36676"},{"key":"7060_CR113","doi-asserted-by":"publisher","unstructured":"Ogundoyin, S., Ikram, M., Asghar, H., Zhao, B., & Kaafar, D. (2025). A large-scale empirical analysis of custom gpts\u2019 vulnerabilities in the openai ecosystem. https:\/\/doi.org\/10.48550\/arxiv.2505.08148","DOI":"10.48550\/arxiv.2505.08148"},{"key":"7060_CR114","unstructured":"O\u2019Neill, J., Subramanian, S., Lin, E., Satish, A., & Mugunthan, V. (2024a). Guardformer: Guardrail instruction pretraining for efficient safeguarding. In: Neurips safe generative AI workshop 2024."},{"key":"7060_CR115","unstructured":"O\u2019Neill, J., Subramanian, S., Lin, E., Satish, A., & Mugunthan, V. (2024b). Guardformer: Guardrail instruction pretraining for efficient safeguarding. In: Neurips safe generative AI workshop."},{"key":"7060_CR116","first-page":"27730","volume":"35","author":"L Ouyang","year":"2022","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., & Ray, A. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR117","doi-asserted-by":"publisher","unstructured":"Pan, J., Liu, X., & Xiao, C. (2025). Oet: Optimization-based prompt injection evaluation toolkit. https:\/\/doi.org\/10.48550\/arxiv.2505.00843","DOI":"10.48550\/arxiv.2505.00843"},{"key":"7060_CR118","unstructured":"Pantha, N., Ramasubramanian, M., Gurung, I., Maskey, M., & Ramachandran, R. (2024). Challenges in guardrailing large language models for science. arXiv Article 2411.08181."},{"key":"7060_CR119","doi-asserted-by":"crossref","unstructured":"Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red teaming language models with language models. arXiv Article 2202.03286.","DOI":"10.18653\/v1\/2022.emnlp-main.225"},{"key":"7060_CR120","first-page":"13387","volume":"2023","author":"E Perez","year":"2023","unstructured":"Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., & Kadavath, S. (2023). Discovering language model behaviors with model-written evaluations. Findings of the Association for Computational Linguistics: ACL, 2023, 13387\u201313434.","journal-title":"Findings of the Association for Computational Linguistics: ACL"},{"key":"7060_CR121","unstructured":"Perez, F., & Ribeiro, I. (2022). Ignore previous prompt: Attack techniques for language models. arXiv Article 2211.09527."},{"key":"7060_CR122","unstructured":"Phute, M., Helbling, A., Hull, M., Peng, S., Szyller, S., Cornelius, C., & Chau, D. H. (2023). Llm self defense: By self examination, llms know they are being tricked. arXiv Article 2308.07308."},{"key":"7060_CR123","doi-asserted-by":"publisher","unstructured":"Qi, W., Shao, S., Gu, W., Zheng, T., Zhao, P., Qin, Z., & Ren, K. (2025). Majic: Markovian adaptive jailbreaking via iterative composition of diverse innovative strategies. https:\/\/doi.org\/10.48550\/arxiv.2508.13048","DOI":"10.48550\/arxiv.2508.13048"},{"key":"7060_CR124","doi-asserted-by":"publisher","unstructured":"Qi, X., Huang, K., Panda, A., Wang, M., & Mittal, P. (2023). Visual adversarial examples jailbreak large language models. https:\/\/doi.org\/10.48550\/arxiv.2306.13213","DOI":"10.48550\/arxiv.2306.13213"},{"key":"7060_CR125","unstructured":"Rad, M. K., Nghiem, H., Luo, A., Wadhwa, S., Sorower, M., & Rawls, S. (2025). Refining input guardrails: Enhancing LLM-as-a-judge efficiency through chain-of-thought fine-tuning and alignment. arXiv Article 2501.13080."},{"key":"7060_CR126","first-page":"53728","volume":"36","author":"R Rafailov","year":"2023","unstructured":"Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728\u201353741.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR127","unstructured":"Ravichandran, Z., Hounie, I., Cladera, F., Ribeiro, A., Pappas, G. J., & Kumar, V. (2025). Distilling on-device language models for robot planning with minimal human intervention. arXiv Article 2506.17486."},{"key":"7060_CR128","doi-asserted-by":"crossref","unstructured":"Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., & Cohen, J. (2023). NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails. arXiv Article 2310.10501.","DOI":"10.18653\/v1\/2023.emnlp-demo.40"},{"key":"7060_CR129","doi-asserted-by":"publisher","unstructured":"Reddy, A., Zagula, A., & Saban, N. (2025). Autoadv: Automated adversarial prompting for multi-turn jailbreaking of large language models. https:\/\/doi.org\/10.48550\/arxiv.2507.01020","DOI":"10.48550\/arxiv.2507.01020"},{"key":"7060_CR130","doi-asserted-by":"publisher","unstructured":"Ren, J., Dras, M., & Naseem, U. (2025). Seeing the threat: Vulnerabilities in vision-language models to adversarial attack. https:\/\/doi.org\/10.48550\/arxiv.2505.21967","DOI":"10.48550\/arxiv.2505.21967"},{"key":"7060_CR131","unstructured":"Sarkar, P., Ebrahimi, S., Etemad, A., Beirami, A., Ar\u0131k, S. \u00d6., & Pfister, T. (2025). Mitigating object hallucination in MLLMs via data-augmented phrase-level alignment. arXiv Article 2405.18654."},{"key":"7060_CR132","unstructured":"Shamsujjoha, M., Lu, Q., Zhao, D., & Zhu, L. (2025). Swiss cheese model for AI safety: A taxonomy and reference architecture for multi-layered guardrails of foundation model based agents. arXiv Article 2408.02205."},{"key":"7060_CR134","unstructured":"Shayegani, E., Mamun, M. A. A., Fu, Y., Zaree, P., Dong, Y., & Abu-Ghazaleh, N. (2023b). Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv Article 2310.10844."},{"key":"7060_CR133","doi-asserted-by":"publisher","unstructured":"Shayegani, E., Mamun, M. A. A., Fu, Y., Zaree, P., Dong, Y., & Abu-Ghazaleh, N. B. (2023a). Survey of vulnerabilities in large language models revealed by adversarial attacks. https:\/\/doi.org\/10.48550\/arxiv.2310.10844","DOI":"10.48550\/arxiv.2310.10844"},{"key":"7060_CR135","unstructured":"Shi, D., Shen, T., Huang, Y., Li, Z., Leng, Y., Jin, R., Liu, C., Wu, X., Guo, Z., Yu, L., et al. (2024). Large language model safety: A holistic survey. arXiv:2412.17686"},{"key":"7060_CR137","doi-asserted-by":"crossref","unstructured":"Siddiq, M. L., Zhang, J., & Santos, J. C. D. S. (2024). Understanding regular expression denial of service (redos): Insights from llm-generated regexes and developer forums. In: Proceedings of the 32nd IEEE\/ACM international conference on program comprehension, pp. 190\u2013201","DOI":"10.1145\/3643916.3644424"},{"key":"7060_CR136","doi-asserted-by":"publisher","unstructured":"Si, S., Zhao, H., Chen, G., Gao, C., Bai, Y., Wang, Z., An, K., Luo, K., Qian, C., Qi, F., Chang, B., & Sun, M. (2025). Aligning large language models to follow instructions and hallucinate less via effective data filtering, pp. 16469\u201316488 https:\/\/doi.org\/10.48550\/arxiv.2502.07340","DOI":"10.48550\/arxiv.2502.07340"},{"key":"7060_CR138","unstructured":"Song, X., Wang, Z., He, K., Dong, G., Mou, Y., Zhao, J., & Xu, W. (2024). Knowledge editing on black-box large language models. arXiv Article 2402.08631."},{"key":"7060_CR139","doi-asserted-by":"crossref","unstructured":"Sreedhar, M. N., Rebedea, T., & Parisien, C. (2025). Safety through reasoning: An empirical study of reasoning guardrail models. arXiv Article 2505.20087.","DOI":"10.18653\/v1\/2025.findings-emnlp.1193"},{"key":"7060_CR140","doi-asserted-by":"publisher","unstructured":"Su, J. (2024). Enhancing adversarial attacks through chain of thought. https:\/\/doi.org\/10.48550\/arxiv.2410.21791","DOI":"10.48550\/arxiv.2410.21791"},{"key":"7060_CR141","unstructured":"Sun, H., & Schaar, M. (2024). Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm alignment. arXiv Article 2405.15624."},{"key":"7060_CR142","unstructured":"Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following LLaMA model. GitHub"},{"key":"7060_CR143","unstructured":"Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et\u00a0al. (2023). Gemini: a family of highly capable multimodal models. arXiv:2312.11805"},{"key":"7060_CR144","unstructured":"Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et\u00a0al. (2024a). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530"},{"key":"7060_CR145","unstructured":"Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi\u00e8re, M., Kale, M.S., Love, J., et\u00a0al. (2024b). Gemma: Open models based on gemini research and technology. arXiv:2403.08295"},{"key":"7060_CR146","doi-asserted-by":"crossref","unstructured":"Tong, T., Xu, J., Liu, Q., & Chen, M. (2024). Securing multi-turn conversational language models from distributed backdoor triggers. arXiv: 2407.04151","DOI":"10.18653\/v1\/2024.findings-emnlp.750"},{"key":"7060_CR147","doi-asserted-by":"publisher","unstructured":"Tonmoy, S., Zaman, S.M.M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. https:\/\/doi.org\/10.48550\/arxiv.2401.01313","DOI":"10.48550\/arxiv.2401.01313"},{"key":"7060_CR148","unstructured":"Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et\u00a0al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288"},{"key":"7060_CR149","doi-asserted-by":"crossref","unstructured":"Varshney, N., Dolin, P., Seth, A., & Baral, C. (2023). The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv:2401.00287","DOI":"10.18653\/v1\/2024.findings-acl.776"},{"key":"7060_CR150","doi-asserted-by":"crossref","unstructured":"Wallace, E., Zhao, T.Z., Feng, S., & Singh, S. (2021). Concealed data poisoning attacks on NLP models. arXiv: 2010.12563","DOI":"10.18653\/v1\/2021.naacl-main.13"},{"key":"7060_CR151","unstructured":"Wan, A., Wallace, E., Shen, S., & Klein, D. (2023). Poisoning language models during instruction tuning. arXiv: 2305.00944"},{"key":"7060_CR152","doi-asserted-by":"publisher","unstructured":"Wan, F., Huang, X., Cui, L., Quan, X., Bi, W., & Shi, S. (2024). Mitigating hallucinations of large language models via knowledge consistent alignment. https:\/\/doi.org\/10.48550\/arxiv.2401.10768","DOI":"10.48550\/arxiv.2401.10768"},{"key":"7060_CR153","unstructured":"Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., & Schaeffer, R. (2023a). Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In: NeurIPS"},{"key":"7060_CR154","unstructured":"Wang, H., Fu, W., Tang, Y., Chen, Z., Huang, Y., Piao, J., Gao, C., Xu, F., Jiang, T., & Li, Y. (2025a). A survey on responsible llms: Inherent risk, malicious use, and mitigation strategy. arXiv:2501.09431"},{"key":"7060_CR155","doi-asserted-by":"publisher","unstructured":"Wang, H., Wu, B., Bian, Y., Chang, Y., Wang, X., & Zhao, P. (2024c). Probing the safety response boundary of large language models via unsafe decoding path generation. https:\/\/doi.org\/10.48550\/arxiv.2408.10668","DOI":"10.48550\/arxiv.2408.10668"},{"key":"7060_CR156","unstructured":"Wang, K., Zhang, G., Zhou, Z., Wu, J., Yu, M., Zhao, S., Yin, C., Fu, J., Yan, Y., Luo, H., et al. (2025d). A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment. arXiv:2504.15585"},{"key":"7060_CR157","unstructured":"Wang, M., Ren, K., Jalan, P., Ashraf, A., Vu, T.V., Seetharaman, R., Nawaz, S., & Naseem, U. (2026). From native memes to global moderation: Cross-cultural evaluation of vision-language models for hateful meme detection. https:\/\/api.semanticscholar.org\/CorpusID:285451876"},{"key":"7060_CR158","unstructured":"Wang, T., & Li, H. (2025). OpenGuardrails: A configurable, unified, and scalable guardrails platform for large language models. arXiv: 2510.19169"},{"key":"7060_CR159","unstructured":"Wang, X., Ji, Z., Wang, W., Li, Z., Wu, D., & Wang, S. (2025b). SoK: Evaluating jailbreak guardrails for large language models. arXiv: 2506.10597"},{"key":"7060_CR161","doi-asserted-by":"crossref","unstructured":"Wang, Y., Li, H., Han, X., Nakov, P., & Baldwin, T. (2023b). Do-not-answer: A dataset for evaluating safeguards in llms. arXiv:2308.13387","DOI":"10.18653\/v1\/2024.findings-eacl.61"},{"key":"7060_CR160","unstructured":"Wang, Y.-S., & Chang, Y. (2022). Toxicity detection with generative prompt-based inference. arXiv: 2205.12390"},{"key":"7060_CR162","doi-asserted-by":"crossref","unstructured":"Wang, Y., Weng, F., Yang, S., Qin, Z., Huang, M., & Wang, W. (2025c). Delman: Dynamic defense against large language model jailbreaking with model editing. arXiv:2502.11647","DOI":"10.18653\/v1\/2025.findings-acl.598"},{"key":"7060_CR163","unstructured":"Wang, Z., Bi, B., Pentyala, S.K., Ramnath, K., Chaudhuri, S., Mehrotra, S., Mao, X.-B., Asur, S., et\u00a0al. (2024a). A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv:2407.16216"},{"key":"7060_CR164","unstructured":"Wang, Z., Tu, H., Mei, J., Zhao, B., Wang, Y., & Xie, C. (2024b). AttnGCG: Enhancing jailbreaking attacks on LLMs with attention manipulation. arXiv Article 2410.09040."},{"key":"7060_CR165","unstructured":"Wang, Z., Yang, F., Wang, L., Zhao, P., Wang, H., Chen, L., Lin, Q., & Wong, K.-F. (2024d). Self-guard: Empower the LLM to safeguard itself. arXiv Article 2310.15851."},{"key":"7060_CR166","first-page":"80079","volume":"36","author":"A Wei","year":"2023","unstructured":"Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 80079\u201380110.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR167","unstructured":"Wei, J., Abdulrazzag, A., Zhang, T., Muursepp, A., & Saileshwar, G. (2026). When speculation spills secrets: Side channels via speculative decoding in LLMs. arXiv Article 2411.01076."},{"key":"7060_CR168","unstructured":"Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., Le, & Q.V. (2021). Finetuned language models are zero-shot learners. arXiv:2109.01652"},{"key":"7060_CR169","doi-asserted-by":"crossref","unstructured":"Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L.A., Anderson, K., Kohli, P., Coppin, B., Huang, P.-S. (2021) Challenges in detoxifying language models. In: Moens, M.-F., Huang, X., Specia, L., Yih, S.W.-t. (eds.) Findings of the association for computational linguistics: EMNLP 2021, pp. 2447\u20132469. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021). https:\/\/doi.org\/10.18653\/v1\/2021.findings-emnlp.210. https:\/\/aclanthology.org\/2021.findings-emnlp.210\/","DOI":"10.18653\/v1\/2021.findings-emnlp.210"},{"key":"7060_CR170","unstructured":"Wen, M., Wan, Z., Wang, J., Zhang, W., & Wen, Y. (2024). Reinforcing llm agents via policy optimization with action decomposition. In: The thirty-eighth annual conference on neural information processing systems."},{"key":"7060_CR171","unstructured":"Wen, X., Mo, W. J., Xie, Y., Qi, P., & Chen, M. (2025). Towards policy-compliant agents: Learning efficient guardrails for policy violation detection. arXiv Article 2510.03485."},{"key":"7060_CR172","doi-asserted-by":"crossref","unstructured":"Wester, J., Schrills, T., Pohl, H., & Berkel, N. (2024). \u201dAs an ai language model, i cannot\u201d: Investigating llm denials of user requests. In: Proceedings of the 2024 CHI conference on human factors in computing systems, pp. 1\u201314","DOI":"10.1145\/3613904.3642135"},{"key":"7060_CR174","unstructured":"Wu, T., Ni, J., Hooi, B., Zhang, J., Ash, E., Ng, S.-K., Sachan, M., & Leippold, M. (2025b). Balancing truthfulness and informativeness with uncertainty-aware instruction fine-tuning. arXiv: 2502.11962"},{"key":"7060_CR175","doi-asserted-by":"crossref","unstructured":"Wu, T., Yuan, W., Golovneva, O., Xu, J., Tian, Y., Jiao, J., Weston, J., & Sukhbaatar, S. (2024). Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv:2407.19594","DOI":"10.18653\/v1\/2025.emnlp-main.583"},{"key":"7060_CR176","unstructured":"Wu, T., Zhu, B., Zhang, R., Wen, Z., Ramchandran, K., & Jiao, J. (2023). Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv:2310.00212"},{"key":"7060_CR173","unstructured":"Wu, Y., Guo, J., Li, D., Zou, H. P., Huang, W.-C., Chen, Y., Wang, Z., Zhang, W., Li, Y., Zhang, M., Jiang, R., & Yu, P. S. (2025a). PSG-agent: Personality-aware safety guardrail for LLM-based agents. arXiv Article 2509.23614."},{"key":"7060_CR177","doi-asserted-by":"publisher","unstructured":"Wu, Y.-H., Xiong, Y., Zhang, H., Zhang, J.-C., & Zhou, Z. (2025c). Sugar-coated poison: Benign generation unlocks jailbreaking. In: Findings of the association for computational linguistics: EMNLP 2025. https:\/\/doi.org\/10.18653\/v1\/2025.findings-emnlp.512","DOI":"10.18653\/v1\/2025.findings-emnlp.512"},{"key":"7060_CR178","doi-asserted-by":"crossref","unstructured":"Xiong, C., Qi, X., Chen, P.-Y., & Ho, T.-Y. (2024). Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks. arXiv:2405.20099","DOI":"10.18653\/v1\/2025.findings-acl.23"},{"key":"7060_CR179","unstructured":"Xu, X., Kong, K., Liu, N., Cui, L., Wang, D., Zhang, J., & Kankanhalli, M. (2023). An llm can fool itself: A prompt-based adversarial attack. arXiv:2310.13345"},{"key":"7060_CR180","doi-asserted-by":"crossref","unstructured":"Xu, Z., Liu, Y., Deng, G., Li, Y., & Picek, S. (2024a). A comprehensive study of jailbreak attack versus defense for large language models. arXiv:2402.13457","DOI":"10.18653\/v1\/2024.findings-acl.443"},{"key":"7060_CR181","doi-asserted-by":"publisher","unstructured":"Xu, Z., Liu, Y., Deng, G., Li, Y., & Picek, S. (2024). A comprehensive study of jailbreak attack versus defense for large language models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the association for computational linguistics: ACL 2024 (pp. 7432\u20137449). Association for Computational Linguistics. https:\/\/doi.org\/10.18653\/v1\/2024.findings-acl.443","DOI":"10.18653\/v1\/2024.findings-acl.443"},{"key":"7060_CR182","doi-asserted-by":"crossref","unstructured":"Yadav, N., Masud, S., Goyal, V., Goyal, V., Akhtar, M.S., & Chakraborty, T. (2024). Tox-BART: leveraging toxicity attributes for explanation generation of implicit hate speech. arXiv: 2406.03953","DOI":"10.18653\/v1\/2024.findings-acl.831"},{"key":"7060_CR183","doi-asserted-by":"publisher","unstructured":"Yamaguchi, K., Etheridge, B., & Arditi, A. (2025). Adversarial manipulation of reasoning models using internal representations. https:\/\/doi.org\/10.48550\/arxiv.2507.03167","DOI":"10.48550\/arxiv.2507.03167"},{"key":"7060_CR184","unstructured":"Yang, A. X., Robeyns, M., Coste, T., Shi, Z., Wang, J., Bou-Ammar, H., & Aitchison, L. (2024a). Bayesian reward models for llm alignment. arXiv Article 2402.13210."},{"key":"7060_CR185","unstructured":"Yang, Y., Dan, S., Roth, D., & Lee, I. (2024b). Benchmarking llm guardrails in handling multilingual toxicity. arXiv Article 2410.22153."},{"key":"7060_CR186","doi-asserted-by":"publisher","unstructured":"Yang, Y., Xiao, Z., Lu, X., Wang, H., Wei, X., Huang, H., Chen, G., & Chen, Y. (2024c). Seqar: Jailbreak llms with sequential auto-generated characters, pp. 912\u2013931 https:\/\/doi.org\/10.18653\/v1\/2025.naacl-long.42","DOI":"10.18653\/v1\/2025.naacl-long.42"},{"key":"7060_CR187","doi-asserted-by":"crossref","unstructured":"Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., & Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, p. 100211","DOI":"10.1016\/j.hcc.2024.100211"},{"key":"7060_CR188","doi-asserted-by":"publisher","unstructured":"Yao, Y., Tong, X., Wang, R., Wang, Y., Li, L., Liu, L., Teng, Y., & Wang, Y. (2025). A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos. https:\/\/doi.org\/10.48550\/arxiv.2502.15806","DOI":"10.48550\/arxiv.2502.15806"},{"key":"7060_CR189","doi-asserted-by":"crossref","unstructured":"Ying, Z., Liu, A., Liang, S., Huang, L., Guo, J., Zhou, W., Liu, X., & Tao, D. (2024). Safebench: A safety evaluation framework for multimodal large language models. arXiv Article 2410.18927.","DOI":"10.1007\/s11263-025-02613-1"},{"key":"7060_CR190","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/csde59766.2023.10487667","volume":"2023","author":"DW Yip","year":"2023","unstructured":"Yip, D. W., Esmradi, A., & Chan, C. (2023). A novel evaluation framework for assessing resilience against prompt injection attacks in large language models. IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), 2023, 1\u20135. https:\/\/doi.org\/10.1109\/csde59766.2023.10487667","journal-title":"IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)"},{"key":"7060_CR191","doi-asserted-by":"publisher","unstructured":"Yu, J., Wu, Y., Shu, D., Jin, M., & Xing, X. (2023). Assessing prompt injection risks in 200+ custom gpts. https:\/\/doi.org\/10.48550\/arxiv.2311.11538","DOI":"10.48550\/arxiv.2311.11538"},{"key":"7060_CR192","doi-asserted-by":"crossref","unstructured":"Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., & Zhuang, Y. (2024a). HalluciDoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv Article 2311.13614.","DOI":"10.1109\/CVPR52733.2024.01230"},{"key":"7060_CR193","doi-asserted-by":"publisher","unstructured":"Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., & Zhang, N. (2024b). Don\u2019t listen to me: Understanding and exploring jailbreak prompts of large language models. https:\/\/doi.org\/10.48550\/arxiv.2403.17336","DOI":"10.48550\/arxiv.2403.17336"},{"key":"7060_CR194","unstructured":"Zeng, Y., Wu, Y., Zhang, X., Wang, H., & Wu, Q. (2024). Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv:2403.04783"},{"key":"7060_CR195","unstructured":"Zhang, H., Huang, J., Mei, K., Yao, Y., Wang, Z., Zhan, C., Wang, H., & Zhang, Y. (2024a) Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. arXiv:2410.02644"},{"issue":"1","key":"7060_CR196","doi-asserted-by":"publisher","DOI":"10.1016\/j.jnlest.2025.100301","volume":"23","author":"R Zhang","year":"2025","unstructured":"Zhang, R., Li, H.-W., Qian, X.-Y., Jiang, W.-B., & Chen, H.-X. (2025b). On large language models safety, security, and privacy: A survey. Journal of Electronic Science and Technology, 23(1), Article Article 100301.","journal-title":"Journal of Electronic Science and Technology"},{"key":"7060_CR197","unstructured":"Zhang, S., Zhao, J., Dong, H., Xu, R., Li, Z., Zhang, Y., Li, S., Wen, Y., Xia, C., Wang, Z., Feng, X., & Cui, H. (2026). Beyond prompts: Space-time decoupling control-plane jailbreaks in LLM structured output. arXiv: 2503.24191"},{"key":"7060_CR198","unstructured":"Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S., & Shen, C. (2023c). Jailguard: A universal detection framework for llm prompt-based attacks. arXiv:2312.10766"},{"key":"7060_CR199","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Li, M., Han, W., Yao, Y., Cen, Z., & Zhao, D. (2025a). Safety is not only about refusal: Reasoning-enhanced fine-tuning for interpretable llm safety. arXiv:2503.05021","DOI":"10.18653\/v1\/2025.findings-acl.960"},{"key":"7060_CR200","unstructured":"Zhang, Y., Rando, J., Evtimov, I., Chi, J., Smith, E.M., Carlini, N., Tram\u00e8r, F., & Ippolito, D. (2024c). Persistent pre-training poisoning of LLMs. arXiv: 2410.13722"},{"key":"7060_CR201","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2023a). Safetybench: Evaluating the safety of large language models. arXiv:2309.07045","DOI":"10.18653\/v1\/2024.acl-long.830"},{"key":"7060_CR202","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., & Huang, M. (2024b). Safetybench: Evaluating the safety of large language models. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp. 15537\u201315553","DOI":"10.18653\/v1\/2024.acl-long.830"},{"key":"7060_CR203","unstructured":"Zhang, Z., Yang, J., Ke, P., Mi, F., Wang, H., & Huang, M. (2023b). Defending large language models against jailbreaking attacks through goal prioritization. arXiv Article 2311.09096."},{"key":"7060_CR204","doi-asserted-by":"crossref","unstructured":"Zhao, W., Li, Z., Li, Y., Zhang, Y., & Sun, J. (2024). Defending large language models against jailbreak attacks via layer-specific editing. arXiv Article 2405.18166.","DOI":"10.18653\/v1\/2024.findings-emnlp.293"},{"key":"7060_CR205","doi-asserted-by":"crossref","unstructured":"Zhao, Y., Zhu, J., Xu, C., Liu, Y., & Li, X. (2025). Enhancing LLM-based hatred and toxicity detection with meta-toxic knowledge graph. arXiv Article 2412.15268.","DOI":"10.18653\/v1\/2025.findings-acl.1270"},{"key":"7060_CR206","unstructured":"Zheng, A., Rana, M., & Stolcke, A. (2024a). Lightweight safety guardrails using fine-tuned BERT embeddings. arXiv Article 2411.14398."},{"key":"7060_CR207","unstructured":"Zheng, A., Rana, M., & Stolcke, A. (2024b). Lightweight safety guardrails using fine-tuned BERT embeddings. arXiv Article 2411.14398."},{"key":"7060_CR208","first-page":"46595","volume":"36","author":"L Zheng","year":"2023","unstructured":"Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., & Xing, E. (2023). Judging LLM-as-a-judge with MT-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 46595\u201346623.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"7060_CR209","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Ni, T., Lee, W.-B., & Zhao, Q. (2025). A survey on backdoor threats in large language models (LLMs): Attacks, defenses, and evaluations. arXiv Article 2502.05224.","DOI":"10.53941\/tai.2025.100003"},{"key":"7060_CR212","doi-asserted-by":"crossref","unstructured":"Zhuang, J., Jin, H., Zhang, Y., Kang, Z., Zhang, W., Dagher, G. G., & Wang, H. (2025). Exploring the vulnerability of the content moderation guardrail in large language models via intent manipulation. arXiv Article 2505.18556.","DOI":"10.18653\/v1\/2025.findings-emnlp.114"},{"key":"7060_CR210","doi-asserted-by":"publisher","unstructured":"Zhu, R., Jiang, Z., Wu, J., Ma, Z., Song, J., Bai, F., Lin, D., Wu, L., & He, C. (2025). Grait: Gradient-driven refusal-aware instruction tuning for effective hallucination mitigation, pp. 4006\u20134021 https:\/\/doi.org\/10.48550\/arxiv.2502.05911","DOI":"10.48550\/arxiv.2502.05911"},{"key":"7060_CR211","unstructured":"Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., & Sun, T. (2023). Autodan: Interpretable gradient-based adversarial attacks on large language models. arXiv Article 2310.15140."},{"key":"7060_CR213","unstructured":"Zizzo, G., Cornacchia, G., Fraser, K., Hameed, M. Z., Rawat, A., Buesser, B., Purcell, M., Chen, P.-Y., Sattigeri, P., & Varshney, K. (2025). Adversarial prompt evaluation: Systematic benchmarking of guardrails against prompt input attacks on LLMs. arXiv Article 2502.15427."},{"key":"7060_CR214","unstructured":"Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv Article 2307.15043."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-026-07060-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-026-07060-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-026-07060-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T11:25:49Z","timestamp":1779276349000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-026-07060-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,20]]},"references-count":215,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2026,6]]}},"alternative-id":["7060"],"URL":"https:\/\/doi.org\/10.1007\/s10994-026-07060-8","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,20]]},"assertion":[{"value":"16 August 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 April 2026","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 April 2026","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 May 2026","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"130"}}