{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T11:55:14Z","timestamp":1767095714977,"version":"3.48.0"},"publisher-location":"New York, NY, USA","reference-count":47,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-2229876"],"award-info":[{"award-number":["IIS-2229876"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,13]]},"DOI":"10.1145\/3733799.3762966","type":"proceedings-article","created":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T11:38:49Z","timestamp":1767094729000},"page":"52-63","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Rethinking How to Evaluate Language Model Jailbreak"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9280-8493","authenticated-orcid":false,"given":"Hongyu","family":"Cai","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1631-6064","authenticated-orcid":false,"given":"Arjun","family":"Arunasalam","sequence":"additional","affiliation":[{"name":"Florida International University, Miami, FL, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1561-0729","authenticated-orcid":false,"given":"Leo Y.","family":"Lin","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2862-5286","authenticated-orcid":false,"given":"Antonio","family":"Bianchi","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7362-8905","authenticated-orcid":false,"given":"Z. Berkay","family":"Celik","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,12,30]]},"reference":[{"key":"e_1_3_3_1_2_2","unstructured":"2018. doccano: Text Annotation Tool for Human. https:\/\/bit.ly\/3TRQD7P."},{"key":"e_1_3_3_1_3_2","unstructured":"2023. BLACKMAMBA: USING AI TO GENERATE POLYMORPHIC MALWARE. https:\/\/bit.ly\/3VgUCu7."},{"key":"e_1_3_3_1_4_2","unstructured":"2023. Generative AI Prohibited Use Policy. https:\/\/bit.ly\/3ITv1BK."},{"key":"e_1_3_3_1_5_2","unstructured":"2024. FBI Warns of Increasing Threat of Cyber Criminals Utilizing Artificial Intelligence. https:\/\/bit.ly\/3xhcFIt."},{"key":"e_1_3_3_1_6_2","unstructured":"2024. Gemma: Introducing new state-of-the-art open models. https:\/\/bit.ly\/3Rg2FGi."},{"key":"e_1_3_3_1_7_2","unstructured":"2024. Introducing Meta Llama 3: The most capable openly available LLM to date. https:\/\/bit.ly\/3XdF7FR."},{"key":"e_1_3_3_1_8_2","unstructured":"2024. Submission Replication. https:\/\/github.com\/purseclab\/jailbreak-evaluation\/."},{"key":"e_1_3_3_1_9_2","unstructured":"2024. Usage policies \u2014 openai.com. https:\/\/bit.ly\/3vrCYLc."},{"key":"e_1_3_3_1_10_2","unstructured":"Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell and Jackson et\u00a0al. Kernion. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv (2022)."},{"key":"e_1_3_3_1_11_2","volume-title":"Natural language processing with Python: analyzing text with the natural language toolkit","author":"Bird Steven","year":"2009","unstructured":"Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit."},{"key":"e_1_3_3_1_12_2","unstructured":"Tom\u00a0B. Brown Benjamin Mann Nick Ryder Melanie Subbiah and Jared et\u00a0al. Kaplan. 2020. Language models are few-shot learners. International Conference on Neural Information Processing Systems (2020)."},{"key":"e_1_3_3_1_13_2","unstructured":"Patrick Chao Edoardo Debenedetti Alexander Robey Maksym Andriushchenko and Francesco\u00a0Croce et al.2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. International Conference on Neural Information Processing Systems (2024)."},{"key":"e_1_3_3_1_14_2","unstructured":"Patrick Chao Alexander Robey Edgar Dobriban Hamed Hassani and George J et\u00a0al. Pappas. 2023. Jailbreaking black box large language models in twenty queries. IEEE Conference on Secure and Trustworthy Machine Learning (2023)."},{"key":"e_1_3_3_1_15_2","unstructured":"Paul\u00a0Francis Christiano Jan Leike Tom\u00a0B. Brown Miljan Martic and Shane\u00a0Legg et al.2017. Deep Reinforcement Learning from Human Preferences. International Conference on Neural Information Processing Systems (2017)."},{"key":"e_1_3_3_1_16_2","unstructured":"Dena De\u00a0Angelo. 2024. The dark side of AI in cybersecurity \u2014 AI-Generated Malware. https:\/\/bit.ly\/3V7DOWi. Palo Alto Networks Blog (2024)."},{"key":"e_1_3_3_1_17_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics (2019)."},{"key":"e_1_3_3_1_18_2","unstructured":"Julius Endert. 2024. Generative AI is the ultimate disinformation amplifier. https:\/\/bit.ly\/3VdAn0f. DW.COM (2024)."},{"key":"e_1_3_3_1_19_2","unstructured":"Fredrik Heiding. 2024. AI will increase the quantity \u2014 and quality \u2014 of phishing scams. https:\/\/bit.ly\/3xdNiqY. Harvard Business Review (2024)."},{"key":"e_1_3_3_1_20_2","unstructured":"Yangsibo Huang Samyak Gupta Mengzhou Xia Kai Li and Danqi Chen. 2023. Catastrophic jailbreak of open-source llms via exploiting generation. International Conference on Learning Representations (2023)."},{"key":"e_1_3_3_1_21_2","unstructured":"Neel Jain Avi Schwarzschild Yuxin Wen Gowthami Somepalli and John\u00a0Kirchenbauer et al.2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv (2023)."},{"key":"e_1_3_3_1_22_2","doi-asserted-by":"crossref","unstructured":"Klaus Krippendorff. 2004. Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research (2004).","DOI":"10.1093\/hcr\/30.3.411"},{"key":"e_1_3_3_1_23_2","unstructured":"Christoph Leiter Piyawat Lertvittayakumjorn Marina Fomicheva Wei Zhao and Yang et\u00a0al. Gao. 2022. Towards explainable evaluation metrics for natural language generation. arXiv (2022)."},{"key":"e_1_3_3_1_24_2","unstructured":"Stephanie Lin Jacob Hilton and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Association for Computational Linguistics (2022)."},{"key":"e_1_3_3_1_25_2","unstructured":"Yi Liu Gelei Deng Zhengzi Xu Yuekang Li and Yaowen\u00a0Zheng et al.2023. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv (2023)."},{"key":"e_1_3_3_1_26_2","unstructured":"Mantas Mazeika Long Phan Xuwang Yin Andy Zou and Zifan et\u00a0al. Wang. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. International Conference on Machine Learning (2024)."},{"key":"e_1_3_3_1_27_2","unstructured":"Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll\u00a0L. Wainwright and Pamela\u00a0Mishkin et al.2022. Training language models to follow instructions with human feedback. International Conference on Neural Information Processing Systems (2022)."},{"key":"e_1_3_3_1_28_2","unstructured":"Jordan Pearson. 2024. Google research shows the fast rise of AI-generated misinformation. https:\/\/bit.ly\/3VgqCym. CBC (2024)."},{"key":"e_1_3_3_1_29_2","unstructured":"Mansi Phute Alec Helbling Matthew Hull ShengYun Peng and Sebastian\u00a0Szyller et al.2024. LLM Self Defense: By Self Examination LLMs Know They Are Being Tricked. Tiny Papers Track at International Conference on Learning Representations (2024)."},{"key":"e_1_3_3_1_30_2","unstructured":"Xiangyu Qi Yi Zeng Tinghao Xie Pin-Yu Chen and Ruoxi\u00a0Jia et al.2023. Fine-tuning Aligned Language Models Compromises Safety Even When Users Do Not Intend To! International Conference on Learning Representations (2023)."},{"key":"e_1_3_3_1_31_2","unstructured":"Delong Ran Jinyuan Liu Yichen Gong Jingyi Zheng and Xinlei\u00a0He et al.2024. JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models. arXiv (2024)."},{"key":"e_1_3_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Traian Rebedea Razvan Dinu Makesh Sreedhar Christopher Parisien and Jonathan Cohen. 2023. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2023).","DOI":"10.18653\/v1\/2023.emnlp-demo.40"},{"key":"e_1_3_3_1_33_2","unstructured":"Alexander Robey Eric Wong Hamed Hassani and George\u00a0J. Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Transactions on Machine Learning Research (2023)."},{"key":"e_1_3_3_1_34_2","doi-asserted-by":"crossref","unstructured":"Xinyue Shen Zeyuan Chen Michael Backes Yun Shen and Yang Zhang. 2024. \u201cDo Anything Now\u201d: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. ACM SIGSAC Conference on Computer and Communications Security (2024).","DOI":"10.1145\/3658644.3670388"},{"key":"e_1_3_3_1_35_2","doi-asserted-by":"crossref","unstructured":"Dong Shu Chong Zhang Mingyu Jin Zihao Zhou and Lingyao Li. 2025. AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models. SIGKDD Explor. Newsl. (2025).","DOI":"10.1145\/3748239.3748242"},{"key":"e_1_3_3_1_36_2","unstructured":"Alexandra Souly Qingyuan Lu Dillon Bowen Tu Trinh and Elvis\u00a0Hsieh et al.2024. A StrongREJECT for Empty Jailbreaks. International Conference on Neural Information Processing Systems (2024)."},{"key":"e_1_3_3_1_37_2","unstructured":"Hugo Touvron Louis Martin Kevin\u00a0R. Stone Peter Albert and Amjad\u00a0Almahairi et al.2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv (2023)."},{"key":"e_1_3_3_1_38_2","unstructured":"Pranshu Verma. 2023. The rise of AI fake news is creating a \u2018misinformation superspreader\u2019. https:\/\/wapo.st\/3VzEZiU. Washington Post (2023)."},{"key":"e_1_3_3_1_39_2","unstructured":"Bob Violino. 2023. AI tools such as ChatGPT are generating a mammoth increase in malicious phishing emails. https:\/\/cnb.cx\/3VlhcBY. CNBC (2023)."},{"key":"e_1_3_3_1_40_2","doi-asserted-by":"crossref","unstructured":"Eric Wallace Tony Zhao Shi Feng and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021).","DOI":"10.18653\/v1\/2021.naacl-main.13"},{"key":"e_1_3_3_1_41_2","unstructured":"Alexander Wan Eric Wallace Sheng Shen and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning. International Conference on Machine Learning (2023)."},{"key":"e_1_3_3_1_42_2","unstructured":"Cunxiang Wang Xiaoze Liu Yuanhao Yue Xiangru Tang and Tianhang\u00a0Zhang et al.2023. Survey on Factuality in Large Language Models: Knowledge Retrieval and Domain-Specificity. arXiv (2023)."},{"key":"e_1_3_3_1_43_2","unstructured":"Alexander Wei Nika Haghtalab and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail? International Conference on Neural Information Processing Systems (2023)."},{"key":"e_1_3_3_1_44_2","unstructured":"Zhangchen Xu Fengqing Jiang Luyao Niu Jinyuan Jia and Bill Yuchen\u00a0Lin et al.2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. Annual Meeting of the Association for Computational Linguistics (2024)."},{"key":"e_1_3_3_1_45_2","unstructured":"Jiahao Yu Xingwei Lin Zheng Yu and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv (2023)."},{"key":"e_1_3_3_1_46_2","unstructured":"Jiahao Yu Xingwei Lin Zheng Yu and Xinyu Xing. 2024. { LLM-Fuzzer} : Scaling Assessment of Large Language Model Jailbreaks. USENIX Security Symposium (2024)."},{"key":"e_1_3_3_1_47_2","unstructured":"Daniel\u00a0M. Ziegler Nisan Stiennon Jeff Wu Tom\u00a0B. Brown and Alec\u00a0Radford et al.2019. Fine-Tuning Language Models from Human Preferences. arXiv (2019)."},{"key":"e_1_3_3_1_48_2","unstructured":"Andy Zou Zifan Wang J\u00a0Zico Kolter and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv (2023)."}],"event":{"name":"AISec '25: Proceedings of the 2025 Workshop on Artificial Intelligence and Security","sponsor":["SIGSAC ACM Special Interest Group on Security, Audit, and Control"],"location":"Taipei , Taiwan","acronym":"AISec '25"},"container-title":["Proceedings of the 18th ACM Workshop on Artificial Intelligence and Security"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3733799.3762966","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T11:52:46Z","timestamp":1767095566000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3733799.3762966"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,13]]},"references-count":47,"alternative-id":["10.1145\/3733799.3762966","10.1145\/3733799"],"URL":"https:\/\/doi.org\/10.1145\/3733799.3762966","relation":{},"subject":[],"published":{"date-parts":[[2025,10,13]]},"assertion":[{"value":"2025-12-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}