{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T00:47:35Z","timestamp":1760316455675,"version":"build-2065373602"},"publisher-location":"Cham","reference-count":61,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783032078834","type":"print"},{"value":"9783032078841","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T00:00:00Z","timestamp":1760313600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2025,10,13]],"date-time":"2025-10-13T00:00:00Z","timestamp":1760313600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026]]},"DOI":"10.1007\/978-3-032-07884-1_13","type":"book-chapter","created":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T16:22:48Z","timestamp":1760286168000},"page":"249-268","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Countering Jailbreak Attacks with\u00a0Two-Axis Pre-detection and\u00a0Conditional Warning Wrappers"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7123-4467","authenticated-orcid":false,"given":"Hyunsik","family":"Na","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0003-2093-9824","authenticated-orcid":false,"given":"Hajun","family":"Kim","sequence":"additional","affiliation":[]},{"given":"Dooshik","family":"Yoon","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1438-0265","authenticated-orcid":false,"given":"Daeseon","family":"Choi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,13]]},"reference":[{"key":"13_CR1","unstructured":"Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)"},{"key":"13_CR2","unstructured":"Andriushchenko, M., Croce, F., Flammarion, N.: Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151 (2024)"},{"key":"13_CR3","doi-asserted-by":"crossref","unstructured":"Arora, A., et al.: Detecting harmful content on online platforms: what platforms need vs. where research efforts go. ACM Comput. Surv. 56(3), 1\u201317 (2023)","DOI":"10.1145\/3603399"},{"key":"13_CR4","unstructured":"Bajcsy, A., Fisac, J.F.: Human-AI safety: a descendant of generative AI and control systems safety. arXiv preprint arXiv:2405.09794 (2024)"},{"key":"13_CR5","doi-asserted-by":"crossref","unstructured":"Bengesi, S., El-Sayed, H., Sarker, M.K., Houkpati, Y., Irungu, J., Oladunni, T.: Advancements in generative AI: a comprehensive review of GANs, GPT, autoencoders, diffusion model, and transformers. IEEE Access (2024)","DOI":"10.1109\/ACCESS.2024.3397775"},{"key":"13_CR6","doi-asserted-by":"crossref","unstructured":"Bhardwaj, R., Anh, D.D., Poria, S.: Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic (2024)","DOI":"10.18653\/v1\/2024.acl-long.762"},{"key":"13_CR7","doi-asserted-by":"crossref","unstructured":"Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q., Liu, Y.: Play guessing game with LLM: indirect jailbreak attack with implicit clues. In: Findings of the Association for Computational Linguistics ACL 2024, pp. 5135\u20135147 (2024)","DOI":"10.18653\/v1\/2024.findings-acl.304"},{"key":"13_CR8","unstructured":"DeepSeek-AI: Deepseek-r1: incentivizing reasoning capability in LLMs via reinforcement learning (2025). https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"13_CR9","unstructured":"Deepset.ai: Deepset (2023). https:\/\/huggingface.co\/deepset\/deberta-v3-base-injection"},{"key":"13_CR10","unstructured":"Devlin, J.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)"},{"key":"13_CR11","doi-asserted-by":"crossref","unstructured":"Ding, N., et al.: Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 (2023)","DOI":"10.18653\/v1\/2023.emnlp-main.183"},{"key":"13_CR12","doi-asserted-by":"publisher","unstructured":"Epivolis: Hyperion (revision e661b91) (2023). https:\/\/doi.org\/10.57967\/hf\/1108. https:\/\/huggingface.co\/Epivolis\/Hyperion","DOI":"10.57967\/hf\/1108"},{"issue":"1","key":"13_CR13","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1007\/s12599-023-00834-7","volume":"66","author":"S Feuerriegel","year":"2024","unstructured":"Feuerriegel, S., Hartmann, J., Janiesch, C., Zschech, P.: Generative AI. Bus. Inf. Syst. Eng. 66(1), 111\u2013126 (2024)","journal-title":"Bus. Inf. Syst. Eng."},{"key":"13_CR14","unstructured":"Fmops.ai: Fmops (2023). https:\/\/huggingface.co\/fmops\/distilbert-prompt-injection"},{"key":"13_CR15","doi-asserted-by":"crossref","unstructured":"Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: Realtoxicityprompts: evaluating neural toxic degeneration in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020 (2020)","DOI":"10.18653\/v1\/2020.findings-emnlp.301"},{"key":"13_CR16","unstructured":"He, P., Gao, J., Chen, W.: Debertav3: improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)"},{"key":"13_CR17","unstructured":"Hosseini, H., Kannan, S., Zhang, B., Poovendran, R.: Deceiving Google\u2019s perspective API built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017)"},{"key":"13_CR18","unstructured":"Hurst, A., et al.: GPT-4o system card. arXiv preprint arXiv:2410.21276 (2024)"},{"key":"13_CR19","doi-asserted-by":"crossref","unstructured":"Jacob, D., Alzahrani, H., Hu, Z., Alomair, B., Wagner, D.: Promptshield: deployable detection for prompt injection attacks. arXiv preprint arXiv:2501.15145 (2025)","DOI":"10.1145\/3714393.3726501"},{"key":"13_CR20","first-page":"24678","volume":"36","author":"J Ji","year":"2023","unstructured":"Ji, J., et al.: Beavertails: towards improved safety alignment of LLM via a human-preference dataset. Adv. Neural. Inf. Process. Syst. 36, 24678\u201324704 (2023)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"13_CR21","first-page":"47094","volume":"37","author":"L Jiang","year":"2024","unstructured":"Jiang, L., et al.: Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. Adv. Neural. Inf. Process. Syst. 37, 47094\u201347165 (2024)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"13_CR22","first-page":"47094","volume":"37","author":"L Jiang","year":"2025","unstructured":"Jiang, L., et al.: Wildteaming at scale: from in-the-wild jailbreaks to (adversarially) safer language models. Adv. Neural. Inf. Process. Syst. 37, 47094\u201347165 (2025)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"13_CR23","unstructured":"Kang, H., et al.: Toxicity detection towards adaptability to changing perturbations. arXiv preprint arXiv:2412.15267 (2024)"},{"key":"13_CR24","doi-asserted-by":"crossref","unstructured":"Kim, S., Lee, G.: Adversarial DPO: harnessing harmful data for reducing toxicity with minimal impact on coherence and evasiveness in dialogue agents. In: Findings of the Association for Computational Linguistics: NAACL 2024, pp. 1821\u20131835 (2024)","DOI":"10.18653\/v1\/2024.findings-naacl.118"},{"key":"13_CR25","doi-asserted-by":"crossref","unstructured":"Lee, H., et al.: Square: a large-scale dataset of sensitive questions and acceptable responses created through human-machine collaboration. In: The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), pp. 6692\u20136712. Association for Computational Linguistics (2023)","DOI":"10.18653\/v1\/2023.acl-long.370"},{"key":"13_CR26","doi-asserted-by":"crossref","unstructured":"Li, L., et al.: Salad-bench: a hierarchical and comprehensive safety benchmark for large language models. In: Findings of the Association for Computational Linguistics: ACL 2024, pp. 3923\u20133954 (2024)","DOI":"10.18653\/v1\/2024.findings-acl.235"},{"key":"13_CR27","unstructured":"Li, R., Chen, M., Hu, C., Chen, H., Xing, W., Han, M.: Gentel-safe: A unified benchmark and shielding framework for defending against prompt injection attacks. arXiv preprint arXiv:2409.19521 (2024)"},{"key":"13_CR28","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll\u00e1r, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980\u20132988 (2017)","DOI":"10.1109\/ICCV.2017.324"},{"key":"13_CR29","unstructured":"Liu, X., Xu, N., Chen, M., Xiao, C.: Autodan: generating stealthy jailbreak prompts on aligned large language models. In: The Twelfth International Conference on Learning Representations (2023)"},{"key":"13_CR30","unstructured":"Liu, Y., et al.: Prompt injection attack against LLM-integrated applications. arXiv preprint arXiv:2306.05499 (2023)"},{"key":"13_CR31","unstructured":"Liu, Y., et al.: Jailbreaking chatgpt via prompt engineering: an empirical study. arXiv preprint arXiv:2305.13860 (2023)"},{"key":"13_CR32","unstructured":"Liu, Y.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692364 (2019)"},{"key":"13_CR33","unstructured":"The llama 3 herd of models (2024). https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"13_CR34","doi-asserted-by":"crossref","unstructured":"Markov, T., et al.: A holistic approach to undesired content detection in the real world. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a037, pp. 15009\u201315018 (2023)","DOI":"10.1609\/aaai.v37i12.26752"},{"key":"13_CR35","first-page":"35181","volume":"235","author":"M Mazeika","year":"2024","unstructured":"Mazeika, M., et al.: Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. Proc. Mach. Learn. Res. 235, 35181\u201335224 (2024)","journal-title":"Proc. Mach. Learn. Res."},{"key":"13_CR36","first-page":"61065","volume":"37","author":"A Mehrotra","year":"2024","unstructured":"Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: jailbreaking black-box LLMs automatically. Adv. Neural. Inf. Process. Syst. 37, 61065\u201361105 (2024)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"13_CR37","unstructured":"Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of jaccard coefficient for keywords similarity. In: Proceedings of the International Multiconference of Engineers and Computer Scientists, vol.\u00a01, pp. 380\u2013384 (2013)"},{"key":"13_CR38","unstructured":"ProtectAI.com: Fine-tuned deberta-v3 for prompt injection detection (2023). https:\/\/huggingface.co\/ProtectAI\/deberta-v3-base-prompt-injection"},{"key":"13_CR39","unstructured":"Ran, D., et al.: Jailbreakeval: an integrated toolkit for evaluating jailbreak attempts against large language models. arXiv preprint arXiv:2406.09321 (2024)"},{"key":"13_CR40","unstructured":"Rando, J., Tram\u00e8r, F.: Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455 (2023)"},{"key":"13_CR41","first-page":"24720","volume":"35","author":"M Rauh","year":"2022","unstructured":"Rauh, M., et al.: Characteristics of harmful text: towards rigorous benchmarking of language models. Adv. Neural. Inf. Process. Syst. 35, 24720\u201324739 (2022)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"13_CR42","unstructured":"Schulhoff, S., et al.: The prompt report: a systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608 (2024)"},{"key":"13_CR43","doi-asserted-by":"crossref","unstructured":"Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: \u201cDo anything now\u201d: characterizing and evaluating in-the-wild jailbreak prompts on large language models. In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671\u20131685 (2024)","DOI":"10.1145\/3658644.3670388"},{"key":"13_CR44","doi-asserted-by":"crossref","unstructured":"Shi, J., et al.: Optimization-based prompt injection attack to LLM-as-a-judge. In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 660\u2013674 (2024)","DOI":"10.1145\/3658644.3690291"},{"issue":"1","key":"13_CR45","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1108\/eb026526","volume":"28","author":"K Sparck Jones","year":"1972","unstructured":"Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11\u201321 (1972)","journal-title":"J. Doc."},{"key":"13_CR46","unstructured":"Taori, R., et al.: Stanford alpaca: an instruction-following llama model (2023). https:\/\/github.com\/tatsu-lab\/stanford_alpaca"},{"key":"13_CR47","unstructured":"Tedeschi, S., et al.: Alert: a comprehensive benchmark for assessing large language models\u2019 safety through red teaming. arXiv preprint arXiv:2404.08676 (2024)"},{"key":"13_CR48","unstructured":"Wan, S., et al.: Cyberseceval 3: advancing the evaluation of cybersecurity risks and capabilities in large language models. arXiv preprint arXiv:2408.01605 (2024)"},{"key":"13_CR49","unstructured":"Wang, L., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)"},{"key":"13_CR50","unstructured":"Wang, Y., Li, H., Han, X., Nakov, P., Baldwin, T.: Do-not-answer: a dataset for evaluating safeguards in LLMs. arXiv preprint arXiv:2308.13387 (2023)"},{"key":"13_CR51","unstructured":"Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: how does LLM safety training fail? In: Advances in Neural Information Processing Systems, vol. 36 (2024)"},{"issue":"12","key":"13_CR52","doi-asserted-by":"publisher","first-page":"1486","DOI":"10.1038\/s42256-023-00765-8","volume":"5","author":"Y Xie","year":"2023","unstructured":"Xie, Y., et al.: Defending chatgpt against jailbreak attack via self-reminders. Nat. Mach. Intell. 5(12), 1486\u20131496 (2023)","journal-title":"Nat. Mach. Intell."},{"key":"13_CR53","unstructured":"Xu, Z., Liu, Y., Deng, G., Li, Y., Picek, S.: LLM jailbreak attack versus defense techniques\u2013a comprehensive study. arXiv preprint arXiv:2402.13457 (2024)"},{"key":"13_CR54","doi-asserted-by":"crossref","unstructured":"Yang, A., Yang, T.A.: Social dangers of generative artificial intelligence: review and guidelines. In: Proceedings of the 25th Annual International Conference on Digital Government Research, pp. 654\u2013658 (2024)","DOI":"10.1145\/3657054.3664243"},{"key":"13_CR55","unstructured":"Yu, J., Lin, X., Yu, Z., Xing, X.: Gptfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)"},{"key":"13_CR56","unstructured":"Yu, J., Lin, X., Yu, Z., Xing, X.: $$\\{$$LLM-Fuzzer$$\\}$$: scaling assessment of large language model jailbreaks. In: 33rd USENIX Security Symposium (USENIX Security 2024), pp. 4657\u20134674 (2024)"},{"key":"13_CR57","doi-asserted-by":"crossref","unstructured":"Yu, S., Choi, J., Kim, Y.: Don\u2019t be a fool: pooling strategies in offensive language detection from user-intended adversarial attacks. In: Findings of the Association for Computational Linguistics: NAACL 2024, pp. 3456\u20133467 (2024)","DOI":"10.18653\/v1\/2024.findings-naacl.219"},{"key":"13_CR58","unstructured":"Yuan, X., et al.: S-eval: automatic and adaptive test generation for benchmarking safety evaluation of large language models. arXiv preprint arXiv:2405.14191 (2024)"},{"key":"13_CR59","doi-asserted-by":"crossref","unstructured":"Zeng, Y., Lin, H., Zhang, J., Yang, D., Jia, R., Shi, W.: How johnny can persuade LLMs to jailbreak them: rethinking persuasion to challenge AI safety by humanizing LLMs. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14322\u201314350 (2024)","DOI":"10.18653\/v1\/2024.acl-long.773"},{"key":"13_CR60","doi-asserted-by":"crossref","unstructured":"Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T.B., Kang, D.: Removing RLHF protections in GPT-4 via fine-tuning. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 681\u2013687 (2024)","DOI":"10.18653\/v1\/2024.naacl-short.59"},{"key":"13_CR61","unstructured":"Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)"}],"container-title":["Lecture Notes in Computer Science","Computer Security \u2013 ESORICS 2025"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-032-07884-1_13","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T16:23:03Z","timestamp":1760286183000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-032-07884-1_13"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,13]]},"ISBN":["9783032078834","9783032078841"],"references-count":61,"URL":"https:\/\/doi.org\/10.1007\/978-3-032-07884-1_13","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,13]]},"assertion":[{"value":"13 October 2025","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"The authors declare no competing financial or personal interests relevant to the content of this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Disclosure of Interests"}},{"value":"The TAPD model proposed in this study was developed at Soongsil University under a technology transfer agreement with eRoun&Company Co., Ltd. The model has been transferred and is currently being used for commercial purposes by eRoun&Company Co., Ltd. Due to contractual obligations, the trained model cannot be publicly released. For academic or collaborative inquiries, please contact the corresponding author.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Model Availability"}},{"value":"ESORICS","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"European Symposium on Research in Computer Security","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Toulouse","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"France","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2025","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"22 September 2025","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"24 September 2025","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"30","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"esorics2025","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/www.esorics2025.org\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}