{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,9]],"date-time":"2025-09-09T21:41:38Z","timestamp":1757454098716,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":66,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,6,3]],"date-time":"2024-06-03T00:00:00Z","timestamp":1717372800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,6,3]]},"DOI":"10.1145\/3630106.3659042","type":"proceedings-article","created":{"date-parts":[[2024,6,5]],"date-time":"2024-06-05T09:14:21Z","timestamp":1717578861000},"page":"2362-2373","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Analyzing And Editing Inner Mechanisms of Backdoored Language Models"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6405-513X","authenticated-orcid":false,"given":"Max","family":"Lamparth","sequence":"first","affiliation":[{"name":"Stanford University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7913-9296","authenticated-orcid":false,"given":"Anka","family":"Reuel","sequence":"additional","affiliation":[{"name":"Stanford University, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,6,5]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Sniper Backdoor: Single Client Targeted Backdoor Attack in Federated Learning. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)","author":"Abad Gorka","year":"2023","unstructured":"Gorka Abad, Servio Paguada, O\u011fuzhan Ersoy, Stjepan Picek, V\u00edctor\u00a0Julio Ram\u00edrez-Dur\u00e1n, and Aitor Urbieta. 2023. Sniper Backdoor: Single Client Targeted Backdoor Attack in Federated Learning. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) (Raleigh, NC, USA, 2023-02). IEEE, 377\u2013391."},{"key":"e_1_3_2_1_2_1","volume-title":"Venomave: Targeted Poisoning Against Speech Recognition. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)","author":"Aghakhani Hojjat","year":"2023","unstructured":"Hojjat Aghakhani, Lea Sch\u00f6nherr, Thorsten Eisenhofer, Dorothea Kolossa, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. 2023. Venomave: Targeted Poisoning Against Speech Recognition. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) (Raleigh, NC, USA, 2023-02). IEEE, 404\u2013417."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330701"},{"key":"e_1_3_2_1_4_1","volume-title":"Spinning Sequence-to-Sequence Models with Meta-Backdoors. arXiv preprint arXiv:2107.10443","author":"Bagdasaryan Eugene","year":"2021","unstructured":"Eugene Bagdasaryan and Vitaly Shmatikov. 2021. Spinning Sequence-to-Sequence Models with Meta-Backdoors. arXiv preprint arXiv:2107.10443 (2021)."},{"key":"e_1_3_2_1_5_1","volume-title":"Identifying and Mitigating the Security Risks of Generative AI. arXiv preprint arXiv:2308.14840","author":"Barrett Clark","year":"2023","unstructured":"Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita\u00a0Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, and Diyi Yang. 2023. Identifying and Mitigating the Security Risks of Generative AI. arXiv preprint arXiv:2308.14840 (2023)."},{"volume-title":"Advances in Neural Information Processing Systems, D.\u00a0Lee, M.\u00a0Sugiyama, U.\u00a0Luxburg, I.\u00a0Guyon, and R.\u00a0Garnett (Eds.). Vol.\u00a029. Curran Associates","author":"Bolukbasi Tolga","key":"e_1_3_2_1_6_1","unstructured":"Tolga Bolukbasi, Kai-Wei Chang, James\u00a0Y Zou, Venkatesh Saligrama, and Adam\u00a0T Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, D.\u00a0Lee, M.\u00a0Sugiyama, U.\u00a0Luxburg, I.\u00a0Guyon, and R.\u00a0Garnett (Eds.). Vol.\u00a029. Curran Associates, Inc."},{"key":"e_1_3_2_1_7_1","volume-title":"Discovering Latent Knowledge in Language Models Without Supervision. In The Eleventh International Conference on Learning Representations (2022-09-29)","author":"Burns Collin","year":"2022","unstructured":"Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering Latent Knowledge in Language Models Without Supervision. In The Eleventh International Conference on Learning Representations (2022-09-29)."},{"key":"e_1_3_2_1_8_1","volume-title":"Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334","author":"Caliskan Aylin","year":"2017","unstructured":"Aylin Caliskan, Joanna\u00a0J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183\u2013186."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.23915\/distill.00024.001"},{"key":"e_1_3_2_1_10_1","volume-title":"Poisoning Web-Scale Training Datasets is Practical. arXiv preprint arXiv:2302.10149","author":"Carlini Nicholas","year":"2023","unstructured":"Nicholas Carlini, Matthew Jagielski, Christopher\u00a0A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tram\u00e8r. 2023. Poisoning Web-Scale Training Datasets is Practical. arXiv preprint arXiv:2302.10149 (2023)."},{"key":"e_1_3_2_1_11_1","volume-title":"Thirty-seventh Conference on Neural Information Processing Systems.","author":"Carlini Nicholas","year":"2023","unstructured":"Nicholas Carlini, Milad Nasr, Christopher\u00a0A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang\u00a0Wei Koh, Daphne Ippolito, Florian Tram\u00e8r, and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned?. In Thirty-seventh Conference on Neural Information Processing Systems."},{"key":"e_1_3_2_1_12_1","volume-title":"SoK: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. In First IEEE Conference on Secure and Trustworthy Machine Learning.","author":"Casper Stephen","year":"2023","unstructured":"Stephen Casper, Tilman Rauker, Anson Ho, and Dylan Hadfield-Menell. 2023. SoK: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. In First IEEE Conference on Secure and Trustworthy Machine Learning."},{"key":"e_1_3_2_1_13_1","volume-title":"nithum, and Will Cukierski","author":"Sorensen Jeffrey","year":"2017","unstructured":"Cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, nithum, and Will Cukierski. 2017. Toxic Comment Classification Challenge."},{"key":"e_1_3_2_1_14_1","volume-title":"A mathematical framework for transformer circuits. Transformer Circuits Thread","author":"Elhage Nelson","year":"2021","unstructured":"Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021)."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.naacl-main.214"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1720347115"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.3"},{"key":"e_1_3_2_1_18_1","unstructured":"Aaron Gokaslan Vanya Cohen Ellie Pavlick and Stefanie Tellex. 2019. OpenWebText Corpus. http:\/\/skylion007.github.io\/OpenWebTextCorpus"},{"key":"e_1_3_2_1_19_1","volume-title":"Planting Undetectable Backdoors in Machine Learning Models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 931\u2013942","author":"Goldwasser Shafi","year":"2022","unstructured":"Shafi Goldwasser, Michael\u00a0P. Kim, Vinod Vaikuntanathan, and Or Zamir. 2022. Planting Undetectable Backdoors in Machine Learning Models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 931\u2013942."},{"key":"e_1_3_2_1_20_1","volume-title":"Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation. medRxiv: 10.1101\/2024.04.07.24305462","author":"Grabb Declan","year":"2024","unstructured":"Declan Grabb, Max Lamparth, and Nina Vasan. 2024. Risks from Language Models for Automated Mental Healthcare: Ethics and Structure for Implementation. medRxiv: 10.1101\/2024.04.07.24305462 (2024)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/SaTML54575.2023.00040"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.eacl-main.19"},{"key":"e_1_3_2_1_23_1","volume-title":"Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. 2021. Unsolved problems in ML safety. arXiv preprint arXiv:2109.13916 (2021)."},{"key":"e_1_3_2_1_24_1","volume-title":"LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (2021-10-06)","author":"Hu J.","year":"2021","unstructured":"Edward\u00a0J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (2021-10-06)."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1014052.1014073"},{"key":"e_1_3_2_1_26_1","volume-title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv preprint arXiv:2309.00614","author":"Jain Neel","year":"2023","unstructured":"Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv preprint arXiv:2309.00614 (2023)."},{"key":"e_1_3_2_1_27_1","volume-title":"Backdoor Attacks on Time Series: A Generative Approach. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)","author":"Jiang Yujing","year":"2023","unstructured":"Yujing Jiang, Xingjun Ma, Sarah\u00a0Monazam Erfani, and James Bailey. 2023. Backdoor Attacks on Time Series: A Generative Approach. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) (Raleigh, NC, USA, 2023-02). IEEE, 392\u2013403."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.249"},{"key":"e_1_3_2_1_29_1","volume-title":"Human vs. Machine: Language Models and Wargames. arXiv preprint arXiv:2403.03407","author":"M. Lamparth","year":"2024","unstructured":"M. Lamparth 2024. Human vs. Machine: Language Models and Wargames. arXiv preprint arXiv:2403.03407 (2024)."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S19-1010"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-demo.21"},{"volume-title":"Security and Artificial Intelligence","author":"Li Shaofeng","key":"e_1_3_2_1_32_1","unstructured":"Shaofeng Li, Shiqing Ma, Minhui Xue, and Benjamin Zi\u00a0Hao Zhao. 2022. Deep learning backdoors. In Security and Artificial Intelligence. Springer, 313\u2013334."},{"key":"e_1_3_2_1_33_1","volume-title":"Backdoor Learning: A Survey","author":"Li Yiming","year":"2022","unstructured":"Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2022. Backdoor Learning: A Survey. IEEE Transactions on Neural Networks and Learning Systems (2022), 1\u201318."},{"volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Liang Paul\u00a0Pu","key":"e_1_3_2_1_34_1","unstructured":"Paul\u00a0Pu Liang, Irene\u00a0Mengze Li, Emily Zheng, Yao\u00a0Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Towards Debiasing Sentence Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5502\u20135515."},{"key":"e_1_3_2_1_35_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0139)","author":"Liang Paul\u00a0Pu","year":"2021","unstructured":"Paul\u00a0Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards Understanding and Mitigating Social Biases in Language Models. In Proceedings of the 38th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol.\u00a0139), Marina Meila and Tong Zhang (Eds.). PMLR, 6565\u20136576."},{"volume-title":"Research in Attacks","author":"Liu Kang","key":"e_1_3_2_1_36_1","unstructured":"Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. In Research in Attacks, Intrusions, and Defenses, Michael Bailey, Thorsten Holz, Manolis Stamatogiannakis, and Sotiris Ioannidis (Eds.). Springer International Publishing, Cham, 273\u2013294."},{"key":"e_1_3_2_1_37_1","volume-title":"Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv preprint arXiv:2305.13860","author":"Liu Yi","year":"2023","unstructured":"Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. 2023. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv preprint arXiv:2305.13860 (2023)."},{"key":"e_1_3_2_1_38_1","volume-title":"Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_1_39_1","unstructured":"Niru Maheswaranathan Alex Williams Matthew Golub Surya Ganguli and David Sussillo. 2019. Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. In Advances in Neural Information Processing Systems Vol.\u00a032."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1062"},{"key":"e_1_3_2_1_41_1","unstructured":"Kevin Meng David Bau Alex\u00a0J Andonian and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_2_1_42_1","volume-title":"The Eleventh International Conference on Learning Representations (2022-09-29)","author":"Nanda Neel","year":"2022","unstructured":"Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2022. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations (2022-09-29)."},{"key":"e_1_3_2_1_43_1","unstructured":"Nostalgebraist. 2020. Interpreting GPT: The Logit Lens. https:\/\/www.lesswrong.com\/posts\/AcKRB8wDpdaN6v6ru\/interpreting-gpt-the-logit-lens"},{"key":"e_1_3_2_1_44_1","volume-title":"In-context learning and induction heads. arXiv preprint arXiv:2209.11895","author":"Olsson Catherine","year":"2022","unstructured":"Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, 2022. In-context learning and induction heads. arXiv preprint arXiv:2209.11895 (2022)."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.225"},{"key":"e_1_3_2_1_47_1","volume-title":"Learning to Generate Reviews and Discovering Sentiment. arXiv preprint arXiv:1704.01444","author":"Radford Alec","year":"2017","unstructured":"Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to Generate Reviews and Discovering Sentiment. arXiv preprint arXiv:1704.01444 (2017)."},{"key":"e_1_3_2_1_48_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever 2018. Improving Language Understanding by Generative Pre-Training. (2018). https:\/\/cdn.openai.com\/research-covers\/language-unsupervised\/language_understanding_paper.pdf"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-4328"},{"volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Ravfogel Shauli","key":"e_1_3_2_1_50_1","unstructured":"Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7237\u20137256."},{"key":"e_1_3_2_1_51_1","volume-title":"Escalation Risks from Language Models in Military and Diplomatic Decision-Making. arXiv preprint arXiv:2401.03408","author":"P. Rivera","year":"2024","unstructured":"J.\u00a0P. Rivera 2024. Escalation Risks from Language Models in Military and Diplomatic Decision-Making. arXiv preprint arXiv:2401.03408 (2024)."},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01298"},{"key":"e_1_3_2_1_53_1","volume-title":"Adversarial Machine Learning-Industry Perspectives. In 2020 IEEE Security and Privacy Workshops (SPW). 69\u201375","author":"Siva\u00a0Kumar Ram\u00a0Shankar","year":"2020","unstructured":"Ram\u00a0Shankar Siva\u00a0Kumar, Magnus Nystr\u00f6m, John Lambert, Andrew Marshall, Mario Goertzel, Andi Comissoneru, Matt Swann, and Sharon Xia. 2020. Adversarial Machine Learning-Industry Perspectives. In 2020 IEEE Security and Privacy Workshops (SPW). 69\u201375."},{"key":"e_1_3_2_1_54_1","volume-title":"Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248","author":"Turner Alexander\u00a0Matt","year":"2023","unstructured":"Alexander\u00a0Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248 (2023)."},{"key":"e_1_3_2_1_55_1","volume-title":"Attention is all you need. Advances in Neural Information Processing Systems (Neurips) 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems (Neurips) 30 (2017)."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.13"},{"key":"e_1_3_2_1_57_1","volume-title":"Poisoning Language Models During Instruction Tuning. arXiv preprint arXiv:2305.00944","author":"Wan Alexander","year":"2023","unstructured":"Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning. arXiv preprint arXiv:2305.00944 (2023)."},{"key":"e_1_3_2_1_58_1","volume-title":"The Eleventh International Conference on Learning Representations (2022-09-29)","author":"Wang Kevin\u00a0Ro","year":"2022","unstructured":"Kevin\u00a0Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. In The Eleventh International Conference on Learning Representations (2022-09-29)."},{"key":"e_1_3_2_1_59_1","volume-title":"Jailbroken: How Does LLM Safety Training Fail?arXiv preprint arXiv:2107.10443","author":"Wei Alexander","year":"2023","unstructured":"Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?arXiv preprint arXiv:2107.10443 (2023)."},{"key":"e_1_3_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_1_61_1","unstructured":"Dongxian Wu and Yisen Wang. 2021. Adversarial Neuron Pruning Purifies Backdoored Deep Models. In Advances in Neural Information Processing Systems A.\u00a0Beygelzimer Y.\u00a0Dauphin P.\u00a0Liang and J.\u00a0Wortman Vaughan (Eds.)."},{"key":"e_1_3_2_1_62_1","volume-title":"Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models. arXiv preprint arXiv:2305.14710","author":"Xu Jiashu","year":"2021","unstructured":"Jiashu Xu, Mingyu\u00a0Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2021. Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models. arXiv preprint arXiv:2305.14710 (2021)."},{"key":"e_1_3_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/EuroSP51992.2021.00022"},{"volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun\u2019ichi Tsujii (Eds.)","author":"Zhao Jieyu","key":"e_1_3_2_1_64_1","unstructured":"Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun\u2019ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 4847\u20134853."},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.11"},{"key":"e_1_3_2_1_66_1","volume-title":"Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043","author":"Zou Andy","year":"2023","unstructured":"Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J.\u00a0Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023)."}],"event":{"name":"FAccT '24: The 2024 ACM Conference on Fairness, Accountability, and Transparency","acronym":"FAccT '24","location":"Rio de Janeiro Brazil"},"container-title":["The 2024 ACM Conference on Fairness Accountability and Transparency"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630106.3659042","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3630106.3659042","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T17:28:15Z","timestamp":1755883695000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3630106.3659042"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,3]]},"references-count":66,"alternative-id":["10.1145\/3630106.3659042","10.1145\/3630106"],"URL":"https:\/\/doi.org\/10.1145\/3630106.3659042","relation":{},"subject":[],"published":{"date-parts":[[2024,6,3]]},"assertion":[{"value":"2024-06-05","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}