{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T19:00:11Z","timestamp":1771614011984,"version":"3.50.1"},"reference-count":225,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,4,30]]},"abstract":"<jats:p>\n                    AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality (\n                    <jats:bold>RICE<\/jats:bold>\n                    ). Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components:\n                    <jats:bold>forward alignment<\/jats:bold>\n                    and\n                    <jats:bold>backward alignment<\/jats:bold>\n                    . The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems\u2019 alignment and govern them appropriately to avoid exacerbating misalignment risks. On forward alignment, we discuss techniques for learning from feedback and learning under the distribution shift. Specifically, we survey traditional preference modeling methods and reinforcement learning from human feedback and further discuss potential frameworks to reach scalable oversight for tasks where effective human oversight is hard to obtain. Within learning under distribution shift, we also cover data distribution interventions such as adversarial training that helps expand the distribution of training data and algorithmic interventions to combat goal misgeneralization. On backward alignment, we discuss assurance techniques and governance practices. Specifically, we survey assurance methods of AI systems throughout their lifecycle, covering safety evaluation, interpretability, and human value compliance. We discuss current and prospective governance practices adopted by governments, industry actors, and other third parties, aimed at managing existing and future AI risks. This survey aims to provide a comprehensive yet beginner-friendly review of alignment research topics. Based on this, we also release and continually update the website\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/www.alignmentsurvey.com\">www.alignmentsurvey.com<\/jats:ext-link>\n                    which features tutorials, collections of papers, blog posts, and other resources.\n                  <\/jats:p>\n                  <jats:p\/>","DOI":"10.1145\/3770749","type":"journal-article","created":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T10:14:57Z","timestamp":1760523297000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["AI Alignment: A Contemporary Survey"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3769-2077","authenticated-orcid":false,"given":"Jiaming","family":"Ji","sequence":"first","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4554-3201","authenticated-orcid":false,"given":"Tianyi","family":"Qiu","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7726-1142","authenticated-orcid":false,"given":"Boyuan","family":"Chen","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-7982-2439","authenticated-orcid":false,"given":"Jiayi","family":"Zhou","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-6470-4450","authenticated-orcid":false,"given":"Borong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7290-0495","authenticated-orcid":false,"given":"Donghai","family":"Hong","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6797-4128","authenticated-orcid":false,"given":"Hantao","family":"Lou","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8505-6745","authenticated-orcid":false,"given":"Kaile","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5124-1192","authenticated-orcid":false,"given":"Yawen","family":"Duan","sequence":"additional","affiliation":[{"name":"University of Cambridge","place":["Cambridge, United Kingdom of Great Britain and Northern Ireland"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-5831-8885","authenticated-orcid":false,"given":"Zhonghao","family":"He","sequence":"additional","affiliation":[{"name":"University of Cambridge","place":["Cambridge, United Kingdom of Great Britain and Northern Ireland"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-2328-5867","authenticated-orcid":false,"given":"Lukas","family":"Vierling","sequence":"additional","affiliation":[{"name":"University of Oxford","place":["Oxford, United Kingdom of Great Britain and Northern Ireland"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-6138-2125","authenticated-orcid":false,"given":"Zhaowei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9395-7819","authenticated-orcid":false,"given":"Fanzhi","family":"Zeng","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2315-573X","authenticated-orcid":false,"given":"Juntao","family":"Dai","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-1890-660X","authenticated-orcid":false,"given":"Xuehai","family":"Pan","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4243-7943","authenticated-orcid":false,"given":"Hua","family":"Xu","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9471-4930","authenticated-orcid":false,"given":"Aidan","family":"O'Gara","sequence":"additional","affiliation":[{"name":"University of Southern California","place":["Los Angeles, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1794-2882","authenticated-orcid":false,"given":"Kwan","family":"Ng","sequence":"additional","affiliation":[{"name":"Concordia AI","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8639-4287","authenticated-orcid":false,"given":"Brian","family":"Tse","sequence":"additional","affiliation":[{"name":"Concordia AI","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4494-843X","authenticated-orcid":false,"given":"Jie","family":"Fu","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3148-7646","authenticated-orcid":false,"given":"Stephen","family":"Mcaleer","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3196-2347","authenticated-orcid":false,"given":"Yanfeng","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University","place":["Shanghai, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0391-9246","authenticated-orcid":false,"given":"Mingchuan","family":"Yang","sequence":"additional","affiliation":[{"name":"China Telecommunications Corporation","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1180-8078","authenticated-orcid":false,"given":"Yunhuai","family":"Liu","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9888-6409","authenticated-orcid":false,"given":"Yizhou","family":"Wang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9458-5583","authenticated-orcid":false,"given":"Song-Chun","family":"Zhu","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8401-282X","authenticated-orcid":false,"given":"Yike","family":"Guo","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8132-5613","authenticated-orcid":false,"given":"Yaodong","family":"Yang","sequence":"additional","affiliation":[{"name":"Peking University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8894-1806","authenticated-orcid":false,"given":"Wen","family":"Gao","sequence":"additional","affiliation":[{"name":"Computer Science, Peking University","place":["Beijing, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"issue":"2","key":"e_1_3_3_2_2","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1007\/s12559-018-9619-0","article-title":"Social integration of artificial intelligence: Functions, automation allocation logic and human-autonomy trust","volume":"11","author":"Abbass Hussein A.","year":"2019","unstructured":"Hussein A. Abbass. 2019. Social integration of artificial intelligence: Functions, automation allocation logic and human-autonomy trust. Cognitive Computation 11, 2 (2019), 159\u2013171.","journal-title":"Cognitive Computation"},{"key":"e_1_3_3_3_2","first-page":"1","article-title":"Apprenticeship learning via inverse reinforcement learning","volume":"1","author":"Abbeel Pieter","year":"2004","unstructured":"Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. Proceedings of the Twenty-First International Conference on Machine Learning 1 (2004), 1.","journal-title":"Proceedings of the Twenty-First International Conference on Machine Learning"},{"key":"e_1_3_3_4_2","first-page":"197","volume-title":"Proceedings of the Economics of Artificial Intelligence: An Agenda","author":"Acemoglu Daron","year":"2018","unstructured":"Daron Acemoglu and Pascual Restrepo. 2018. Artificial intelligence, automation, and work. In Proceedings of the Economics of Artificial Intelligence: An Agenda. University of Chicago Press, 197\u2013236."},{"key":"e_1_3_3_5_2","article-title":"Understanding intermediate layers using linear classifier probes","author":"Alain Guillaume","year":"2017","unstructured":"Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes. ICLR 2017.","journal-title":"ICLR 2017"},{"key":"e_1_3_3_6_2","unstructured":"David Alvarez Melis and Tommi Jaakkola. 2018. Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems S. Bengio H. Wallach H. Larochelle K. Grauman N. Cesa-Bianchi and R. Garnett (Eds.). Vol. 31. Curran Associates Inc."},{"key":"e_1_3_3_7_2","unstructured":"Dario Amodei Chris Olah Jacob Steinhardt Paul Christiano John Schulman and Dan Man\u00e9. 2016. Concrete problems in AI safety. arXiv:1606.06565. Retrieved from https:\/\/arxiv.org\/abs\/1606.06565"},{"key":"e_1_3_3_8_2","unstructured":"Markus Anderljung Joslyn Barnhart Anton Korinek Jade Leung Cullen O\u2019Keefe Jess Whittlestone Shahar Avin Miles Brundage Justin Bullock Duncan Cass-Beggs Ben Chang Tantum Collins Tim Fist Gillian Hadfield Alan Hayes Lewis Ho Sara Hooker Eric Horvitz Noam Kolt Jonas Schuett Yonadav Shavit Divya Siddarth Robert Trager and Kevin Wolf. 2023. Frontier AI regulation: Managing emerging risks to public safety. arXiv:2307.03718. Retrieved from https:\/\/arxiv.org\/abs\/2307.03718"},{"key":"e_1_3_3_9_2","unstructured":"Anthropic. 2022. Softmax Linear Units. https:\/\/transformer-circuits.pub\/2022\/solu\/index.html. [Accessed: October 27 2025]."},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i8.16826"},{"key":"e_1_3_3_11_2","unstructured":"Martin Arjovsky L\u00e9on Bottou Ishaan Gulrajani and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from https:\/\/arxiv.org\/abs\/1907.02893"},{"key":"e_1_3_3_12_2","unstructured":"Stuart Armstrong and S\u00f6ren Mindermann. 2017. Impossibility of deducing preferences and rationality from human policy. arXiv preprint arXiv:1712.05812 (2017)."},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2021.103500"},{"key":"e_1_3_3_14_2","unstructured":"Yuntao Bai Andy Jones Kamal Ndousse Amanda Askell Anna Chen Nova DasSarma Dawn Drain Stanislav Fort Deep Ganguli Tom Henighan Nicholas Joseph Saurav Kadavath Jackson Kernion Tom Conerly Sheer El-Showk Nelson Elhage Zac Hatfield-Dodds Danny Hernandez Tristan Hume Scott Johnston Shauna Kravec Liane Lovitt Neel Nanda Catherine Olsson Dario Amodei Tom Brown Jack Clark Sam McCandlish Chris Olah Ben Mann and Jared Kaplan. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. Retrieved from https:\/\/arxiv.org\/abs\/2204.05862"},{"key":"e_1_3_3_15_2","unstructured":"Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell Jackson Kernion Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon Carol Chen Catherine Olsson Christopher Olah Danny Hernandez Dawn Drain Deep Ganguli Dustin Li Eli Tran-Johnson Ethan Perez Jamie Kerr Jared Mueller Jeffrey Ladish Joshua Landau Kamal Ndousse Kamile Lukosuite Liane Lovitt Michael Sellitto Nelson Elhage Nicholas Schiefer Noemi Mercado Nova DasSarma Robert Lasenby Robin Larson Sam Ringer Scott Johnston Shauna Kravec Sheer El Showk Stanislav Fort Tamera Lanham Timothy Telleen-Lawton Tom Conerly Tom Henighan Tristan Hume Samuel R. Bowman Zac Hatfield-Dodds Ben Mann Dario Amodei Nicholas Joseph Sam McCandlish Tom Brown and Jared Kaplan. 2022. Constitutional AI: Harmlessness from ai feedback. arXiv:2212.08073. Retrieved from https:\/\/arxiv.org\/abs\/2212.08073"},{"key":"e_1_3_3_16_2","first-page":"24639","article-title":"Video pretraining (vpt): Learning to act by watching unlabeled online videos","volume":"35","year":"2022","unstructured":"Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. 2022. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems 35 (2022), 24639\u201324654.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1162\/coli_a_00422"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.adn0117"},{"key":"e_1_3_3_19_2","unstructured":"Leonard Bereska and Efstratios Gavves. 2024. Mechanistic interpretability for AI safety\u2013a review. (2024)."},{"key":"e_1_3_3_20_2","unstructured":"Lukas Berglund Asa Cooper Stickland Mikita Balesni Max Kaufmann Meg Tong Tomasz Korbak Daniel Kokotajlo and Owain Evans. 2023. Taken out of context: On measuring situational awareness in LLMs. arXiv:2309.00667 (2023)."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.5555\/3091125.3091145"},{"key":"e_1_3_3_22_2","unstructured":"Omar Besbes Will Ma and Omar Mouchtaki. 2022. Beyond IID: Data-driven decision-making in heterogeneous environments. In Advances in Neural Information Processing Systems S. Koyejo S. Mohamed A. Agarwal D. Belgrave K. Cho and A. Oh (Eds.). Vol. 35. Curran Associates Inc. 23979\u201323991. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/974ff7b5bf08dbf9400b5d599a39c77f-Paper-Conference.pdf"},{"key":"e_1_3_3_23_2","unstructured":"Richard Blumenthal and Josh Hawley. 2023. Bipartisan Framework for U.S. AI Act. Retrieved from https:\/\/www.blumenthal.senate.gov\/imo\/media\/doc\/09072023bipartisanaiframework.pdf"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.082080899"},{"key":"e_1_3_3_25_2","first-page":"12","article-title":"Ethical issues in advanced artificial intelligence","author":"Bostrom Nick","year":"2003","unstructured":"Nick Bostrom. 2003. Ethical issues in advanced artificial intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence (2003), 12\u201317.","journal-title":"Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence"},{"key":"e_1_3_3_26_2","volume-title":"Global Catastrophic Risks","author":"Bostrom Nick","year":"2011","unstructured":"Nick Bostrom and Milan M. Cirkovic. 2011. Global Catastrophic Risks. Oxford University Press, USA."},{"key":"e_1_3_3_27_2","unstructured":"Samuel R. Bowman Jeeyoon Hyun Ethan Perez Edwin Chen Craig Pettit Scott Heiner Kamil\u0117 Luko\u0161i\u016bt\u0117 Amanda Askell Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon Christopher Olah Daniela Amodei Dario Amodei Dawn Drain Dustin Li Eli Tran-Johnson Jackson Kernion Jamie Kerr Jared Mueller Jeffrey Ladish Joshua Landau Kamal Ndousse Liane Lovitt Nelson Elhage Nicholas Schiefer Nicholas Joseph Noem\u00ed Mercado Nova DasSarma Robin Larson Sam McCandlish Sandipan Kundu Scott Johnston Shauna Kravec Sheer El Showk Stanislav Fort Timothy Telleen-Lawton Tom Brown Tom Henighan Tristan Hume Yuntao Bai Zac Hatfield-Dodds Ben Mann and Jared Kaplan. 2022. Measuring progress on scalable oversight for large language models. arXiv:2211.03540. Retrieved from https:\/\/arxiv.org\/abs\/2211.03540"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.2307\/2334029"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781107446984"},{"key":"e_1_3_3_30_2","unstructured":"Greg Brockman Vicki Cheung Ludwig Pettersson Jonas Schneider John Schulman Jie Tang and Wojciech Zaremba. 2016. Openai Gym. arXiv:1606.01540. Retrieved from https:\/\/arxiv.org\/abs\/1606.01540"},{"key":"e_1_3_3_31_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry AmandaAskell SandhiniAgarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel Ziegler Jeffrey Wu Clemens Winter Chris Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems H. Larochelle M. Ranzato R. Hadsell M. F. Balcan and H. Lin (Eds.). Vol. 33. Curran Associates Inc. 1877\u20131901. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_3_3_32_2","unstructured":"S\u00e9bastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke Eric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712. Retrieved from https:\/\/arxiv.org\/abs\/2303.12712"},{"key":"e_1_3_3_33_2","unstructured":"Collin Burns Pavel Izmailov Jan Hendrik Kirchner Bowen Baker Leo Gao Leopold Aschenbrenner Yining Chen Adrien Ecoffet Manas Joglekar Jan Leike Ilya Sutskever and Jeff Wu. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv:2312.09390 [cs.CL]. https:\/\/arxiv.org\/abs\/2312.09390"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.aal4230"},{"key":"e_1_3_3_35_2","unstructured":"Nicholas Carlini Milad Nasr Christopher A. Choquette-Choo Matthew Jagielski Irena Gao Pang Wei W. Koh Daphne Ippolito Florian Tramer and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned? In Advances in Neural Information Processing Systems A. Oh T. Naumann A. Globerson K. Saenko M. Hardt and S. Levine (Eds.). Vol. 36. Curran Associates Inc. 61478\u201361500."},{"key":"e_1_3_3_36_2","unstructured":"Joseph Carlsmith. 2022. Is power-seeking AI an existential risk? arXiv:2206.13353. Retrieved from https:\/\/arxiv.org\/abs\/2206.13353"},{"key":"e_1_3_3_37_2","unstructured":"Andres Carranza Dhruv Pai Rylan Schaeffer Arnuv Tandon and Sanmi Koyejo. 2023. Deceptive Alignment Monitoring. arXiv:2307.10569. Retrieved from https:\/\/arxiv.org\/abs\/2307.10569"},{"key":"e_1_3_3_38_2","first-page":"2686","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Carroll Micah D.","year":"2022","unstructured":"Micah D. Carroll, Anca Dragan, Stuart Russell, and Dylan Hadfield-Menell. 2022. Estimating and penalizing induced preference shifts in recommender systems. In Proceedings of the International Conference on Machine Learning. PMLR, 2686\u20132708."},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics8080832"},{"key":"e_1_3_3_40_2","article-title":"Open problems and fundamental limitations of reinforcement learning from human feedback","author":"Casper Stephen","year":"2023","unstructured":"Stephen Casper, Xander Davies, Claudia Shi, et\u00a0al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research (2023). Survey Certification.","journal-title":"Transactions on Machine Learning Research"},{"key":"e_1_3_3_41_2","unstructured":"Stephen Casper Jason Lin Joe Kwon Gatlen Culp and Dylan Hadfield-Menell. 2023. Explore establish exploit: Red teaming language models from scratch. arXiv:2306.09442. Retrieved from https:\/\/arxiv.org\/abs\/2306.09442"},{"key":"e_1_3_3_42_2","unstructured":"Zhaoyu Chen Bo Li Shuang Wu Kaixun Jiang Shouhong Ding and Wenqiang Zhang. 2023. Content-based Unrestricted adversarial attack. In Advances in Neural Information Processing Systems A. Oh T. Naumann A. Globerson K. Saenko M. Hardt and S. Levine (Eds.). Vol. 36. Curran Associates Inc. 51719\u201351733. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/a24cd16bc361afa78e57d31d34f3d936-Paper-Conference.pdf"},{"key":"e_1_3_3_43_2","first-page":"223","volume-title":"Proceedings of the 27th International Conference on Machine Learning (ICML-10)","author":"Cheng Weiwei","year":"2010","unstructured":"Weiwei Cheng, Eyke H\u00fcllermeier, and Krzysztof J Dembczynski. 2010. Graded multilabel classification: The ordinal case. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 223\u2013230."},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.5555\/3104322.3104351"},{"key":"e_1_3_3_45_2","first-page":"215","volume-title":"Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference","year":"2010","unstructured":"Weiwei Cheng, Micha\u00ebl Rademaker, Bernard De Baets, and Eyke H\u00fcllermeier. 2010. Predicting partial orders: Ranking with abstention. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference. Springer, 215\u2013230."},{"key":"e_1_3_3_46_2","unstructured":"Paul F. Christiano Jan Leike Tom Brown Miljan Martic Shane Legg and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems I. Guyon U. Von Luxburg S. Bengio H. Wallach R. Fergus S. Vishwanathan and R. Garnett (Eds.). Vol. 30. Curran Associates Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf"},{"key":"e_1_3_3_47_2","unstructured":"Collective Intelligence Project. 2023. Introducing the Collective Intelligence Project. Retrieved from https:\/\/cip.org\/whitepaper. [Accessed: October 27 2025]."},{"key":"e_1_3_3_48_2","unstructured":"Andrew Critch and David Krueger. 2020. AI research considerations for human existential safety (ARCHES). (2020)."},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pdig.0000651"},{"issue":"11","key":"e_1_3_3_50_2","article-title":"AI-enhanced collective intelligence","volume":"5","author":"Cui Hao","year":"2024","unstructured":"Hao Cui and Taha Yasseri. 2024. AI-enhanced collective intelligence. Patterns 5, 11 (2024), 101074.","journal-title":"Patterns"},{"key":"e_1_3_3_51_2","unstructured":"Robert Huben Hoagy Cunningham Logan Riggs Smith Aidan Ewart and Lee Sharkey. 2024. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=F76bwRSLeK"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1038\/d41586-021-01170-0"},{"key":"e_1_3_3_53_2","unstructured":"Allan Dafoe Edward Hughes Yoram Bachrach Tantum Collins Kevin R. McKee Joel Z. Leibo Kate Larson and Thore Graepel. 2020. Open problems in cooperative AI. arXiv:2012.08630. Retrieved from https:\/\/arxiv.org\/abs\/2012.08630"},{"key":"e_1_3_3_54_2","unstructured":"David \u201cdavidad\u201d Dalrymple Joar Skalse Yoshua Bengio Stuart Russell Max Tegmark Sanjit Seshia Steve Omohundro Christian Szegedy Ben Goldhaber Nora Ammann Alessandro Abate Joe Halpern Clark Barrett Ding Zhao Tan Zhi-Xuan Jeannette Wing and Joshua Tenenbaum. 2024. Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems. arXiv:2405.06624. Retrieved from https:\/\/arxiv.org\/abs\/2405.06624"},{"key":"e_1_3_3_55_2","volume-title":"Proceedings of the International Conference on Learning Representations","year":"2019","unstructured":"Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: A simple approach to controlled text generation. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1016\/0004-3702(77)90003-0"},{"key":"e_1_3_3_57_2","unstructured":"Pim de Haan Dinesh Jayaraman and Sergey Levine. 2019. Causal confusion in lmitation learning. In Advances in Neural Information Processing Systems H. Wallach H. Larochelle A. Beygelzimer F. d\u2019Alch\u00e9-Buc E. Fox and R. Garnett (Eds.). Vol. 32. Curran Associates Inc. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2019\/file\/947018640bf36a2bb609d3557a285329-Paper.pdf"},{"key":"e_1_3_3_58_2","doi-asserted-by":"crossref","unstructured":"Jonas Degrave Federico Felici Jonas Buchli Michael Neunert Brendan Tracey Francesco Carpanese Timo Ewalds Roland Hafner Abbas Abdolmaleki Diego de las Casas Craig Donner Leslie Fritz Cristian Galperti Andrea Huber James Keeling Maria Tsimpoukelli Jackie Kay Antoine Merle Jean-Marc Moret Seb Noury Federico Pesamosca David Pfau Olivier Sauter Cristian Sommariva Stefano Coda Basil Duval Ambrogio Fasoli Pushmeet Kohli Koray Kavukcuoglu Demis Hassabis and Martin Riedmiller. 2022. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602 7897 (2022) 414\u2013419.","DOI":"10.1038\/s41586-021-04301-9"},{"key":"e_1_3_3_59_2","unstructured":"Carson Denison Monte MacDiarmid Fazl Barez David Duvenaud Shauna Kravec Samuel Marks Nicholas Schiefer Ryan Soklaski Alex Tamkin Jared Kaplan et\u00a0al. 2024. Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv:2406.10162. Retrieved from https:\/\/arxiv.org\/abs\/2406.10162"},{"key":"e_1_3_3_60_2","doi-asserted-by":"crossref","first-page":"4884","DOI":"10.18653\/v1\/P19-1483","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","year":"2019","unstructured":"Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4884\u20134895."},{"key":"e_1_3_3_61_2","first-page":"12004","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Langosco Lauro Langosco Di","year":"2022","unstructured":"Lauro Langosco Di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, and David Krueger. 2022. Goal misgeneralization in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 12004\u201312019."},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.5555\/3692070.3692537"},{"key":"e_1_3_3_63_2","unstructured":"Yuqing Du Stas Tiomkin Emre Kiciman Daniel Polani Pieter Abbeel and Anca Dragan. 2020. AvE: Assistance via Empowerment. In Advances in Neural Information Processing Systems H. Larochelle M. Ranzato R. Hadsell M. F. Balcan and H. Lin (Eds.). Vol. 33. Curran Associates Inc. 4560\u20134571. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/30de9ece7cf3790c8c39ccff1a044209-Paper.pdf"},{"key":"e_1_3_3_64_2","unstructured":"Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer Tom Henighan Shauna Kravec Zac Hatfield-Dodds Robert Lasenby Dawn Drain Carol Chen Roger Grosse Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg and Christopher Olah. 2022. Toy models of superposition. arXiv:2209.10652. Retrieved from https:\/\/arxiv.org\/abs\/2209.10652"},{"key":"e_1_3_3_65_2","unstructured":"Daniel Fabian. 2023. Google\u2019s AI Red Team: the ethical hackers making AI safer. Retrieved from https:\/\/blog.google\/technology\/safety-security\/googles-ai-red-team-the-ethical-hackers-making-ai-safer. [Accessed: October 27 2025]."},{"key":"e_1_3_3_66_2","unstructured":"Tanner Fiez Benjamin Chasnov and Lillian Ratliff. 2020. Implicit learning dynamics in stackelberg games: Equilibria characterization convergence analysis and empirical study. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research Vol. 119) Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR 3133\u20133144. https:\/\/proceedings.mlr.press\/v119\/fiez20a.html"},{"key":"e_1_3_3_67_2","doi-asserted-by":"crossref","unstructured":"Luciano Floridi Josh Cowls Monica Beltrametti Raja Chatila Patrice Chazerand Virginia Dignum Christoph Luetge Robert Madelin Ugo Pagallo Francesca Rossi Burkhard Schafer Peggy Valcke and Effy Vayena. 2018. AI4People\u2014an ethical framework for a good AI society: opportunities risks principles and recommendations. Minds and Machines 28 4 (2018) 689\u2013707.","DOI":"10.1007\/s11023-018-9482-5"},{"key":"e_1_3_3_68_2","article-title":"Learning to communicate with deep multi-agent reinforcement learning","volume":"29","author":"Foerster Jakob","year":"2016","unstructured":"Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in Neural Information Processing Systems 29 (2016).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_69_2","first-page":"1136","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Foerster Jakob N.","year":"2017","unstructured":"Jakob N. Foerster, Justin Gilmer, Jascha Sohl-Dickstein, Jan Chorowski, and David Sussillo. 2017. Input switched affine networks: An RNN architecture designed for interpretability. In Proceedings of the International Conference on Machine Learning. PMLR, 1136\u20131145."},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11023-020-09539-2"},{"key":"e_1_3_3_71_2","unstructured":"Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. In Advances in Neural Information Processing Systems H. Larochelle M. Ranzato R. Hadsell M.F. Balcan and H. Lin (Eds.). Vol. 33. Curran Associates Inc. 6616\u20136628. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/49562478de4c54fafd4ec46fdb297de5-Paper.pdf"},{"key":"e_1_3_3_72_2","first-page":"10835","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Gao Leo","year":"2023","unstructured":"Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning. PMLR, 10835\u201310866."},{"key":"e_1_3_3_73_2","article-title":"Loss surfaces, mode connectivity, and fast ensembling of dnns","volume":"31","author":"Garipov Timur","year":"2018","unstructured":"Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P. Vetrov, and Andrew G. Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_74_2","first-page":"3356","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","year":"2020","unstructured":"Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2020. 3356\u20133369."},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","unstructured":"Robert Geirhos Patricia Rubisch Claudio Michaelis Matthias Bethge Felix A. Wichmann and Wieland Brendel. 2018. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv e-prints Article arXiv:1811.12231 (Nov. 2018) arXiv:1811.12231 pages. arXiv:1811.12231 [cs.CV]. 10.48550\/arXiv.1811.12231","DOI":"10.48550\/arXiv.1811.12231"},{"key":"e_1_3_3_76_2","first-page":"2280","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Gilmer Justin","year":"2019","unstructured":"Justin Gilmer, Nicolas Ford, Nicholas Carlini, and Ekin Cubuk. 2019. Adversarial examples are a natural consequence of test error in noise. In Proceedings of the International Conference on Machine Learning. PMLR, 2280\u20132289."},{"key":"e_1_3_3_77_2","unstructured":"Ian J Goodfellow Jonathon Shlens and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv:1412.6572. Retrieved from https:\/\/arxiv.org\/abs\/1412.6572"},{"key":"e_1_3_3_78_2","unstructured":"Government of the United Kingdom. 2021. The roadmap to an effective AI assurance ecosystem\u2014extended version. Retrieved from https:\/\/www.gov.uk\/government\/publications\/the-roadmap-to-an-effective-ai-assurance-ecosystem\/the-roadmap-to-an-effective-ai-assurance-ecosystem-extended-version"},{"key":"e_1_3_3_79_2","unstructured":"Ryan Greenblatt Carson Denison Benjamin Wright Fabien Roger Monte MacDiarmid Sam Marks Johannes Treutlein Tim Belonax Jack Chen David Duvenaud et\u00a0al. 2024. Alignment faking in large language models. arXiv preprint arXiv:2412.14093 (2024)."},{"key":"e_1_3_3_80_2","article-title":"Ai control: Improving safety despite intentional subversion","author":"Greenblatt Ryan","year":"2024","unstructured":"Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. 2024. Ai control: Improving safety despite intentional subversion. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024).","journal-title":"In Proceedings of the 41st International Conference on Machine Learning (ICML 2024)."},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","DOI":"10.1145\/3593013.3594036"},{"key":"e_1_3_3_82_2","unstructured":"Wes Gurnee and Max Tegmark. 2023. Language Models Represent Space and Time. arXiv:2310.02207. Retrieved from https:\/\/arxiv.org\/abs\/2310.02207"},{"key":"e_1_3_3_83_2","unstructured":"Ant\u00f3nio Guterres. 2023. Secretary-General\u2019s remarks to the Security Council on Artificial Intelligence. Retrieved from https:\/\/www.un.org\/sg\/en\/content\/sg\/speeches\/2023-07-18\/secretary-generals-remarks-the-security-council-artificial-intelligence"},{"key":"e_1_3_3_84_2","volume-title":"Proceedings of the Workshops at the 31st AAAI Conference on Artificial Intelligence","author":"Hadfield-Menell Dylan","year":"2017","unstructured":"Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2017. The off-switch game. In Proceedings of the Workshops at the 31st AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_3_85_2","article-title":"Cooperative inverse reinforcement learning","volume":"29","author":"Hadfield-Menell Dylan","year":"2016","unstructured":"Dylan Hadfield-Menell, Stuart J. Russell, Pieter Abbeel, and Anca Dragan. 2016. Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems 29 (2016), 3916\u20133924.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_86_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10458-022-09552-y"},{"key":"e_1_3_3_87_2","unstructured":"Dan Hendrycks. 2023. Natural selection favors ais over humans. arXiv:2303.16200 (2023)."},{"key":"e_1_3_3_88_2","unstructured":"Dan Hendrycks Nicholas Carlini John Schulman and Jacob Steinhardt. 2021. Unsolved problems in ml safety. arXiv:2109.13916. Retrieved from https:\/\/arxiv.org\/abs\/2109.13916"},{"key":"e_1_3_3_89_2","unstructured":"Dan Hendrycks and Mantas Mazeika. 2022. X-risk analysis for ai research. arXiv:2206.05862. Retrieved from https:\/\/arxiv.org\/abs\/2206.05862"},{"key":"e_1_3_3_90_2","article-title":"Generative adversarial imitation learning","volume":"29","author":"Ho Jonathan","year":"2016","unstructured":"Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. Advances in Neural Information Processing Systems 29 (2016), 4572\u20134580.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_91_2","unstructured":"Lewis Ho Joslyn Barnhart Robert Trager Yoshua Bengio Miles Brundage Allison Carnegie Rumman Chowdhury Allan Dafoe Gillian Hadfield Margaret Levi and Duncan Snidal. 2023. International Institutions for Advanced AI. arXiv:2307.04699 (2023)."},{"key":"e_1_3_3_92_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Holtzman Ari","year":"2019","unstructured":"Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_93_2","first-page":"4399","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Hu Hengyuan","year":"2020","unstructured":"Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. 2020. \u201cother-play\u201d for zero-shot coordination. In Proceedings of the International Conference on Machine Learning. PMLR, 4399\u20134410."},{"key":"e_1_3_3_94_2","doi-asserted-by":"crossref","unstructured":"Jiaheng Hu Rose Hendrix Ali Farhadi Aniruddha Kembhavi Roberto Martin-Martin Peter Stone Kuo-Hao Zeng and Kiana Ehsani. 2025. FLaRe: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning. In IEEE International Conference on Robotics and Automation (ICRA). 3617\u20133624.","DOI":"10.1109\/ICRA55743.2025.11127934"},{"key":"e_1_3_3_95_2","unstructured":"Evan Hubinger Carson Denison Jesse Mu Mike Lambert Meg Tong Monte MacDiarmid Tamera Lanham Daniel M. Ziegler Tim Maxwell Newton Cheng Adam Jermyn Amanda Askell Ansh Radhakrishnan Cem Anil David Duvenaud Deep Ganguli Fazl Barez Jack Clark Kamal Ndousse Kshitij Sachan Michael Sellitto Mrinank Sharma Nova DasSarma Roger Grosse Shauna Kravec Yuntao Bai Zachary Witten Marina Favaro Jan Brauner Holden Karnofsky Paul Christiano Samuel R. Bowman Logan Graham Jared Kaplan S\u00f6ren Mindermann Ryan Greenblatt Buck Shlegeris Nicholas Schiefer and Ethan Perez. 2024. Sleeper agents: Training deceptive llms that persist through safety training. arXiv:2401.05566 (2024)."},{"key":"e_1_3_3_96_2","unstructured":"Evan Hubinger Chris van Merwijk Vladimir Mikulik Joar Skalse and Scott Garrabrant. 2019. The Inner Alignment Problem. Retrieved from https:\/\/www.alignmentforum.org\/posts\/pL56xPoniLvtMDQ4J\/the-inner-alignment-problem"},{"key":"e_1_3_3_97_2","unstructured":"Evan Hubinger Chris van Merwijk Vladimir Mikulik Joar Skalse and Scott Garrabrant. 2019. Risks from learned optimization in advanced machine learning systems. arXiv:1906.01820. Retrieved from https:\/\/arxiv.org\/abs\/1906.01820"},{"key":"e_1_3_3_98_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2008.08.002"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.1145\/3054912"},{"key":"e_1_3_3_100_2","article-title":"Reward learning from human preferences and demonstrations in atari","volume":"31","author":"Ibarz Borja","year":"2018","unstructured":"Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in atari. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_101_2","unstructured":"Geoffrey Irving Paul Christiano and Dario Amodei. 2018. AI safety via debate. arXiv:1805.00899. Retrieved from https:\/\/arxiv.org\/abs\/1805.00899"},{"key":"e_1_3_3_102_2","first-page":"90853","article-title":"Aligner: Efficient alignment by learning to correct","volume":"37","author":"Ji Jiaming","year":"2024","unstructured":"Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Tianyi Alex Qiu, Juntao Dai, and Yaodong Yang. 2024. Aligner: Efficient alignment by learning to correct. Advances in Neural Information Processing Systems 37, 1 (2024), 90853\u201390890.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_103_2","article-title":"Beavertails: Towards improved safety alignment of llm via a human-preference dataset","volume":"36","author":"Ji Jiaming","year":"2024","unstructured":"Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, et\u00a0al. 2024. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems 36 (2024).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_104_2","unstructured":"Jiaming Ji Jiayi Zhou Hantao Lou Boyuan Chen Donghai Hong Xuyao Wang Wenqi Chen Kaile Wang Rui Pan Jiahao Li et\u00a0al. 2024. Align anything: Training all-modality models to follow instructions with language feedback. arXiv:2412.15838. Retrieved from https:\/\/arxiv.org\/abs\/2412.15838"},{"key":"e_1_3_3_105_2","doi-asserted-by":"crossref","unstructured":"Ziwei Ji Nayeon Lee Rita Frieske Tiezheng Yu Dan Su Yan Xu Etsuko Ishii Ye Jin Bang Andrea Madotto and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys (CSUR) 55 12 (2023) 1\u201338.","DOI":"10.1145\/3571730"},{"key":"e_1_3_3_106_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1215"},{"key":"e_1_3_3_107_2","series-title":"Proceedings of Machine Learning Research","first-page":"15307","volume-title":"Proceedings of the International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA","volume":"202","author":"Jones Erik","year":"2023","unstructured":"Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt. 2023. Automatically auditing large language models via discrete optimization. In Proceedings of the International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA(Proceedings of Machine Learning Research, Vol. 202). PMLR, 15307\u201315329."},{"key":"e_1_3_3_108_2","doi-asserted-by":"publisher","DOI":"10.1049\/ccs2.12027"},{"key":"e_1_3_3_109_2","unstructured":"Cameron F. Kerry Joshua P. Meltzer Andrea Renda and Rosanna Fanni. 2021. Strengthening international cooperation on AI Progress report. https:\/\/www.brookings.edu\/articles\/strengthening-international-cooperation-on-ai"},{"key":"e_1_3_3_110_2","doi-asserted-by":"publisher","DOI":"10.5555\/3692070.3693020"},{"key":"e_1_3_3_111_2","unstructured":"Megan Kinniment Lucas Jun Koba Sato Haoxing Du Brian Goodrich Max Hasin Lawrence Chan Luke Harold Miles Tao R. Lin Hjalmar Wijk et\u00a0al. 2023. Evaluating Language-Model Agents on Realistic Autonomous Tasks."},{"key":"e_1_3_3_112_2","doi-asserted-by":"publisher","DOI":"10.1111\/jofi.12498"},{"key":"e_1_3_3_113_2","first-page":"105236","article-title":"The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models","volume":"37","year":"2024","unstructured":"Hannah Rose Kirk, Alexander Whitefield, Paul R\u00f6ttger, Andrew Michael Bean, Katerina Margatina, Rafael Mosquera, Juan Manuel Ciro, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models. Advances in Neural Information Processing Systems 37 (2024), 105236\u2013105344.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_114_2","unstructured":"Leonie Koessler and Jonas Schuett. 2023. Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries. arXiv:2307.08823. Retrieved from https:\/\/arxiv.org\/abs\/2307.08823"},{"key":"e_1_3_3_115_2","doi-asserted-by":"crossref","unstructured":"Philipp Koralus. 2025. The philosophic turn for AI agents: Replacing centralized digital rhetoric with decentralized truth-seeking. arXiv:2504.18601. Retrieved from https:\/\/arxiv.org\/abs\/2504.18601","DOI":"10.1007\/s11299-025-00326-z"},{"key":"e_1_3_3_116_2","unstructured":"Vanessa Kosoy. 2017. Forecasting using incomplete models. arXiv:1705.04630. Retrieved from https:\/\/arxiv.org\/abs\/1705.04630"},{"key":"e_1_3_3_117_2","unstructured":"Victoria Krakovna. 2022. Paradigms of AI alignment: components and enablers. Retrieved from https:\/\/vkrakovna.wordpress.com\/2022\/06\/02\/paradigms-of-ai-alignment-components-and-enablers"},{"key":"e_1_3_3_118_2","unstructured":"David Krueger Ethan Caballero J\u00f6rn-Henrik Jacobsen Amy Zhang Jonathan Binas Dinghuai Zhang R\u00e9mi Le Priol and Aaron Courville. 2021. Out-of-Distribution Generalization via Risk Extrapolation (REx). In Proceedings of the 38th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research Vol. 139). PMLR 5815\u20135826. http:\/\/proceedings.mlr.press\/v139\/krueger21a.html"},{"issue":"1","key":"e_1_3_3_119_2","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1037\/amp0000972","article-title":"Auditing the AI auditors: A framework for evaluating fairness and bias in high stakes AI predictive models.","volume":"78","author":"Landers Richard N.","year":"2023","unstructured":"Richard N. Landers and Tara S. Behrend. 2023. Auditing the AI auditors: A framework for evaluating fairness and bias in high stakes AI predictive models. American Psychologist 78, 1 (2023), 36.","journal-title":"American Psychologist"},{"key":"e_1_3_3_120_2","first-page":"6187","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Leibo Joel Z.","year":"2021","unstructured":"Joel Z. Leibo, Edgar A. Due\u00f1ez-Guzman, Alexander Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. 2021. Scalable evaluation of multi-agent reinforcement learning with melting pot. In Proceedings of the International Conference on Machine Learning. PMLR, 6187\u20136199."},{"key":"e_1_3_3_121_2","unstructured":"Joel Z. Leibo Alexander Sasha Vezhnevets Manfred Diaz John P. Agapiou William A. Cunningham Peter Sunehag Julia Haas Raphael Koster Edgar A. Du\u00e9\u00f1ez-Guzm\u00e1n William S Isaac et\u00a0al. 2024. A theory of appropriateness with applications to generative artificial intelligence. arXiv:2412.19010. Retrieved from https:\/\/arxiv.org\/abs\/2412.19010"},{"key":"e_1_3_3_122_2","volume-title":"Nonparametric General Reinforcement Learning","author":"Leike Jan","year":"2016","unstructured":"Jan Leike. 2016. Nonparametric General Reinforcement Learning. The Australian National University (Australia)."},{"key":"e_1_3_3_123_2","unstructured":"Jan Leike David Krueger Tom Everitt Miljan Martic Vishal Maini and Shane Legg. 2018. Scalable agent alignment via reward modeling: A research direction. arXiv:1811.07871. Retrieved from https:\/\/arxiv.org\/abs\/1811.07871"},{"key":"e_1_3_3_124_2","first-page":"1280","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Li Chao","year":"2022","unstructured":"Chao Li, Kelu Yao, Jin Wang, Boyu Diao, Yongjun Xu, and Quanshi Zhang. 2022. Interpretable generative adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence. 36, 2 (2022), 1280\u20131288."},{"key":"e_1_3_3_125_2","doi-asserted-by":"publisher","unstructured":"Junyi Li Xiaoxue Cheng Wayne Xin Zhao Jian-Yun Nie and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. (Dec. 2023) 6449\u20136464. 10.18653\/v1\/2023.emnlpmain.397","DOI":"10.18653\/v1\/2023.emnlpmain.397"},{"key":"e_1_3_3_126_2","unstructured":"Yifan Li Yifan Du Kun Zhou Jinpeng Wang Xin Zhao and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 292\u2013305."},{"key":"e_1_3_3_127_2","first-page":"6565","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Liang Paul Pu","year":"2021","unstructured":"Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards understanding and mitigating social biases in language models. In Proceedings of the International Conference on Machine Learning. PMLR, 6565\u20136576."},{"key":"e_1_3_3_128_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01835"},{"key":"e_1_3_3_129_2","first-page":"3214","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","year":"2022","unstructured":"Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3214\u20133252."},{"key":"e_1_3_3_130_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Liu Ruibo","year":"2024","unstructured":"Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Diyi Yang, and Soroush Vosoughi. 2024. Training socially aligned language models on simulated social interactions. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_131_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Liu Xiao","year":"2024","unstructured":"Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, et\u00a0al. 2024. AgentBench: Evaluating LLMs as agents. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_132_2","first-page":"22965","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Lubana Ekdeep Singh","year":"2023","unstructured":"Ekdeep Singh Lubana, Eric J. Bigelow, Robert P. Dick, David Krueger, and Hidenori Tanaka. 2023. Mechanistic mode connectivity. In Proceedings of the International Conference on Machine Learning. PMLR, 22965\u201323004."},{"key":"e_1_3_3_133_2","unstructured":"Tambiama Andr\u00e9 Madiega. 2024. Artificial intelligence act. PE 698.792 (September 2024). EU Legislation in Progress. 4th Edition."},{"key":"e_1_3_3_134_2","first-page":"1","article-title":"A future that works: AI, automation, employment, and productivity","volume":"60","author":"Manyika James","year":"2017","unstructured":"James Manyika, Michael Chui, Mehdi Miremadi, Jacques Bughin, Katy George, et\u00a0al. 2017. A future that works: AI, automation, employment, and productivity. McKinsey Global Institute Research, Tech. Rep 60 (2017), 1\u2013135.","journal-title":"McKinsey Global Institute Research, Tech. Rep"},{"key":"e_1_3_3_135_2","doi-asserted-by":"publisher","DOI":"10.1145\/3457607"},{"key":"e_1_3_3_136_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2018.07.007"},{"key":"e_1_3_3_137_2","doi-asserted-by":"publisher","DOI":"10.1007\/s43681-023-00289-2"},{"key":"e_1_3_3_138_2","doi-asserted-by":"crossref","unstructured":"Tim Mulgan. 2016. Superintelligence: Paths dangers strategies.","DOI":"10.1093\/pq\/pqv034"},{"key":"e_1_3_3_139_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009744630224"},{"key":"e_1_3_3_140_2","unstructured":"Andrew Y. Ng and Stuart Russell. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000). 663\u2013670."},{"key":"e_1_3_3_141_2","unstructured":"Richard Ngo. 2020. AGI Safety from First Principles. Retrieved from https:\/\/www.alignmentforum.org\/s\/mzgtmmTKKn5MuCzFJ"},{"key":"e_1_3_3_142_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Ngo Richard","year":"2024","unstructured":"Richard Ngo, Lawrence Chan, and S\u00f6ren Mindermann. 2024. The alignment problem from a deep learning perspective: A position paper. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_143_2","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1356"},{"issue":"3","key":"e_1_3_3_144_2","first-page":"e00024\u2013001","article-title":"Zoom in: An introduction to circuits","volume":"5","author":"Olah Chris","year":"2020","unstructured":"Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits. Distill 5, 3 (2020), e00024\u2013001.","journal-title":"Distill"},{"key":"e_1_3_3_145_2","unstructured":"Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova DasSarma Tom Henighan Ben Mann Amanda Askell Yuntao Bai Anna Chen et\u00a0al. 2022. In-context learning and induction heads. arXiv:2209.11895. Retrieved from https:\/\/arxiv.org\/abs\/2209.11895"},{"key":"e_1_3_3_146_2","first-page":"483","volume-title":"Proceedings of the AGI","author":"Omohundro Stephen M.","year":"2008","unstructured":"Stephen M. Omohundro. 2008. The basic AI drives. In Proceedings of the AGI. 483\u2013492."},{"key":"e_1_3_3_147_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_3_148_2","unstructured":"Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray John Schulman Jacob Hilton Fraser Kelton Luke E. Miller Maddie Simens Amanda Askell PeterWelinder Paul Francis Christiano Jan Leike and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS\u201922) 35 (2022) 27730\u201327744."},{"key":"e_1_3_3_149_2","article-title":"Do the rewards justify the means? measuring tradeoffs between rewards and ethical behavior in the machiavelli benchmark.","author":"Pan Alexander","year":"2023","unstructured":"Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023. Do the rewards justify the means? measuring tradeoffs between rewards and ethical behavior in the machiavelli benchmark. ICML (2023).","journal-title":"ICML"},{"key":"e_1_3_3_150_2","first-page":"2086","volume-title":"Findings of the Association for Computational Linguistics: ACL 2022","year":"2022","unstructured":"Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. 2022. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022. 2086\u20132105."},{"key":"e_1_3_3_151_2","volume-title":"Programming Machine Ethics","year":"2016","unstructured":"Lu\u00eds Moniz Pereira and Ari Saptawijaya. 2016. Programming Machine Ethics. Vol. 26. Springer."},{"key":"e_1_3_3_152_2","doi-asserted-by":"crossref","unstructured":"Ethan Perez Saffron Huang Francis Song Trevor Cai Roman Ring John Aslanides Amelia Glaese Nat McAleese and Geoffrey Irving. 2022. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 3419\u20133448.","DOI":"10.18653\/v1\/2022.emnlp-main.225"},{"key":"e_1_3_3_153_2","doi-asserted-by":"crossref","unstructured":"Ethan Perez Sam Ringer Kamile Lukosiute Karina Nguyen Edwin Chen Scott Heiner Craig Pettit Catherine Olsson Sandipan Kundu Saurav Kadavath Andy Jones Anna Chen Benjamin Mann Brian Israel Bryan Seethor Cameron McKinnon Christopher Olah Da Yan Daniela Amodei Dario Amodei Dawn Drain Dustin Li Eli Tran-Johnson Guro Khundadze Jackson Kernion James Landis Jamie Kerr Jared Mueller Jeeyoon Hyun Joshua Landau Kamal Ndousse Landon Goldberg Liane Lovitt Martin Lucas Michael Sellitto Miranda Zhang Neerav Kingsland Nelson Elhage Nicholas Joseph Noemi Mercado Nova DasSarma Oliver Rausch Robin Larson Sam McCandlish Scott Johnston Shauna Kravec Sheer El Showk Tamera Lanham Timothy Telleen-Lawton Tom Brown Tom Henighan Tristan Hume Yuntao Bai Zac Hatfield-Dodds Jack Clark Samuel R. Bowman Amanda Askell Roger Grosse Danny Hernandez Deep Ganguli Evan Hubinger Nicholas Schiefer and Jared Kaplan. 2023. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023. Toronto Canada July 9-14 2023. Association for Computational Linguistics 13387\u201313434.","DOI":"10.18653\/v1\/2023.findings-acl.847"},{"key":"e_1_3_3_154_2","unstructured":"Julien Perolat Joel Z. Leibo Vinicius Zambaldi Charles Beattie Karl Tuyls and Thore Graepel. 2017. A multiagent reinforcement learning model of common-pool resource appropriation. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach California USA) (NIPS\u201917). Curran Associates Inc. Red Hook NY USA 3646\u20133655."},{"key":"e_1_3_3_155_2","unstructured":"Lucas Perry. 2020. Evan Hubinger on Inner Alignment Outer Alignment and Proposals for Building Safe Advanced AI. Retrieved from https:\/\/www.alignmentforum.org\/posts\/qZGoHkRgANQpGHWnu\/evan-hubinger-on-inner-alignment-outer-alignment-and"},{"key":"e_1_3_3_156_2","article-title":"Causal inference using invariant prediction: Identification and confidence intervals. arXiv","author":"Peters J.","year":"2015","unstructured":"J. Peters, Peter Buhlmann, and N. Meinshausen. 2015. Causal inference using invariant prediction: Identification and confidence intervals. arXiv. Methodology (2015).","journal-title":"Methodology"},{"key":"e_1_3_3_157_2","unstructured":"Steve Phelps and Yvan I. Russell. 2023. Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv:2305.07970. Retrieved from https:\/\/arxiv.org\/abs\/2305.07970"},{"key":"e_1_3_3_158_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2010.07.002"},{"key":"e_1_3_3_159_2","unstructured":"Quintin Pope and TurnTrout. 2022. The shard theory of human values. Retrieved from https:\/\/www.alignmentforum.org\/posts\/iCfdcxiyr2Kj8m8mT\/the-shard-theory-of-human-values. [Accessed: October 27 2025]."},{"key":"e_1_3_3_160_2","first-page":"15711","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","year":"2021","unstructured":"Omid Poursaeed, Tianxing Jiang, Harry Yang, Serge Belongie, and Ser-Nam Lim. 2021. Robustness and generalization via generative adversarial training. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 15711\u201315720."},{"key":"e_1_3_3_161_2","doi-asserted-by":"crossref","unstructured":"Tianyi Alex Qiu Yang Zhang Xuchuan Huang Jasmine Li Jiaming Ji and Yaodong Yang. 2024. ProgressGym: Alignment with a millennium of moral progress. Advances in Neural Information Processing Systems 37 NeurIPS\u201924. 14570\u201314607.","DOI":"10.52202\/079017-0465"},{"key":"e_1_3_3_162_2","unstructured":"Rafael Rafailov Archit Sharma Eric Mitchell Christopher D. Manning Stefano Ermon and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems (NeurIPS\u201923) 36 (2023) 53728\u201353741."},{"key":"e_1_3_3_163_2","unstructured":"Daking Rai Yilun Zhou Shi Feng Abulhair Saparov and Ziyu Yao. 2024. A practical review of mechanistic interpretability for transformer-based language models. ArXiv abs\/2407.02646 2407.02646 (2024). https:\/\/api.semanticscholar.org\/CorpusID:270924412"},{"key":"e_1_3_3_164_2","doi-asserted-by":"crossref","first-page":"464","DOI":"10.1109\/SaTML54575.2023.00039","volume-title":"Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)","author":"R\u00e4uker Tilman","year":"2023","unstructured":"Tilman R\u00e4uker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In Proceedings of the 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 464\u2013483."},{"key":"e_1_3_3_165_2","volume-title":"Proceedings of the Workshop on Transparent and Interpretable Machine Learning in Safety Critical Environments, 31st Conference on Neural Information Processing Systems","author":"Ross Andrew","year":"2017","unstructured":"Andrew Ross, Isaac Lage, and Finale Doshi-Velez. 2017. The neural lasso: Local linear sparsity for interpretable explanations. In Proceedings of the Workshop on Transparent and Interpretable Machine Learning in Safety Critical Environments, 31st Conference on Neural Information Processing Systems, Vol. 4."},{"key":"e_1_3_3_166_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-019-0048-x"},{"key":"e_1_3_3_167_2","volume-title":"Human Compatible: AI and the Problem of Control","author":"Russell Stuart","year":"2019","unstructured":"Stuart Russell. 2019. Human Compatible: AI and the Problem of Control. Penguin Uk."},{"key":"e_1_3_3_168_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Sagawa Shiori","year":"2020","unstructured":"Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. 2020. Distributionally robust neural networks. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_3_169_2","series-title":"Proceedings of Machine Learning Research","first-page":"29971","volume-title":"Proceedings of the International Conference on Machine Learning, ICML 2023","volume":"202","author":"Santurkar Shibani","year":"2023","unstructured":"Shibani Santurkar, Esin Durmus, Faisal Ladhak, et\u00a0al. 2023. Whose opinions do language models reflect?. In Proceedings of the International Conference on Machine Learning, ICML 2023(Proceedings of Machine Learning Research, Vol. 202). PMLR, 29971\u201330004."},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.1016\/S1364-6613(99)01327-3"},{"key":"e_1_3_3_171_2","doi-asserted-by":"crossref","DOI":"10.3389\/frai.2020.00036","article-title":"The moral choice machine","author":"Schramowski Patrick","year":"2020","unstructured":"Patrick Schramowski, Cigdem Turan, Sophie Jentzsch, Constantin Rothkopf, and Kristian Kersting. 2020. The moral choice machine. Frontiers in Artificial Intelligence 3, 1 (2020), 36.","journal-title":"Frontiers in Artificial Intelligence"},{"key":"e_1_3_3_172_2","unstructured":"Jonas Schuett Noemi Dreksler Markus Anderljung David McCaffary Lennart Heim Emma Bluemke and Ben Garfinkel. 2023. Towards best practices in AGI safety and governance: A survey of expert opinion. arXiv:2305.07153 (2023)."},{"key":"e_1_3_3_173_2","doi-asserted-by":"publisher","DOI":"10.1016\/0022-5193(83)90445-9"},{"key":"e_1_3_3_174_2","unstructured":"Elizabeth Seger Noemi Dreksler Richard Moulange Emily Dardaman Jonas Schuett K. Wei Christoph Winter Mackenzie Arnold Se\u00e1n \u00d3 h\u00c9igeartaigh Anton Korinek Markus Anderljung Ben Bucknall Alan Chan Eoghan Stafford Leonie Koessler Aviv Ovadya Ben Garfinkel Emma Bluemke Michael Aird Patrick Levermore Julian Hazell and Abhishek Gupta. 2023. Open-sourcing highly capable foundation models. arXiv:2311.09227 (2023)."},{"key":"e_1_3_3_175_2","unstructured":"Rohin Shah Pedro Freire Neel Alex Rachel Freedman Dmitrii Krasheninnikov Lawrence Chan Michael D. Dennis Pieter Abbeel Anca Dragan and Stuart Russell. 2020. Benefits of assistance over reward learning. (2020)."},{"key":"e_1_3_3_176_2","volume-title":"Proceedings of the Socially Responsible Language Modelling Research","year":"2023","unstructured":"Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. 2023. Scalable and transferable black-box jailbreaks for language models via persona modulation. In Proceedings of the Socially Responsible Language Modelling Research."},{"key":"e_1_3_3_177_2","unstructured":"Rohin Shah Vikrant Varma Ramana Kumar Mary Phuong Victoria Krakovna Jonathan Uesato and Zac Kenton. 2022. Goal misgeneralization: Why correct specifications aren\u2019t enough for correct goals. ArXiv abs\/2210.01790 2210.01790 (2022). https:\/\/api.semanticscholar.org\/CorpusID:252693373"},{"key":"e_1_3_3_178_2","volume-title":"Proceedings of the 12th International Conference on Learning Representations","year":"2024","unstructured":"Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2024. Towards understanding sycophancy in language models. In Proceedings of the 12th International Conference on Learning Representations."},{"key":"e_1_3_3_179_2","first-page":"654","volume-title":"Proceedings of the Conference on Robot Learning","author":"Shaw Kenneth","year":"2023","unstructured":"Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. 2023. Videodex: Learning dexterity from internet videos. In Proceedings of the Conference on Robot Learning. PMLR, 654\u2013665."},{"key":"e_1_3_3_180_2","unstructured":"Toby Shevlane Sebastian Farquhar Ben Garfinkel Mary Phuong Jess Whittlestone Jade Leung Daniel Kokotajlo Nahema Marchal Markus Anderljung Noam Kolt Lewis Ho Divya Siddarth Shahar Avin William T. Hawkins Been Kim Iason Gabriel Vijay Bolina Jack Clark Yoshua Bengio Paul Francis Christiano and Allan Dafoe. 2023. Model evaluation for extreme risks. ArXiv abs\/2305.15324 (2023). https:\/\/api.semanticscholar.org\/CorpusID:258865507"},{"key":"e_1_3_3_181_2","volume-title":"Proceedings of the Alignment Forum","author":"Shlegeris Buck","year":"2023","unstructured":"Buck Shlegeris and Ryan Greenblatt. 2023. Some summaries of agent foundation work. In Proceedings of the Alignment Forum. Retrieved from https:\/\/www.alignmentforum.org\/posts\/3vDb6EzBpaHqDqQif\/some-summaries-of-agent-foundations-work-1"},{"key":"e_1_3_3_182_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_3_3_183_2","first-page":"9460","article-title":"Defining and characterizing reward gaming","volume":"35","author":"Skalse Joar","year":"2022","unstructured":"Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35, 1 (2022), 9460\u20139471.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_184_2","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1007\/978-3-662-54033-6_5","article-title":"Agent foundations for aligning machine intelligence with human interests: A technical research agenda","author":"Soares Nate","year":"2017","unstructured":"Nate Soares and Benya Fallenstein. 2017. Agent foundations for aligning machine intelligence with human interests: A technical research agenda. The Technological Singularity: Managing the Journey 1 (2017), 103\u2013125.","journal-title":"The Technological Singularity: Managing the Journey"},{"key":"e_1_3_3_185_2","volume-title":"Proceedings of the Workshops at the 29th AAAI Conference on Artificial Intelligence","author":"Soares Nate","year":"2015","unstructured":"Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. 2015. Corrigibility. In Proceedings of the Workshops at the 29th AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_3_186_2","article-title":"Multi-agent generative adversarial imitation learning","volume":"31","author":"Song Jiaming","year":"2018","unstructured":"Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. 2018. Multi-agent generative adversarial imitation learning. Advances in Neural Information Processing Systems 31 (2018). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2018\/file\/8cea559c47e4fbdb73b23e0223d04e79-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_187_2","article-title":"Constructing unrestricted adversarial examples with generative models","volume":"31","author":"Song Yang","year":"2018","unstructured":"Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. 2018. Constructing unrestricted adversarial examples with generative models. Advances in Neural Information Processing Systems 31 (2018). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2018\/file\/8cea559c47e4fbdb73b23e0223d04e79-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_188_2","unstructured":"Taylor Sorensen Jared Moore Jillian Fisher Mitchell Gordon Niloofar Mireshghallah Christopher Michael Rytting Andre Ye Liwei Jiang Ximing Lu Nouha Dziri Tim Althoff and Yejin Choi. 2024. A roadmap to pluralistic alignment. arXiv:2402.05070. Retrieved from https:\/\/arxiv.org\/abs\/2402.05070"},{"issue":"4","key":"e_1_3_3_189_2","doi-asserted-by":"crossref","first-page":"1505","DOI":"10.1007\/s00146-021-01229-6","article-title":"What overarching ethical principle should a superintelligent AI follow?","volume":"37","author":"S\u00f8vik Atle Ottesen","year":"2022","unstructured":"Atle Ottesen S\u00f8vik. 2022. What overarching ethical principle should a superintelligent AI follow? AI and Society: Knowledge Culture and Communication 37, 4 (2022), 1505\u20131518.","journal-title":"AI and Society: Knowledge Culture and Communication"},{"key":"e_1_3_3_190_2","unstructured":"Stag. 2023. Shallow review of live agendas in alignment and safety. In Alignment Forum. https:\/\/www.lesswrong.com\/posts\/zaaGsFBeDTpCsYHef\/shallow-review-of-live-agendas-in-alignment-and-safety"},{"key":"e_1_3_3_191_2","first-page":"3008","article-title":"Learning to summarize with human feedback","volume":"33","author":"Stiennon Nisan","year":"2020","unstructured":"Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008\u20133021. https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/1f89885d556929e98d3ef9b86448f951-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_192_2","unstructured":"AI Safety Summit. 2023. The Bletchley Declaration by Countries Attending the AI Safety Summit."},{"key":"e_1_3_3_193_2","volume-title":"Reinforcement Learning: An Introduction","author":"Sutton Richard S.","year":"2018","unstructured":"Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Vol. 1. MIT Press."},{"key":"e_1_3_3_194_2","volume-title":"Reinforcement Learning: An Introduction","year":"1998","unstructured":"Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press Cambridge."},{"key":"e_1_3_3_195_2","unstructured":"Jihoon Tack Jack Lanchantin Jane Yu Andrew Cohen Ilia Kulikov Janice Lan Shibo Hao Yuandong Tian Jason Weston and Xian Li. 2025. LLM Pretraining with Continuous Concepts. ArXiv abs\/2502.08524 2502.08524 (2025). https:\/\/api.semanticscholar.org\/CorpusID:276287841"},{"issue":"3","key":"e_1_3_3_196_2","doi-asserted-by":"crossref","first-page":"viad040","DOI":"10.1093\/isr\/viad040","article-title":"The global governance of artificial intelligence: Next steps for empirical and normative research","volume":"25","year":"2023","unstructured":"Jonas Tallberg, Eva Erman, Markus Furendal, Johannes Geith, Mark Klamberg, and Magnus Lundgren. 2023. The global governance of artificial intelligence: Next steps for empirical and normative research. International Studies Review 25, 3 (2023), viad040.","journal-title":"International Studies Review"},{"key":"e_1_3_3_197_2","unstructured":"The White House. 2023. FACT SHEET: Biden-Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI."},{"key":"e_1_3_3_198_2","first-page":"278","volume-title":"Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","author":"Thulasidasan Sunil","year":"2021","unstructured":"Sunil Thulasidasan, Sushil Thapa, Sayera Dhaubhadel, Gopinath Chennupati, Tanmoy Bhattacharya, and Jeff Bilmes. 2021. An effective baseline for robustness to distributional shift. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 278\u2013285."},{"key":"e_1_3_3_199_2","doi-asserted-by":"publisher","DOI":"10.1145\/3419633"},{"key":"e_1_3_3_200_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_3_201_2","unstructured":"Alex Turner. 2022. Inner and outer alignment decompose one hard problem into two extremely hard problems. Retrieved from https:\/\/www.alignmentforum.org\/posts\/gHefoxiznGfsbiAu9\/inner-and-outer-alignment-decompose-one-hard-problem-into"},{"key":"e_1_3_3_202_2","article-title":"Optimal policies tend to seek power","author":"Turner Alex","year":"2021","unstructured":"Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. 2021. Optimal policies tend to seek power. Advances in Neural Information Processing Systems 34, 1 (2021), 23063\u201323074.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_203_2","unstructured":"UNESCO. 2021. Recommendation on the Ethics of Artificial Intelligence. Retrieved from https:\/\/unesdoc.unesco.org\/ark:\/48223\/pf0000381137. [Accessed: October 27 2025]."},{"key":"e_1_3_3_204_2","article-title":"Population of global offline continues steady decline to 2.6 billion people in 2023","author":"ITU United Nations,","year":"2023","unstructured":"United Nations, ITU. 2023. Population of global offline continues steady decline to 2.6 billion people in 2023. ITU Press Release 1, 1 (2023), 1\u20132.","journal-title":"ITU Press Release"},{"key":"e_1_3_3_205_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-022-00465-9"},{"key":"e_1_3_3_206_2","article-title":"Principles of risk minimization for learning theory","volume":"4","author":"Vapnik Vladimir","year":"1991","unstructured":"Vladimir Vapnik. 1991. Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems 4 (1991). https:\/\/proceedings.neurips.cc\/paper_files\/paper\/1991\/file\/ff4d5fbbafdf976cfdc032e3bde78de5-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_207_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","year":"2022","unstructured":"Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_3_208_2","unstructured":"Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter Fei Xia Ed Chi Quoc Le and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 1 (2022) 24824\u201324837."},{"key":"e_1_3_3_209_2","volume-title":"Evolutionary Game Theory","author":"Weibull J\u00f6rgen W.","year":"1997","unstructured":"J\u00f6rgen W. Weibull. 1997. Evolutionary Game Theory. MIT Press."},{"key":"e_1_3_3_210_2","unstructured":"Jiaxin Wen Ruiqi Zhong Akbir Khan Ethan Perez Jacob Steinhardt Minlie Huang Samuel R. Boman He He and Shi Feng. 2025. Language models learn to mislead humans via RLHF. In The Thirteenth International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=xJljiPE6dg"},{"issue":"136","key":"e_1_3_3_211_2","first-page":"1","article-title":"A survey of preference-based reinforcement learning methods","volume":"18","year":"2017","unstructured":"Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F\u00fcrnkranz. 2017. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18, 136 (2017), 1\u201346.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_212_2","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Wu Yueh-Hua","year":"2018","unstructured":"Yueh-Hua Wu and Shou-De Lin. 2018. A low-cost ethics shaping approach for designing reinforcement learning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32."},{"key":"e_1_3_3_213_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052591"},{"key":"e_1_3_3_214_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-023-00765-8"},{"key":"e_1_3_3_215_2","first-page":"945","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Yoo Jin Yong","year":"2021","unstructured":"Jin Yong Yoo and Yanjun Qi. 2021. Towards improving adversarial training of NLP models. In Findings of the Association for Computational Linguistics: EMNLP 2021. 945\u2013956."},{"key":"e_1_3_3_216_2","doi-asserted-by":"publisher","DOI":"10.5555\/3304652.3304793"},{"key":"e_1_3_3_217_2","article-title":"The AI alignment problem: Why it is hard, and where to start","author":"Yudkowsky Eliezer","year":"2016","unstructured":"Eliezer Yudkowsky. 2016. The AI alignment problem: Why it is hard, and where to start. Symbolic Systems Distinguished Speaker 4, 1 (2016), 1\u201320.","journal-title":"Symbolic Systems Distinguished Speaker"},{"key":"e_1_3_3_218_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1144"},{"key":"e_1_3_3_219_2","article-title":"Democratic inputs to AI","author":"Zaremb Wojciech","year":"2023","unstructured":"Wojciech Zaremb, Arka Dhar, Lama Ahmad, Tyna Eloundou, Shibani Santurkar, Sandhini Agarwal, and Jade Leung. 2023. Democratic inputs to AI. OpenAI Blog 1, 1 (2023).","journal-title":"OpenAI Blog"},{"key":"e_1_3_3_220_2","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371790"},{"key":"e_1_3_3_221_2","first-page":"27765","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Zhou Jiayi","year":"2025","unstructured":"Jiayi Zhou, Jiaming Ji, Josef Dai, and Yaodong Yang. 2025. Sequence to sequence reward modeling: Improving rlhf by language feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 27765\u201327773."},{"key":"e_1_3_3_222_2","first-page":"15763","article-title":"Consequences of misaligned AI","volume":"33","author":"Zhuang Simon","year":"2020","unstructured":"Simon Zhuang and Dylan Hadfield-Menell. 2020. Consequences of misaligned AI. Advances in Neural Information Processing Systems 33, 1 (2020), 15763\u201315773.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_223_2","unstructured":"Brian D. Ziebart Andrew L. Maas J. Andrew Bagnell and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3 (Chicago Illinois) (AAAI\u201908). AAAI Press 1433\u20131438."},{"key":"e_1_3_3_224_2","first-page":"9274","article-title":"Adversarial training for high-stakes reliability","volume":"35","author":"Ziegler Daniel","year":"2022","unstructured":"Daniel Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Benjamin Weinstein-Raun, Daniel de Haas, et\u00a0al. 2022. Adversarial training for high-stakes reliability. Advances in Neural Information Processing Systems 35, 1 (2022), 9274\u20139286.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_225_2","unstructured":"Andy Zou Long Phan Sarah Chen James Campbell Phillip Guo Richard Ren Alexander Pan Xuwang Yin Mantas Mazeika Ann-Kathrin Dombrowski Shashwat Goel Nathaniel Li Michael J. Byun Zifan Wang Alex Mallen Steven Basart Sanmi Koyejo Dawn Song Matt Fredrikson J. Zico Kolter and Dan Hendrycks. 2023. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405. Retrieved from https:\/\/arxiv.org\/abs\/2310.01405"},{"key":"e_1_3_3_226_2","unstructured":"Andy Zou Zifan Wang J. Zico Kolter and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043. Retrieved from https:\/\/arxiv.org\/abs\/2307.15043"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3770749","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T14:46:07Z","timestamp":1763736367000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3770749"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":225,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2026,4,30]]}},"alternative-id":["10.1145\/3770749"],"URL":"https:\/\/doi.org\/10.1145\/3770749","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2024-12-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-03","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}