{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,27]],"date-time":"2026-04-27T21:37:12Z","timestamp":1777325832785,"version":"3.51.4"},"reference-count":80,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2025,6,1]],"date-time":"2025-06-01T00:00:00Z","timestamp":1748736000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,4]],"date-time":"2025-06-04T00:00:00Z","timestamp":1748995200000},"content-version":"vor","delay-in-days":3,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"publisher","award":["101120237"],"award-info":[{"award-number":["101120237"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003246","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["024.004.022"],"award-info":[{"award-number":["024.004.022"]}],"id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003246","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["024.004.022"],"award-info":[{"award-number":["024.004.022"]}],"id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Umea University"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Ethics Inf Technol"],"published-print":{"date-parts":[[2025,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLHF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics, and contributing to AI safety. We highlight tensions inherent in the goals of RLHF, as captured in the HHH principle (helpful, harmless and honest). In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLHF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We offer an alternative vision for AI safety and ethics which positions RLHF approaches within a broader context of comprehensive design across institutions, processes and technological systems, and suggest the establishment of AI safety as a sociotechnical discipline that is open to the normative and political dimensions of artificial intelligence.<\/jats:p>","DOI":"10.1007\/s10676-025-09837-2","type":"journal-article","created":{"date-parts":[[2025,6,4]],"date-time":"2025-06-04T03:29:38Z","timestamp":1749007778000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":22,"title":["Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback"],"prefix":"10.1007","volume":"27","author":[{"given":"Adam","family":"Dahlgren Lindstr\u00f6m","sequence":"first","affiliation":[]},{"given":"Leila","family":"Methnani","sequence":"additional","affiliation":[]},{"given":"Lea","family":"Krause","sequence":"additional","affiliation":[]},{"given":"Petter","family":"Ericson","sequence":"additional","affiliation":[]},{"given":"\u00cd\u00f1igo Mart\u00ednez","family":"de Rituerto de Troya","sequence":"additional","affiliation":[]},{"given":"Dimitri","family":"Coelho Mollo","sequence":"additional","affiliation":[]},{"given":"Roel","family":"Dobbe","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,4]]},"reference":[{"key":"9837_CR1","doi-asserted-by":"publisher","unstructured":"Aler Tubella, A., Coelho Mollo, D., Dahlgren Lindstr\u00f6m, A., Devinney, H., Dignum, V., Ericson, P., Jonsson, A., Kampik, T., Lenaerts, T., Mendez, J.A., & Nieves, J.C. (2023). ACROCPoLis: A descriptive framework for making sense of fairness. In: 2023 ACM Conference on Fairness, Accountability, and Transparency. FAccT \u201923. ACM, Chicago, IL, USA https:\/\/doi.org\/10.1145\/3593013.3594059.","DOI":"10.1145\/3593013.3594059"},{"key":"9837_CR2","unstructured":"Anderljung, M., Barnhart, J., Korinek, A., Leung, J., & O\u2019Keefe, C. (2023). Frontier AI regulation: Managing emerging risks to public safety. arXiv:2307.03718."},{"key":"9837_CR3","unstructured":"Askell, A., Bai, Y., Chen, A., Drain, D., & Ganguli, D. (2021). A general language assistant as a laboratory for alignment. arXiv:2112.00861."},{"key":"9837_CR4","doi-asserted-by":"crossref","unstructured":"Atari, M., Xue, M.J., Park, P.S., Blasi, D., & Henrich, J. (2023). Which humans? PsyPsyArXiv:5b26t.","DOI":"10.31234\/osf.io\/5b26t"},{"key":"9837_CR5","unstructured":"Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073."},{"key":"9837_CR6","unstructured":"Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862."},{"key":"9837_CR7","unstructured":"Bansal, H., Dang, J., & Grover, A. Peering through preferences: Unraveling feedback acquisition for aligning large language models. In: ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models."},{"key":"9837_CR8","doi-asserted-by":"publisher","unstructured":"Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT \u201921, pp. 610\u2013623. Association for Computing Machinery, New York, NY, USAhttps:\/\/doi.org\/10.1145\/3442188.3445922 . Accessed 2021-05-14.","DOI":"10.1145\/3442188.3445922"},{"key":"9837_CR9","doi-asserted-by":"publisher","unstructured":"Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V., Mazeika, M., Michael, J., et al. (2025). International AI Safety Report. https:\/\/doi.org\/10.48550\/arXiv.2501.17805 . arXiv:2501.17805 Accessed 2025-02-05.","DOI":"10.48550\/arXiv.2501.17805"},{"key":"9837_CR10","doi-asserted-by":"publisher","unstructured":"Casper, S., Davies, X., Shi, C., Krendl Gilbert, T., al., S. (2023). Open problems and fundamental limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research. https:\/\/doi.org\/10.3929\/ethz-b-000651806","DOI":"10.3929\/ethz-b-000651806"},{"key":"9837_CR11","unstructured":"Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems 30."},{"key":"9837_CR12","unstructured":"Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., & Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback. arXiv:2310.01377."},{"key":"9837_CR13","unstructured":"Dinan, E., Abercrombie, G., Bergman, A.S., Spruit, S., Hovy, D., Boureau, Y.-L., & Rieser, V. (2021). Anticipating safety issues in e2e conversational AI: Framework and tooling. arXiv:2107.03451."},{"key":"9837_CR14","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2021.103555","author":"R Dobbe","year":"2021","unstructured":"Dobbe, R., Krendl Gilbert, T., & Mintz, Y. (2021). Hard choices in artificial intelligence. Artificial Intelligence. https:\/\/doi.org\/10.1016\/j.artint.2021.103555","journal-title":"Artificial Intelligence"},{"key":"9837_CR15","doi-asserted-by":"crossref","unstructured":"Dobbe, R. (2022). System safety and artificial intelligence. In: The Oxford Handbook of AI Governance, p. 67. Oxford University Press, Oxford, UK.","DOI":"10.1093\/oxfordhb\/9780197579329.013.67"},{"key":"9837_CR16","unstructured":"Dobbe, R. (2023). \u2018Safety Washing\u2019 at the AI Safety Summit https:\/\/www.linkedin.com\/pulse\/safety-washing-ai-summit-roel-dobbe-gy4oe Accessed 2024-03-03."},{"issue":"2","key":"9837_CR17","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1007\/s11023-024-09668-y","volume":"34","author":"R Dobbe","year":"2024","unstructured":"Dobbe, R., & Wolters, A. (2024). Toward sociotechnical AI: Mapping vulnerabilities for machine learning in context. Minds and Machines, 34(2), 12. https:\/\/doi.org\/10.1007\/s11023-024-09668-y","journal-title":"Minds and Machines"},{"key":"9837_CR18","unstructured":"Dzieza, J. (2023). AI is a lot of work. The Verge."},{"key":"9837_CR19","unstructured":"Fricker, M. (2010). Epistemic Injustice (Reprinted). Oxford University Press"},{"key":"9837_CR20","unstructured":"Franceschi-Bicchierai, L. (2023). Jailbreak tricks Discord\u2019s new chatbot into sharing napalm and meth instructions. TechCrunch."},{"key":"9837_CR21","unstructured":"Gabriel, I., Manzini, A., Keeling, G., Hendricks, L.A., Rieser, V., Iqbal, H., Toma\u0161ev, N., Ktena, I., Kenton, Z., Rodriguez, M., El-Sayed, S., Brown, S., Akbulut, C., Trask, A., Hughes, E., Bergman, A.S., Shelby, R., Marchal, N., Griffin, C., et al. (2024). The Ethics of Advanced AI Assistants arXiv:2404.16244."},{"key":"9837_CR22","unstructured":"Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. (2022). Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858."},{"key":"9837_CR23","doi-asserted-by":"crossref","unstructured":"Gansky, B., & McDonald, S. (2022). CounterFAccTual: How FAccT undermines its organizing principles. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1982\u20131992.","DOI":"10.1145\/3531146.3533241"},{"key":"9837_CR24","unstructured":"Goldberg, Y. (2019). Assessing BERT\u2019s syntactic abilities. arXiv:1901.05287."},{"key":"9837_CR25","doi-asserted-by":"publisher","unstructured":"Goot, M.J., Koubayov\u00e1, N., & Reijmersdal, E.A. (2024). Understanding users\u2019 responses to disclosed vs. undisclosed customer service chatbots: a mixed methods study. AI & SOCIETY https:\/\/doi.org\/10.1007\/s00146-023-01818-7.","DOI":"10.1007\/s00146-023-01818-7"},{"key":"9837_CR26","unstructured":"Gray, M.L., & Suri, S. (2019). Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Harper Business."},{"key":"9837_CR27","doi-asserted-by":"crossref","unstructured":"Grice, H.P. (1975). Logic and conversation. In: Speech Acts, pp. 41\u201358. Brill, Leiden, the Netherlands.","DOI":"10.1163\/9789004368811_003"},{"key":"9837_CR28","unstructured":"Guo, E. An AI chatbot told a user how to kill himself-but the company doesn\u2019t want to \u201ccensor\u201d it. https:\/\/www.technologyreview.com\/2025\/02\/06\/1111077\/nomi-ai-chatbot-told-user-to-kill-himself\/ Accessed 2025-02-26."},{"key":"9837_CR29","doi-asserted-by":"publisher","DOI":"10.1007\/s10676-023-09742-6","author":"P Helm","year":"2024","unstructured":"Helm, P., Bella, G., Koch, G., & Giunchiglia, F. (2024). Diversity and language technology: How language modeling bias causes epistemic injustice. Ethics and Information Technology. https:\/\/doi.org\/10.1007\/s10676-023-09742-6","journal-title":"Ethics and Information Technology"},{"key":"9837_CR30","doi-asserted-by":"publisher","unstructured":"Hershcovich, D., Frank, S., Lent, H., Lhoneux, M., Abdou, M., Brandl, S., Bugliarello, E., Cabello&nbsp;Piqueras, L., Chalkidis, I., Cui, R., Fierro, C., Margatina, K., Rust, P., & S\u00f8gaard, A. (2022). Challenges and strategies in cross-cultural NLP. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6997\u20137013. Association for Computational Linguistics, Dublin, Ireland https:\/\/doi.org\/10.18653\/v1\/2022.acl-long.482 . https:\/\/aclanthology.org\/2022.acl-long.482\/.","DOI":"10.18653\/v1\/2022.acl-long.482"},{"key":"9837_CR31","doi-asserted-by":"crossref","unstructured":"Jawahar, G., Sagot, B., & Seddah, D. (2019). What does BERT learn about the structure of language? In: ACL 2019-57th Annual Meeting of the Association for Computational Linguistics.","DOI":"10.18653\/v1\/P19-1356"},{"key":"9837_CR32","first-page":"24678","volume":"36","author":"J Ji","year":"2024","unstructured":"Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., & Yang, Y. (2024). Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 24678\u201324704.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9837_CR33","doi-asserted-by":"publisher","unstructured":"Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6282\u20136293. Association for Computational Linguistics, Online https:\/\/doi.org\/10.18653\/v1\/2020.acl-main.560 . https:\/\/aclanthology.org\/2020.acl-main.560\/.","DOI":"10.18653\/v1\/2020.acl-main.560"},{"issue":"7815","key":"9837_CR34","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1038\/d41586-020-02003-2","volume":"583","author":"P Kalluri","year":"2020","unstructured":"Kalluri, P. (2020). Don\u2019t ask if artificial intelligence is good or fair, ask how it shifts power. Nature, 583(7815), 169\u2013169.","journal-title":"Nature"},{"key":"9837_CR35","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-65647-7_5","volume-title":"Tuning for LLM Alignment","author":"U Kamath","year":"2024","unstructured":"Kamath, U., Keenan, K., Somers, G., & Sorenson, S. (2024). Tuning for LLM Alignment. In Large Language Models: A Deep Dive: Bridging Theory and Practice (pp. 177-218). Cham: Springer Nature Switzerland."},{"key":"9837_CR36","unstructured":"Kenton, J.D.M.-W.C., & Toutanova, L.K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171\u20134186."},{"issue":"1","key":"9837_CR37","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1016\/j.chb.2011.09.006","volume":"28","author":"Y Kim","year":"2012","unstructured":"Kim, Y., & Sundar, S. S. (2012). Anthropomorphism of computers: Is it mindful or mindless? Computers in Human Behavior, 28(1), 241\u2013250. https:\/\/doi.org\/10.1016\/j.chb.2011.09.006","journal-title":"Computers in Human Behavior"},{"key":"9837_CR38","doi-asserted-by":"publisher","unstructured":"Kirk, H., Bean, A., Vidgen, B., Rottger, P., & Hale, S. (2023a). The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2409\u20132430. Association for Computational Linguistics, Singapore. https:\/\/doi.org\/10.18653\/v1\/2023.emnlp-main.148 . https:\/\/aclanthology.org\/2023.emnlp-main.148 Accessed 2024-06-25.","DOI":"10.18653\/v1\/2023.emnlp-main.148"},{"key":"9837_CR39","unstructured":"Kirk, H., Vidgen, B., Rottger, P., & Hale, S. (2023b). The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising \u201cAlignment\u201d in Large Language Models. In: Socially Responsible Language Modelling Research https:\/\/openreview.net\/forum?id=6mHKQkV8NY."},{"key":"9837_CR40","unstructured":"Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2024a). Understanding the effects of RLHF on LLM generalisation and diversity. arXiv:2310.06452."},{"key":"9837_CR41","doi-asserted-by":"publisher","unstructured":"Kirk, H. R., Vidgen, B., R\u00f6ttger, P., & Hale, S. A. (2024b). The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence, 6(4), 383\u2013392. https:\/\/doi.org\/10.1038\/s42256-024-00820-y","DOI":"10.1038\/s42256-024-00820-y"},{"key":"9837_CR42","first-page":"105236","volume":"37","author":"HR Kirk","year":"2025","unstructured":"Kirk, H. R., Whitefield, A., Rottger, P., Bean, A. M., Margatina, K., Mosquera-Gomez, R., Ciro, J., Bartolo, M., Williams, A., He, H., et al. (2025). The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models. Advances in Neural Information Processing Systems, 37, 105236\u2013105344.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"3","key":"9837_CR43","doi-asserted-by":"publisher","first-page":"240","DOI":"10.1504\/IJTPM.2005.008406","volume":"5","author":"J Koppenjan","year":"2005","unstructured":"Koppenjan, J., & Groenewegen, J. (2005). Institutional design for complex technological systems. International Journal of Technology, Policy and Management, 5(3), 240\u2013257. https:\/\/doi.org\/10.1504\/IJTPM.2005.008406","journal-title":"International Journal of Technology, Policy and Management"},{"key":"9837_CR44","unstructured":"Krause, L., Tufa, W., Baez&nbsp;Santamaria, S., Daza, A., Khurana, U., & Vossen, P. (2023). Confidently wrong: Exploring the calibration and expression of (un)certainty of large language models in a multilingual setting. In: Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), Prague, Czech Republic, pp. 1\u20139 https:\/\/aclanthology.org\/2023.mmnlg-1.1."},{"key":"9837_CR45","unstructured":"Lambert, N., & Calandra, R. (2023). The alignment ceiling: Objective mismatch in Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2311.00168."},{"key":"9837_CR46","doi-asserted-by":"crossref","unstructured":"Lambert, N., Gilbert, T.K., & Zick, T. (2023). Entangled preferences: The history and risks of reinforcement learning and human feedback. arXiv:2310.13595.","DOI":"10.1145\/3600211.3604698"},{"key":"9837_CR47","unstructured":"Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., & Rastogi, A. (2023). RLAIF: Scaling Reinforcement Learning from Human Feedback with AI feedback. arXiv:2309.00267."},{"key":"9837_CR48","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/8179.001.0001","volume-title":"Engineering a Safer World Systems Thinking Applied to Safety","author":"NG Leveson","year":"2012","unstructured":"Leveson, N. G. (2012). Engineering a Safer World Systems Thinking Applied to Safety. MIT Press."},{"key":"9837_CR49","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2501.08617","author":"K Liang","year":"2025","unstructured":"Liang, K., Hu, H., Liu, R., Griffiths, T. L., & Fisac, J. F. (2025). RLHS: Mitigating misalignment in RLHF with hindsight simulation. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2501.08617","journal-title":"arXiv"},{"key":"9837_CR50","unstructured":"Liu, R., Sumers, T.R., Dasgupta, I., & Griffiths, T.L. (2024). How do large language models navigate conflicts between honesty and helpfulness? arXiv:2402.07282."},{"key":"9837_CR51","unstructured":"Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M.F., & Li, H. (2023). Trustworthy LLMs: a survey and guideline for evaluating large language models\u2019 alignment. In: Socially Responsible Language Modelling Research https:\/\/openreview.net\/forum?id=oss9uaPFfB."},{"issue":"CSCW2","key":"9837_CR52","first-page":"1","volume":"6","author":"M Miceli","year":"2022","unstructured":"Miceli, M., & Posada, J. (2022). The data-production dispositif. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1\u201337.","journal-title":"Proceedings of the ACM on Human-Computer Interaction"},{"key":"9837_CR53","doi-asserted-by":"publisher","unstructured":"Milli\u00e8re, R. (2023). The Alignment Problem in Context. arXiv https:\/\/doi.org\/10.48550\/ARXIV.2311.02147","DOI":"10.48550\/ARXIV.2311.02147"},{"key":"9837_CR54","unstructured":"Mozes, M., He, X., Kleinberg, B., & Griffin, L.D. (2023). Use of LLMs for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv:2308.12833."},{"key":"9837_CR55","unstructured":"Narayanan, A., Kapoor, S., & Seth, L. (2023). Model alignment protects against accidental harms, not intentional ones. AI Snake Oil Blog https:\/\/www.aisnakeoil.com\/p\/model-alignment-protects-against."},{"key":"9837_CR56","doi-asserted-by":"crossref","unstructured":"Nouws, S., Mart\u00ednez De Rituerto De Troya, \u00cd., Dobbe, R., & Janssen, M. (2023). Diagnosing and addressing emergent harms in the design process of public AI and algorithmic systems. In: Proceedings of the 24th Annual International Conference on Digital Government Research, pp. 679\u2013681.","DOI":"10.1145\/3598469.3598557"},{"key":"9837_CR57","unstructured":"Nouws, S.J.J., & Dobbe, R.I.J. (2024). The Rule of Law for Artificial Intelligence in Public Administration: A System Safety Perspective. In: Digital Governance: Confronting the Challenges Posed by Artificial Intelligence. TMC Asser Press, https:\/\/surfdrive.surf.nl\/files\/index.php\/s\/gjt6Cg8RgxEVpWE Accessed 2024-09-23."},{"key":"9837_CR58","first-page":"27730","volume":"35","author":"L Ouyang","year":"2022","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9837_CR59","doi-asserted-by":"crossref","unstructured":"Park, P.S., Goldstein, S., O\u2019Gara, A., Chen, M., & Hendrycks, D. (2023). AI deception: A survey of examples, risks, and potential solutions. arXiv:2308.14752.","DOI":"10.1016\/j.patter.2024.100988"},{"key":"9837_CR60","volume-title":"Findings of the Association for Computational Linguistics","author":"E Perez","year":"2023","unstructured":"Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., & Chen, E. (2023). Discovering language model behaviors with model-written evaluations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics. Association for Computational Linguistics."},{"key":"9837_CR61","unstructured":"Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36."},{"key":"9837_CR62","unstructured":"Raji, I.D., & Dobbe, R. (2020). Concrete problems in AI safety, revisited. In: ICLR Workshop on ML in the Real World."},{"key":"9837_CR63","doi-asserted-by":"publisher","DOI":"10.1145\/3593013.3594014","author":"B Rakova","year":"2023","unstructured":"Rakova, B., & Dobbe, R. (2023). Algorithms as social-ecological-technological systems: An environmental justice lens on algorithmic audits. arXiv. https:\/\/doi.org\/10.1145\/3593013.3594014","journal-title":"arXiv"},{"key":"9837_CR64","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2410.22526","author":"S Rismani","year":"2024","unstructured":"Rismani, S., Dobbe, R., & Moon, A. (2024). From silos to systems: Process-oriented hazard analysis for AI systems. arXiv. https:\/\/doi.org\/10.48550\/arXiv.2410.22526","journal-title":"arXiv"},{"issue":"2","key":"9837_CR65","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1080\/21507740.2020.1740350","volume":"11","author":"A Salles","year":"2020","unstructured":"Salles, A., Evers, K., & Farisco, M. (2020). Anthropomorphism in AI. AJOB Neuroscience, 11(2), 88\u201395. https:\/\/doi.org\/10.1080\/21507740.2020.1740350","journal-title":"AJOB Neuroscience"},{"key":"9837_CR66","unstructured":"Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438."},{"key":"9837_CR67","unstructured":"Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., Kravec, S.M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2024). Towards understanding sycophancy in language models. In: The Twelfth International Conference on Learning Representations https:\/\/openreview.net\/forum?id=tvhaxkMKAn."},{"key":"9837_CR68","doi-asserted-by":"publisher","unstructured":"Shelby, R., Rismani, S., Henne, K., Moon, A., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & Virk, G. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In: Proceedings of the 2023 AAAI\/ACM Conference on AI, Ethics, and Society. AIES \u201923, pp. 723\u2013741. Association for Computing Machinery, New York, NY, USA. https:\/\/doi.org\/10.1145\/3600211.3604673 .","DOI":"10.1145\/3600211.3604673"},{"key":"9837_CR69","doi-asserted-by":"crossref","unstructured":"Sloane, M., Moss, E., Awomolo, O., & Forlano, L. (2022). Participation is not a design fix for machine learning. In: Equity and Access in Algorithms, Mechanisms, and Optimization, pp. 1\u20136.","DOI":"10.1145\/3551624.3555285"},{"key":"9837_CR70","doi-asserted-by":"crossref","unstructured":"Song, F., Yu, B., Li, M., Yu, H., Huang, F., Li, Y., & Wang, H. (2024). Preference ranking optimization for human alignment. In: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24).","DOI":"10.1609\/aaai.v38i17.29865"},{"key":"9837_CR71","unstructured":"Sorensen, T., Moore, J., Fisher, J., Gordon, M., Mireshghallah, N., Rytting, C.M., Ye, A., Jiang, L., Lu, X., Dziri, N., et al. (2024). Position: a roadmap to pluralistic alignment. In: Proceedings of the 41st International Conference on Machine Learning, pp. 46280\u201346302."},{"key":"9837_CR72","unstructured":"The Guardian (2024). Mother says AI chatbot led her son to kill himself in lawsuit against its maker. The Guardian Accessed: 2025-02-27."},{"key":"9837_CR73","unstructured":"The Times (2023). AI chatbot blamed for Belgian man\u2019s suicide. The Times Accessed: 2025-02-27."},{"key":"9837_CR74","first-page":"74952","volume":"36","author":"M Turpin","year":"2023","unstructured":"Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don\u2019t always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952\u201374965.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"9837_CR75","doi-asserted-by":"publisher","DOI":"10.1007\/s13347-022-00577-5","author":"B Vaassen","year":"2022","unstructured":"Vaassen, B. (2022). AI, opacity, and personal autonomy. Philosophy and Technology. https:\/\/doi.org\/10.1007\/s13347-022-00577-5","journal-title":"Philosophy and Technology"},{"key":"9837_CR76","unstructured":"Wei, J., Huang, D., Lu, Y., Zhou, D., & Le, Q.V. (2024). Simple synthetic data reduces sycophancy in large language models arXiv:2308.03958."},{"key":"9837_CR77","unstructured":"Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L.A., Isaac, W., Legassick, S., Irving, G., & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv:2112.04359."},{"key":"9837_CR78","doi-asserted-by":"crossref","unstructured":"Weizenbaum, J. (1977). Computer Power and Human Reason: From Judgment to Calculation (1st ed.). W. H. Freeman & Co.","DOI":"10.1063\/1.3037375"},{"key":"9837_CR79","unstructured":"Wu, M., & Aji, A.F. (2025). Style over substance: Evaluation biases for large language models. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 297\u2013312."},{"key":"9837_CR80","unstructured":"Zhuo, T., Huang, Y., Chen, C., & Xing, Z. (2023). Red teaming ChatGPT via jailbreaking: Bias, robustness, reliability and toxicity. arXiv:2301.12867."}],"container-title":["Ethics and Information Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10676-025-09837-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10676-025-09837-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10676-025-09837-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T17:30:33Z","timestamp":1757179833000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10676-025-09837-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6]]},"references-count":80,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6]]}},"alternative-id":["9837"],"URL":"https:\/\/doi.org\/10.1007\/s10676-025-09837-2","relation":{},"ISSN":["1388-1957","1572-8439"],"issn-type":[{"value":"1388-1957","type":"print"},{"value":"1572-8439","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6]]},"assertion":[{"value":"4 June 2025","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"28"}}