{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T09:55:10Z","timestamp":1779011710268,"version":"3.51.4"},"reference-count":222,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>A significant challenge in training large language models (LLMs) as effective assistants is aligning them with human preferences. Reinforcement learning from human feedback (RLHF) has emerged as a promising solution. However, our understanding of RLHF is often limited to initial design choices. This article analyzes RLHF through reinforcement learning principles, focusing on the reward model. It examines modeling choices and function approximation caveats, highlighting assumptions about reward expressivity and revealing limitations like incorrect generalization, model misspecification, and sparse feedback. A categorical review of current literature provides insights for researchers to understand the challenges of RLHF and build upon existing methods.<\/jats:p>","DOI":"10.1145\/3743127","type":"journal-article","created":{"date-parts":[[2025,6,5]],"date-time":"2025-06-05T07:32:12Z","timestamp":1749108732000},"page":"1-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":49,"title":["RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8826-2253","authenticated-orcid":false,"given":"Shreyas","family":"Chaudhari","sequence":"first","affiliation":[{"name":"University of Massachusetts Amherst","place":["Amherst, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2962-1535","authenticated-orcid":false,"given":"Pranjal","family":"Aggarwal","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University","place":["Pittsburgh, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9304-0701","authenticated-orcid":false,"given":"Vishvak","family":"Murahari","sequence":"additional","affiliation":[{"name":"Princeton University","place":["Princeton, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9302-4244","authenticated-orcid":false,"given":"Tanmay","family":"Rajpurohit","sequence":"additional","affiliation":[{"name":"Independent Researcher","place":["Seattle, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-3448-3305","authenticated-orcid":false,"given":"Ashwin","family":"Kalyan","sequence":"additional","affiliation":[{"name":"Independent Researcher","place":["Seattle, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9894-9983","authenticated-orcid":false,"given":"Karthik","family":"Narasimhan","sequence":"additional","affiliation":[{"name":"Princeton University","place":["Princeton, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9885-0385","authenticated-orcid":false,"given":"Ameet","family":"Deshpande","sequence":"additional","affiliation":[{"name":"Princeton University","place":["Princeton, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3708-5728","authenticated-orcid":false,"given":"Bruno","family":"Castro da Silva","sequence":"additional","affiliation":[{"name":"University of Massachusetts Amherst","place":["Amherst, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,10]]},"reference":[{"key":"e_1_3_3_2_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Aggarwal Pranjal","year":"2023","unstructured":"Pranjal Aggarwal, A. Deshpande, and Karthik Narasimhan. 2023. SemSup-XC: Semantic supervision for zero and few-shot extreme classification. In Proceedings of the International Conference on Machine Learning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:256274863"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","unstructured":"Anirudh Ajith Chris Pan Mengzhou Xia Ameet Deshpande and Karthik Narasimhan. 2024. InstructEval: Systematic evaluation of instruction selection methods. In Findings of the Association for Computational Linguistics: NAACL 2024 Kevin Duh Helena Gomez and Steven Bethard (Eds.). Association for Computational Linguistics Mexico City Mexico 4336\u20134350. 10.18653\/v1\/2024.findings-naacl.270","DOI":"10.18653\/v1\/2024.findings-naacl.270"},{"key":"e_1_3_3_4_2","doi-asserted-by":"crossref","unstructured":"Sina Alemohammad Josue Casco-Rodriguez Lorenzo Luzi Ahmed Imtiaz Humayun Hossein Babaei Daniel LeJeune Ali Siahkoohi and Richard G. Baraniuk. 2023. Self-Consuming generative models go MAD. arxiv:2307.01850 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2307.01850","DOI":"10.52591\/lxai202312101"},{"key":"e_1_3_3_5_2","unstructured":"Marcin Andrychowicz Filip Wolski Alex Ray Jonas Schneider Rachel Fong Peter Welinder Bob McGrew Josh Tobin OpenAI Pieter Abbeel and Wojciech Zaremba. 2017. Hindsight experience replay. Advances in Neural Information Processing Systems 30 (2017) 5048\u20135058."},{"key":"e_1_3_3_6_2","unstructured":"Rohan Anil Andrew M. Dai Orhan Firat Melvin Johnson Dmitry Lepikhin Alexandre Tachard Passos Siamak Shakeri Emanuel Taropa Paige Bailey Z. Chen et\u00a0al. 2023. PaLM 2 Technical Report."},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.aacl-main.39"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","unstructured":"Kai Arulkumaran Marc Peter Deisenroth Miles Brundage and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34 6 (2017) 26\u201338. 10.1109\/MSP.2017.2743240","DOI":"10.1109\/MSP.2017.2743240"},{"key":"e_1_3_3_9_2","unstructured":"Amanda Askell Yuntao Bai Anna Chen Dawn Drain Deep Ganguli T. J. Henighan Andy Jones Nicholas Joseph Benjamin Mann Nova DasSarma et\u00a0al. 2021. A general language assistant as a laboratory for alignment. ArXiv abs\/2112.00861 (2021)."},{"key":"e_1_3_3_10_2","unstructured":"Mohammad Gheshlaghi Azar Mark Rowland Bilal Piot Daniel Guo Daniele Calandriello Michal Valko and R\u00e9mi Munos. 2023. A general theoretical paradigm to understand learning from human preferences. arxiv:2310.12036 [cs.AI]. Retrieved from https:\/\/arxiv.org\/abs\/2310.12036"},{"key":"e_1_3_3_11_2","unstructured":"Dzmitry Bahdanau Philemon Brakel Kelvin Xu Anirudh Goyal Ryan Lowe Joelle Pineau Aaron C. Courville and Yoshua Bengio. 2016. An actor-critic algorithm for sequence prediction. arXiv:1607.07086. Retrieved from https:\/\/arxiv.org\/abs\/1607.07086"},{"key":"e_1_3_3_12_2","unstructured":"Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell John Kernion Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon et\u00a0al. 2022. Constitutional AI: harmlessness from AI feedback. ArXiv abs\/2212.08073 (2022)."},{"key":"e_1_3_3_13_2","unstructured":"Yuntao Bai Andy Jones Kamal Ndousse Amanda Askell Anna Chen Nova DasSarma Dawn Drain Stanislav Fort Deep Ganguli et\u00a0al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv abs\/2204.05862 (2022)."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cmpb.2021.106504"},{"key":"e_1_3_3_15_2","volume-title":"Proceedings of the Conference on Robot Learning","author":"Bajcsy Andrea V.","year":"2017","unstructured":"Andrea V. Bajcsy, Dylan P. Losey, Marcia Kilchenman O\u2019Malley, and Anca D. Dragan. 2017. Learning robot objectives from physical human interaction. In Proceedings of the Conference on Robot Learning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:28406224"},{"key":"e_1_3_3_16_2","unstructured":"Peter Barnett Rachel Freedman Justin Svegliato and Stuart Russell. 2023. Active reward learning from multiple teachers. arXiv:2303.00894. Retrieved from https:\/\/arxiv.org\/abs\/2303.00894"},{"key":"e_1_3_3_17_2","doi-asserted-by":"crossref","unstructured":"Andrew G. Barto and Sridhar Mahadevan. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13 4 (2003) 41\u201377. https:\/\/api.semanticscholar.org\/CorpusID:386824","DOI":"10.1023\/A:1022140919877"},{"key":"e_1_3_3_18_2","volume-title":"Proceedings of the NIPS","author":"Bellemare Marc G.","year":"2016","unstructured":"Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and R\u00e9mi Munos. 2016. Unifying count-based exploration and intrinsic motivation. In Proceedings of the NIPS. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:8310565"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.5555\/560669"},{"key":"e_1_3_3_20_2","unstructured":"Christopher Bishop. 2006. Pattern recognition and machine learning. Springer Google Schola 2 (2006) 531\u2013537."},{"key":"e_1_3_3_21_2","unstructured":"Kevin Black Michael Janner Yilun Du Ilya Kostrikov and Sergey Levine. 2023. Training diffusion models with reinforcement learning. ArXiv abs\/2305.13301 (2023)."},{"key":"e_1_3_3_22_2","doi-asserted-by":"crossref","unstructured":"Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The Method of Paired Comparisons. Biometrika 39 3\/4 (1952) 324.","DOI":"10.2307\/2334029"},{"key":"e_1_3_3_23_2","unstructured":"Daniel S. Brown and Scott Niekum. 2019. Deep Bayesian reward learning from preferences. arXiv:1912.04472. Retrieved from https:\/\/arxiv.org\/abs\/1912.04472"},{"key":"e_1_3_3_24_2","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et\u00a0al. 2020. Language models are few-shot learners. ArXiv abs\/2005.14165 (2020)."},{"key":"e_1_3_3_25_2","unstructured":"Stephen Casper Xander Davies Claudia Shi Thomas Krendl Gilbert J\u00e9r\u00e9my Scheurer Javier Rando Rachel Freedman Tomasz Korbak David Lindner Pedro Freire et\u00a0al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv:2307.15217 [cs.AI]."},{"key":"e_1_3_3_26_2","unstructured":"Angelica Chen. 2023. Improving code generation by training with natural language feedback. arXiv:2303.16749. Retrieved from https:\/\/arxiv.org\/abs\/2303.16749. https:\/\/api.semanticscholar.org\/CorpusID:257804798"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.174"},{"key":"e_1_3_3_28_2","unstructured":"Wei-Lin Chiang Zhuohan Li Zi Lin Ying Sheng Zhanghao Wu Hao Zhang Lianmin Zheng Siyuan Zhuang Yonghao Zhuang Joseph E. Gonzalez Ion Stoica and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. Blog post March 30 (2023). https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/"},{"key":"e_1_3_3_29_2","unstructured":"Leshem Choshen Lior Fox Zohar Aizenbud and Omri Abend. 2019. On the weaknesses of reinforcement learning for neural machine translation. arXiv:1907.01752. Retrieved from https:\/\/arxiv.org\/abs\/1907.01752"},{"key":"e_1_3_3_30_2","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann et\u00a0al. 2022. PaLM: Scaling language modeling with pathways. ArXiv abs\/2204.02311 (2022)."},{"key":"e_1_3_3_31_2","unstructured":"Paul Francis Christiano Jan Leike Tom B. Brown Miljan Martic Shane Legg and Dario Amodei. 2017. Deep reinforcement learning from human preferences. arXiv:1706.03741. Retrieved from https:\/\/arxiv.org\/abs\/1706.03741"},{"key":"e_1_3_3_32_2","unstructured":"Karl Cobbe Vineet Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun Lukasz Kaiser Matthias Plappert Jerry Tworek Jacob Hilton Reiichiro Nakano et al. 2021. Training verifiers to solve math word problems. arXiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/2110.14168"},{"key":"e_1_3_3_33_2","unstructured":"Thomas Coste Usman Anwar Robert Kirk and David Krueger. 2023. Reward model ensembles help mitigate overoptimization. arxiv:2310.02743 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2310.02743"},{"key":"e_1_3_3_34_2","unstructured":"Josef Dai Xuehai Pan Ruiyang Sun Jiaming Ji Xinbo Xu Mickel Liu Yizhou Wang and Yaodong Yang. 2023. Safe RLHF: Safe reinforcement learning from human feedback. arxiv:2310.12773 [cs.AI]. Retrieved from https:\/\/arxiv.org\/abs\/2310.12773"},{"key":"e_1_3_3_35_2","unstructured":"DeepSeek-AI Daya Guo Dejian Yang Haowei Zhang Junxiao Song Ruoyu Zhang Runxin Xu Qihao Zhu Shirong Ma Peiyi Wang et\u00a0al. 2025. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 [cs.CL]. https:\/\/arxiv.org\/abs\/2501.12948"},{"key":"e_1_3_3_36_2","unstructured":"DeepSeek-AI Aixin Liu Bei Feng Bing Xue Bingxuan Wang Bochao Wu Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang et\u00a0al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]. https:\/\/arxiv.org\/abs\/2412.19437. arXiv:2310.09520 [cs.CL]."},{"key":"e_1_3_3_37_2","unstructured":"Haikang Deng and Colin Raffel. 2023. Reward-Augmented decoding: Efficient controlled text generation with a unidirectional reward model. arxiv:2310.09520 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2310.09520"},{"key":"e_1_3_3_38_2","doi-asserted-by":"crossref","unstructured":"A. Deshpande Vishvak Murahari Tanmay Rajpurohit A. Kalyan and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing persona-assigned language models. arXiv:2304.05335. Retrieved from https:\/\/arxiv.org\/abs\/2304.05335. https:\/\/api.semanticscholar.org\/CorpusID:258060002","DOI":"10.18653\/v1\/2023.findings-emnlp.88"},{"key":"e_1_3_3_39_2","doi-asserted-by":"crossref","unstructured":"Ameet Deshpande Tanmay Rajpurohit Karthik Narasimhan and Ashwin Kalyan. 2023. Anthropomorphization of AI: Opportunities and risks. arXiv:2305.14784. Retrieved from https:\/\/arxiv.org\/abs\/2305.14784","DOI":"10.18653\/v1\/2023.nllp-1.1"},{"key":"e_1_3_3_40_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_3_41_2","unstructured":"Thomas G. Dietterich. 1999. Hierarchical reinforcement learning with the MAXQ value function decomposition. arXiv:9905014. Retrieved from https:\/\/arxiv.org\/abs\/9905014. https:\/\/api.semanticscholar.org\/CorpusID:57341"},{"key":"e_1_3_3_42_2","unstructured":"Hanze Dong Wei Xiong Deepanshu Goyal Rui Pan Shizhe Diao Jipeng Zhang Kashun Shum and T. Zhang. 2023. RAFT: Reward rAnked FineTuning for generative foundation model alignment. arXiv:2304.06767. Retrieved from https:\/\/arxiv.org\/abs\/2304.06767"},{"key":"e_1_3_3_43_2","unstructured":"Nan Du Yanping Huang Andrew M. Dai Simon Tong Dmitry Lepikhin Yuanzhong Xu Maxim Krikun Yanqi Zhou Adams Wei Yu Orhan Firat et\u00a0al. 2021. GLaM: Efficient scaling of language models with mixture-of-experts. ArXiv abs\/2112.06905 (2021)."},{"key":"e_1_3_3_44_2","unstructured":"Esin Durmus Karina Nyugen Thomas Liao Nicholas Schiefer Amanda Askell Anton Bakhtin Carol Chen Zac Hatfield-Dodds Danny Hernandez Nicholas Joseph et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv:2306.16388. Retrieved from https:\/\/arxiv.org\/abs\/2306.16388. https:\/\/api.semanticscholar.org\/CorpusID:259275051"},{"key":"e_1_3_3_45_2","unstructured":"Kawin Ethayarajh Winnie Xu Niklas Muennighoff Dan Jurafsky and Douwe Kiela. 2024. KTO: Model alignment as prospect theoretic optimization. arxiv:2402.01306 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2402.01306"},{"key":"e_1_3_3_46_2","unstructured":"Xiang Fan Yiwei Lyu Paul Pu Liang Ruslan Salakhutdinov and Louis-Philippe Morency. 2022. Nano: Nested human-in-the-loop reward learning for few-shot language model control. arXiv:2211.05750. Retrieved from https:\/\/arxiv.org\/abs\/2211.05750"},{"key":"e_1_3_3_47_2","doi-asserted-by":"crossref","unstructured":"Patrick Fernandes Ant\u00f3nio Farinhas Ricardo Rei Jos\u00e9 G. C. de Souza Perez Ogayo Graham Neubig and Andr\u00e9 F. T. Martins. 2022. Quality-aware decoding for neural machine translation. arXiv:2205.00978. Retrieved from https:\/\/arxiv.org\/abs\/2205.00978","DOI":"10.18653\/v1\/2022.naacl-main.100"},{"key":"e_1_3_3_48_2","unstructured":"Patrick Fernandes Aman Madaan Emmy Liu Ant\u00f3nio Farinhas Pedro Henrique Martins Amanda Bertsch Jos\u00e9 G. C. de Souza Shuyan Zhou Tongshuang Sherry Wu Graham Neubig et al. 2023. Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv:2305.00955. Retrieved from https:\/\/arxiv.org\/abs\/2305.00955. https:\/\/api.semanticscholar.org\/CorpusID:258426970"},{"key":"e_1_3_3_49_2","doi-asserted-by":"crossref","unstructured":"Emilio Ferrara. 2023. Should ChatGPT be biased? Challenges and risks of bias in large language models. arXiv:2304.03738. Retrieved from https:\/\/arxiv.org\/abs\/2304.03738. https:\/\/api.semanticscholar.org\/CorpusID:258041203","DOI":"10.2139\/ssrn.4627814"},{"key":"e_1_3_3_50_2","volume-title":"Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics","author":"Fried Daniel","year":"2017","unstructured":"Daniel Fried, Jacob Andreas, and Dan Klein. 2017. Unified pragmatic models for generating and following instructions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:21015570"},{"key":"e_1_3_3_51_2","unstructured":"Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai Saurav Kadavath Benjamin Mann Ethan Perez Nicholas Schiefer Kamal Ndousse et\u00a0al. 2022. Red teaming language models to reduce harms: Methods scaling behaviors and lessons learned. ArXiv abs\/2209.07858 (2022). https:\/\/api.semanticscholar.org\/CorpusID:252355458"},{"key":"e_1_3_3_52_2","doi-asserted-by":"crossref","unstructured":"Ge Gao Hung-Ting Chen Yoav Artzi and Eunsol Choi. 2023. Continually improving extractive QA via human feedback. arXiv:2305.12473. Retrieved from https:\/\/arxiv.org\/abs\/2305.12473","DOI":"10.18653\/v1\/2023.emnlp-main.27"},{"key":"e_1_3_3_53_2","unstructured":"Leo Gao John Schulman and Jacob Hilton. 2022. Scaling laws for reward model overoptimization. arXiv:2210.10760. Retrieved from https:\/\/arxiv.org\/abs\/2210.10760"},{"key":"e_1_3_3_54_2","unstructured":"Amelia Glaese Nathan McAleese Maja Trkebacz John Aslanides Vlad Firoiu Timo Ewalds Maribeth Rauh Laura Weidinger Martin Chadwick Phoebe Thacker et\u00a0al. 2022. Improving alignment of dialogue agents via targeted human judgements. ArXiv abs\/2209.14375 (2022)."},{"key":"e_1_3_3_55_2","volume-title":"Proceedings of the LAW@ACL","author":"Graham Yvette","year":"2013","unstructured":"Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the LAW@ACL."},{"key":"e_1_3_3_56_2","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et\u00a0al. 2024. The Llama 3 herd of models. arXiv:2407.21783 [cs.AI]. https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_3_57_2","unstructured":"Marek Grze\u015b. 2017. Reward shaping in episodic reinforcement learning. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (S\u00e3o Paulo Brazil) (AAMAS\u201917). International Foundation for Autonomous Agents and Multiagent Systems Richland SC 565\u2013573."},{"key":"e_1_3_3_58_2","unstructured":"Arnav Gudibande Eric Wallace Charlie Snell Xinyang Geng Hao Liu Pieter Abbeel Sergey Levine and Dawn Song. 2023. The false promise of imitating proprietary LLMs. arxiv:2305.15717 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2305.15717"},{"key":"e_1_3_3_59_2","unstructured":"Shashank Gupta Vaishnavi Shrivastava Ameet Deshpande Ashwin Kalyan Peter Clark Ashish Sabharwal and Tushar Khot. 2023. Bias runs deep: Implicit reasoning biases in persona-assigned LLMs. arXiv:2311.04892. Retrieved from https:\/\/arxiv.org\/abs\/2311.04892."},{"key":"e_1_3_3_60_2","unstructured":"Dylan Hadfield-Menell Smitha Milli P. Abbeel Stuart J. Russell and Anca D. Dragan. 2017. Inverse reward design. arXiv:1711.02827. Retrieved from https:\/\/arxiv.org\/abs\/1711.02827. https:\/\/api.semanticscholar.org\/CorpusID:3805733"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-025-92889-7"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1358"},{"key":"e_1_3_3_63_2","unstructured":"Austin W. Hanjie A. Deshpande and Karthik Narasimhan. 2022. SemSup: Semantic supervision for simple and scalable zero-shot generalization. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:255595954"},{"key":"e_1_3_3_64_2","doi-asserted-by":"crossref","unstructured":"Thomas Hartvigsen Saadia Gabriel Hamid Palangi Maarten Sap Dipankar Ray and Ece Kamar. 2022. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arxiv:2203.09509 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2203.09509","DOI":"10.18653\/v1\/2022.acl-long.234"},{"key":"e_1_3_3_65_2","unstructured":"Alex Havrilla Yuqing Du Sharath Chandra Raparthy Christoforos Nalmpantis Jane Dwivedi-Yu Maksym Zhuravinskyi Eric Hambro Sainbayar Sukhbaatar and Roberta Raileanu. 2024. Teaching large language models to reason with reinforcement learning. arxiv:2403.04642 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2403.04642"},{"key":"e_1_3_3_66_2","unstructured":"Danny Hernandez Tom B. Brown Tom Conerly Nova DasSarma Dawn Drain Sheer El-Showk Nelson Elhage Zac Hatfield-Dodds T. J. Henighan Tristan Hume et\u00a0al. 2022. Scaling laws and interpretability of learning from repeated data. ArXiv abs\/2205.10487 (2022)."},{"key":"e_1_3_3_67_2","unstructured":"Geoffrey E. Hinton Oriol Vinyals and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531. Retrieved from https:\/\/arxiv.org\/abs\/1503.02531"},{"key":"e_1_3_3_68_2","unstructured":"Jordan Hoffmann Sebastian Borgeaud Arthur Mensch Elena Buchatskaya Trevor Cai Eliza Rutherford Diego de Las Casas Lisa Anne Hendricks Johannes Welbl Aidan Clark et\u00a0al. 2022. training compute-optimal large language models. ArXiv abs\/2203.15556 (2022)."},{"key":"e_1_3_3_69_2","unstructured":"Borja Ibarz Jan Leike Tobias Pohlen Geoffrey Irving Shane Legg and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in Atari. arXiv:1811.06521. Retrieved from https:\/\/arxiv.org\/abs\/1811.06521. https:\/\/api.semanticscholar.org\/CorpusID:53424488"},{"key":"e_1_3_3_70_2","unstructured":"Hamish Ivison Yizhong Wang Valentina Pyatkin Nathan Lambert Matthew Peters Pradeep Dasigi Joel Jang David Wadden Noah A. Smith Iz Beltagy et al. 2023. Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arxiv:2311.10702 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2311.10702"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","unstructured":"Ashesh Jain Shikhar Sharma Thorsten Joachims and Ashutosh Saxena. 2015. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34 10 (2015) 1296\u20131313. 10.1177\/0278364915581193","DOI":"10.1177\/0278364915581193"},{"key":"e_1_3_3_72_2","unstructured":"Natasha Jaques Asma Ghandeharioun Judy Hanwen Shen Craig Ferguson \u00c0gata Lapedriza Noah J. Jones Shixiang Shane Gu and Rosalind W. Picard. 2019. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv:1907.00456. Retrieved from https:\/\/arxiv.org\/abs\/1907.00456"},{"key":"e_1_3_3_73_2","doi-asserted-by":"publisher","DOI":"10.5555\/1622737.1622748"},{"key":"e_1_3_3_74_2","unstructured":"Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated language models must hallucinate. arXiv:2311.14648. Retrieved from https:\/\/arxiv.org\/abs\/2311.14648"},{"key":"e_1_3_3_75_2","unstructured":"Jared Kaplan Sam McCandlish T. J. Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeff Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv:2001.08361. Retrieved from https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_3_76_2","unstructured":"Timo Kaufmann Paul Weng Viktor Bengs and Eyke H\u00fcllermeier. 2023. A survey of reinforcement learning from human feedback. arXiv:2312.14925. Retrieved from https:\/\/arxiv.org\/abs\/2312.14925"},{"key":"e_1_3_3_77_2","first-page":"2469","article-title":"Deep reinforcement learning for sequence-to-sequence models","volume":"31","author":"Keneshloo Yaser","year":"2018","unstructured":"Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K. Reddy. 2018. Deep reinforcement learning for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems 31 (2018), 2469\u20132489.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_3_78_2","unstructured":"Urvashi Khandelwal Kevin Clark Dan Jurafsky and Lukasz Kaiser. 2019. Sample efficient text summarization using a single pre-trained transformer. arXiv:1905.08836. Retrieved from https:\/\/arxiv.org\/abs\/1905.08836"},{"key":"e_1_3_3_79_2","doi-asserted-by":"crossref","unstructured":"Samuel Kiegeland and Julia Kreutzer. 2021. Revisiting the weaknesses of reinforcement learning for neural machine translation. arXiv:2106.08942. Retrieved from https:\/\/arxiv.org\/abs\/2106.08942","DOI":"10.18653\/v1\/2021.naacl-main.133"},{"key":"e_1_3_3_80_2","unstructured":"Sungdong Kim Sanghwan Bae Jamin Shin Soyoung Kang Donghyun Kwak Kang Min Yoo and Minjoon Seo. 2023. Aligning large language models through synthetic feedback. arXiv:2305.13735. Retrieved from https:\/\/arxiv.org\/abs\/2305.13735"},{"key":"e_1_3_3_81_2","unstructured":"Robert Kirk Ishita Mediratta Christoforos Nalmpantis Jelena Luketina Eric Hambro Edward Grefenstette and Roberta Raileanu. 2023. Understanding the effects of RLHF on LLM generalisation and diversity. arxiv:2310.06452 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2310.06452"},{"key":"e_1_3_3_82_2","unstructured":"W. Bradley Knox Stephane Hatgis-Kessell Serena Booth Scott Niekum Peter Stone and Alessandro Allievi. 2022. Models of human preference for learning reward functions. arXiv:2206.02231. Retrieved from https:\/\/arxiv.org\/abs\/2206.02231"},{"key":"e_1_3_3_83_2","doi-asserted-by":"publisher","DOI":"10.1109\/DEVLRN.2008.4640845"},{"key":"e_1_3_3_84_2","doi-asserted-by":"crossref","unstructured":"Tomasz Korbak Ethan Perez and Christopher L. Buckley. 2022. RL with KL penalties is better viewed as Bayesian inference. arXiv:2205.11275. Retrieved from https:\/\/arxiv.org\/abs\/2205.11275","DOI":"10.18653\/v1\/2022.findings-emnlp.77"},{"key":"e_1_3_3_85_2","unstructured":"Tomasz Korbak Kejian Shi Angelica Chen Rasika Bhalerao Christopher L. Buckley Jason Phang Sam Bowman and Ethan Perez. 2023. Pretraining language models with human preferences. arXiv:2302.08582. Retrieved from https:\/\/arxiv.org\/abs\/2302.08582"},{"key":"e_1_3_3_86_2","unstructured":"Julia Kreutzer Shahram Khadivi Evgeny Matusov and Stefan Riezler. 2018. Can neural machine translation be improved with user feedback? arXiv:1804.05958. Retrieved from https:\/\/arxiv.org\/abs\/1804.05958"},{"key":"e_1_3_3_87_2","doi-asserted-by":"crossref","unstructured":"Julia Kreutzer Joshua Uyheng and Stefan Riezler. 2018. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv:1805.10627. Retrieved from https:\/\/arxiv.org\/abs\/1805.10627","DOI":"10.18653\/v1\/P18-1165"},{"key":"e_1_3_3_88_2","unstructured":"Sandipan Kundu Yuntao Bai Saurav Kadavath Amanda Askell Andrew Callahan Anna Chen Anna Goldie Avital Balwit Azalia Mirhoseini Brayden McLean et\u00a0al. 2023. Specific versus general principles for constitutional AI. arxiv:2310.13798 [cs.CL]."},{"key":"e_1_3_3_89_2","doi-asserted-by":"crossref","unstructured":"Fran\u00e7ois Lagunas Ella Charlaix Victor Sanh and Alexander M. Rush. 2021. Block pruning for faster transformers. arXiv:2109.04838. Retrieved from https:\/\/arxiv.org\/abs\/2109.04838","DOI":"10.18653\/v1\/2021.emnlp-main.829"},{"key":"e_1_3_3_90_2","unstructured":"Nathan Lambert Jacob Morrison Valentina Pyatkin Shengyi Huang Hamish Ivison Faeze Brahman Lester James V. Miranda Alisa Liu Nouha Dziri Shane Lyu et al. 2025. Tulu 3: Pushing frontiers in open language model post-training. arxiv:2411.15124 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2411.15124"},{"key":"e_1_3_3_91_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1169"},{"key":"e_1_3_3_92_2","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"Lee Katherine","year":"2021","unstructured":"Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating training data makes language models better. In Proceedings of the Annual Meeting of the Association for Computational Linguistics."},{"key":"e_1_3_3_93_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1259"},{"key":"e_1_3_3_94_2","unstructured":"Jiwei Li Alexander H. Miller Sumit Chopra Marc\u2019Aurelio Ranzato and Jason Weston. 2016. Dialogue learning with human-in-the-loop. arXiv:1611.09823. Retrieved from https:\/\/arxiv.org\/abs\/1611.09823"},{"key":"e_1_3_3_95_2","unstructured":"Shengzhi Li Rongyu Lin and Shichao Pei. 2024. Multi-modal preference alignment remedies degradation of visual instruction tuning on language models. arXiv2402.10884. Retrieved from https:\/\/arxiv.org\/abs\/2402.10884"},{"key":"e_1_3_3_96_2","unstructured":"Zichao Li Xin Jiang Lifeng Shang and Hang Li. 2017. Paraphrase generation with deep reinforcement learning. arXiv:1711.00279. Retrieved from https:\/\/arxiv.org\/abs\/1711.00279"},{"key":"e_1_3_3_97_2","unstructured":"Zichao Li Prakhar Sharma Xing Han Lu Jackie Chi Kit Cheung and Siva Reddy. 2022. Using interactive feedback to improve the accuracy and explainability of question answering systems post-deployment. arXiv:2204.03025. Retrieved from https:\/\/arxiv.org\/abs\/2204.03025. https:\/\/api.semanticscholar.org\/CorpusID:248006299"},{"key":"e_1_3_3_98_2","unstructured":"Hunter Lightman Vineet Kosaraju Yura Burda Harrison Edwards Bowen Baker Teddy Lee Jan Leike John Schulman Ilya Sutskever and Karl Cobbe. 2023. Let\u2019s verify step by step. arXiv:2305.20050. Retrieved from https:\/\/arxiv.org\/abs\/2305.20050"},{"key":"e_1_3_3_99_2","unstructured":"Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology 22 140 (1932) 55\u201355."},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219032"},{"key":"e_1_3_3_101_2","unstructured":"Stephanie Lin Jacob Hilton and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. arxiv:2109.07958 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2109.07958"},{"key":"e_1_3_3_102_2","unstructured":"Yong Lin Hangyu Lin Wei Xiong Shizhe Diao Jianmeng Liu Jipeng Zhang Rui Pan Haoxiang Wang Wenbin Hu Hanning Zhang et al. 2024. Mitigating the alignment tax of RLHF. arxiv:2309.06256 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2309.06256"},{"key":"e_1_3_3_103_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1187"},{"key":"e_1_3_3_104_2","unstructured":"Chia-Wei Liu Ryan Lowe Iulian Serban Michael Noseworthy Laurent Charlin and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv:1603.08023. Retrieved from https:\/\/arxiv.org\/abs\/1603.08023"},{"key":"e_1_3_3_105_2","unstructured":"Hao Liu Carmelo Sferrazza and P. Abbeel. 2023. Chain of hindsight aligns language models with feedback. arXiv:2302.02676. Retrieved from https:\/\/arxiv.org\/abs\/2302.02676"},{"key":"e_1_3_3_106_2","unstructured":"Ruibo Liu Chenyan Jia Ge Zhang Ziyu Zhuang Tony X. Liu and Soroush Vosoughi. 2023. Second thoughts are best: Learning to re-align with human values from text edits. arXiv:2301.00355. Retrieved from https:\/\/arxiv.org\/abs\/2301.00355"},{"key":"e_1_3_3_107_2","unstructured":"Tianqi Liu Yao Zhao Rishabh Joshi Misha Khalman Mohammad Saleh Peter J. Liu and Jialu Liu. 2023. Statistical rejection sampling improves preference optimization. arxiv:2309.06657 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2309.06657"},{"key":"e_1_3_3_108_2","doi-asserted-by":"publisher","unstructured":"Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr. 3 3 (March 2009) 225\u2013331. 10.1561\/1500000016","DOI":"10.1561\/1500000016"},{"key":"e_1_3_3_109_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arxiv:1907.11692 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_3_110_2","unstructured":"R. Duncan Luce. 1979. Individual choice behavior: A theoretical analysis."},{"key":"e_1_3_3_111_2","unstructured":"Aman Madaan Niket Tandon Prakhar Gupta Skyler Hallinan Luyu Gao Sarah Wiegreffe Uri Alon Nouha Dziri Shrimai Prabhumoye Yiming Yang et al. 2023. Self-Refine: Iterative refinement with self-feedback. arXiv:2303.17651. Retrieved from https:\/\/arxiv.org\/abs\/2303.17651"},{"key":"e_1_3_3_112_2","first-page":"3","article-title":"The theory of algorithms","volume":"42","author":"Markov Andrei Andreevich","year":"1954","unstructured":"Andrei Andreevich Markov. 1954. The theory of algorithms. Trudy Matematicheskogo Instituta Imeni VA Steklova 42 (1954), 3\u2013375.","journal-title":"Trudy Matematicheskogo Instituta Imeni VA Steklova"},{"key":"e_1_3_3_113_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Matuszek Cynthia","year":"2012","unstructured":"Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the International Conference on Machine Learning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:2408319"},{"key":"e_1_3_3_114_2","doi-asserted-by":"crossref","unstructured":"Joshua Maynez Shashi Narayan Bernd Bohnet and Ryan T. McDonald. 2020. On faithfulness and factuality in abstractive summarization. arXiv:2005.00661. Retrieved from https:\/\/arxiv.org\/abs\/2005.00661","DOI":"10.18653\/v1\/2020.acl-main.173"},{"key":"e_1_3_3_115_2","doi-asserted-by":"crossref","unstructured":"Daniel McFadden. 1981. Econometric models of probabilistic choice.","DOI":"10.1086\/296093"},{"key":"e_1_3_3_116_2","doi-asserted-by":"crossref","unstructured":"Nick McKenna Tianyi Li Liang Cheng Mohammad Javad Hosseini Mark Johnson and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. arXiv:2305.14552. Retrieved from https:\/\/arxiv.org\/abs\/2305.14552. https:\/\/api.semanticscholar.org\/CorpusID:258865517","DOI":"10.18653\/v1\/2023.findings-emnlp.182"},{"key":"e_1_3_3_117_2","unstructured":"Yu Meng Mengzhou Xia and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. arxiv:2405.14734 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2405.14734"},{"key":"e_1_3_3_118_2","unstructured":"Jacob Menick Maja Trebacz Vladimir Mikulik John Aslanides Francis Song Martin Chadwick Mia Glaese Susannah Young Lucy Campbell-Gillingham Geoffrey Irving and Nathan McAleese. 2022. Teaching language models to support answers with verified quotes. ArXiv abs\/2203.11147 (2022)."},{"key":"e_1_3_3_119_2","unstructured":"Yuchun Miao Sen Zhang Liang Ding Rong Bao Lefei Zhang and Dacheng Tao. 2024. InfoRM: Mitigating reward hacking in RLHF via information-theoretic reward modeling. arxiv:2402.09345 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2402.09345"},{"key":"e_1_3_3_120_2","first-page":"1928","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Mnih Volodymyr","year":"2016","unstructured":"Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1928\u20131937."},{"key":"e_1_3_3_121_2","unstructured":"R\u00e9mi Munos Michal Valko Daniele Calandriello Mohammad Gheshlaghi Azar Mark Rowland Zhaohan Daniel Guo Yunhao Tang Matthieu Geist Thomas Mesnard Andrea Michi et al. 2023. Nash learning from human feedback. arXiv:2312.00886. Retrieved from https:\/\/arxiv.org\/abs\/2312.00886. https:\/\/api.semanticscholar.org\/CorpusID:265609682"},{"key":"e_1_3_3_122_2","unstructured":"Vishvak Murahari Ameet Deshpande Carlos E. Jimenez Izhak Shafran Mingqiu Wang Yuan Cao and Karthik Narasimhan. 2023. MUX-PLMs: Pre-training language models with data multiplexing. arXiv:2302.12441. Retrieved from https:\/\/arxiv.org\/abs\/2302.12441"},{"key":"e_1_3_3_123_2","volume-title":"Proceedings of the 36th Conference on Neural Information Processing Systems","author":"Murahari Vishvak","year":"2022","unstructured":"Vishvak Murahari, Carlos E. Jimenez, Runzhe Yang, and Karthik R. Narasimhan. 2022. DataMUX: Data multiplexing for neural networks. In Proceedings of the 36th Conference on Neural Information Processing Systems. Retrieved from https:\/\/openreview.net\/forum?id=UdgtTVTdswg"},{"key":"e_1_3_3_124_2","unstructured":"Reiichiro Nakano Jacob Hilton S. Arun Balaji Jeff Wu Long Ouyang Christina Kim Christopher Hesse Shantanu Jain Vineet Kosaraju William Saunders et al. 2021. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332. Retrieved from https:\/\/arxiv.org\/abs\/2112.09332"},{"key":"e_1_3_3_125_2","unstructured":"Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak and Ilya Sutskever. 2019. Deep double descent: where bigger models and more data hurt. arXiv:191202292 [cs stat]. (2019)."},{"key":"e_1_3_3_126_2","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ng A.","year":"1999","unstructured":"A. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:5730166"},{"key":"e_1_3_3_127_2","unstructured":"Richard Ngo Lawrence Chan and S\u00f6ren Mindermann. 2022. The alignment problem from a deep learning perspective. arXiv:2209.00626. Retrieved from https:\/\/arxiv.org\/abs\/2209.00626"},{"key":"e_1_3_3_128_2","doi-asserted-by":"crossref","unstructured":"Duy-Hung Nguyen Nguyen-Viet-Dung Nghiem Bao-Sinh Nguyen Dung Tien Le Shahab Sabahi Minh Le Nguyen and Hung Le. 2022. Make The most of prior data: A solution for interactive text summarization with preference feedback. In NAACL-HLT.","DOI":"10.18653\/v1\/2022.findings-naacl.147"},{"key":"e_1_3_3_129_2","doi-asserted-by":"crossref","unstructured":"Khanh Nguyen Hal Daum\u00e9 and Jordan L. Boyd-Graber. 2017. Reinforcement learning for bandit neural machine translation with simulated human feedback. ArXiv abs\/1707.07402 (2017).","DOI":"10.18653\/v1\/D17-1153"},{"key":"e_1_3_3_130_2","unstructured":"Khanh Nguyen Dipendra Misra Robert Schapire Miro Dud\u00edk and Patrick Shafto. 2021. Interactive learning from activity description. In ICML."},{"key":"e_1_3_3_131_2","unstructured":"Michael Noukhovitch Samuel Lavoie Florian Strub and Aaron Courville. 2023. Language model alignment with elastic reset. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans LA USA) (NIPS\u201923). Curran Associates Inc. Red Hook NY USA Article 152 23 pages."},{"key":"e_1_3_3_132_2","doi-asserted-by":"publisher","unstructured":"Marcus O\u2019Connor. 1989. Models of human behaviour and confidence in judgement: A review. International Journal of Forecasting 5 2 (1989) 159\u2013169. 10.1016\/0169-2070(89)90083-6","DOI":"10.1016\/0169-2070(89)90083-6"},{"key":"e_1_3_3_133_2","unstructured":"OpenAI. 2022. ChatGPT. https:\/\/openai.com\/blog\/chatgpt. (2022)."},{"key":"e_1_3_3_134_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. ArXiv abs\/2303.08774 (2023)."},{"key":"e_1_3_3_135_2","unstructured":"OpenAI Aaron Jaech Adam Kalai Adam Lerer Adam Richardson Ahmed El-Kishky Aiden Low Alec Helyar Aleksander Madry Alex Beutel et\u00a0al. 2024. OpenAI o1 System Card. arXiv:2412.16720 [cs.AI]. https:\/\/arxiv.org\/abs\/2412.16720"},{"key":"e_1_3_3_136_2","doi-asserted-by":"publisher","unstructured":"Pierre-Yves Oudeyer Frdric Kaplan and Verena V. Hafner. 2007. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation 11 2 (2007) 265\u2013286. 10.1109\/TEVC.2006.890271","DOI":"10.1109\/TEVC.2006.890271"},{"key":"e_1_3_3_137_2","unstructured":"Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray John Schulman Jacob Hilton Fraser Kelton Luke E. Miller Maddie Simens Amanda Askell Peter Welinder Paul Francis Christiano Jan Leike and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv abs\/2203.02155 (2022)."},{"key":"e_1_3_3_138_2","unstructured":"Alexander Pan Kush Bhatia and Jacob Steinhardt. 2022. The effects of reward misspecification: Mapping and mitigating misaligned models. arxiv:2201.03544 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2201.03544"},{"key":"e_1_3_3_139_2","unstructured":"Richard Yuanzhe Pang Vishakh Padmakumar Thibault Sellam Ankur P. Parikh and He He. 2022. Reward gaming in conditional text generation. arXiv:2211.08714. Retrieved from https:\/\/arxiv.org\/abs\/2211.08714"},{"key":"e_1_3_3_140_2","volume-title":"Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics","author":"Parmar Mihir","year":"2022","unstructured":"Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta Baral. 2022. Don\u2019t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics."},{"key":"e_1_3_3_141_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.70"},{"key":"e_1_3_3_142_2","unstructured":"Andi Peng Besmira Nushi Emre Kiciman Kori Inkpen and Ece Kamar. 2022. Investigations of performance and bias in human-AI teamwork in hiring. arxiv:2202.11812 [cs.HC]. Retrieved from https:\/\/arxiv.org\/abs\/2202.11812"},{"key":"e_1_3_3_143_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.225"},{"key":"e_1_3_3_144_2","volume-title":"Proceedings of the ICML 2023 Workshop The Many Facets of Preference-Based Learning","author":"Pitis Silviu","year":"2023","unstructured":"Silviu Pitis. 2023. Failure modes of learning reward models for LLMs and other sequence models. In Proceedings of the ICML 2023 Workshop The Many Facets of Preference-Based Learning. Retrieved from https:\/\/openreview.net\/forum?id=NjOoxFRZA4"},{"key":"e_1_3_3_145_2","doi-asserted-by":"publisher","unstructured":"R. L. Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C 24 2 (June 1975) 193\u2013202. 10.2307\/2346567","DOI":"10.2307\/2346567"},{"key":"e_1_3_3_146_2","unstructured":"Dean A. Pomerleau. 1988. Alvinn: An autonomous land vehicle in a neural network. Advances in Neural Information Processing Systems 1 (1988) 305\u2013313."},{"key":"e_1_3_3_147_2","unstructured":"Qwen An Yang Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chengyuan Li Dayiheng Liu Fei Huang et\u00a0al. 2025. Qwen2.5 technical report. arXiv:2412.15115 [cs.CL]. https:\/\/arxiv.org\/abs\/2412.15115"},{"key":"e_1_3_3_148_2","unstructured":"Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training."},{"key":"e_1_3_3_149_2","unstructured":"Jack W. Rae Sebastian Borgeaud Trevor Cai Katie Millican Jordan Hoffmann Francis Song John Aslanides Sarah Henderson Roman Ring Susannah Young et\u00a0al. 2021. Scaling language models: Methods analysis & insights from training gopher. ArXiv abs\/2112.11446 (2021)."},{"key":"e_1_3_3_150_2","unstructured":"Rafael Rafailov Archit Sharma Eric Mitchell Stefano Ermon Christopher D. Manning and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290. Retrieved from https:\/\/arxiv.org\/abs\/2305.18290"},{"key":"e_1_3_3_151_2","unstructured":"Colin Raffel Noam M. Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683. Retrieved from https:\/\/arxiv.org\/abs\/1910.10683"},{"key":"e_1_3_3_152_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Raichuk Anton","year":"2021","unstructured":"Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Rapha\u00ebl Marinier, L\u2019eonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, and Sylvain Gelly. 2021. What matters for on-policy deep actor-critic methods? A large-scale study. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:233340556"},{"key":"e_1_3_3_153_2","first-page":"2586","volume-title":"Proceedings of the IJCAI","volume":"7","author":"Ramachandran Deepak","year":"2007","unstructured":"Deepak Ramachandran and Eyal Amir. 2007. Bayesian inverse reinforcement learning. In Proceedings of the IJCAI, Vol. 7. 2586\u20132591."},{"key":"e_1_3_3_154_2","unstructured":"Rajkumar Ramamurthy Prithviraj Ammanabrolu Kiant\u00e9 Brantley Jack Hessel Rafet Sifa Christian Bauckhage Hannaneh Hajishirzi and Yejin Choi. 2022. Is reinforcement learning (not) for natural language processing?: Benchmarks baselines and building blocks for natural language policy optimization. arXiv:2210.01241. Retrieved from https:\/\/arxiv.org\/abs\/2210.01241"},{"key":"e_1_3_3_155_2","unstructured":"Alexandre Ram\u00e9 Guillaume Couairon Mustafa Shukor Corentin Dancette Jean-Baptiste Gaya Laure Soulier and Matthieu Cord. 2023. Rewarded soups: Towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards."},{"key":"e_1_3_3_156_2","unstructured":"Marc\u2019Aurelio Ranzato Sumit Chopra Michael Auli and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732. Retrieved from https:\/\/arxiv.org\/abs\/1511.06732"},{"key":"e_1_3_3_157_2","unstructured":"Desik Rengarajan Gargi Nikhil Vaidya Akshay Sarvesh Dileep M. Kalathil and Srinivas Shakkottai. 2022. Reinforcement learning with sparse rewards using guidance from offline demonstration. arXiv:2202.04628. Retrieved from https:\/\/arxiv.org\/abs\/2202.04628. https:\/\/api.semanticscholar.org\/CorpusID:246679865"},{"key":"e_1_3_3_158_2","unstructured":"Diederik M. Roijers. 2016. Multi-objective decision-theoretic planning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:124195290"},{"key":"e_1_3_3_159_2","unstructured":"Diederik M. Roijers Peter Vamplew Shimon Whiteson and Richard Dazeley. 2013. A survey of multi-objective sequential decision-making. arXiv:1402.0590. Retrieved from https:\/\/arxiv.org\/abs\/1402.0590. https:\/\/api.semanticscholar.org\/CorpusID:14478191"},{"key":"e_1_3_3_160_2","unstructured":"Michael Santacroce Yadong Lu Han Yu Yuanzhi Li and Yelong Shen. 2023. Efficient RLHF: Reducing the memory usage of PPO. arxiv:2309.00754 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2309.00754"},{"key":"e_1_3_3_161_2","unstructured":"Shibani Santurkar Esin Durmus Faisal Ladhak Cinoo Lee Percy Liang and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect?arxiv:2303.17548 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2303.17548"},{"key":"e_1_3_3_162_2","unstructured":"Rylan Schaeffer Mikail Khona Zachary Robertson Akhilan Boopathy Kateryna Pistunova Jason W. Rocks Ila Rani Fiete and Oluwasanmi Koyejo. 2023. Double descent demystified: Identifying interpreting & ablating the sources of a deep learning puzzle. arXiv:2303.14151. Retrieved from https:\/\/arxiv.org\/abs\/2303.14151"},{"key":"e_1_3_3_163_2","unstructured":"J\u2019er\u2019emy Scheurer Jon Ander Campos Jun Shern Chan Angelica Chen Kyunghyun Cho and Ethan Perez. 2022. Training language models with language feedback."},{"key":"e_1_3_3_164_2","unstructured":"J\u2019er\u2019emy Scheurer Jon Ander Campos Tomasz Korbak Jun Shern Chan Angelica Chen Kyunghyun Cho and Ethan Perez. 2023. Training language models with language feedback at scale. arXiv:2303.16755. Retrieved from https:\/\/arxiv.org\/abs\/2303.16755"},{"key":"e_1_3_3_165_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-2007"},{"key":"e_1_3_3_166_2","unstructured":"John Schulman Filip Wolski Prafulla Dhariwal Alec Radford and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347. Retrieved from https:\/\/arxiv.org\/abs\/1707.06347"},{"key":"e_1_3_3_167_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.704"},{"key":"e_1_3_3_168_2","unstructured":"Zhihong Shao Peiyi Wang Qihao Zhu Runxin Xu Junxiao Song Xiao Bi Haowei Zhang Mingchuan Zhang Y. K. Li Y. Wu and Daya Guo. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arxiv:2402.03300 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2402.03300"},{"key":"e_1_3_3_169_2","unstructured":"Shiqi Shen Yong Cheng Zhongjun He W. He Hua Wu Maosong Sun and Yang Liu. 2015. Minimum risk training for neural machine translation. arXiv:1512.02433. Retrieved from https:\/\/arxiv.org\/abs\/1512.02433"},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/606"},{"key":"e_1_3_3_171_2","unstructured":"Noah Shinn Federico Cassano Beck Labash Ashwin Gopinath Karthik Narasimhan and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:258833055"},{"key":"e_1_3_3_172_2","unstructured":"Ilia Shumailov Zakhar Shumaylov Yiren Zhao Yarin Gal Nicolas Papernot and Ross Anderson. 2023. The curse of recursion: Training on generated data makes models forget. arxiv:2305.17493 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2305.17493"},{"key":"e_1_3_3_173_2","doi-asserted-by":"publisher","unstructured":"David Silver Satinder Singh Doina Precup and Richard S. Sutton. 2021. Reward is enough. Artificial Intelligence 299 (2021) 103535. 10.1016\/j.artint.2021.103535","DOI":"10.1016\/j.artint.2021.103535"},{"key":"e_1_3_3_174_2","doi-asserted-by":"publisher","unstructured":"Karan Singhal Shekoofeh Azizi Tao Tu S. Sara Mahdavi Jason Wei Hyung Won Chung Nathan Scales Ajay Tanwani Heather Cole-Lewis Stephen Pfohl et\u00a0al. 2023. Large language models encode clinical knowledge. Nature 620 7972 (1 Aug 2023) 172\u2013180. 10.1038\/s41586-02306291-2","DOI":"10.1038\/s41586-02306291-2"},{"key":"e_1_3_3_175_2","unstructured":"Prasann Singhal Tanya Goyal Jiacheng Xu and Greg Durrett. 2023. A long way to go: Investigating length correlations in RLHF. arxiv:2310.03716 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2310.03716"},{"key":"e_1_3_3_176_2","unstructured":"Artem Sokolov Stefan Riezler and Tanguy Urvoy. 2016. Bandit structured prediction for learning from partial feedback in statistical machine translation. arXiv:1601.04468. Retrieved from https:\/\/arxiv.org\/abs\/1601.04468"},{"key":"e_1_3_3_177_2","unstructured":"Feifan Song Bowen Yu Minghao Li Haiyang Yu Fei Huang Yongbin Li and Houfeng Wang. 2023. Preference ranking optimization for human alignment. arxiv:2306.17492 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2306.17492"},{"key":"e_1_3_3_178_2","unstructured":"Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan J. Lowe Chelsea Voss Alec Radford Dario Amodei and Paul Christiano. 2020. Learning to summarize from human feedback. arXiv:2009.01325. Retrieved from https:\/\/arxiv.org\/abs\/2009.01325"},{"key":"e_1_3_3_179_2","unstructured":"Yushan Su Vishvak Murahari Karthik Narasimhan and Kai Li. 2023. PruMUX: Augmenting data multiplexing with model compression. arXiv:2305.14706. Retrieved from https:\/\/arxiv.org\/abs\/2305.14706"},{"key":"e_1_3_3_180_2","unstructured":"Zhiqing Sun Yikang Shen Qinhong Zhou Hongxin Zhang Zhenfang Chen David D. Cox Yiming Yang and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv:2305.03047. Retrieved from https:\/\/arxiv.org\/abs\/2305.03047"},{"key":"e_1_3_3_181_2","unstructured":"R. S. Sutton. 2004. The reward hypothesis. Blog post 1 February 2024. (2004). http:\/\/incompleteideas.net\/rlai.cs.ualberta.ca\/RLAI\/rewardhypothesis.html"},{"key":"e_1_3_3_182_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2004.842673"},{"key":"e_1_3_3_183_2","volume-title":"Reinforcement Learning: An Introduction","author":"Sutton Richard S.","year":"2018","unstructured":"Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press."},{"key":"e_1_3_3_184_2","unstructured":"Yingshui Tan Yilei Jiang Yanshi Li Jiaheng Liu Xingyuan Bu Wenbo Su Xiangyu Yue Xiaoyong Zhu and Bo Zheng. 2025. Equilibrate RLHF: Towards balancing helpfulness-safety trade-off in large language models. arxiv:2502.11555 [cs.AI]. Retrieved from https:\/\/arxiv.org\/abs\/2502.11555"},{"key":"e_1_3_3_185_2","unstructured":"Hugo Touvron Louis Martin Kevin R. Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models."},{"key":"e_1_3_3_186_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et al. 2023. LLaMA: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_3_187_2","unstructured":"Jonathan Uesato Nate Kushman Ramana Kumar Francis Song Noah Siegel L. Wang Antonia Creswell Geoffrey Irving and Irina Higgins. 2022. Solving math word problems with process- and outcome-based feedback. arXiv:2211.14275. Retrieved from https:\/\/arxiv.org\/abs\/2211.14275"},{"key":"e_1_3_3_188_2","volume-title":"Proceedings of the NIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS."},{"key":"e_1_3_3_189_2","unstructured":"Yufei Wang Wanjun Zhong Liangyou Li Fei Mi Xingshan Zeng Wenyong Huang Lifeng Shang Xin Jiang and Qun Liu. 2023. Aligning large language models with human: A survey. arxiv:2307.12966 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2307.12966"},{"key":"e_1_3_3_190_2","unstructured":"Alexander Wei Nika Haghtalab and Jacob Steinhardt. 2023. Jailbroken: How does LLM safety training fail? arXiv:2307.02483. Retrieved from https:\/\/arxiv.org\/abs\/2307.02483. https:\/\/api.semanticscholar.org\/CorpusID:259342528"},{"key":"e_1_3_3_191_2","unstructured":"Jason Wei Maarten Bosma Vincent Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai and Quoc V. Le. 2021. Finetuned language models are zero-shot learners. arXiv:2109.01652. Retrieved from https:\/\/arxiv.org\/abs\/2109.01652"},{"key":"e_1_3_3_192_2","unstructured":"Jiaxin Wen Ruiqi Zhong Akbir Khan Ethan Perez Jacob Steinhardt Minlie Huang Samuel R. Bowman He He and Shi Feng. 2024. Language models learn to mislead humans via RLHF. arxiv:2409.12822 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2409.12822"},{"key":"e_1_3_3_193_2","unstructured":"Lilian Weng. 2018. Policy gradient algorithms. Blog post April 8 2018. lilianweng.github.io (2018). https:\/\/lilianweng.github.io\/posts\/2018-04-08policy-gradient\/"},{"key":"e_1_3_3_194_2","unstructured":"Lilian Weng. 2024. Reward hacking in reinforcement learning. Blog post November 28 2024. Lil\u2019Log (2024). https:\/\/lilianweng.github.io\/posts\/202411-28-reward-hacking\/"},{"key":"e_1_3_3_195_2","unstructured":"Jeff Wu Long Ouyang Daniel M. Ziegler Nissan Stiennon Ryan Lowe Jan Leike and Paul Francis Christiano. 2021. Recursively summarizing books with human feedback. arXiv:2109.10862. Retrieved from https:\/\/arxiv.org\/abs\/2109.10862"},{"key":"e_1_3_3_196_2","unstructured":"Shijie Wu Ozan Irsoy Steven Lu Vadim Dabravolski Mark Dredze Sebastian Gehrmann Prabhanjan Kambadur David Rosenberg and Gideon Mann. 2023. BloombergGPT: A large language model for finance. arXiv:2303.17564. Retrieved from https:\/\/arxiv.org\/abs\/2303.17564. https:\/\/api.semanticscholar.org\/CorpusID:257833842"},{"key":"e_1_3_3_197_2","unstructured":"Tianhao Wu Banghua Zhu Ruoyu Zhang Zhaojin Wen Kannan Ramchandran and Jiantao Jiao. 2023. Pairwise proximal policy optimization: Harnessing relative feedback for LLM alignment. arXiv:2310.00212. Retrieved from https:\/\/arxiv.org\/abs\/2310.00212. https:\/\/api.semanticscholar.org\/CorpusID:263334045"},{"key":"e_1_3_3_198_2","unstructured":"Zeqiu Wu Yushi Hu Weijia Shi Nouha Dziri Alane Suhr Prithviraj Ammanabrolu Noah A. Smith Mari Ostendorf and Hanna Hajishirzi. 2023. Fine-grained human feedback gives better rewards for language model training."},{"key":"e_1_3_3_199_2","unstructured":"Zeqiu Wu Yushi Hu Weijia Shi Nouha Dziri Alane Suhr Prithviraj Ammanabrolu Noah A Smith Mari Ostendorf and Hannaneh Hajishirzi. 2023. Fine-grained human feedback gives better rewards for language model training. arXiv:2306.01693. Retrieved from https:\/\/arxiv.org\/abs\/2306.01693"},{"key":"e_1_3_3_200_2","unstructured":"Mengzhou Xia Tianyu Gao Zhiyuan Zeng and Danqi Chen. [n. d.]. Sheared LLaMA: Accelerating language model pre-training via structured pruning. ([n. d.])."},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.107"},{"key":"e_1_3_3_202_2","unstructured":"Sang Michael Xie Hieu Pham Xuanyi Dong Nan Du Hanxiao Liu Yifeng Lu Percy Liang Quoc V. Le Tengyu Ma and Adams Wei Yu. 2023. DoReMi: Optimizing data mixtures speeds up language model pretraining."},{"key":"e_1_3_3_203_2","unstructured":"Jing Xu Megan Ung Mojtaba Komeili Kushal Arora Y-Lan Boureau and Jason Weston. 2022. Learning new skills after deployment: Improving open-domain internet-driven dialogue with human feedback. arXiv:2208.03270. Retrieved from https:\/\/arxiv.org\/abs\/2208.03270"},{"key":"e_1_3_3_204_2","doi-asserted-by":"publisher","unstructured":"Lixiang Yan Lele Sha Linxuan Zhao Yuheng Li Roberto Martinez-Maldonado Guanliang Chen Xinyu Li Yueqiao Jin and Dragan Ga\u0161evi\u0107. 2023. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology 55 1 (Aug. 2023) 90\u2013112. 10.1111\/bjet.13370","DOI":"10.1111\/bjet.13370"},{"key":"e_1_3_3_205_2","unstructured":"An Yang Anfeng Li Baosong Yang Beichen Zhang Binyuan Hui Bo Zheng Bowen Yu Chang Gao Chengen Huang Chenxu Lv et\u00a0al. 2025. Qwen3 technical report. arXiv:2505.09388 [cs.CL]. https:\/\/arxiv.org\/abs\/2505.09388"},{"key":"e_1_3_3_206_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.296"},{"key":"e_1_3_3_207_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-demo.4"},{"key":"e_1_3_3_208_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24600-5_47"},{"key":"e_1_3_3_209_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-8608"},{"key":"e_1_3_3_210_2","unstructured":"Yichun Yin Cheng Chen Lifeng Shang Xin Jiang Xiao Chen and Qun Liu. 2021. AutoTinyBERT: Automatic hyper-parameter optimization for efficient pre-trained language models. 5146\u20135157."},{"key":"e_1_3_3_211_2","unstructured":"Zheng Yuan Hongyi Yuan Chuanqi Tan Wei Wang Songfang Huang and Feiran Huang. 2023. RRHF: Rank responses to align language models with human feedback without tears. arXiv:2304.05302. Retrieved from https:\/\/arxiv.org\/abs\/2304.05302"},{"key":"e_1_3_3_212_2","unstructured":"Jiazheng Zhang Wenqing Jing Zizhuo Zhang Zhiheng Xi Shihan Dou Rongxiang Weng Jiahuan Li Jingang Wang Mingxu Chai Shibo Hong et al. 2025. Two minds better than one: Collaborative reward modeling for LLM alignment. arxiv:2505.10597 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2505.10597"},{"key":"e_1_3_3_213_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin et\u00a0al. 2022. OPT: Open Pre-trained transformer language models. ArXiv abs\/2205.01068 (2022)."},{"key":"e_1_3_3_214_2","unstructured":"Tianjun Zhang Fangchen Liu Justin Wong P. Abbeel and Joseph Gonzalez. 2023. The wisdom of hindsight makes language models better instruction followers. arXiv:2302.05206. Retrieved from https:\/\/arxiv.org\/abs\/2302.05206"},{"key":"e_1_3_3_215_2","unstructured":"Yao Zhao Rishabh Joshi Tianqi Liu Misha Khalman Mohammad Saleh and Peter J. Liu. 2023. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv:2305.10425. Retrieved from https:\/\/arxiv.org\/abs\/2305.10425"},{"key":"e_1_3_3_216_2","unstructured":"Yao Zhao Misha Khalman Rishabh Joshi Shashi Narayan Mohammad Saleh and Peter J. Liu. 2022. Calibrating sequence likelihood improves conditional language generation. arXiv:2210.00045. Retrieved from https:\/\/arxiv.org\/abs\/2210.00045"},{"key":"e_1_3_3_217_2","unstructured":"Rui Zheng Shihan Dou Songyang Gao Wei Shen Bing Wang Yan Liu Senjie Jin Qin Liu Limao Xiong Luyao Chen et\u00a0al. 2023. Secrets of RLHF in large language models Part I: PPO."},{"key":"e_1_3_3_218_2","unstructured":"Chunting Zhou Pengfei Liu Puxin Xu Srini Iyer Jiao Sun Yuning Mao Xuezhe Ma Avia Efrat Ping Yu Lili Yu et al. 2023. LIMA: Less is more for alignment. arxiv:2305.11206 [cs.CL]. Retrieved from https:\/\/arxiv.org\/abs\/2305.11206"},{"key":"e_1_3_3_219_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6521"},{"key":"e_1_3_3_220_2","unstructured":"Banghua Zhu Jiantao Jiao and M.I. Jordan. 2023. Principled reinforcement learning with human feedback from pairwise or K-wise comparisons. arXiv:2301.11270. Retrieved from https:\/\/arxiv.org\/abs\/2301.11270"},{"key":"e_1_3_3_221_2","unstructured":"Banghua Zhu Hiteshi Sharma Felipe Vieira Frujeri Shi Dong Chenguang Zhu M. I. Jordan and Jiantao Jiao. 2023. Fine-tuning language models with advantage-induced policy alignment. arXiv:2306.02231. Retrieved from https:\/\/arxiv.org\/abs\/2306.02231"},{"key":"e_1_3_3_222_2","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Ziebart Brian D.","year":"2008","unstructured":"Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_3_223_2","unstructured":"Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv:1909.08593. Retrieved from https:\/\/arxiv.org\/abs\/1909.08593"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3743127","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T13:13:16Z","timestamp":1757509996000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3743127"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,10]]},"references-count":222,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3743127"],"URL":"https:\/\/doi.org\/10.1145\/3743127","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,10]]},"assertion":[{"value":"2024-11-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}