{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T14:39:01Z","timestamp":1774449541762,"version":"3.50.1"},"reference-count":234,"publisher":"Association for Computing Machinery (ACM)","issue":"6","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62276077, 62406091, U23B2055, 62350710797"],"award-info":[{"award-number":["62276077, 62406091, U23B2055, 62350710797"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100021171","name":"Guangdong Basic and Applied Basic Research Foundation","doi-asserted-by":"crossref","award":["2024A1515011205"],"award-info":[{"award-number":["2024A1515011205"]}],"id":[{"id":"10.13039\/501100021171","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shenzhen Science and Technology Program","award":["KQTD20240729102154066, ZDSYS20230626091203008"],"award-info":[{"award-number":["KQTD20240729102154066, ZDSYS20230626091203008"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,4,30]]},"abstract":"<jats:p>The recent surge in versatile large language models (LLMs) demonstrates remarkable success across a wide range of contexts. A key factor contributing to this success is LLM alignment, in which human preference learning plays a decisive role in steering the models\u2019 capabilities toward fulfilling human objectives. In this survey, we review the progress in human preference learning within a unified framework, aiming to provide a comprehensive perspective on established methodologies while exploring avenues to further advance LLM alignment. Specifically, we categorize human preference feedback based on data sources and formats, summarize techniques for human preference modeling and usage, and present an overview of prevailing evaluation protocols for LLM alignment. Finally, we discuss the existing challenges and identify potential directions for future research, with a particular emphasis on generalizability, transferability, and controllability.<\/jats:p>","DOI":"10.1145\/3773279","type":"journal-article","created":{"date-parts":[[2025,11,4]],"date-time":"2025-11-04T11:16:38Z","timestamp":1762254998000},"page":"1-39","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["A Survey on Human Preference Learning for Aligning Large Language Models"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6126-2317","authenticated-orcid":false,"given":"Ruili","family":"Jiang","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4346-7618","authenticated-orcid":false,"given":"Kehai","family":"Chen","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7044-0683","authenticated-orcid":false,"given":"Xuefeng","family":"Bai","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen)","place":["Shenzhen, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-8511-4398","authenticated-orcid":false,"given":"Zhixuan","family":"He","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen)","place":["Shenzhen, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6286-7529","authenticated-orcid":false,"given":"Juntao","family":"Li","sequence":"additional","affiliation":[{"name":"Soochow University","place":["Suzhou, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5940-0266","authenticated-orcid":false,"given":"Muyun","family":"Yang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4659-4935","authenticated-orcid":false,"given":"Tiejun","family":"Zhao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology","place":["Harbin, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1476-0273","authenticated-orcid":false,"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen)","place":["Shenzhen, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3895-5510","authenticated-orcid":false,"given":"Min","family":"Zhang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen)","place":["Shenzhen, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,12,6]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.662"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23780-5_11"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33486-3_8"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.427"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627673.3679832"},{"key":"e_1_3_3_7_2","volume-title":"Concrete problems in AI safety","author":"Amodei Dario","year":"2016","unstructured":"Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9. 2016. Concrete problems in AI safety. arxiv:1606.06565. Retrieved from https:\/\/arxiv.org\/abs\/1606.06565"},{"key":"e_1_3_3_8_2","volume-title":"PaLM 2 technical report","year":"2023","unstructured":"Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et\u00a0al. 2023. PaLM 2 technical report. arxiv:2305.10403. Retrieved from https:\/\/arxiv.org\/abs\/2305.10403"},{"key":"e_1_3_3_9_2","volume-title":"A general language assistant as a laboratory for alignment","year":"2021","unstructured":"Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et\u00a0al. 2021. A general language assistant as a laboratory for alignment. arxiv:2112.00861. Retrieved from https:\/\/arxiv.org\/abs\/2112.00861"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589334.3645404"},{"key":"e_1_3_3_11_2","volume-title":"Proceedings of the ICLR","author":"Baheti Ashutosh","year":"2024","unstructured":"Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, and Mark Riedl. 2024. Leftover lunch: Advantage-based offline reinforcement learning for language models. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=ZDGKPbF0VQ"},{"key":"e_1_3_3_12_2","volume-title":"Qwen technical report","year":"2023","unstructured":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et\u00a0al. 2023. Qwen technical report. arxiv:2309.16609. Retrieved from https:\/\/arxiv.org\/abs\/2309.16609"},{"key":"e_1_3_3_13_2","volume-title":"Training a helpful and harmless assistant with reinforcement learning from human feedback","year":"2022","unstructured":"Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et\u00a0al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arxiv:2204.05862. Retrieved from https:\/\/arxiv.org\/abs\/2204.05862"},{"key":"e_1_3_3_14_2","volume-title":"Constitutional AI: Harmlessness from AI feedback","year":"2022","unstructured":"Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et\u00a0al. 2022. Constitutional AI: Harmlessness from AI feedback. arxiv:2212.08073. Retrieved from https:\/\/arxiv.org\/abs\/2212.08073"},{"key":"e_1_3_3_15_2","volume-title":"Proceedings of the NeurIPS","year":"2022","unstructured":"Michiel A. Bakker, Martin J. Chadwick, Hannah R. Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matthew M. Botvinick, et\u00a0al. 2022. Fine-tuning language models to find agreement among humans with diverse preferences. In Proceedings of the NeurIPS. Article 2766, 14 pages."},{"key":"e_1_3_3_16_2","first-page":"999","volume-title":"Proceedings of the AAMAS","author":"Barlier Merwan","year":"2018","unstructured":"Merwan Barlier, Romain Laroche, and Olivier Pietquin. 2018. Training dialogue systems with human advice. In Proceedings of the AAMAS. 999\u20131007."},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1307"},{"key":"e_1_3_3_18_2","volume-title":"On the opportunities and risks of foundation models","year":"2022","unstructured":"Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et\u00a0al. 2022. On the opportunities and risks of foundation models. arxiv:2108.07258. Retrieved from https:\/\/arxiv.org\/abs\/2108.07258"},{"key":"e_1_3_3_19_2","volume-title":"Measuring progress on scalable oversight for large language models","year":"2022","unstructured":"Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil\u0117 Luko\u0161i\u016bt\u0117, Amanda Askell, Andy Jones, Anna Chen, et\u00a0al. 2022. Measuring progress on scalable oversight for large language models. arxiv:2211.03540. Retrieved from https:\/\/arxiv.org\/abs\/2211.03540"},{"issue":"3","key":"e_1_3_3_20_2","first-page":"324","article-title":"Rank analysis of incomplete block designs: I. the method of paired comparisons","volume":"39","author":"Bradley Ralph Allan","year":"1952","unstructured":"Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39, 3\/4 (1952), 324\u2013345.","journal-title":"Biometrika"},{"key":"e_1_3_3_21_2","first-page":"783","volume-title":"Proceedings of the ICML","volume":"97","author":"Brown Daniel","year":"2019","unstructured":"Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. 2019. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the ICML, Vol. 97. 783\u2013792. Retrieved from https:\/\/proceedings.mlr.press\/v97\/brown19a.html"},{"key":"e_1_3_3_22_2","volume-title":"Proceedings of the NeurIPS","year":"2020","unstructured":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. In Proceedings of the NeurIPS. Article 159, 25 pages."},{"key":"e_1_3_3_23_2","first-page":"4971","volume-title":"Proceedings of the ICML","volume":"235","year":"2024","unstructured":"Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et\u00a0al. 2024. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In Proceedings of the ICML, Vol. 235. 4971\u20135012. Retrieved from https:\/\/proceedings.mlr.press\/v235\/burns24b.html"},{"key":"e_1_3_3_24_2","volume-title":"ULMA: Unified Language Model Alignment with human demonstration and point-wise preference","author":"Cai Tianchi","year":"2024","unstructured":"Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie Gu, and Guannan Zhang. 2024. ULMA: Unified Language Model Alignment with human demonstration and point-wise preference. arxiv:2312.02554. Retrieved from https:\/\/arxiv.org\/abs\/2312.02554"},{"key":"e_1_3_3_25_2","first-page":"6116","volume-title":"Proceedings of the ICML","volume":"235","author":"Chakraborty Souradip","year":"2024","unstructured":"Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit Bedi, and Mengdi Wang. 2024. MaxMin-RLHF: Alignment with diverse human preferences. In Proceedings of the ICML, Vol. 235. 6116\u20136135. Retrieved from https:\/\/proceedings.mlr.press\/v235\/chakraborty24b.html"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3743127"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.52202\/079017-3741"},{"key":"e_1_3_3_28_2","volume-title":"Proceedings of the ICLR","year":"2024","unstructured":"Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et\u00a0al. 2024. AlpaGasus: Training a better alpaca with fewer data. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=FdVXgSJhvz"},{"key":"e_1_3_3_29_2","volume-title":"Proceedings of the NeurIPS","author":"Chen Lili","year":"2021","unstructured":"Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling. In Proceedings of the NeurIPS. Article 1156, 14 pages. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/file\/7f489f642a0ddb10272b5c31057f0663-Paper.pdf"},{"key":"e_1_3_3_30_2","volume-title":"Evaluating large language models trained on code","year":"2021","unstructured":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et\u00a0al. 2021. Evaluating large language models trained on code. arxiv:2107.03374. Retrieved from https:\/\/arxiv.org\/abs\/2107.03374"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.411"},{"key":"e_1_3_3_32_2","first-page":"6621","volume-title":"Proceedings of the ICML","volume":"235","author":"Chen Zixiang","year":"2024","unstructured":"Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024. Self-play fine-tuning converts weak language models to strong language models. In Proceedings of the ICML, Vol. 235. 6621\u20136642. Retrieved from https:\/\/proceedings.mlr.press\/v235\/chen24j.html"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.338"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.221"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23780-5_30"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-2401"},{"issue":"240","key":"e_1_3_3_37_2","first-page":"1","article-title":"PaLM: Scaling language modeling with pathways","volume":"24","year":"2023","unstructured":"Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et\u00a0al. 2023. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1\u2013113. Retrieved from http:\/\/jmlr.org\/papers\/v24\/22-1144.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_38_2","volume-title":"Supervising strong learners by amplifying weak experts","author":"Christiano Paul","year":"2018","unstructured":"Paul Christiano, Buck Shlegeris, and Dario Amodei. 2018. Supervising strong learners by amplifying weak experts. arxiv:1810.08575. Retrieved from https:\/\/arxiv.org\/abs\/1810.08575"},{"key":"e_1_3_3_39_2","first-page":"4302","volume-title":"Proceedings of the NeurIPS","author":"Christiano Paul F.","year":"2017","unstructured":"Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Proceedings of the NeurIPS. 4302\u20134310."},{"key":"e_1_3_3_40_2","volume-title":"Training verifiers to solve math word problems","year":"2021","unstructured":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et\u00a0al. 2021. Training verifiers to solve math word problems. arxiv:2110.14168. Retrieved from https:\/\/arxiv.org\/abs\/2110.14168"},{"key":"e_1_3_3_41_2","volume-title":"Proceedings of the ICLR","author":"Coste Thomas","year":"2024","unstructured":"Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. 2024. Reward model ensembles help mitigate overoptimization. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=dcjtMYkpXx"},{"key":"e_1_3_3_42_2","first-page":"9722","volume-title":"Proceedings of the ICML","volume":"235","year":"2024","unstructured":"Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et\u00a0al. 2024. ULTRAFEEDBACK: Boosting language models with scaled AI feedback. In Proceedings of the ICML, Vol. 235. 9722\u20139744. Retrieved from https:\/\/proceedings.mlr.press\/v235\/cui24f.html"},{"key":"e_1_3_3_43_2","volume-title":"Proceedings of the ICLR","author":"Dai Josef","year":"2024","unstructured":"Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2024. Safe RLHF: Safe reinforcement learning from human feedback. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=TyFrPOKYXw"},{"key":"e_1_3_3_44_2","volume-title":"Proceedings of the NeurIPS","author":"Dai Wenliang","year":"2023","unstructured":"Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the NeurIPS. Article 2142, 18 pages."},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-015-9454-z"},{"key":"e_1_3_3_46_2","first-page":"31156","volume-title":"Proceedings of the ACL","author":"Deng Qiyuan","year":"2025","unstructured":"Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, and Min Zhang. 2025. Efficient safety alignment of large language models via preference re-ranking and representation-based reward modeling. In Proceedings of the ACL. 31156\u201331171. Retrieved from https:\/\/aclanthology.org\/2025.acl-long.1504\/"},{"key":"e_1_3_3_47_2","article-title":"RAFT: Reward rAnked finetuning for generative foundation model alignment","author":"Dong Hanze","year":"2023","unstructured":"Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. RAFT: Reward rAnked finetuning for generative foundation model alignment. Transactions on Machine Learning Research 2023. Retrieved from https:\/\/openreview.net\/forum?id=m7p5O7zblY","journal-title":"Transactions on Machine Learning Research 2023"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.754"},{"key":"e_1_3_3_49_2","volume-title":"Proceedings of the NeurIPS","author":"Dubois Yann","year":"2023","unstructured":"Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A simulation framework for methods that learn from human feedback. In Proceedings of the NeurIPS. Article 1308, 31 pages."},{"key":"e_1_3_3_50_2","first-page":"457","volume-title":"Proceedings of the AAMAS","author":"Asri Layla El","year":"2016","unstructured":"Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and Olivier Pietquin. 2016. Score-based inverse reinforcement learning. In Proceedings of the AAMAS. 457\u2013465."},{"key":"e_1_3_3_51_2","first-page":"5988","volume-title":"Proceedings of the ICML","volume":"162","author":"Ethayarajh Kawin","year":"2022","unstructured":"Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding dataset difficulty with \\(\\mathcal {V}\\) -usable information. In Proceedings of the ICML, Vol. 162. 5988\u20136008. Retrieved from https:\/\/proceedings.mlr.press\/v162\/ethayarajh22a.html"},{"key":"e_1_3_3_52_2","first-page":"12634","volume-title":"Proceedings of the ICML","volume":"235","author":"Ethayarajh Kawin","year":"2024","unstructured":"Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. In Proceedings of the ICML, Vol. 235. 12634\u201312651. Retrieved from https:\/\/proceedings.mlr.press\/v235\/ethayarajh24a.html"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1287\/opre.11.3.399"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00626"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.52202\/079017-1961"},{"key":"e_1_3_3_56_2","first-page":"13927","volume-title":"Proceedings of the ICML","volume":"235","author":"Frans Kevin","year":"2024","unstructured":"Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. 2024. Unsupervised zero-shot reinforcement learning via functional reward encodings. In Proceedings of the ICML, Vol. 235. 13927\u201313942. Retrieved from https:\/\/proceedings.mlr.press\/v235\/frans24a.html"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-14125-6"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-012-5313-8"},{"key":"e_1_3_3_59_2","first-page":"10835","volume-title":"Proceedings of the ICML","volume":"202","author":"Gao Leo","year":"2023","unstructured":"Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In Proceedings of the ICML, Vol. 202. 10835\u201310866. Retrieved from https:\/\/proceedings.mlr.press\/v202\/gao23h.html"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.301"},{"key":"e_1_3_3_61_2","first-page":"4447","volume-title":"Proceedings of the AISTATS","volume":"238","author":"Azar Mohammad Gheshlaghi","year":"2024","unstructured":"Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. 2024. A general theoretical paradigm to understand learning from human preferences. In Proceedings of the AISTATS, Vol. 238. 4447\u20134455. Retrieved from https:\/\/proceedings.mlr.press\/v238\/gheshlaghi-azar24a.html"},{"key":"e_1_3_3_62_2","first-page":"2672","volume-title":"Proceedings of the NeurIPS","author":"Goodfellow Ian J.","year":"2014","unstructured":"Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the NeurIPS. 2672\u20132680."},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1098\/rspa.2021.0068"},{"key":"e_1_3_3_64_2","volume-title":"The Llama 3 herd of models","year":"2024","unstructured":"Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et\u00a0al. 2024. The Llama 3 herd of models. arxiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_3_65_2","volume-title":"Reinforced Self-Training (ReST) for language modeling","year":"2023","unstructured":"Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et\u00a0al. 2023. Reinforced Self-Training (ReST) for language modeling. arxiv:2308.08998. Retrieved from https:\/\/arxiv.org\/abs\/2308.08998"},{"key":"e_1_3_3_66_2","volume-title":"Proceedings of the ICLR","author":"Guo Geyang","year":"2024","unstructured":"Geyang Guo, Ranchi Zhao, Tianyi Tang, Xin Zhao, and Ji-Rong Wen. 2024. Beyond imitation: Leveraging fine-grained quality signals for alignment. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=LNLjU5C5dK"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1358"},{"key":"e_1_3_3_68_2","volume-title":"Proceedings of the ICLR","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=d7KBjmI3GmQ"},{"key":"e_1_3_3_69_2","volume-title":"Proceedings of the NeurIPS","year":"2022","unstructured":"Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et\u00a0al. 2022. Training compute-optimal large language models. In Proceedings of the NeurIPS. Article 2176, 15 pages."},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.626"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.685"},{"key":"e_1_3_3_72_2","volume-title":"Proceedings of the ICLR","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=nZeVKeeFYf9"},{"key":"e_1_3_3_73_2","volume-title":"Aligning language models with offline learning from human feedback","author":"Hu Jian","year":"2023","unstructured":"Jian Hu, Li Tao, June Yang, and Chandler Zhou. 2023. Aligning language models with offline learning from human feedback. arxiv:2308.12050. Retrieved from https:\/\/arxiv.org\/abs\/2308.12050"},{"key":"e_1_3_3_74_2","volume-title":"Intuitive fine-tuning: Towards simplifying alignment into a single process","author":"Hua Ermo","year":"2024","unstructured":"Ermo Hua, Biqing Qi, Kaiyan Zhang, Yue Yu, Ning Ding, Xingtai Lv, Kai Tian, and Bowen Zhou. 2024. Intuitive fine-tuning: Towards simplifying alignment into a single process. arxiv:2405.11870. Retrieved from https:\/\/arxiv.org\/abs\/2405.11870"},{"key":"e_1_3_3_75_2","volume-title":"Qwen2.5-Coder technical report","year":"2024","unstructured":"Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et\u00a0al. 2024. Qwen2.5-Coder technical report. arxiv:2409.12186. Retrieved from https:\/\/arxiv.org\/abs\/2409.12186"},{"key":"e_1_3_3_76_2","first-page":"8022","volume-title":"Proceedings of the NeurIPS","author":"Ibarz Borja","year":"2018","unstructured":"Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. 2018. Reward learning from human preferences and demonstrations in Atari. In Proceedings of the NeurIPS. 8022\u20138034."},{"key":"e_1_3_3_77_2","volume-title":"AI safety via debate","author":"Irving Geoffrey","year":"2018","unstructured":"Geoffrey Irving, Paul Christiano, and Dario Amodei. 2018. AI safety via debate. arxiv:1805.00899. Retrieved from https:\/\/arxiv.org\/abs\/1805.00899"},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/375735.376334"},{"key":"e_1_3_3_79_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10458-006-0005-z"},{"key":"e_1_3_3_80_2","first-page":"36","volume-title":"Proceedings of the AAAI","author":"Isbell Charles Lee","year":"2000","unstructured":"Charles Lee Isbell, Michael J. Kearns, David P. Kormann, Satinder Singh, and Peter Stone. 2000. Cobot in LambdaMOO: A social statistics agent. In Proceedings of the AAAI. 36\u201341. Retrieved from http:\/\/www.aaai.org\/Library\/AAAI\/2000\/aaai00-006.php"},{"key":"e_1_3_3_81_2","first-page":"1393","volume-title":"Proceedings of the NeurIPS","author":"Isbell Charles Lee","year":"2001","unstructured":"Charles Lee Isbell, Christian R. Shelton, Michael Kearns, Satinder Singh, and Peter Stone. 2001. Cobot: A social reinforcement learning agent. In Proceedings of the NeurIPS. 1393\u20131400."},{"key":"e_1_3_3_82_2","first-page":"21648","volume-title":"Proceedings of the ICML","volume":"235","author":"Ji Haozhe","year":"2024","unstructured":"Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang. 2024. Towards efficient exact optimization of language model alignment. In Proceedings of the ICML, Vol. 235. 21648\u201321671. Retrieved from https:\/\/proceedings.mlr.press\/v235\/ji24c.html"},{"key":"e_1_3_3_83_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-56063-7_42"},{"key":"e_1_3_3_84_2","volume-title":"Proceedings of the NeurIPS","author":"Ji Jiaming","year":"2023","unstructured":"Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BEAVERTAILS: Towards improved safety alignment of LLM via a human-preference dataset. In Proceedings of the NeurIPS. Article 1072, 27 pages."},{"key":"e_1_3_3_85_2","volume-title":"AI alignment: A comprehensive survey","year":"2024","unstructured":"Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et\u00a0al. 2024. AI alignment: A comprehensive survey. arxiv:2310.19852. Retrieved from https:\/\/arxiv.org\/abs\/2310.19852"},{"key":"e_1_3_3_86_2","volume-title":"Mixtral of experts","year":"2024","unstructured":"Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et\u00a0al. 2024. Mixtral of experts. arxiv:2401.04088. Retrieved from https:\/\/arxiv.org\/abs\/2401.04088"},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02553"},{"key":"e_1_3_3_88_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.792"},{"key":"e_1_3_3_89_2","volume-title":"Preference as reward, maximum preference optimization with importance sampling","author":"Jiang Zaifan","year":"2024","unstructured":"Zaifan Jiang, Xing Huang, and Chao Wei. 2024. Preference as reward, maximum preference optimization with importance sampling. arxiv:2312.16430. Retrieved from https:\/\/arxiv.org\/abs\/2312.16430"},{"key":"e_1_3_3_90_2","volume-title":"A survey of reinforcement learning from human feedback","author":"Kaufmann Timo","year":"2023","unstructured":"Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke H\u00fcllermeier. 2023. A survey of reinforcement learning from human feedback. arxiv:2312.14925. Retrieved from https:\/\/arxiv.org\/abs\/2312.14925"},{"key":"e_1_3_3_91_2","volume-title":"Alignment of language agents","author":"Kenton Zachary","year":"2021","unstructured":"Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. 2021. Alignment of language agents. arxiv:2103.14659. Retrieved from https:\/\/arxiv.org\/abs\/2103.14659"},{"key":"e_1_3_3_92_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.844"},{"key":"e_1_3_3_93_2","volume-title":"Proceedings of the ICLR","year":"2024","unstructured":"Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et\u00a0al. 2024. Prometheus: Inducing fine-grained evaluation capability in language models. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=8euJaTveKw"},{"key":"e_1_3_3_94_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.emnlp-main.148"},{"key":"e_1_3_3_95_2","volume-title":"Learning from Human-Generated Reward","author":"Knox W. Bradley","year":"2012","unstructured":"W. Bradley Knox. 2012. Learning from Human-Generated Reward. Ph. D. Dissertation. University of Texas at Austin."},{"key":"e_1_3_3_96_2","first-page":"292","volume-title":"Proceedings of the ICDL","author":"Knox W. Bradley","year":"2008","unstructured":"W. Bradley Knox and Peter Stone. 2008. TAMER: Training an agent manually via evaluative reinforcement. In Proceedings of the ICDL. 292\u2013297."},{"key":"e_1_3_3_97_2","doi-asserted-by":"publisher","DOI":"10.1145\/1597735.1597738"},{"key":"e_1_3_3_98_2","first-page":"5","volume-title":"Proceedings of the AAMAS","author":"Knox W. Bradley","year":"2010","unstructured":"W. Bradley Knox and Peter Stone. 2010. Combining manual feedback with subsequent MDP reward signals for reinforcement learning. In Proceedings of the AAMAS. 5\u201312."},{"key":"e_1_3_3_99_2","first-page":"475","volume-title":"Proceedings of the AAMAS","author":"Knox W. Bradley","year":"2012","unstructured":"W. Bradley Knox and Peter Stone. 2012. Reinforcement learning from simultaneous human and MDP reward. In Proceedings of the AAMAS. 475\u2013482."},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.1145\/2449396.2449422"},{"key":"e_1_3_3_101_2","volume-title":"Proceedings of the NeurIPS","author":"K\u00f6pf Andreas","year":"2023","unstructured":"Andreas K\u00f6pf, Yannic Kilcher, Dimitri von R\u00fctte, Sotiris Anagnostidis, et\u00a0al. 2023. OpenAssistant conversations - democratizing large language model alignment. In Proceedings of the NeurIPS. Article 2064, 13 pages."},{"key":"e_1_3_3_102_2","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2405460121"},{"key":"e_1_3_3_103_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1165"},{"key":"e_1_3_3_104_2","doi-asserted-by":"publisher","DOI":"10.3389\/frai.2022.778852"},{"key":"e_1_3_3_105_2","first-page":"26874","volume-title":"Proceedings of the ICML","volume":"235","year":"2024","unstructured":"Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et\u00a0al. 2024. RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. In Proceedings of the ICML, Vol. 235. 26874\u201326901. Retrieved from https:\/\/proceedings.mlr.press\/v235\/lee24t.html"},{"key":"e_1_3_3_106_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.577"},{"key":"e_1_3_3_107_2","volume-title":"FairMindSim: Alignment of behavior, emotion, and belief in humans and LLM agents amid ethical dilemmas","author":"Lei Yu","year":"2024","unstructured":"Yu Lei, Hao Liu, Chengxing Xie, Songjia Liu, Zhiyu Yin, Canyu Chen, Guohao Li, Philip Torr, and Zhen Wu. 2024. FairMindSim: Alignment of behavior, emotion, and belief in humans and LLM agents amid ethical dilemmas. arxiv:2410.10398. Retrieved from https:\/\/arxiv.org\/abs\/2410.10398"},{"key":"e_1_3_3_108_2","volume-title":"Scalable agent alignment via reward modeling: A research direction","author":"Leike Jan","year":"2018","unstructured":"Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. Scalable agent alignment via reward modeling: A research direction. arxiv:1811.07871. Retrieved from https:\/\/arxiv.org\/abs\/1811.07871"},{"key":"e_1_3_3_109_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.1011"},{"key":"e_1_3_3_110_2","volume-title":"Proceedings of the ICLR","author":"Li Junlong","year":"2024","unstructured":"Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. 2024. Generative judge for evaluating alignment. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=gtkFw6sZGS"},{"key":"e_1_3_3_111_2","volume-title":"Proceedings of the ICLR","author":"Li Lei","year":"2024","unstructured":"Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, and Hua Wu. 2024. Tool-augmented reward modeling. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=d94x0gWTUX"},{"key":"e_1_3_3_112_2","first-page":"4715","volume-title":"Proceedings of the ACL","author":"Li Margaret","year":"2020","unstructured":"Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don\u2019t say that! Making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the ACL. 4715\u20134728. Retrieved from https:\/\/aclanthology.org\/2020.acl-main.428"},{"key":"e_1_3_3_113_2","volume-title":"Proceedings of the ICLR","author":"Li Xian","year":"2024","unstructured":"Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason E. Weston, and Mike Lewis. 2024. Self-alignment with instruction backtranslation. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=1oijHJBRsT"},{"key":"e_1_3_3_114_2","first-page":"29128","volume-title":"Proceedings of the ICML","volume":"235","author":"Li Ziniu","year":"2024","unstructured":"Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. 2024. ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models. In Proceedings of the ICML, Vol. 235. 29128\u201329163. Retrieved from https:\/\/proceedings.mlr.press\/v235\/li24cd.html"},{"key":"e_1_3_3_115_2","volume-title":"Proceedings of the ICLR","author":"Lightman Hunter","year":"2024","unstructured":"Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let\u2019s verify step by step. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=v8L0pN6EOi"},{"key":"e_1_3_3_116_2","first-page":"55","article-title":"A technique for the measurement of attitudes","volume":"22","author":"Likert Rensis","year":"1932","unstructured":"Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology 22 140 (1932), 55\u201355.","journal-title":"Archives of Psychology"},{"key":"e_1_3_3_117_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.229"},{"key":"e_1_3_3_118_2","volume-title":"Proceedings of the NeurIPS","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Proceedings of the NeurIPS. Article 1516, 25 pages."},{"key":"e_1_3_3_119_2","volume-title":"Proceedings of the ICLR","author":"Liu Hao","year":"2024","unstructured":"Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2024. Chain of hindsight aligns language models with feedback. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=6xfe4IVcOu"},{"key":"e_1_3_3_120_2","volume-title":"Proceedings of the NeurIPS","author":"Liu Ruibo","year":"2022","unstructured":"Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, and Soroush Vosoughi. 2022. Second thoughts are best: Learning to re-align with human values from text edits. In Proceedings of the NeurIPS. Article 14, 16 pages."},{"key":"e_1_3_3_121_2","volume-title":"Proceedings of the ICLR","author":"Liu Ruibo","year":"2024","unstructured":"Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Diyi Yang, and Soroush Vosoughi. 2024. Training socially aligned language models on simulated social interactions. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=NddKiWtdUm"},{"key":"e_1_3_3_122_2","volume-title":"LiPO: Listwise preference optimization through learning-to-rank","year":"2024","unstructured":"Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et\u00a0al. 2024. LiPO: Listwise preference optimization through learning-to-rank. arxiv:2402.01878. Retrieved from https:\/\/arxiv.org\/abs\/2402.01878"},{"key":"e_1_3_3_123_2","volume-title":"Proceedings of the ICLR","author":"Liu Tianqi","year":"2024","unstructured":"Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. 2024. Statistical rejection sampling improves preference optimization. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=xbjSwwrQOe"},{"key":"e_1_3_3_124_2","doi-asserted-by":"publisher","DOI":"10.1561\/1500000016"},{"key":"e_1_3_3_125_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.572"},{"key":"e_1_3_3_126_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627673.3679596"},{"key":"e_1_3_3_127_2","volume-title":"WizardMath: Empowering mathematical reasoning for large language models via reinforced Evol-Instruct","author":"Luo Haipeng","year":"2023","unstructured":"Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. WizardMath: Empowering mathematical reasoning for large language models via reinforced Evol-Instruct. arxiv:2308.09583. Retrieved from https:\/\/arxiv.org\/abs\/2308.09583"},{"key":"e_1_3_3_128_2","first-page":"142","volume-title":"Proceedings of the ACL","author":"Maas Andrew L.","year":"2011","unstructured":"Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the ACL. 142\u2013150. Retrieved from https:\/\/aclanthology.org\/P11-1015"},{"key":"e_1_3_3_129_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF00114730"},{"key":"e_1_3_3_130_2","first-page":"17622","volume-title":"Proceedings of the EMNLP","author":"Mao Xin","year":"2024","unstructured":"Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, and Anh Tuan Luu. 2024. Don\u2019t forget your reward values: Language model alignment via value-based calibration. In Proceedings of the EMNLP. 17622\u201317642. Retrieved from https:\/\/aclanthology.org\/2024.emnlp-main.976"},{"key":"e_1_3_3_131_2","volume-title":"Proceedings of the NeurIPS","author":"Meng Yu","year":"2024","unstructured":"Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. In Proceedings of the NeurIPS. Article 3946, 38 pages. Retrieved from https:\/\/openreview.net\/forum?id=3Tzcot1LKb"},{"key":"e_1_3_3_132_2","volume-title":"Proceedings of the ICLR","author":"Moskovitz Ted","year":"2024","unstructured":"Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca Dragan, and Stephen Marcus McAleer. 2024. Confronting reward model overoptimization with constrained RLHF. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=gkfUvn0fLU"},{"key":"e_1_3_3_133_2","first-page":"15991","volume-title":"Proceedings of the ACL","year":"2023","unstructured":"Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, et\u00a0al. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the ACL. 15991\u201316111. Retrieved from https:\/\/aclanthology.org\/2023.acl-long.891"},{"key":"e_1_3_3_134_2","volume-title":"Proceedings of the NeurIPS Workshop SoLaR","author":"Mukobi Gabriel","year":"2023","unstructured":"Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta Kutyniok, Kush Bhatia, and Silas Alberti. 2023. SuperHF: Supervised iterative learning from human feedback. In Proceedings of the NeurIPS Workshop SoLaR. Retrieved from https:\/\/openreview.net\/forum?id=FMINWxrHOJ"},{"key":"e_1_3_3_135_2","first-page":"36743","volume-title":"Proceedings of the ICML","volume":"235","year":"2024","unstructured":"Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, C\u00f4me Fiegel, et\u00a0al. 2024. Nash learning from human feedback. In Proceedings of the ICML, Vol. 235. 36743\u201336768. Retrieved from https:\/\/proceedings.mlr.press\/v235\/munos24a.html"},{"key":"e_1_3_3_136_2","volume-title":"WebGPT: Browser-assisted question-answering with human feedback","year":"2022","unstructured":"Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et\u00a0al. 2022. WebGPT: Browser-assisted question-answering with human feedback. arxiv:2112.09332. Retrieved from https:\/\/arxiv.org\/abs\/2112.09332"},{"key":"e_1_3_3_137_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K16-1028"},{"key":"e_1_3_3_138_2","volume-title":"GPT-4 technical report","year":"2024","unstructured":"OpenAI. 2024. GPT-4 technical report. arxiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_3_139_2","volume-title":"Proceedings of the NeurIPS","year":"2022","unstructured":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et\u00a0al. 2022. Training language models to follow instructions with human feedback. In Proceedings of the NeurIPS. Article 2011, 15 pages."},{"key":"e_1_3_3_140_2","first-page":"39400","volume-title":"Proceedings of the ICML","volume":"235","author":"Pandey Gaurav","year":"2024","unstructured":"Gaurav Pandey, Yatin Nandwani, Tahira Naseem, Mayank Mishra, Guangxuan Xu, Dinesh Raghu, Sachindra Joshi, Asim Munawar, and Ram\u00f3n Fernandez Astudillo. 2024. BRAIn: Bayesian reward-conditioned amortized inference for natural language generation from feedback. In Proceedings of the ICML, Vol. 235. 39400\u201339415. Retrieved from https:\/\/proceedings.mlr.press\/v235\/pandey24a.html"},{"key":"e_1_3_3_141_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.262"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICORR.2011.5975338"},{"key":"e_1_3_3_143_2","volume-title":"Scaling language models: Methods, analysis and insights from training Gopher","year":"2022","unstructured":"Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et\u00a0al. 2022. Scaling language models: Methods, analysis and insights from training Gopher. arxiv:2112.11446. Retrieved from https:\/\/arxiv.org\/abs\/2112.11446"},{"key":"e_1_3_3_144_2","volume-title":"Proceedings of the COLM","author":"Rafailov Rafael","year":"2024","unstructured":"Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. 2024. From r to \\(Q^*\\) : Your language model is secretly a Q-function. In Proceedings of the COLM. Retrieved from https:\/\/openreview.net\/forum?id=kEVcNxtqXk"},{"key":"e_1_3_3_145_2","volume-title":"Proceedings of the NeurIPS","author":"Rafailov Rafael","year":"2023","unstructured":"Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the NeurIPS. Article 2338, 14 pages."},{"key":"e_1_3_3_146_2","volume-title":"Proceedings of the ICLR","author":"Ramamurthy Rajkumar","year":"2023","unstructured":"Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant\u00e9 Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. 2023. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=8aHzds2uUyB"},{"key":"e_1_3_3_147_2","volume-title":"Proceedings of the NeurIPS","author":"Rame Alexandre","year":"2023","unstructured":"Alexandre Rame, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. 2023. Rewarded soups: Towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Proceedings of the NeurIPS. Article 3114, 40 pages."},{"key":"e_1_3_3_148_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589334.3645458"},{"key":"e_1_3_3_149_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.sigdial-1.27"},{"key":"e_1_3_3_150_2","volume-title":"Offline regularised reinforcement learning for large language models alignment","year":"2024","unstructured":"Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, et\u00a0al. 2024. Offline regularised reinforcement learning for large language models alignment. arxiv:2405.19107. Retrieved from https:\/\/arxiv.org\/abs\/2405.19107"},{"key":"e_1_3_3_151_2","volume-title":"Efficient RLHF: Reducing the memory usage of PPO","author":"Santacroce Michael","year":"2023","unstructured":"Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. 2023. Efficient RLHF: Reducing the memory usage of PPO. arxiv:2309.00754. Retrieved from https:\/\/arxiv.org\/abs\/2309.00754"},{"key":"e_1_3_3_152_2","first-page":"1503","volume-title":"Proceedings of the ICML","author":"Schoenauer Marc","year":"2014","unstructured":"Marc Schoenauer, Riad Akrour, Michele Sebag, and Jean-Christophe Souplet. 2014. Programming by feedback. In Proceedings of the ICML. 1503\u20131511. Retrieved from https:\/\/proceedings.mlr.press\/v32\/schoenauer14.html"},{"key":"e_1_3_3_153_2","volume-title":"Proximal policy optimization algorithms","author":"Schulman John","year":"2017","unstructured":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arxiv:1707.06347. Retrieved from https:\/\/arxiv.org\/abs\/1707.06347"},{"key":"e_1_3_3_154_2","volume-title":"DeepSeekMath: Pushing the limits of mathematical reasoning in open language models","year":"2024","unstructured":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, et\u00a0al. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arxiv:2402.03300. Retrieved from https:\/\/arxiv.org\/abs\/2402.03300"},{"key":"e_1_3_3_155_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.539"},{"key":"e_1_3_3_156_2","volume-title":"Large language model alignment: A survey","author":"Shen Tianhao","year":"2023","unstructured":"Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. 2023. Large language model alignment: A survey. arxiv:2309.15025. Retrieved from https:\/\/arxiv.org\/abs\/2309.15025"},{"key":"e_1_3_3_157_2","first-page":"11521","volume-title":"Proceedings of the ACL","year":"2024","unstructured":"Shivalika Singh, Freddie Vargus, Daniel D\u2019souza, B\u00f6rje Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O\u2019Mahony, et\u00a0al. 2024. Aya dataset: An open-access collection for multilingual instruction tuning. In Proceedings of the ACL. 11521\u201311567. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.620"},{"key":"e_1_3_3_158_2","article-title":"Break it, imitate it, fix it: Robustness by generating human-like attacks","author":"Sinha Aradhana","year":"2024","unstructured":"Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, and Alex Beutel. 2024. Break it, imitate it, fix it: Robustness by generating human-like attacks. Transactions on Machine Learning Research 2024. Retrieved from https:\/\/openreview.net\/forum?id=UAT4j3Y7HP. Expert Certification.","journal-title":"Transactions on Machine Learning Research 2024"},{"key":"e_1_3_3_159_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i17.29865"},{"key":"e_1_3_3_160_2","first-page":"46280","volume-title":"Proceedings of the ICML","volume":"235","year":"2024","unstructured":"Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell L Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, et\u00a0al. 2024. Position: A roadmap to pluralistic alignment. In Proceedings of the ICML, Vol. 235. 46280\u201346302. Retrieved from https:\/\/proceedings.mlr.press\/v235\/sorensen24a.html"},{"key":"e_1_3_3_161_2","volume-title":"Proceedings of the NeurIPS","author":"Stiennon Nisan","year":"2020","unstructured":"Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. In Proceedings of the NeurIPS. Article 253, 14 pages."},{"key":"e_1_3_3_162_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41562-024-01882-z"},{"key":"e_1_3_3_163_2","volume-title":"Proceedings of the CHI Workshop ToMinHAI","author":"Street Winnie","year":"2024","unstructured":"Winnie Street. 2024. LLM theory of mind and alignment: Opportunities and risks. In Proceedings of the CHI Workshop ToMinHAI."},{"key":"e_1_3_3_164_2","doi-asserted-by":"publisher","DOI":"10.1109\/ROMAN.2011.6005223"},{"key":"e_1_3_3_165_2","first-page":"13088","volume-title":"Findings of the ACL","year":"2024","unstructured":"Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et\u00a0al. 2024. Aligning large multimodal models with factually augmented RLHF. In Findings of the ACL. 13088\u201313110. Retrieved from https:\/\/aclanthology.org\/2024.findings-acl.775"},{"key":"e_1_3_3_166_2","volume-title":"Proceedings of the ICLR","author":"Sun Zhiqing","year":"2024","unstructured":"Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, and Chuang Gan. 2024. SALMON: Self-alignment with instructable reward models. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=xJbsmB8UMx"},{"key":"e_1_3_3_167_2","volume-title":"Proceedings of the NeurIPS","author":"Sun Zhiqing","year":"2023","unstructured":"Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Proceedings of the NeurIPS. Article 115, 55 pages."},{"key":"e_1_3_3_168_2","first-page":"13003","volume-title":"Findings of the ACL","year":"2023","unstructured":"Mirac Suzgun, Nathan Scales, Nathanael Sch\u00e4rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et\u00a0al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the ACL. 13003\u201313051. Retrieved from https:\/\/aclanthology.org\/2023.findings-acl.824"},{"key":"e_1_3_3_169_2","first-page":"47345","volume-title":"Proceedings of the ICML","volume":"235","author":"Swamy Gokul","year":"2024","unstructured":"Gokul Swamy, Christoph Dann, Rahul Kidambi, Steven Wu, and Alekh Agarwal. 2024. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the ICML, Vol. 235. 47345\u201347377. Retrieved from https:\/\/proceedings.mlr.press\/v235\/swamy24a.html"},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.findings-acl.839"},{"key":"e_1_3_3_171_2","first-page":"47725","volume-title":"Proceedings of the ICML","volume":"235","author":"Tang Yunhao","year":"2024","unstructured":"Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Remi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Avila Pires, and Bilal Piot. 2024. Generalized preference optimization: A unified approach to offline alignment. In Proceedings of the ICML, Vol. 235. 47725\u201347742. Retrieved from https:\/\/proceedings.mlr.press\/v235\/tang24b.html"},{"key":"e_1_3_3_172_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.428"},{"key":"e_1_3_3_173_2","volume-title":"Llama 2: Open foundation and fine-tuned chat models","year":"2023","unstructured":"Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et\u00a0al. 2023. Llama 2: Open foundation and fine-tuned chat models. arxiv:2307.09288. Retrieved from https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_3_3_174_2","volume-title":"Proceedings of the COLM","year":"2024","unstructured":"Lewis Tunstall, Edward Emanuel Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro Von Werra, Cl\u00e9mentine Fourrier, Nathan Habib, et\u00a0al. 2024. Zephyr: Direct distillation of LM alignment. In Proceedings of the COLM. Retrieved from https:\/\/openreview.net\/forum?id=aKkAwZB6JV"},{"key":"e_1_3_3_175_2","volume-title":"Large language models fail on trivial alterations to theory-of-mind tasks","author":"Ullman Tomer","year":"2023","unstructured":"Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arxiv:2302.08399. Retrieved from https:\/\/arxiv.org\/abs\/2302.08399"},{"key":"e_1_3_3_176_2","volume-title":"Representation learning with contrastive predictive coding","author":"Oord Aaron van den","year":"2019","unstructured":"Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation learning with contrastive predictive coding. arxiv:1807.03748. Retrieved from https:\/\/arxiv.org\/abs\/1807.03748"},{"key":"e_1_3_3_177_2","first-page":"6000","volume-title":"Proceedings of the NeurIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NeurIPS. 6000\u20136010."},{"key":"e_1_3_3_178_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4508"},{"key":"e_1_3_3_179_2","first-page":"49890","volume-title":"Proceedings of the ICML","volume":"235","author":"Wan Ziyu","year":"2024","unstructured":"Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. 2024. AlphaZero-like tree-search can guide large language model decoding and training. In Proceedings of the ICML, Vol. 235. 49890\u201349920. Retrieved from https:\/\/proceedings.mlr.press\/v235\/wan24c.html"},{"key":"e_1_3_3_180_2","volume-title":"Proceedings of the ICLR","author":"Wang Chaoqi","year":"2024","unstructured":"Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. 2024. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=2cRzmWXK9N"},{"key":"e_1_3_3_181_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.emnlp-main.460"},{"key":"e_1_3_3_182_2","volume-title":"Proceedings of the ICLR","author":"Wang Guan","year":"2024","unstructured":"Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. 2024. OpenChat: Advancing open-source language models with mixed-quality data. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=AOJyfhWYHf"},{"key":"e_1_3_3_183_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.468"},{"key":"e_1_3_3_184_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.620"},{"key":"e_1_3_3_185_2","volume-title":"Proceedings of the NeurIPS","author":"Wang Jiashuo","year":"2023","unstructured":"Jiashuo Wang, Haozhao Wang, Shichao Sun, and Wenjie Li. 2023. Aligning language models with human preferences via a Bayesian approach. In Proceedings of the NeurIPS. Article 2135, 20 pages."},{"key":"e_1_3_3_186_2","first-page":"9440","volume-title":"Proceedings of the ACL","year":"2024","unstructured":"Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et\u00a0al. 2024. Large language models are not fair evaluators. In Proceedings of the ACL. 9440\u20139450. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.511"},{"key":"e_1_3_3_187_2","volume-title":"Making large language models better reasoners with alignment","author":"Wang Peiyi","year":"2023","unstructured":"Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. 2023. Making large language models better reasoners with alignment. arxiv:2309.02144. Retrieved from https:\/\/arxiv.org\/abs\/2309.02144"},{"key":"e_1_3_3_188_2","volume-title":"ERNIE 3.0 Titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation","year":"2021","unstructured":"Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, et\u00a0al. 2021. ERNIE 3.0 Titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arxiv:2112.12731. Retrieved from https:\/\/arxiv.org\/abs\/2112.12731"},{"key":"e_1_3_3_189_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1224"},{"key":"e_1_3_3_190_2","volume-title":"Self-Taught evaluators","author":"Wang Tianlu","year":"2024","unstructured":"Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. 2024. Self-Taught evaluators. arxiv:2408.02666. Retrieved from https:\/\/arxiv.org\/abs\/2408.02666"},{"key":"e_1_3_3_191_2","volume-title":"Shepherd: A critic for language model generation","author":"Wang Tianlu","year":"2023","unstructured":"Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O\u2019Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. Shepherd: A critic for language model generation. arxiv:2308.04592. Retrieved from https:\/\/arxiv.org\/abs\/2308.04592"},{"key":"e_1_3_3_192_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2024\/918"},{"key":"e_1_3_3_193_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.754"},{"key":"e_1_3_3_194_2","first-page":"5085","volume-title":"Proceedings of the EMNLP","year":"2022","unstructured":"Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, et\u00a0al. 2022. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the EMNLP. 5085\u20135109. Retrieved from https:\/\/aclanthology.org\/2022.emnlp-main.340"},{"key":"e_1_3_3_195_2","volume-title":"Proceedings of the ICLR","year":"2024","unstructured":"Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, et\u00a0al. 2024. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=5Nn2BLV7SB"},{"key":"e_1_3_3_196_2","volume-title":"Aligning large language models with human: A survey","author":"Wang Yufei","year":"2023","unstructured":"Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large language models with human: A survey. arxiv:2307.12966. Retrieved from https:\/\/arxiv.org\/abs\/2307.12966"},{"key":"e_1_3_3_197_2","volume-title":"A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and More","year":"2024","unstructured":"Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, et\u00a0al. 2024. A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and More. arxiv:2407.16216. Retrieved from https:\/\/arxiv.org\/abs\/2407.16216"},{"key":"e_1_3_3_198_2","volume-title":"Proceedings of the ICLR","author":"Wei Jason","year":"2022","unstructured":"Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learners. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=gEZrGCozdqR"},{"key":"e_1_3_3_199_2","volume-title":"Ethical and social risks of harm from language models","year":"2021","unstructured":"Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et\u00a0al. 2021. Ethical and social risks of harm from language models. arxiv:2112.04359. Retrieved from https:\/\/arxiv.org\/abs\/2112.04359"},{"key":"e_1_3_3_200_2","volume-title":"Proceedings of the ICLR","author":"Welleck Sean","year":"2020","unstructured":"Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=SJeYe0NtvH"},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.acl-long.451"},{"key":"e_1_3_3_202_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF00992696"},{"key":"e_1_3_3_203_2","first-page":"1133","volume-title":"Proceedings of the NeurIPS","author":"Wilson Aaron","year":"2012","unstructured":"Aaron Wilson, Alan Fern, and Prasad Tadepalli. 2012. A Bayesian approach for policy learning from trajectory preference queries. In Proceedings of the NeurIPS. 1133\u20131141."},{"issue":"136","key":"e_1_3_3_204_2","first-page":"1","article-title":"A survey of preference-based reinforcement learning methods","volume":"18","author":"Wirth Christian","year":"2017","unstructured":"Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F\u00fcrnkranz. 2017. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18, 136 (2017), 1\u201346. Retrieved from http:\/\/jmlr.org\/papers\/v18\/16-634.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_205_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-41398-8_37"},{"key":"e_1_3_3_206_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10269"},{"key":"e_1_3_3_207_2","volume-title":"Proceedings of the NeurIPS","author":"Wu Junkang","year":"2024","unstructured":"Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. 2024. \\(\\beta\\) -DPO: Direct preference optimization with dynamic \\(\\beta\\) . In Proceedings of the NeurIPS. Article 4128, 23 pages. Retrieved from https:\/\/openreview.net\/forum?id=ZfBuhzE556"},{"key":"e_1_3_3_208_2","volume-title":"Thinking LLMs: General instruction following with thought generation","author":"Wu Tianhao","year":"2024","unstructured":"Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. 2024. Thinking LLMs: General instruction following with thought generation. arxiv:2410.10630. Retrieved from https:\/\/arxiv.org\/abs\/2410.10630"},{"key":"e_1_3_3_209_2","volume-title":"Proceedings of the NeurIPS Workshop FMDM","author":"Wu Tianhao","year":"2023","unstructured":"Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhaojin Wen, Kannan Ramchandran, and Jiantao Jiao. 2023. Pairwise proximal policy optimization: Harnessing relative feedback for LLM alignment. In Proceedings of the NeurIPS Workshop FMDM. Retrieved from https:\/\/openreview.net\/forum?id=yQT406rH72"},{"key":"e_1_3_3_210_2","volume-title":"Proceedings of the NeurIPS","author":"Wu Zeqiu","year":"2023","unstructured":"Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-grained human feedback gives better rewards for language model training. In Proceedings of the NeurIPS. Article 2574, 26 pages."},{"key":"e_1_3_3_211_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.775"},{"key":"e_1_3_3_212_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-acl.730"},{"key":"e_1_3_3_213_2","volume-title":"A survey on knowledge distillation of large language models","author":"Xu Xiaohan","year":"2024","unstructured":"Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. 2024. A survey on knowledge distillation of large language models. arxiv:2402.13116. Retrieved from https:\/\/arxiv.org\/abs\/2402.13116"},{"key":"e_1_3_3_214_2","volume-title":"Baichuan 2: Open large-scale language models","year":"2023","unstructured":"Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et\u00a0al. 2023. Baichuan 2: Open large-scale language models. arxiv:2309.10305. Retrieved from https:\/\/arxiv.org\/abs\/2309.10305"},{"key":"e_1_3_3_215_2","volume-title":"Qwen2 technical report","year":"2024","unstructured":"An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et\u00a0al. 2024. Qwen2 technical report. arxiv:2407.10671. Retrieved from https:\/\/arxiv.org\/abs\/2407.10671"},{"key":"e_1_3_3_216_2","volume-title":"Proceedings of the ICLR","author":"Yang Kevin","year":"2024","unstructured":"Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, and Yuandong Tian. 2024. RLCD: Reinforcement learning from contrastive distillation for LM alignment. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=v3XXtxWKi6"},{"key":"e_1_3_3_217_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W19-8608"},{"key":"e_1_3_3_218_2","volume-title":"Proceedings of the NeurIPS","author":"Yuan Hongyi","year":"2023","unstructured":"Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: Rank responses to align language models with human feedback. In Proceedings of the NeurIPS. Article 482, 16 pages."},{"key":"e_1_3_3_219_2","volume-title":"Proceedings of the ICLR","year":"2023","unstructured":"Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et\u00a0al. 2023. GLM-130B: An open bilingual pre-trained model. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=-Aw0rrrPUF"},{"key":"e_1_3_3_220_2","volume-title":"Exploring translation mechanism of large language models","author":"Zhang Hongbin","year":"2025","unstructured":"Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, and Min Zhang. 2025. Exploring translation mechanism of large language models. arxiv:2502.11806. Retrieved from https:\/\/arxiv.org\/abs\/2502.11806"},{"key":"e_1_3_3_221_2","volume-title":"Proceedings of the ICLR","author":"Zhang Han","year":"2024","unstructured":"Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, and Ruifeng Xu. 2024. CPPO: Continual learning for reinforcement learning with human feedback. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=86zAUE80pP"},{"key":"e_1_3_3_222_2","first-page":"41414","volume-title":"Proceedings of the ICML","volume":"202","author":"Zhang Tianjun","year":"2023","unstructured":"Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, and Joseph E. Gonzalez. 2023. The wisdom of hindsight makes language models better instruction followers. In Proceedings of the ICML, Vol. 202. 41414\u201341428. Retrieved from https:\/\/proceedings.mlr.press\/v202\/zhang23ab.html"},{"key":"e_1_3_3_223_2","volume-title":"Proceedings of the ICLR","author":"Zhao Siyan","year":"2024","unstructured":"Siyan Zhao, John Dang, and Aditya Grover. 2024. Group preference optimization: Few-shot alignment of large language models. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=DpFeMH4l8Q"},{"key":"e_1_3_3_224_2","volume-title":"A survey of large language models","year":"2023","unstructured":"Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et\u00a0al. 2023. A survey of large language models. arxiv:2303.18223. Retrieved from https:\/\/arxiv.org\/abs\/2303.18223"},{"key":"e_1_3_3_225_2","volume-title":"SLiC-HF: Sequence likelihood calibration with human feedback","author":"Zhao Yao","year":"2023","unstructured":"Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. 2023. SLiC-HF: Sequence likelihood calibration with human feedback. arxiv:2305.10425. Retrieved from https:\/\/arxiv.org\/abs\/2305.10425"},{"key":"e_1_3_3_226_2","volume-title":"Proceedings of the ICLR","author":"Zhao Yao","year":"2023","unstructured":"Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. 2023. Calibrating sequence likelihood improves conditional language generation. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=0qSOodKmJaN"},{"key":"e_1_3_3_227_2","volume-title":"Proceedings of the NeurIPS","year":"2023","unstructured":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et\u00a0al. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Proceedings of the NeurIPS. Article 2020, 29 pages."},{"key":"e_1_3_3_228_2","volume-title":"Secrets of RLHF in large language models part I: PPO","year":"2023","unstructured":"Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, et\u00a0al. 2023. Secrets of RLHF in large language models part I: PPO. arxiv:2307.04964. Retrieved from https:\/\/arxiv.org\/abs\/2307.04964"},{"key":"e_1_3_3_229_2","volume-title":"Proceedings of the ICLR","year":"2024","unstructured":"Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, et\u00a0al. 2024. Improving generalization of alignment with human preferences through group invariant learning. In Proceedings of the ICLR. Retrieved from https:\/\/openreview.net\/forum?id=fwCoLe3TAX"},{"key":"e_1_3_3_230_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-naacl.149"},{"key":"e_1_3_3_231_2","volume-title":"Proceedings of the NeurIPS","year":"2023","unstructured":"Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et\u00a0al. 2023. LIMA: Less is more for alignment. In Proceedings of the NeurIPS. Article 2400, 16 pages."},{"key":"e_1_3_3_232_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.624"},{"key":"e_1_3_3_233_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6521"},{"key":"e_1_3_3_234_2","volume-title":"Fine-tuning language models from human preferences","author":"Ziegler Daniel M.","year":"2020","unstructured":"Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. Fine-tuning language models from human preferences. arxiv:1909.08593. Retrieved from https:\/\/arxiv.org\/abs\/1909.08593"},{"key":"e_1_3_3_235_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2025.findings-acl.1039"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773279","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T14:32:03Z","timestamp":1765031523000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773279"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,6]]},"references-count":234,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2026,4,30]]}},"alternative-id":["10.1145\/3773279"],"URL":"https:\/\/doi.org\/10.1145\/3773279","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,6]]},"assertion":[{"value":"2024-09-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}