{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T16:24:18Z","timestamp":1769012658653,"version":"3.49.0"},"reference-count":52,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T00:00:00Z","timestamp":1745798400000},"content-version":"vor","delay-in-days":117,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,4,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Recent advances in large language models (LLMs) focus on aligning models with human values to minimize harmful content. However, existing methods often rely on a single type of feedback, such as preferences, annotated labels, or critiques, which can lead to overfitting and suboptimal performance. In this paper, we propose Diverse AIFeedback (DAIF), a novel approach that integrates three types of feedback\u2014critique, refinement, and preference\u2014tailored to tasks of varying uncertainty levels. Through an analysis of information gain, we show that critique feedback is most effective for low-uncertainty tasks, refinement feedback for medium-uncertainty tasks, and preference feedback for high-uncertainty tasks. Training with this diversified feedback reduces overfitting and improves alignment. Experimental results across three tasks\u2014question answering, dialog generation, and text summarization\u2013demonstrate that DAIF outperforms traditional methods relying on a single feedback type.1<\/jats:p>","DOI":"10.1162\/tacl_a_00746","type":"journal-article","created":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T19:04:54Z","timestamp":1745867094000},"page":"392-407","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["Diverse AI Feedback For Large Language Model Alignment"],"prefix":"10.1162","volume":"13","author":[{"given":"Tianshu","family":"Yu","sequence":"first","affiliation":[{"name":"Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China. ts.yu@siat.ac.cn"},{"name":"University of Chinese Academy of Sciences, China"},{"name":"Tongyi Lab, China"}]},{"given":"Ting-En","family":"Lin","sequence":"additional","affiliation":[{"name":"Tongyi Lab, China. ting-en.lte@alibaba-inc.com"}]},{"given":"Yuchuan","family":"Wu","sequence":"additional","affiliation":[{"name":"Tongyi Lab, China"}]},{"given":"Min","family":"Yang","sequence":"additional","affiliation":[{"name":"Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China. min.yang@siat.ac.cn"},{"name":"Shenzhen University of Advanced Technology, China"}]},{"given":"Fei","family":"Huang","sequence":"additional","affiliation":[{"name":"Tongyi Lab, China"}]},{"given":"Yongbin","family":"Li","sequence":"additional","affiliation":[{"name":"Tongyi Lab, China. shuide.lyb@alibaba-inc.com"}]}],"member":"281","published-online":{"date-parts":[[2025,4,17]]},"reference":[{"key":"2025042815044865300_bib1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.427","article-title":"Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs","author":"Aky\u00fcrek","year":"2023","journal-title":"arXiv preprint arXiv:2305.08844"},{"key":"2025042815044865300_bib2","article-title":"Uncertainty in natural language generation: From theory to applications","author":"Baan","year":"2023","journal-title":"arXiv preprint arXiv:2307.15703"},{"key":"2025042815044865300_bib3","article-title":"Training a helpful and harmless assistant with reinforcement learning from human feedback","author":"Bai","year":"2022","journal-title":"arXiv preprint arXiv:2204.05862"},{"key":"2025042815044865300_bib4","article-title":"Constitutional ai: Harmlessness from ai feedback","author":"Bai","year":"2022","journal-title":"arXiv preprint arXiv:2212.08073"},{"issue":"1","key":"2025042815044865300_bib5","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1177\/02783649211041652","article-title":"Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences","volume":"41","author":"B\u0131y\u0131k","year":"2022","journal-title":"The International Journal of Robotics Research"},{"key":"2025042815044865300_bib6","article-title":"Asking easy questions: A user-friendly approach to active reward learning","author":"B\u0131y\u0131k","year":"2019","journal-title":"arXiv preprint arXiv:1910.04365"},{"issue":"3\/4","key":"2025042815044865300_bib7","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1093\/biomet\/39.3-4.324","article-title":"Rank analysis of incomplete block designs: I. The method of paired comparisons","volume":"39","author":"Bradley","year":"1952","journal-title":"Biometrika"},{"key":"2025042815044865300_bib8","article-title":"Sparks of artificial general intelligence: Early experiments with gpt-4","author":"Bubeck","year":"2023","journal-title":"arXiv preprint arXiv:2303.12712"},{"key":"2025042815044865300_bib9","article-title":"Self-play fine-tuning converts weak language models to strong language models","author":"Chen","year":"2024","journal-title":"arXiv preprint arXiv:2401.01335"},{"key":"2025042815044865300_bib10","article-title":"Active preference optimization for sample efficient rlhf","volume-title":"ICML 2024 Workshop on Theoretical Foundations of Foundation Models","author":"Das","year":"2024"},{"key":"2025042815044865300_bib11","article-title":"Alpacafarm: A simulation framework for methods that learn from human feedback","author":"Dubois","year":"2023","journal-title":"arXiv preprint arXiv:2305.14387"},{"key":"2025042815044865300_bib12","article-title":"Kto: Model alignment as prospect theoretic optimization","author":"Ethayarajh","year":"2024","journal-title":"arXiv preprint arXiv:2402.01306"},{"key":"2025042815044865300_bib13","doi-asserted-by":"publisher","first-page":"11084","DOI":"10.18653\/v1\/2024.findings-emnlp.648","article-title":"Improving factual consistency of news summarization by contrastive preference optimization","volume-title":"Findings of EMNLP 2024","author":"Feng","year":"2024"},{"key":"2025042815044865300_bib14","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.355","article-title":"Simulating bandit learning from user feedback for extractive question answering","author":"Ge","year":"2022","journal-title":"arXiv preprint arXiv:2203.10079"},{"key":"2025042815044865300_bib15","first-page":"14567","article-title":"Self-explanation prompting improves dialogue understanding in large language models","volume-title":"LREC-COLING 2024","author":"Gao","year":"2024"},{"key":"2025042815044865300_bib16","first-page":"10835","article-title":"Scaling laws for reward model overoptimization","volume-title":"International Conference on Machine Learning","author":"Gao","year":"2023"},{"key":"2025042815044865300_bib17","doi-asserted-by":"publisher","first-page":"5983","DOI":"10.1609\/aaai.v37i5.25740","article-title":"The effect of modeling human rationality level on learning rewards from multiple feedback types","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Ghosal","year":"2023"},{"key":"2025042815044865300_bib18","article-title":"Uncertainty estimation for language reward models","author":"Gleave","year":"2022","journal-title":"arXiv preprint arXiv:2203.07472"},{"key":"2025042815044865300_bib19","article-title":"Uncertainty in natural language processing: Sources, quantification, and applications","author":"Mengting","year":"2023","journal-title":"arXiv preprint arXiv:2306.04459"},{"key":"2025042815044865300_bib20","first-page":"4415","article-title":"Reward-rational (implicit) choice: A unifying formalism for reward learning","volume":"33","author":"Jeon","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815044865300_bib21","article-title":"Binary classifier optimization for large language model alignment","author":"Jung","year":"2024","journal-title":"arXiv preprint arXiv:2404.04656"},{"key":"2025042815044865300_bib22","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/7443.003.0014","article-title":"A tutorial on energy-based learning","author":"LeCun","year":"2006","journal-title":"Predicting Structured Data"},{"key":"2025042815044865300_bib23","article-title":"Api-bank: A benchmark for tool-augmented llms","author":"Li","year":"2023","journal-title":"arXiv preprint arXiv:2304.08244"},{"key":"2025042815044865300_bib24","article-title":"Chain of hindsight aligns language models with feedback","volume":"3","author":"Liu","year":"2023","journal-title":"arXiv preprint arXiv:2302.02676v6"},{"key":"2025042815044865300_bib25","article-title":"Training socially aligned language models in simulated human society","author":"Liu","year":"2023","journal-title":"arXiv preprint arXiv:2305.16960"},{"key":"2025042815044865300_bib26","article-title":"Mmevol: Empowering multimodal large language models with evol-instruct","author":"Luo","year":"2024","journal-title":"arXiv preprint arXiv:2409.05840"},{"key":"2025042815044865300_bib27","article-title":"Self-refine: Iterative refinement with self-feedback","author":"Madaan","year":"2023","journal-title":"arXiv preprint arXiv:2303.17651"},{"key":"2025042815044865300_bib28","article-title":"Sample efficient reinforcement learning from human feedback via active exploration","author":"Mehta","year":"2023","journal-title":"arXiv preprint arXiv:2312.00267"},{"key":"2025042815044865300_bib29","article-title":"Deep bayesian active learning for preference modeling in large language models","author":"Melo","year":"2024","journal-title":"arXiv preprint arXiv:2406.10023"},{"key":"2025042815044865300_bib30","article-title":"Scaling data-constrained language models","author":"Muennighoff","year":"2023","journal-title":"arXiv preprint arXiv:2305.16264"},{"key":"2025042815044865300_bib31","article-title":"Webgpt: Browser-assisted question-answering with human feedback","author":"Nakano","year":"2021","journal-title":"arXiv preprint arXiv:2112.09332"},{"key":"2025042815044865300_bib32","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815044865300_bib33","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2019.XV.023","article-title":"Learning reward functions by integrating human demonstrations and preferences","author":"Palan","year":"2019","journal-title":"arXiv preprint arXiv:1906.08928"},{"key":"2025042815044865300_bib34","article-title":"Direct preference optimization: Your language model is secretly a reward model","author":"Rafailov","year":"2023","journal-title":"arXiv preprint arXiv:2305.18290"},{"key":"2025042815044865300_bib35","article-title":"Self-critiquing models for assisting human evaluators","author":"Saunders","year":"2022","journal-title":"arXiv preprint arXiv:2206.05802"},{"key":"2025042815044865300_bib36","article-title":"Training language models with natural language feedback","volume":"8","author":"Scheurer","year":"2022","journal-title":"arXiv preprint arXiv:2204.14146"},{"key":"2025042815044865300_bib37","article-title":"Active learning literature survey","author":"Settles","year":"2009"},{"key":"2025042815044865300_bib38","article-title":"When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels","author":"Shi","year":"2022","journal-title":"arXiv preprint arXiv:2210.15893"},{"key":"2025042815044865300_bib39","article-title":"Preference ranking optimization for human alignment","author":"Song","year":"2023","journal-title":"arXiv preprint arXiv:2306.17492"},{"key":"2025042815044865300_bib40","first-page":"3008","article-title":"Learning to summarize with human feedback","volume":"33","author":"Stiennon","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025042815044865300_bib41","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-naacl.26","article-title":"Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback","author":"Tandon","year":"2021","journal-title":"arXiv preprint arXiv:2112.09737"},{"key":"2025042815044865300_bib42","article-title":"A survey on self-evolution of large language models","author":"Tao","year":"2024","journal-title":"arXiv preprint arXiv:2404.14387"},{"key":"2025042815044865300_bib43","article-title":"Llama: Open and efficient foundation language models","author":"Touvron","year":"2023","journal-title":"arXiv preprint arXiv:2302.13971"},{"key":"2025042815044865300_bib44","article-title":"Solving math word problems with process-and outcome-based feedback","author":"Uesato","year":"2022","journal-title":"arXiv preprint arXiv:2211.14275"},{"key":"2025042815044865300_bib45","article-title":"Large language models are not fair evaluators","author":"Wang","year":"2023","journal-title":"arXiv preprint arXiv:2305.17926"},{"key":"2025042815044865300_bib46","article-title":"Aligning large language models with human: A survey","author":"Wang","year":"2023","journal-title":"arXiv preprint arXiv:2307.12966"},{"key":"2025042815044865300_bib47","article-title":"Generating sequences by learning to self-correct","author":"Welleck","year":"2022","journal-title":"arXiv preprint arXiv:2211.00053"},{"key":"2025042815044865300_bib48","article-title":"Self-play preference optimization for language model alignment","author":"Yue","year":"2024","journal-title":"arXiv preprint arXiv:2405.00675"},{"key":"2025042815044865300_bib49","article-title":"Fine-grained human feedback gives better rewards for language model training","author":"Zeqiu","year":"2023","journal-title":"arXiv preprint arXiv:2306.01693"},{"key":"2025042815044865300_bib50","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","author":"Zheng","year":"2023","journal-title":"arXiv preprint arXiv:2306.05685"},{"key":"2025042815044865300_bib51","article-title":"Principled reinforcement learning with human feedback from pairwise or k-wise comparisons","author":"Zhu","year":"2023","journal-title":"arXiv preprint arXiv:2301.11270"},{"key":"2025042815044865300_bib52","article-title":"Fine-tuning language models from human preferences","author":"Ziegler","year":"2019","journal-title":"arXiv preprint arXiv:1909.08593"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00746\/2514577\/tacl_a_00746.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00746\/2514577\/tacl_a_00746.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,28]],"date-time":"2025-04-28T19:05:01Z","timestamp":1745867101000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00746\/128938\/Diverse-AI-Feedback-For-Large-Language-Model"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":52,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00746","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}