{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T03:17:30Z","timestamp":1773803850915,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"31","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>Evaluating reward models is a fundamental challenge in Reinforcement Learning (RL), particularly in settings where the reward model is learned or manually designed. The standard paradigm for Reward Model Evaluation (RME) involves training an optimal policy via RL on the given reward model and assessing model quality through the performance of the resulting policy. However, this approach conflates the quality of the reward model with the effectiveness of RL training, and is computationally expensive due to the need for policy optimization. Recent RME methods attempt to circumvent this issue by evaluating reward models directly, without RL, but often rely on impractical assumptions such as access to a ground-truth reward or fail to utilize available supervision in a fine-grained manner. To overcome these limitations, we propose the Policy Preference Alignment Coefficient (PPAC), a novel metric for RME that requires neither RL training nor ground-truth rewards. PPAC first generates a sequence of automatically ranked policy preferences that guarantee monotonic improvement in the policy value, and then quantifies the alignment between these generated preferences and those implied by the candidate reward model. Experimental results across gridworld and continuous control task demonstrate that PPAC yields preference sequences with consistently increasing policy values and outperforms existing metrics in evaluating reward model quality.<\/jats:p>","DOI":"10.1609\/aaai.v40i31.39815","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T02:07:32Z","timestamp":1773799652000},"page":"26124-26132","source":"Crossref","is-referenced-by-count":0,"title":["Reward Model Evaluation via Automatically-Ranked Policy Alignment"],"prefix":"10.1609","volume":"40","author":[{"given":"Aoran","family":"Wang","sequence":"first","affiliation":[]},{"given":"Lei","family":"Ou","sequence":"additional","affiliation":[]},{"given":"Yang","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Zongzhang","family":"Zhang","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39815\/43776","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39815\/43776","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T02:07:32Z","timestamp":1773799652000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/39815"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"31","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i31.39815","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}