{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T03:04:54Z","timestamp":1773803094467,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"27","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>We study infinite-horizon average-reward reinforcement learning for continuous space Lipschitz Markov decision processes (MDPs) in which an agent can play policies from a given set \u03a6. The proposed algorithms efficiently explore the policy space by \u201czooming\u201d into the \u201cpromising regions\u201d of \u03a6, thereby achieving adaptivity gains in the performance. We upper bound the regret as O \u0303(T^(1-d_(eff.)^(-1) ) ), where d_(eff.) = d_z^\u03a6+2 for our model-free algorithm PZRL-MF and d_(eff.) = 2d_S + d_z^\u03a6+ 3 for our model-based algorithm PZRL-MB. Here, d_S is the dimension of the state space, and d_z^\u03a6 is the zooming dimension given a set of policies \u03a6. d_z^\u03a6 is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on \u03a6. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and\/or the agent competes against a low-complexity \u03a6 (that has a small d_z^\u03a6). When specialized to the case of finite-dimensional policy space, we obtain that d_(eff.) scales as the dimension of this space under mild technical conditions; and also obtain d_(eff.) = 2, or equivalently O \u0303(\u221aT) regret for PZRL-MF, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.<\/jats:p>","DOI":"10.1609\/aaai.v40i27.39412","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T01:32:35Z","timestamp":1773797555000},"page":"22527-22535","source":"Crossref","is-referenced-by-count":0,"title":["Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning"],"prefix":"10.1609","volume":"40","author":[{"given":"Avik","family":"Kar","sequence":"first","affiliation":[]},{"given":"Rahul","family":"Singh","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39412\/43373","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39412\/43373","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T01:32:35Z","timestamp":1773797555000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/39412"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"27","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i27.39412","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}