{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T03:04:40Z","timestamp":1773803080258,"version":"3.50.1"},"reference-count":0,"publisher":"Association for the Advancement of Artificial Intelligence (AAAI)","issue":"26","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["AAAI"],"abstract":"<jats:p>As a data-driven learning approach, model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning. However, these methods typically overlook the influence of historical information on environmental dynamics, thus generating unreliable trajectories that fail to align with the true data distribution. In this paper, we propose a new MORL algorithm called Reliability-guaranteed and Reward-seeking Transformer (RT). RT can avoid generating unreliable trajectories through the calculation of cumulative reliability of the trajectories, which is a weighted variational distance between the generated trajectory distribution and the true data distribution.\nMoreover, by sampling candidate actions with high rewards, RT can efficiently generate high-reward trajectories from the existing offline data, thereby further facilitating policy learning. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several offline benchmark tasks and a large-scale industrial dataset from an on-demand food delivery platform.<\/jats:p>","DOI":"10.1609\/aaai.v40i26.39315","type":"journal-article","created":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T01:29:50Z","timestamp":1773797390000},"page":"21654-21662","source":"Crossref","is-referenced-by-count":0,"title":["Reliability-Guaranteed and Reward-Seeking Sequence Modeling for Model-Based Offline Reinforcement Learning"],"prefix":"10.1609","volume":"40","author":[{"given":"Shenghong","family":"He","sequence":"first","affiliation":[]},{"given":"Chao","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Qian","family":"Lin","sequence":"additional","affiliation":[]},{"given":"Yile","family":"Liang","sequence":"additional","affiliation":[]},{"given":"Donghui","family":"Li","sequence":"additional","affiliation":[]},{"given":"Xuetao","family":"Ding","sequence":"additional","affiliation":[]}],"member":"9382","published-online":{"date-parts":[[2026,3,14]]},"container-title":["Proceedings of the AAAI Conference on Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39315\/43276","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/download\/39315\/43276","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T01:29:50Z","timestamp":1773797390000},"score":1,"resource":{"primary":{"URL":"https:\/\/ojs.aaai.org\/index.php\/AAAI\/article\/view\/39315"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,14]]},"references-count":0,"journal-issue":{"issue":"26","published-online":{"date-parts":[[2026,3,17]]}},"URL":"https:\/\/doi.org\/10.1609\/aaai.v40i26.39315","relation":{},"ISSN":["2374-3468","2159-5399"],"issn-type":[{"value":"2374-3468","type":"electronic"},{"value":"2159-5399","type":"print"}],"subject":[],"published":{"date-parts":[[2026,3,14]]}}}