{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T17:23:28Z","timestamp":1772645008836,"version":"3.50.1"},"reference-count":58,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Robot. AI"],"abstract":"<jats:p>The automatic synthesis of policies for robotics systems through reinforcement learning relies upon, and is intimately guided by, a reward signal. Consequently, this signal should faithfully reflect the designer\u2019s intentions, which are often expressed as a collection of high-level requirements. Several works have been developing automated reward definitions from formal requirements, but they show limitations in producing a signal which is both effective in training and able to fulfill multiple heterogeneous requirements. In this paper, we define a task as a partially ordered set of safety, target, and comfort requirements and introduce an automated methodology to enforce a natural order among requirements into the reward signal. We perform this by automatically translating the requirements into a sum of safety, target, and comfort rewards, where the target reward is a function of the safety reward and the comfort reward is a function of the safety and target rewards. Using a potential-based formulation, we enhance sparse to dense rewards and formally prove this to maintain policy optimality. We call our novel approach hierarchical, potential-based reward shaping (HPRS). Our experiments on eight robotics benchmarks demonstrate that HPRS is able to generate policies satisfying complex hierarchical requirements. Moreover, compared with the state of the art, HPRS achieves faster convergence and superior performance with respect to the rank-preserving policy-assessment metric. By automatically balancing competing requirements, HPRS produces task-satisfying policies with improved comfort and without manual parameter tuning. Through ablation studies, we analyze the impact of individual requirement classes on emergent behavior. Our experiments show that HPRS benefits from comfort requirements when aligned with the target and safety and ignores them when in conflict with the safety or target requirements. Finally, we validate the practical usability of HPRS in real-world robotics applications, including two sim-to-real experiments using F1TENTH vehicles. These experiments show that a hierarchical design of task specifications facilitates the sim-to-real transfer without any domain adaptation.<\/jats:p>","DOI":"10.3389\/frobt.2024.1444188","type":"journal-article","created":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T08:58:26Z","timestamp":1739177906000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["HPRS: hierarchical potential-based reward shaping from task specifications"],"prefix":"10.3389","volume":"11","author":[{"given":"Luigi","family":"Berducci","sequence":"first","affiliation":[]},{"given":"Edgar A.","family":"Aguilar","sequence":"additional","affiliation":[]},{"given":"Dejan","family":"Ni\u010dkovi\u0107","sequence":"additional","affiliation":[]},{"given":"Radu","family":"Grosu","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,2,10]]},"reference":[{"key":"B1","first-page":"11","article-title":"Dynamic weights in multi-objective deep reinforcement learning","volume-title":"International conference on machine learning","author":"Abels","year":"2019"},{"key":"B2","article-title":"Concrete problems in ai safety","author":"Amodei","year":"2016"},{"key":"B3","first-page":"3481","article-title":"Structured reward shaping using signal temporal logic specifications","author":"Balakrishnan","year":"2019"},{"key":"B4","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/1390156.1390162","article-title":"Learning all optimal policies with multiple criteria","volume-title":"Proceedings of the 25th international conference on Machine learning","author":"Barrett","year":"2008"},{"key":"B5","doi-asserted-by":"publisher","first-page":"341","DOI":"10.1023\/a:1025696116075","article-title":"Recent advances in hierarchical reinforcement learning","volume":"13","author":"Barto","year":"2003","journal-title":"Discrete event Dyn. Syst."},{"key":"B6","article-title":"Hierarchical potential-based reward shaping from task specifications","author":"Berducci","year":"2021"},{"key":"B7","doi-asserted-by":"crossref","first-page":"360","DOI":"10.1007\/978-3-031-19849-6_21","article-title":"Safe policy improvement in constrained markov decision processes","volume-title":"Leveraging applications of formal methods, verification and validation. Verification principles: 11th international symposium, ISoLA 2022, rhodes, Greece, october 22\u201330, 2022, proceedings, Part I","author":"Berducci","year":"2022"},{"key":"B8","doi-asserted-by":"crossref","first-page":"7513","DOI":"10.1109\/ICRA46639.2022.9811650","article-title":"Latent imagination facilitates zero-shot transfer in autonomous racing","volume-title":"2022 international conference on robotics and automation (ICRA)","author":"Brunnbauer","year":"2022"},{"key":"B9","doi-asserted-by":"crossref","first-page":"2315","DOI":"10.1109\/IJCNN.2014.6889732","article-title":"Multi-objectivization of reinforcement learning problems by reward shaping","volume-title":"2014 international joint conference on neural networks (IJCNN)","author":"Brys","year":"2014"},{"key":"B10","article-title":"Non-markovian rewards expressed in ltl: guiding search via reward shaping","volume-title":"SOCS","author":"Camacho","year":"2017"},{"key":"B11","first-page":"8536","article-title":"Liability, ethics, and culture-aware behavior specification using rulebooks","author":"Censi","year":"2019"},{"key":"B12","article-title":"Deep reinforcement learning from human preferences","volume":"30","author":"Christiano","year":"2017","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B13","article-title":"Robust satisfaction of temporal logic specifications via reinforcement learning","author":"[Dataset] Jones","year":"2015"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.8127026","article-title":"Gymnasium","author":"Towers","year":"2023"},{"key":"B15","first-page":"20118","article-title":"Explicable reward design for reinforcement learning agents","volume":"34","author":"Devidze","year":"2021","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B16","first-page":"1329","article-title":"Benchmarking deep reinforcement learning for continuous control","author":"Duan","year":"2016"},{"key":"B17","article-title":"Pybullet a python module for physics simulation for games","author":"Erwin","year":"2016","journal-title":"PyBullet"},{"key":"B18","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1007\/978-3-030-41188-6_3","article-title":"Reward function design in reinforcement learning","author":"Eschmann","year":"2021","journal-title":"Reinf. Learn. Algorithms Analysis Appl."},{"key":"B19","doi-asserted-by":"crossref","DOI":"10.15607\/RSS.2014.X.039","article-title":"Probably approximately correct MDP learning and control with temporal logic constraints","volume-title":"Robotics: science and systems X","author":"Fu","year":"2014"},{"key":"B20","first-page":"197","article-title":"Multi-criteria reinforcement learning","author":"G\u00e1bor","year":"1998"},{"key":"B21","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1007\/s10458-022-09552-y","article-title":"A practical guide to multi-objective reinforcement learning and planning","volume":"36","author":"Hayes","year":"2022","journal-title":"Aut. Agents Multi-Agent Syst."},{"key":"B22","first-page":"1271","article-title":"Real-time loop closure in 2d lidar slam","author":"Hess","year":"2016"},{"key":"B23","first-page":"1","article-title":"Cleanrl: high-quality single-file implementations of deep reinforcement learning algorithms","volume":"23","author":"Huang","year":"2022","journal-title":"J. Mach. Learn. Res."},{"key":"B24","first-page":"2107","article-title":"Using reward machines for high-level task specification and decomposition in reinforcement learning","author":"Icarte","year":"2018"},{"key":"B25","article-title":"A composable specification language for reinforcement learning tasks","volume-title":"Advances in neural information processing systems","author":"Jothimurugan","year":"2019"},{"key":"B26","first-page":"13906","article-title":"Compositional reinforcement learning from logical specifications","author":"Jothimurugan","year":"2021","journal-title":"Corr. abs\/2106"},{"key":"B27","first-page":"440","article-title":"The influence of reward on the speed of reinforcement learning: an analysis of shaping","author":"Laud","year":"2003"},{"key":"B28","doi-asserted-by":"crossref","first-page":"240","DOI":"10.23919\/ACC.2018.8431181","article-title":"A policy search method for temporal logic specified reinforcement learning tasks","volume-title":"2018 annual American control conference (ACC)","author":"Li","year":"2018"},{"key":"B29","first-page":"3834","article-title":"Reinforcement learning with temporal logic rewards","author":"Li","year":"2017"},{"key":"B30","article-title":"Continuous control with deep reinforcement learning","author":"Lillicrap","year":"2016"},{"key":"B31","doi-asserted-by":"publisher","first-page":"385","DOI":"10.1109\/TSMC.2014.2358639","article-title":"Multiobjective reinforcement learning: a comprehensive overview","volume":"45","author":"Liu","year":"2015","journal-title":"IEEE Trans. Syst. Man, Cybern. Syst."},{"key":"B32","doi-asserted-by":"crossref","first-page":"152","DOI":"10.1007\/978-3-540-30206-3_12","article-title":"Monitoring temporal properties of continuous signals","volume-title":"Formal techniques, modelling and Analysis of timed and fault-tolerant systems","author":"Maler","year":"2004"},{"key":"B33","first-page":"1690","article-title":"Arithmetic-geometric mean robustness for control from signal temporal logic specifications","author":"Mehdipour","year":"2019"},{"key":"B34","doi-asserted-by":"publisher","first-page":"2006","DOI":"10.1109\/lcsys.2020.3047362","article-title":"Specifying user preferences using weighted signal temporal logic","volume":"5","author":"Mehdipour","year":"2020","journal-title":"IEEE Control Syst. Lett."},{"key":"B35","doi-asserted-by":"publisher","first-page":"3740","DOI":"10.1109\/LRA.2023.3270034","article-title":"Orbit: a unified simulation framework for interactive robot learning environments","volume":"8","author":"Mittal","year":"2023","journal-title":"IEEE Robotics Automation Lett."},{"key":"B36","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"nature"},{"key":"B37","first-page":"601","article-title":"Dynamic preferences in multi-criteria reinforcement learning","author":"Natarajan","year":"2005"},{"key":"B38","first-page":"278","article-title":"Policy invariance under reward transformations: theory and application to reward shaping","author":"Ng","year":"1999"},{"key":"B39","doi-asserted-by":"crossref","first-page":"564","DOI":"10.1007\/978-3-030-59152-6_34","article-title":"Rtamt: online robustness monitors from stl","volume-title":"International symposium on automated Technology for verification and Analysis","author":"Ni\u010dkovi\u0107","year":"2020"},{"key":"B40","article-title":"Isaac sim - robotics simulation and synthetic data generation","year":"2023","journal-title":"NVIDIA"},{"key":"B41","article-title":"F1tenth: an open-source evaluation environment for continuous control and reinforcement learning","volume":"123","author":"O\u2019Kelly","year":"2020","journal-title":"Proc. Mach. Learn. Res."},{"key":"B42","doi-asserted-by":"publisher","first-page":"6250","DOI":"10.1109\/LRA.2021.3092676","article-title":"Learning from demonstrations using signal temporal logic in stochastic and continuous domains","volume":"6","author":"Puranic","year":"2021","journal-title":"IEEE Robotics Automation Lett."},{"key":"B43","unstructured":"Rl baselines3 zoo\n          \n          \n            \n              Raffin\n              A.\n            \n          \n          \n          2020"},{"key":"B44","unstructured":"Stable baselines3\n          \n          \n            \n              Raffin\n              A.\n            \n            \n              Hill\n              A.\n            \n            \n              Ernestus\n              M.\n            \n            \n              Gleave\n              A.\n            \n            \n              Kanervisto\n              A.\n            \n            \n              Dormann\n              N.\n            \n          \n          \n          2019"},{"key":"B45","first-page":"11","article-title":"Temporal logic as filtering","author":"Rodionova","year":"2016"},{"key":"B46","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1613\/jair.3987","article-title":"A survey of multi-objective sequential decision-making","volume":"48","author":"Roijers","year":"2013","journal-title":"J. Artif. Int. Res."},{"key":"B47","volume-title":"Artificial intelligence: a modern approach","author":"Russell","year":"2020"},{"key":"B48","article-title":"Balancing multiple sources of reward in reinforcement learning","volume-title":"Advances in neural information processing systems","author":"Shelton","year":"2001"},{"key":"B49","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/nature24270","article-title":"Mastering the game of go without human knowledge","volume":"550","author":"Silver","year":"2017","journal-title":"nature"},{"key":"B50","volume-title":"Reinforcement learning: an introduction","author":"Sutton","year":"2018"},{"key":"B51","first-page":"23","volume-title":"Domain randomization for transferring deep neural networks from simulation to the real world","author":"Tobin","year":"2017"},{"key":"B52","first-page":"452","article-title":"Teaching multiple tasks to an rl agent using ltl","author":"Toro Icarte","year":"2018"},{"key":"B53","doi-asserted-by":"publisher","first-page":"9477","DOI":"10.1109\/tpami.2021.3127674","article-title":"Pharmacological, non-pharmacological policies and mutation: an artificial intelligence based multi-dimensional policy making algorithm for controlling the casualties of the pandemic diseases","volume":"44","author":"Tutsoy","year":"2021","journal-title":"IEEE Trans. Pattern Analysis Mach. Intell."},{"key":"B54","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1109\/ADPRL.2013.6615007","article-title":"Scalarized multi-objective reinforcement learning: novel design techniques","volume-title":"2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL)","author":"Van Moffaert","year":"2013"},{"key":"B55","first-page":"1507","article-title":"Receding horizon planning with rule hierarchies for autonomous vehicles","author":"Veer","year":"2023"},{"key":"B56","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1613\/jair.1190","article-title":"Potential-based shaping and q-value initialization are equivalent","volume":"19","author":"Wiewiora","year":"2003","journal-title":"J. Artif. Intell. Res."},{"key":"B57","first-page":"143","article-title":"Rule-based optimal control for autonomous driving","author":"Xiao","year":"2021"},{"key":"B58","doi-asserted-by":"crossref","first-page":"3190","DOI":"10.1109\/WCICA.2010.5553980","article-title":"Multi-objective reinforcement learning algorithm for mosdmp in unknown environment","volume-title":"2010 8th world congress on intelligent control and automation","author":"Zhao","year":"2010"}],"container-title":["Frontiers in Robotics and AI"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2024.1444188\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T08:58:37Z","timestamp":1739177917000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2024.1444188\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":58,"alternative-id":["10.3389\/frobt.2024.1444188"],"URL":"https:\/\/doi.org\/10.3389\/frobt.2024.1444188","relation":{},"ISSN":["2296-9144"],"issn-type":[{"value":"2296-9144","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]},"article-number":"1444188"}}