{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T03:20:56Z","timestamp":1740108056076,"version":"3.37.3"},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2021,9,28]],"date-time":"2021-09-28T00:00:00Z","timestamp":1632787200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,28]],"date-time":"2021-09-28T00:00:00Z","timestamp":1632787200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100013373","name":"Alberta Machine Intelligence Institute","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100013373","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002790","name":"Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002790","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007631","name":"Canadian Institute for Advanced Research","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100007631","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2022,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p><jats:italic>Reinforcement learning<\/jats:italic>(RL) is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such<jats:italic>potential-based reward shaping<\/jats:italic>(PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced<jats:italic>dynamic potential-based advice<\/jats:italic>(DPBA) was proposed to tackle this challenge by predicting the potential function values as part of the learning process. However, this article demonstrates theoretically and empirically that, while DPBA can facilitate learning with good advice, it does in fact alter the optimal policy. We further show that when adding the correction term to \u201cfix\u201d DPBA it no longer shows effective shaping with good advice. We then present a simple method called<jats:italic>policy invariant explicit shaping<\/jats:italic>(PIES) and show theoretically and empirically that PIES can use arbitrary advice, speed-up learning, and leave the optimal policy unchanged.<\/jats:p>","DOI":"10.1007\/s00521-021-06259-1","type":"journal-article","created":{"date-parts":[[2021,9,28]],"date-time":"2021-09-28T12:03:15Z","timestamp":1632830595000},"page":"1673-1686","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Policy invariant explicit shaping: an efficient alternative to reward shaping"],"prefix":"10.1007","volume":"34","author":[{"given":"Paniz","family":"Behboudian","sequence":"first","affiliation":[]},{"given":"Yash","family":"Satsangi","sequence":"additional","affiliation":[]},{"given":"Matthew E.","family":"Taylor","sequence":"additional","affiliation":[]},{"given":"Anna","family":"Harutyunyan","sequence":"additional","affiliation":[]},{"given":"Michael","family":"Bowling","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,9,28]]},"reference":[{"key":"6259_CR1","doi-asserted-by":"crossref","unstructured":"Bianchi RA, Ribeiro CH, Costa AH (2004) Heuristically accelerated Q\u2013learning: a new approach to speed up reinforcement learning. In: Brazilian symposium on artificial intelligence. Springer, pp. 245\u2013254","DOI":"10.1007\/978-3-540-28645-5_25"},{"key":"6259_CR2","unstructured":"Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W(2016) OpenAI Gym. arXiv preprint arXiv:1606.01540"},{"key":"6259_CR3","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1016\/j.neucom.2017.02.096","volume":"263","author":"T Brys","year":"2017","unstructured":"Brys T, Harutyunyan A, Vrancx P, Now\u00e9 A, Taylor ME (2017) Multi-objectivization and ensembles of shapings in reinforcement learning. Neurocomputing 263:48\u201359","journal-title":"Neurocomputing"},{"key":"6259_CR4","doi-asserted-by":"crossref","unstructured":"Brys T, Harutyunyan A, Vrancx P, Taylor ME, Kudenko D, Now\u00e9 A (2014) Multi-objectivization of reinforcement learning problems by reward shaping. In: 2014 international joint conference on neural networks. IEEE, pp. 2315\u20132322","DOI":"10.1109\/IJCNN.2014.6889732"},{"key":"6259_CR5","unstructured":"Devlin SM, Kudenko D (2012) Dynamic potential-based reward shaping. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems, pp. 433\u2013440"},{"issue":"4","key":"6259_CR6","doi-asserted-by":"publisher","first-page":"541","DOI":"10.1016\/j.neunet.2010.01.001","volume":"23","author":"M Grze\u015b","year":"2010","unstructured":"Grze\u015b M, Kudenko D (2010) Online learning of shaping rewards in reinforcement learning. Neural Netw 23(4):541\u2013550","journal-title":"Neural Netw"},{"key":"6259_CR7","doi-asserted-by":"crossref","unstructured":"Gullapalli V, Barto AG (1992) Shaping as a method for accelerating reinforcement learning. In: Proceedings of the 1992 IEEE international symposium on intelligent control, pp. 554\u2013559","DOI":"10.1109\/ISIC.1992.225046"},{"key":"6259_CR8","doi-asserted-by":"crossref","unstructured":"Harutyunyan A, Devlin S, Vrancx P, Now\u00e9 A (2015) Expressing arbitrary reward functions as potential-based advice. In: The association for the advancement of artificial intelligence, pp. 2652\u20132658","DOI":"10.1609\/aaai.v29i1.9628"},{"key":"6259_CR9","unstructured":"Knox WB, Fasel IR, Stone P (2009) Design principles for creating human-shapable agents. In: The association for the advancement of artificial intelligence spring symposium: agents that learn from human teachers, pp. 79\u201386"},{"key":"6259_CR10","unstructured":"Knox WB, Stone P (2008) TAMER: Training an agent manually via evaluative reinforcement. In: 2008 7th IEEE international conference on development and learning, pp. 292\u2013297"},{"key":"6259_CR11","doi-asserted-by":"crossref","unstructured":"Knox WB, Stone P (2009) Interactively shaping agents via human reinforcement: the TAMER framework. In: Proceedings of the 5th international conference on knowledge capture. ACM, pp. 9\u201316","DOI":"10.1145\/1597735.1597738"},{"key":"6259_CR12","unstructured":"Knox WB, Stone P (2010) Combining manual feedback with subsequent MDP reward signals for reinforcement learning. In: Proceedings of the 9th international conference on autonomous agents and multiagent systems: Vol. 1, pp. 5\u201312"},{"key":"6259_CR13","unstructured":"Knox WB, Stone P (2012) Reinforcement learning from simultaneous human and MDP reward. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems-vol. 1, pp. 475\u2013482"},{"key":"6259_CR14","doi-asserted-by":"crossref","unstructured":"Marom O, Rosman B (2018) Belief Reward Shaping in Reinforcement Learning. Proc AAAI Conf Artif Intell 32(1)","DOI":"10.1609\/aaai.v32i1.11741"},{"key":"6259_CR15","doi-asserted-by":"crossref","unstructured":"Marthi B (2007) Automatic shaping and decomposition of reward functions. In: Proceedings of the 24th international conference on machine learning. ACM, pp. 601\u2013608","DOI":"10.1145\/1273496.1273572"},{"issue":"2","key":"6259_CR16","first-page":"137","volume":"2","author":"D Michie","year":"1968","unstructured":"Michie D, Chambers RA (1968) Boxes: an experiment in adaptive control. Mach Intell 2(2):137\u2013152","journal-title":"Mach Intell"},{"key":"6259_CR17","unstructured":"Moore A (2002) Efficient memory-based learning for robot control"},{"key":"6259_CR18","first-page":"278","volume":"99","author":"AY Ng","year":"1999","unstructured":"Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: theory and application to reward shaping. Int Conf Mach Learn 99:278\u2013287","journal-title":"Int Conf Mach Learn"},{"key":"6259_CR19","unstructured":"Ng AY, Jordan MI (2003) Shaping and policy search in reinforcement learning. Ph.D. thesis, University of California, Berkeley"},{"key":"6259_CR20","unstructured":"OpenAI (2018) OpenAI Five. https:\/\/blog.openai.com\/openai-five\/"},{"key":"6259_CR21","volume-title":"Markov decision processes: discrete stochastic dynamic programming","author":"ML Puterman","year":"2014","unstructured":"Puterman ML (2014) Markov decision processes: discrete stochastic dynamic programming. Wiley, New Jersey"},{"key":"6259_CR22","first-page":"463","volume":"98","author":"J Randl\u00f8v","year":"1998","unstructured":"Randl\u00f8v J, Alstr\u00f8m P (1998) Learning to drive a bicycle using reinforcement learning and shaping. Int Conf Mach Learn 98:463\u2013471","journal-title":"Int Conf Mach Learn"},{"issue":"3","key":"6259_CR23","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1023\/A:1007678930559","volume":"38","author":"S Singh","year":"2000","unstructured":"Singh S, Jaakkola T, Littman ML, Szepesv\u00e1ri C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287\u2013308","journal-title":"Mach Learn"},{"issue":"3","key":"6259_CR24","doi-asserted-by":"publisher","first-page":"94","DOI":"10.1037\/h0049039","volume":"13","author":"BF Skinner","year":"1958","unstructured":"Skinner BF (1958) Reinforcement today. Am Psychol 13(3):94","journal-title":"Am Psychol"},{"key":"6259_CR25","unstructured":"Sutton RS (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In: Advances in neural information processing systems, pp. 1038\u20131044"},{"key":"6259_CR26","unstructured":"Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. The MIT Press. http:\/\/incompleteideas.net\/book\/the-book-2nd.html"},{"key":"6259_CR27","doi-asserted-by":"publisher","first-page":"350","DOI":"10.1038\/s41586-019-1724-z","volume":"575","author":"O Vinyals","year":"2019","unstructured":"Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou JP, Jaderberg M, Vezhnevets AS, Leblond R, Pohlen T, Dalibard V, Budden D, Sulsky Y, Molloy J, Paine TL, Gulcehre C, Wang Z, Pfaff T, Wu Y, Ring R, Yogatama D, W\u00fcnsch D, McKinney K, Smith O, Schaul T, Lillicrap T, Kavukcuoglu K, Hassabis D, Apps C, Silver D (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575:350\u2013354","journal-title":"Nature"},{"issue":"3\u20134","key":"6259_CR28","first-page":"279","volume":"8","author":"CJ Watkins","year":"1992","unstructured":"Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3\u20134):279\u2013292","journal-title":"Mach Learn"},{"key":"6259_CR29","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1613\/jair.1190","volume":"19","author":"E Wiewiora","year":"2003","unstructured":"Wiewiora E (2003) Potential-based shaping and Q-value initialization are equivalent. J Artif Intell Res 19:205\u2013208","journal-title":"J Artif Intell Res"},{"key":"6259_CR30","unstructured":"Wiewiora E, Cottrell GW, Elkan C (2003) Principled methods for advising reinforcement learning agents. In: Proceedings of the 20th international conference on machine learning, pp. 792\u2013799"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06259-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-021-06259-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06259-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,10]],"date-time":"2023-01-10T16:41:34Z","timestamp":1673368894000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-021-06259-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,28]]},"references-count":30,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,2]]}},"alternative-id":["6259"],"URL":"https:\/\/doi.org\/10.1007\/s00521-021-06259-1","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"type":"print","value":"0941-0643"},{"type":"electronic","value":"1433-3058"}],"subject":[],"published":{"date-parts":[[2021,9,28]]},"assertion":[{"value":"16 November 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 September 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}