{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T18:28:30Z","timestamp":1767637710726,"version":"3.48.0"},"reference-count":24,"publisher":"Maximum Academic Press","license":[{"start":{"date-parts":[[2019,1,1]],"date-time":"2019-01-01T00:00:00Z","timestamp":1546300800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The Knowledge Engineering Review"],"published-print":{"date-parts":[[2019]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent\u2019s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert\u2019s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state\u2013action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert\u2019s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.<\/jats:p>","DOI":"10.1017\/s0269888919000031","type":"journal-article","created":{"date-parts":[[2019,7,12]],"date-time":"2019-07-12T04:18:43Z","timestamp":1562905123000},"source":"Crossref","is-referenced-by-count":2,"title":["Introspective\n                    <i>Q<\/i>\n                    -learning and learning from demonstration"],"prefix":"10.48130","volume":"34","author":[{"given":"Mao","family":"Li","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tim","family":"Brys","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel","family":"Kudenko","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"27968","published-online":{"date-parts":[[2019,1,1]]},"reference":[{"key":"S0269888919000031_ref24","unstructured":"Wiewiora, E. , Cottrell, G. & Elkan, C. 2003. Principled methods for advising reinforcement learning agents. In ICML. 792\u2013799."},{"volume-title":"Learning from Delayed Rewards.","year":"1989","author":"Watkins","key":"S0269888919000031_ref23"},{"key":"S0269888919000031_ref10","first-page":"278","article-title":"Policy invariance under reward transformations: theory and application to reward shaping","volume":"99","author":"Ng","year":"1999","journal-title":"Proceedings of the Sixteenth International Conference on Machine Learning"},{"key":"S0269888919000031_ref7","doi-asserted-by":"publisher","DOI":"10.1109\/TCIAIG.2012.2188528"},{"key":"S0269888919000031_ref5","first-page":"433","volume-title":"Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems","volume":"1","author":"Devlin","year":"2012"},{"key":"S0269888919000031_ref4","unstructured":"Brys, T. , Harutyunyan, A. , Suay, H. B. , Chernova, S. , Taylor, M. E. & Now\u00e9, A. 2015. Reinforcement learning from demonstration through shaping. In IJCAI. 3352\u20133358."},{"key":"S0269888919000031_ref8","unstructured":"Mataric, M. J. 1994. Reward functions for accelerated learning. In Machine Learning: Proceedings of the Eleventh International Conference, 181\u2013189."},{"key":"S0269888919000031_ref6","doi-asserted-by":"crossref","unstructured":"Harutyunyan, A. , Devlin, S. , Vrancx, P. & Now\u00e9, A. 2015. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.","DOI":"10.1609\/aaai.v29i1.9628"},{"key":"S0269888919000031_ref16","doi-asserted-by":"publisher","DOI":"10.1007\/BF00114726"},{"key":"S0269888919000031_ref13","first-page":"1040","article-title":"Learning from demonstration","volume":"9","author":"Schaal","year":"1997","journal-title":"Advances in Neural Information Processing Systems"},{"key":"S0269888919000031_ref2","unstructured":"Bellemare, M. , Srinivasan, S. , Ostrovski, G. , Schaul, T. , Saxton, D. , & Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Proceedings of the 30th Conference on Advances in Neural Information Processing Systems (pp. 1471\u20131479)."},{"key":"S0269888919000031_ref19","volume-title":"Reinforcement Learning: An Introduction","volume":"1","author":"Sutton","year":"1998"},{"key":"S0269888919000031_ref3","unstructured":"Brys, T. 2016. Reinforcement Learning with Heuristic Information. PhD thesis, Vrije Universiteit Brussel."},{"key":"S0269888919000031_ref1","doi-asserted-by":"publisher","DOI":"10.1016\/j.robot.2008.10.024"},{"key":"S0269888919000031_ref9","first-page":"137","article-title":"Boxes: an experiment in adaptive control","volume":"2","author":"Michie","year":"1968","journal-title":"Machine Intelligence"},{"key":"S0269888919000031_ref11","first-page":"663","article-title":"Algorithms for inverse reinforcement learning","volume":"1","author":"Ng","year":"2000","journal-title":"ICML"},{"key":"S0269888919000031_ref15","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"S0269888919000031_ref12","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.70"},{"key":"S0269888919000031_ref21","first-page":"617","volume-title":"The 10th International Conference on Autonomous Agents and Multiagent Systems","volume":"2","author":"Taylor","year":"2011"},{"key":"S0269888919000031_ref14","unstructured":"Schaul, T. , Quan, J. , Antonoglou, I. & Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952."},{"key":"S0269888919000031_ref17","first-page":"3404","article-title":"Effective reinforcement learning for mobile robots","volume":"4","author":"Smart","year":"2002","journal-title":"IEEE International Conference on Robotics and Automation"},{"key":"S0269888919000031_ref18","first-page":"429","volume-title":"Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems","author":"Suay","year":"2016"},{"key":"S0269888919000031_ref20","first-page":"1633","article-title":"Transfer learning for reinforcement learning domains: a survey","volume":"10","author":"Taylor","year":"2009","journal-title":"Journal of Machine Learning Research"},{"key":"S0269888919000031_ref22","doi-asserted-by":"publisher","DOI":"10.1007\/BF00993306"}],"container-title":["The Knowledge Engineering Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S0269888919000031","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,5]],"date-time":"2026-01-05T14:42:15Z","timestamp":1767624135000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S0269888919000031\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019]]},"references-count":24,"alternative-id":["S0269888919000031"],"URL":"https:\/\/doi.org\/10.1017\/s0269888919000031","relation":{},"ISSN":["0269-8889","1469-8005"],"issn-type":[{"type":"print","value":"0269-8889"},{"type":"electronic","value":"1469-8005"}],"subject":[],"published":{"date-parts":[[2019]]},"article-number":"e8"}}