{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,6]],"date-time":"2025-11-06T15:41:35Z","timestamp":1762443695706,"version":"3.28.0"},"reference-count":41,"publisher":"MIT Press","issue":"10","content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,9,8]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Deep reinforcement learning (DRL) provides an agent with an optimal policy so as to maximize the cumulative rewards. The policy defined in DRL mainly depends on the state, historical memory, and policy model parameters. However, we humans usually take actions according to our own intentions, such as moving fast or slow, besides the elements included in the traditional policy models. In order to make the action-choosing mechanism more similar to humans and make the agent to select actions that incorporate intentions, we propose an intention-aware policy learning method in this letter To formalize this process, we first define an intention-aware policy by incorporating the intention information into the policy model, which is learned by maximizing the cumulative rewards with the mutual information (MI) between the intention and the action. Then we derive an approximation of the MI objective that can be optimized efficiently. Finally, we demonstrate the effectiveness of the intention-aware policy in the classical MuJoCo control task and the multigoal continuous chain walking task.<\/jats:p>","DOI":"10.1162\/neco_a_01607","type":"journal-article","created":{"date-parts":[[2023,7,31]],"date-time":"2023-07-31T18:06:43Z","timestamp":1690826803000},"page":"1657-1677","update-policy":"http:\/\/dx.doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":1,"title":["Learning Intention-Aware Policies in Deep Reinforcement Learning"],"prefix":"10.1162","volume":"35","author":[{"given":"Tingting","family":"Zhao","sequence":"first","affiliation":[{"name":"College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, P.R.C. tingting@tust.edu.cn"}]},{"given":"Shuai","family":"Wu","sequence":"additional","affiliation":[{"name":"College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, P.R.C. tingting@tust.edu.cn"}]},{"given":"Guixi","family":"Li","sequence":"additional","affiliation":[{"name":"College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, P.R.C. tingting@tust.edu.cn"}]},{"given":"Yarui","family":"Chen","sequence":"additional","affiliation":[{"name":"College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, P.R.C. tingting@tust.edu.cn"}]},{"given":"Gang","family":"Niu","sequence":"additional","affiliation":[{"name":"RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan gang.niu.ml@gmail.com"}]},{"given":"Masashi","family":"Sugiyama","sequence":"additional","affiliation":[{"name":"RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan"},{"name":"Graduate School of Frontier Sciences, University of Tokyo, Tokyo 277-8561, Japan sugi@k.u-tokyo.ac.jp"}]}],"member":"281","published-online":{"date-parts":[[2023,9,8]]},"reference":[{"issue":"3","key":"2023101718042850700_bib1","doi-asserted-by":"publisher","first-page":"299","DOI":"10.1109\/PGEC.1967.264666","article-title":"A theory of adaptive pattern classifiers","volume":"EC-16","author":"Amari","year":"1967","journal-title":"IEEE Transactions on Electronic Computers"},{"issue":"3","key":"2023101718042850700_bib2","doi-asserted-by":"publisher","first-page":"493","DOI":"10.1007\/s10994-019-05845-8","article-title":"Skill-based curiosity for intrinsically motivated reinforcement learning","volume":"109","author":"Bougie","year":"2020","journal-title":"Machine Learning"},{"journal-title":"OpenAI Gym","year":"2016","author":"Brockman","key":"2023101718042850700_bib3"},{"key":"2023101718042850700_bib4","first-page":"2180","article-title":"InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets","volume-title":"Advances in neural information processing systems","author":"Chen","year":"2016"},{"key":"2023101718042850700_bib5","first-page":"1701","article-title":"Deep reinforcement learning for general game playing","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Goldwaser","year":"2020"},{"journal-title":"World models.","year":"2018","author":"Ha","key":"2023101718042850700_bib6"},{"issue":"3","key":"2023101718042850700_bib7","doi-asserted-by":"publisher","first-page":"522","DOI":"10.1111\/bmsp.12199","article-title":"Curiosity-driven recommendation strategy for adaptive learning via deep reinforcement learning","volume":"73","author":"Han","year":"2020","journal-title":"British Journal of Mathematical and Statistical Psychology"},{"journal-title":"InfoRL: Interpretable reinforcement learning using information maximization","year":"2019","author":"Hayat","key":"2023101718042850700_bib8"},{"issue":"4","key":"2023101718042850700_bib9","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRev.106.620","article-title":"Information theory and statistical mechanics","volume":"106","author":"Jaynes","year":"1957","journal-title":"Physical Review"},{"key":"2023101718042850700_bib10","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1613\/jair.301","article-title":"Reinforcement learning: A survey","volume":"4","author":"Kaelbling","year":"1996","journal-title":"Journal of Artificial Intelligence Research"},{"journal-title":"A maximum mutual information framework for multi-agent reinforcement learning.","year":"2020","author":"Kim","key":"2023101718042850700_bib11"},{"journal-title":"Adam: A method for stochastic optimization.","year":"2014","author":"Kingma","key":"2023101718042850700_bib12"},{"issue":"9","key":"2023101718042850700_bib13","doi-asserted-by":"publisher","first-page":"3354","DOI":"10.1073\/pnas.1309933111","article-title":"Equitability, mutual information, and the maximal information coefficient","volume":"111","author":"Kinney","year":"2014","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2023101718042850700_bib14","first-page":"1008","article-title":"Actor-critic algorithms","volume-title":"Advances in neural information processing systems","author":"Konda","year":"1999"},{"issue":"1","key":"2023101718042850700_bib15","first-page":"1334","article-title":"End-to-end training of deep visuomotor policies","volume":"17","author":"Levine","year":"2016","journal-title":"Journal of Machine Learning Research"},{"key":"2023101718042850700_bib16","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1016\/j.neucom.2020.08.024","article-title":"Random curiosity-driven exploration in deep reinforcement learning","author":"Li","year":"2020","journal-title":"Neurocomputing"},{"journal-title":"Continuous control with deep reinforcement learning.","year":"2015","author":"Lillicrap","key":"2023101718042850700_bib17"},{"issue":"3","key":"2023101718042850700_bib18","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1109\/2.36","article-title":"Self-organization in a perceptual network","volume":"21","author":"Linsker","year":"1988","journal-title":"Computer"},{"issue":"1","key":"2023101718042850700_bib19","doi-asserted-by":"publisher","DOI":"10.1037\/0033-2909.116.1.75","article-title":"The psychology of curiosity: A review and reinterpretation","volume":"116","author":"Loewenstein","year":"1994","journal-title":"Psychological Bulletin"},{"key":"2023101718042850700_bib20","first-page":"206","article-title":"Exploration in model-based reinforcement learning by empirically estimating learning progress","volume-title":"Advances in neural information processing systems","author":"Lopes","year":"2012"},{"key":"2023101718042850700_bib21","first-page":"7655","article-title":"Efficient continuous control with double actors and regularized critics","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Lyu","year":"2022"},{"journal-title":"Playing Atari with deep reinforcement learning.","year":"2013","author":"Mnih","key":"2023101718042850700_bib22"},{"key":"2023101718042850700_bib23","first-page":"1928","article-title":"Asynchronous methods for deep reinforcement learning","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Mnih","year":"2016"},{"issue":"7540","key":"2023101718042850700_bib24","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"2023101718042850700_bib25","doi-asserted-by":"publisher","first-page":"209320","DOI":"10.1109\/ACCESS.2020.3038605","article-title":"A gentle introduction to reinforcement learning and its application in different fields","volume":"8","author":"Naeem","year":"2020","journal-title":"IEEE Access"},{"journal-title":"Pegasus: A policy search method for large MDPs and POMDPs","year":"2013","author":"Ng","key":"2023101718042850700_bib26"},{"issue":"2","key":"2023101718042850700_bib27","doi-asserted-by":"publisher","first-page":"265","DOI":"10.1109\/TEVC.2006.890271","article-title":"Intrinsic motivation systems for autonomous mental development","volume":"11","author":"Oudeyer","year":"2007","journal-title":"IEEE Transactions on Evolutionary Computation"},{"journal-title":"Prioritized experience replay","year":"2015","author":"Schaul","key":"2023101718042850700_bib28"},{"journal-title":"Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments","year":"1990","author":"Schmidhuber","key":"2023101718042850700_bib29"},{"key":"2023101718042850700_bib30","doi-asserted-by":"crossref","first-page":"222","DOI":"10.7551\/mitpress\/3115.003.0030","article-title":"A possibility for implementing curiosity and boredom in model-building neural controllers","volume-title":"Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats","author":"Schmidhuber","year":"1991"},{"journal-title":"On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models","year":"2015","author":"Schmidhuber","key":"2023101718042850700_bib31"},{"journal-title":"Reinforcement learning upside down: Don\u2019t predict rewards\u2013just map them to actions","year":"2019","author":"Schmidhuber","key":"2023101718042850700_bib32"},{"journal-title":"Learning to generate focus trajectories for attentive vision.","year":"1990","author":"Schmidhuber","key":"2023101718042850700_bib33"},{"journal-title":"Proximal policy optimization algorithms","year":"2017","author":"Schulman","key":"2023101718042850700_bib34"},{"key":"2023101718042850700_bib35","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1007\/BF00115009","article-title":"Learning to predict by the methods of temporal differences","volume":"3","author":"Sutton","year":"1988","journal-title":"Machine Learning"},{"key":"2023101718042850700_bib36","first-page":"1057","volume-title":"Advances in neural information processing systems","author":"Sutton","year":"2000"},{"issue":"2","key":"2023101718042850700_bib37","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1162\/neco.1994.6.2.215","article-title":"TD-Gammon a self-teaching backgammon program, achieves master-level play","volume":"6","author":"Tesauro","year":"1994","journal-title":"Neural Computation"},{"key":"2023101718042850700_bib38","first-page":"5026","article-title":"MuJoCo: A physics engine for model-based control","volume-title":"Proceedings of the 2012 IEEE\/RSJ International Conference on Intelligent Robots and Systems","author":"Todorov","year":"2012"},{"key":"2023101718042850700_bib39","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1007\/BF00992696","article-title":"Simple statistical gradient-following algorithms for connectionist reinforcement learning","volume":"8","author":"Williams","year":"1992","journal-title":"Machine Learning"},{"journal-title":"Mutual information-based state-control for intrinsically motivated reinforcement learning","year":"2020","author":"Zhao","key":"2023101718042850700_bib40"},{"key":"2023101718042850700_bib41","first-page":"1433","article-title":"Maximum entropy inverse reinforcement learning","volume-title":"Proceedings of the 23rd AAAI Conference on Artificial Intelligence","author":"Ziebart","year":"2008"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/35\/10\/1657\/2163437\/neco_a_01607.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/35\/10\/1657\/2163437\/neco_a_01607.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,25]],"date-time":"2024-10-25T11:45:58Z","timestamp":1729856758000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/35\/10\/1657\/117017\/Learning-Intention-Aware-Policies-in-Deep"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,8]]},"references-count":41,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2023,9,8]]},"published-print":{"date-parts":[[2023,9,8]]}},"URL":"https:\/\/doi.org\/10.1162\/neco_a_01607","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"type":"print","value":"0899-7667"},{"type":"electronic","value":"1530-888X"}],"subject":[],"published-other":{"date-parts":[[2023,10]]},"published":{"date-parts":[[2023,9,8]]}}}