{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T04:34:21Z","timestamp":1730349261011,"version":"3.28.0"},"reference-count":30,"publisher":"MIT Press","issue":"9","license":[{"start":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T00:00:00Z","timestamp":1722902400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,8,19]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>In reinforcement learning (RL), artificial agents are trained to maximize numerical rewards by performing tasks. Exploration is essential in RL because agents must discover information before exploiting it. Two rewards encouraging efficient exploration are the entropy of action policy and curiosity for information gain. Entropy is well established in the literature, promoting randomized action selection. Curiosity is defined in a broad variety of ways in literature, promoting discovery of novel experiences. One example, prediction error curiosity, rewards agents for discovering observations they cannot accurately predict. However, such agents may be distracted by unpredictable observational noises known as curiosity traps. Based on the free energy principle (FEP), this letter proposes hidden state curiosity, which rewards agents by the KL divergence between the predictive prior and posterior probabilities of latent variables. We trained six types of agents to navigate mazes: baseline agents without rewards for entropy or curiosity and agents rewarded for entropy and\/or either prediction error curiosity or hidden state curiosity. We find that entropy and curiosity result in efficient exploration, especially both employed together. Notably, agents with hidden state curiosity demonstrate resilience against curiosity traps, which hinder agents with prediction error curiosity. This suggests implementing the FEP that may enhance the robustness and generalization of RL models, potentially aligning the learning processes of artificial and biological agents.<\/jats:p>","DOI":"10.1162\/neco_a_01690","type":"journal-article","created":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T20:28:27Z","timestamp":1722976107000},"page":"1854-1885","update-policy":"http:\/\/dx.doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":0,"title":["Intrinsic Rewards for Exploration Without Harm From Observational Noise: A Simulation Study Based on the Free Energy Principle"],"prefix":"10.1162","volume":"36","author":[{"given":"Theodore Jerome","family":"Tinker","sequence":"first","affiliation":[{"name":"Cognitive Neurorobotics Research Unit, Okinawa Institute of Science and Technology Graduate University, Onna-san 904-0495, Okinawa, Japan theodore.tinker@oist.jp"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kenji","family":"Doya","sequence":"additional","affiliation":[{"name":"Neural Computation Unit, Okinawa Institute of Science and Technology Graduate University, Onna-san 904-0495, Okinawa, Japan doya@oist.jp"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jun","family":"Tani","sequence":"additional","affiliation":[{"name":"Cognitive Neurorobotics Research Unit, Okinawa Institute of Science and Technology Graduate University, Onna-san 904-0495, Okinawa, Japan jun.tani@oist.jp"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2024,8,19]]},"reference":[{"key":"2024103022101675500_bib1","doi-asserted-by":"publisher","first-page":"2025","DOI":"10.1162\/neco_a_01228","article-title":"A novel predictive-coding-inspired variational RNN model for online prediction and recognition","volume":"31","author":"Ahmadi","year":"2019","journal-title":"Neural Computation"},{"key":"2024103022101675500_bib2","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2014.00985","article-title":"Intrinsic motivations and open-ended development in animals, humans, and robots: An overview","volume":"5","author":"Baldassarre","year":"2014","journal-title":"Frontiers in Psychology"},{"issue":"5","key":"2024103022101675500_bib3","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TSMC.1983.6313077","article-title":"Neuronlike adaptive elements that can solve difficult learning control problems","volume":"SMC-13","author":"Barto","year":"1983","journal-title":"IEEE Transactions on Systems, Man, and Cybernetics"},{"year":"2011","author":"Berger","key":"2024103022101675500_bib4"},{"key":"2024103022101675500_bib5","first-page":"1613","article-title":"Weight uncertainty in neural network","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Blundell","year":"2015"},{"year":"2016","author":"Chung","key":"2024103022101675500_bib6"},{"key":"2024103022101675500_bib7","doi-asserted-by":"crossref","DOI":"10.1016\/j.jmp.2020.102447","article-title":"Active inference on discrete state-spaces: A synthesis","author":"Da Costa","year":"2020"},{"key":"2024103022101675500_bib8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.physrep.2023.07.001","article-title":"The free energy principle made simpler but not too simple","volume":"1024","author":"Friston","year":"2023","journal-title":"Physics Reports"},{"key":"2024103022101675500_bib9","doi-asserted-by":"publisher","first-page":"862","DOI":"10.1016\/j.neubiorev.2016.06.022","article-title":"Active inference and learning","volume":"68","author":"Friston","year":"2016","journal-title":"Neuroscience and Biobehavioral Reviews"},{"year":"2018","author":"Haarnoja","key":"2024103022101675500_bib10"},{"key":"2024103022101675500_bib11","first-page":"55","article-title":"Decision analysis: Applied decision theory","volume-title":"Proceedings of the Fourth International Conference on Operational Research","author":"Howard","year":"1966"},{"key":"2024103022101675500_bib12","doi-asserted-by":"publisher","first-page":"1295","DOI":"10.1016\/j.visres.2008.09.007","article-title":"Bayesian surprise attracts human attention","volume":"49","author":"Itti","year":"2009","journal-title":"Vision Research"},{"issue":"15","key":"2024103022101675500_bib13","first-page":"6176","article-title":"Control of the multi-timescale process using multiple timescale recurrent neural network-based model predictive control","volume":"62","author":"Jian","year":"2023","journal-title":"Industrial and Engineering Chemistry Research"},{"issue":"3","key":"2024103022101675500_bib14","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1007\/s00422-018-0753-2","article-title":"Planning and navigation as active inference","volume":"112","author":"Kaplan","year":"2018","journal-title":"Biological Cybernetics"},{"key":"2024103022101675500_bib15","doi-asserted-by":"crossref","DOI":"10.1109\/SII52469.2022.9708819","article-title":"A curiosity algorithm for robots based on the free energy principle","author":"Kawahara","year":"2022"},{"key":"2024103022101675500_bib16","first-page":"2575","article-title":"Variational dropout and the local reparameterization trick","volume-title":"Advances in neural information processing systems","author":"Kingma","year":"2015"},{"issue":"4","key":"2024103022101675500_bib17","doi-asserted-by":"publisher","first-page":"986","DOI":"10.1214\/aoms\/1177728069","article-title":"On a measure of the information provided by an experiment","volume":"27","author":"Lindley","year":"1956","journal-title":"Annals of Mathematical Statistics"},{"key":"2024103022101675500_bib18","doi-asserted-by":"publisher","first-page":"448","DOI":"10.1162\/neco.1992.4.3.448","article-title":"A practical Bayesian framework for backpropagation networks","volume":"4","author":"MacKay","year":"1992","journal-title":"Neural Computation"},{"year":"2020","author":"Millidge","key":"2024103022101675500_bib19"},{"issue":"7540","key":"2024103022101675500_bib20","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"issue":"6","key":"2024103022101675500_bib21","article-title":"What is intrinsic motivation? A typology of computational approaches","author":"Oudeyer","year":"2007"},{"issue":"5\u20136","key":"2024103022101675500_bib22","doi-asserted-by":"publisher","first-page":"495","DOI":"10.1007\/s00422-019-00805-w","article-title":"Generalised free energy and active inference","volume":"113","author":"Parr","year":"2019","journal-title":"Biological Cybernetics"},{"key":"2024103022101675500_bib23","doi-asserted-by":"crossref","DOI":"10.1109\/CVPRW.2017.70","article-title":"Curiosity-driven exploration by self-supervised prediction","author":"Pathak","year":"2017"},{"issue":"3","key":"2024103022101675500_bib24","doi-asserted-by":"publisher","first-page":"230","DOI":"10.1109\/TAMD.2010.2056368","article-title":"Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010)","volume":"2","author":"Schmidhuber","year":"2010","journal-title":"IEEE Transactions on Autonomous Mental Development"},{"key":"2024103022101675500_bib25","doi-asserted-by":"publisher","first-page":"e41703","DOI":"10.7554\/eLife.41703","article-title":"Computational mechanisms of curiosity and goal-directed exploration","volume":"8","author":"Schwartenbeck","year":"2019","journal-title":"eLife"},{"article-title":"Scaling active inference","year":"2023","author":"Tschantz","key":"2024103022101675500_bib26"},{"year":"2020","author":"Tschantz","key":"2024103022101675500_bib27"},{"key":"2024103022101675500_bib28","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/BF00992698","article-title":"Q-learning","volume":"8","author":"Watkins","year":"1992","journal-title":"Machine Learning"},{"issue":"11","key":"2024103022101675500_bib29","doi-asserted-by":"publisher","first-page":"e1000220","DOI":"10.1371\/journal.pcbi.1000220","article-title":"Emergence of functional hierarchy in a multiple timescale neural network model: A humanoid robot experiment","volume":"4","author":"Yamashita","year":"2008","journal-title":"PLOS Computational Biology"},{"article-title":"Recurrent off-policy baselines for memory-based continuous control","year":"2021","author":"Yang","key":"2024103022101675500_bib30"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/36\/9\/1854\/2477511\/neco_a_01690.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/36\/9\/1854\/2477511\/neco_a_01690.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T22:10:24Z","timestamp":1730326224000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/36\/9\/1854\/123686\/Intrinsic-Rewards-for-Exploration-Without-Harm"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,19]]},"references-count":30,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2024,8,19]]},"published-print":{"date-parts":[[2024,8,19]]}},"URL":"https:\/\/doi.org\/10.1162\/neco_a_01690","relation":{},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"type":"print","value":"0899-7667"},{"type":"electronic","value":"1530-888X"}],"subject":[],"published-other":{"date-parts":[[2024,9]]},"published":{"date-parts":[[2024,8,19]]}}}