{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T17:52:31Z","timestamp":1775325151282,"version":"3.50.1"},"reference-count":172,"publisher":"Emerald","issue":"5-6","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2015,11,26]]},"abstract":"<jats:p>Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: 1) it provides an elegant approach to action-selection (exploration\/exploitation) as a function of the uncertainty in learning; and 2) it provides a machinery to incorporate prior knowledge into the algorithms. We first discuss models and methods for Bayesian inference in the simple single-step Bandit model. We then review the extensive recent literature on Bayesian methods for model-based RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class. The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.<\/jats:p>","DOI":"10.1561\/2200000049","type":"journal-article","created":{"date-parts":[[2015,11,26]],"date-time":"2015-11-26T05:05:34Z","timestamp":1448514334000},"page":"359-483","source":"Crossref","is-referenced-by-count":201,"title":["Bayesian Reinforcement Learning: A Survey"],"prefix":"10.1561","volume":"8","author":[{"given":"Mohammad","family":"Ghavamzadeh","sequence":"first","affiliation":[{"name":"Adobe Research & INRIA"}]},{"given":"Shie","family":"Mannor","sequence":"additional","affiliation":[{"name":"Technion"}]},{"given":"Joelle","family":"Pineau","sequence":"additional","affiliation":[{"name":"McGill University"}]},{"given":"Aviv","family":"Tamar","sequence":"additional","affiliation":[{"name":"University of California, Berkeley"}]}],"member":"140","published-online":{"date-parts":[[2015,11,26]]},"reference":[{"key":"2026033014113327000_ref001","article-title":"Bayesian optimal control of smoothly parameterized systems","volume-title":"Proceedings of the Conference on Uncertainty in Artificial Intelligence","author":"Abbasi-Yadkori","year":"2015"},{"key":"2026033014113327000_ref002","doi-asserted-by":"crossref","DOI":"10.1145\/1015330.1015430","article-title":"Apprenticeship learning via inverse reinforcement learning","volume-title":"Proceedings of the 21st International Conference on Machine Learning","author":"Abbeel","year":"2004"},{"key":"2026033014113327000_ref003","first-page":"39.1","article-title":"Analysis of Thompson sampling for the multi-armed bandit problem","volume":"23","author":"Agrawal","year":"2012","journal-title":"Proceedings of the 25th Annual Conference on Learning Theory (COLT), JMLR W&CP"},{"key":"2026033014113327000_ref004","first-page":"99","article-title":"Further optimal regret bounds for Thompson sampling","volume-title":"Proceedings of the 16th International Conference on Artificial Intelligence and Statistics","author":"Agrawal","year":"2013"},{"key":"2026033014113327000_ref005","first-page":"127","article-title":"Thompson sampling for contextual bandits with linear payoffs","volume-title":"Proceedings of the 30th International Conference on Machine Learning (ICML-13)","author":"Agrawal","year":"2013"},{"key":"2026033014113327000_ref006","article-title":"Near-optimal BRL using optimistic local transitions","volume-title":"International Conference on Machine Learning","author":"Araya-Lopez","year":"2012"},{"key":"2026033014113327000_ref007","volume-title":"Model-based Bayesian Reinforcement Learning with Generalized Priors","author":"Asmuth","year":"2013"},{"key":"2026033014113327000_ref008","article-title":"Approaching Bayes-optimality using Monte-Carlo tree search","volume-title":"International Conference on Automated Planning and Scheduling (ICAPS)","author":"Asmuth","year":"2011"},{"key":"2026033014113327000_ref009","article-title":"A Bayesian sampling approach to exploration in reinforcement learning","volume-title":"Proceedings of the Conference on Uncertainty in Artificial Intelligence","author":"Asmuth","year":"2009"},{"key":"2026033014113327000_ref010","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1016\/0022-247X(65)90154-X","article-title":"Optimal control of Markov decision processes with incomplete state estimation","volume":"10","author":"Astrom","year":"1965","journal-title":"Journal of Mathematical Analysis and Applications"},{"key":"2026033014113327000_ref011","doi-asserted-by":"crossref","DOI":"10.1109\/ROBOT.1997.606886","article-title":"A comparison of direct and model-based reinforcement learning","volume-title":"International Conference on Robotics and Automation (ICRA)","author":"Atkeson","year":"1997"},{"key":"2026033014113327000_ref012","doi-asserted-by":"crossref","DOI":"10.1145\/1502650.1502700","article-title":"A Bayesian reinforcement learning approach for customizing human-robot interfaces","volume-title":"International Conference on Intelligent User Interfaces","author":"Atrash","year":"2009"},{"issue":"2-3","key":"2026033014113327000_ref013","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1023\/A:1013689704352","article-title":"Finite-time analysis of the multi-armed bandit problem","volume":"47","author":"Auer","year":"2002","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref014","first-page":"897","article-title":"Apprenticeship learning about multiple intentions","volume-title":"Proceedings of the 28th International Conference on Machine Learning","author":"Babes","year":"2011"},{"key":"2026033014113327000_ref015","first-page":"835","article-title":"Neuron-like elements that can solve difficult learning control problems","volume":"13","author":"Barto","year":"1983","journal-title":"IEEE Transaction on Systems, Man and Cybernetics"},{"key":"2026033014113327000_ref016","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1613\/jair.731","article-title":"A model of inductive bias learning","volume":"12","author":"Baxter","year":"2000","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2026033014113327000_ref017","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1613\/jair.806","article-title":"Infinite-horizon policy-gradient estimation","volume":"15","author":"Baxter","year":"2001","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2026033014113327000_ref018","first-page":"28","article-title":"Knightcap: A chess program that learns by combining TD(\u03bb) with game-tree search","volume-title":"Proceedings of the 15th International Conference on Machine Learning","author":"Baxter","year":"1998"},{"key":"2026033014113327000_ref019","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1613\/jair.807","article-title":"Experiments with infinite-horizon policy-gradient estimation","volume":"15","author":"Baxter","year":"2001","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2026033014113327000_ref020","volume-title":"Dynamic Programming","author":"Bellman","year":"1957"},{"key":"2026033014113327000_ref021","volume-title":"Neuro-Dynamic Programming","author":"Bertsekas","year":"1996"},{"issue":"5","key":"2026033014113327000_ref022","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1109\/MCS.2012.2205478","article-title":"Robust adaptive Markov decision processes: Planning with model uncertainty","volume":"32","author":"Bertuccelli","year":"2012","journal-title":"Control Systems, IEEE"},{"key":"2026033014113327000_ref023","first-page":"105","article-title":"Incremental natural actor-Critic algorithms","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Bhatnagar","year":"2007"},{"issue":"11","key":"2026033014113327000_ref024","doi-asserted-by":"crossref","first-page":"2471","DOI":"10.1016\/j.automatica.2009.07.008","article-title":"Natural actor-critic algorithms","volume":"45","author":"Bhatnagar","year":"2009","journal-title":"Automatica"},{"key":"2026033014113327000_ref025","first-page":"49","article-title":"Least-squares temporal difference learning","volume-title":"Proceedings of the 16th International Conference on Machine Learning","author":"Boyan","year":"1999"},{"key":"2026033014113327000_ref026","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1023\/A:1018056104778","article-title":"Linear least-squares algorithms for temporal difference learning","volume":"22","author":"Bradtke","year":"1996","journal-title":"Journal of Machine Learning"},{"key":"2026033014113327000_ref027","first-page":"213","article-title":"R-max - a general polynomial time algorithm for near-optimal reinforcement learning","volume":"3","author":"Brafman","year":"2003","journal-title":"Journal of Machine Learning Research (JMLR)"},{"issue":"1","key":"2026033014113327000_ref028","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/2200000024","article-title":"Regret analysis of stochastic and nonstochastic multi-armed bandit problems","volume":"5","author":"Bubeck","year":"2012","journal-title":"Foundations and Trends in Machine Learning"},{"issue":"1","key":"2026033014113327000_ref029","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1023\/A:1007379606734","article-title":"Multitask learning","volume":"28","author":"Caruana","year":"1997","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref030","first-page":"2437","article-title":"Using linear programming for Bayesian exploration in Markov decision processes","volume-title":"International Joint Conference on Artificial Intelligence","author":"Castro","year":"2007"},{"key":"2026033014113327000_ref031","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-642-15880-3_19","article-title":"Smarter sampling in model-based Bayesian reinforcement learning","volume-title":"Machine Learning and Knowledge Discovery in Databases","author":"Castro","year":"2010"},{"key":"2026033014113327000_ref032","article-title":"Bbrl a c++ open-source library used to compare bayesian reinforcement learning algorithms","author":"Castronovo","year":"2015"},{"key":"2026033014113327000_ref033","author":"Castronovo","year":"2015"},{"key":"2026033014113327000_ref034","article-title":"Coordination in multiagent reinforcement learning: A Bayesian approach","volume-title":"Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems (AA-MAS)","author":"Chalkiadakis","year":"2013"},{"issue":"1","key":"2026033014113327000_ref035","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1613\/jair.3075","article-title":"Cooperative games with overlapping coalitions","volume":"39","author":"Chalkiadakis","year":"2010","journal-title":"Journal of Artificial Intelligence Research"},{"key":"2026033014113327000_ref036","first-page":"2249","article-title":"An empirical evaluation of Thompson sampling","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Chapelle","year":"2011"},{"key":"2026033014113327000_ref037","first-page":"2078","article-title":"Tractable objectives for robust policy optimization","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Chen","year":"2012"},{"key":"2026033014113327000_ref038","first-page":"1989","article-title":"Map inference for Bayesian inverse reinforcement learning","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Choi","year":"2011"},{"key":"2026033014113327000_ref039","first-page":"305","article-title":"Nonparametric Bayesian inverse reinforcement learning for multiple reward functions","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Choi","year":"2012"},{"key":"2026033014113327000_ref040","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1023\/A:1007518724497","article-title":"Elevator group control using multiple reinforcement learning agents","volume":"33","author":"Crites","year":"1998","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref041","article-title":"Bayesian reinforcement learning in continuous POMDPs with Gaussian processes","volume-title":"IEEE\/RSJ International Conference on Intelligent Robots and Systems","author":"Dallaire","year":"2009"},{"key":"2026033014113327000_ref042","first-page":"761","article-title":"Bayesian Q-learning","volume-title":"AAAI Conference on Artificial Intelligence","author":"Dearden","year":"1998"},{"key":"2026033014113327000_ref043","article-title":"Model based Bayesian exploration","volume-title":"Proceedings of the Conference on Uncertainty in Artificial Intelligence","author":"Dearden","year":"1999"},{"issue":"1","key":"2026033014113327000_ref044","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1287\/opre.1080.0685","article-title":"Percentile optimization for Markov decision processes with parameter uncertainty","volume":"58","author":"Delage","year":"2010","journal-title":"Operations Research"},{"key":"2026033014113327000_ref045","article-title":"Bayesian multi-task inverse reinforcement learning","volume-title":"Proceedings of the European Workshop on Reinforcement Learning","author":"Dimitrakakis","year":"2011"},{"key":"2026033014113327000_ref046","doi-asserted-by":"crossref","DOI":"10.1145\/1390156.1390189","article-title":"Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs","volume-title":"International Conference on Machine Learning","author":"Doshi","year":"2008"},{"key":"2026033014113327000_ref047","article-title":"The infinite partially observable Markov decision process","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Doshi-Velez","year":"2009"},{"key":"2026033014113327000_ref048","article-title":"Nonparametric Bayesian policy priors for reinforcement learning","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Doshi-Velez","year":"2010"},{"key":"2026033014113327000_ref049","article-title":"Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs","volume-title":"Artificial Intelligence","author":"Doshi-Velez","year":"2011"},{"key":"2026033014113327000_ref050","article-title":"Monte-Carlo algorithms for the improvement of finite-state stochastic controllers: Application to Bayes-adaptive Markov decision processes","volume-title":"International Workshop on Artificial Intelligence and Statistics (AI-STATS)","author":"Duff","year":"2001"},{"key":"2026033014113327000_ref051","volume-title":"Optimal Learning: Computational Procedures for Bayes-Adaptive Markov Decision Processes","author":"Duff","year":"2002"},{"key":"2026033014113327000_ref052","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511761362","volume-title":"Large-Scale Inference: Empirical Bayes Methods fro Estimation, Testing, and Prediction","author":"Efron","year":"2010"},{"key":"2026033014113327000_ref053","volume-title":"Algorithms and Representations for Reinforcement Learning","author":"Engel","year":"2005"},{"key":"2026033014113327000_ref054","first-page":"84","article-title":"Sparse online greedy support vector regression","volume-title":"Proceedings of the 13th European Conference on Machine Learning","author":"Engel","year":"2002"},{"key":"2026033014113327000_ref055","first-page":"154","article-title":"Bayes meets Bellman: The Gaussian process approach to temporal difference learning","volume-title":"Proceedings of the 20th International Conference on Machine Learning","author":"Engel","year":"2003"},{"key":"2026033014113327000_ref056","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1145\/1102351.1102377","article-title":"Reinforcement learning with Gaussian processes","volume-title":"Proceedings of the 22nd International Conference on Machine Learning","author":"Engel","year":"2005"},{"key":"2026033014113327000_ref057","article-title":"Learning to control an octopus arm with gaussian process temporal difference methods","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Engel","year":"2005"},{"key":"2026033014113327000_ref058","first-page":"503","article-title":"Tree-based batch mode reinforcement learning","volume":"6","author":"Ernst","year":"2005","journal-title":"Journal of Machine Learning Research"},{"key":"2026033014113327000_ref059","first-page":"441","article-title":"Regularized policy iteration","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Farahmand","year":"2008"},{"key":"2026033014113327000_ref060","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1007\/978-3-540-89722-4_5","article-title":"Regularized fitted q-iteration: Application to planning","volume-title":"Recent Advances in Reinforcement Learning, 8th European Workshop, EWRL","author":"Farahmand","year":"2008"},{"key":"2026033014113327000_ref061","article-title":"PAC-Bayesian model selection for reinforcement learning","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Fard","year":"2010"},{"key":"2026033014113327000_ref062","article-title":"PAC-Bayesian policy evaluation for reinforcement learning","volume-title":"Proceedings of Conference on Uncertainty in Artificial Intelligence (UAI)","author":"Fard","year":"2011"},{"key":"2026033014113327000_ref063","first-page":"874","article-title":"Dual control theory, parts i and ii","volume":"21","author":"Feldbaum","year":"1961","journal-title":"Automation and Remote Control"},{"key":"2026033014113327000_ref064","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1049\/ip-cta:20000107","article-title":"Survey of adaptive dual control methods","volume":"147","author":"Filatov","year":"2000","journal-title":"IEEE Control Theory and Applications"},{"key":"2026033014113327000_ref065","article-title":"Efficient Bayesian parameter estimation in large discrete domains","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Friedman","year":"1999"},{"issue":"3","key":"2026033014113327000_ref066","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1145\/2093548.2093574","article-title":"The grand challenge of computer Go: Monte Carlo tree search and extensions","volume":"55","author":"Gelly","year":"2012","journal-title":"Communications of the ACM"},{"key":"2026033014113327000_ref067","first-page":"457","article-title":"Bayesian policy gradient algorithms","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Ghavamzadeh","year":"2006"},{"key":"2026033014113327000_ref068","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1145\/1273496.1273534","article-title":"Bayesian Actor-Critic algorithms","volume-title":"Proceedings of the 24th International Conference on Machine Learning","author":"Ghavamzadeh","year":"2007"},{"key":"2026033014113327000_ref069","author":"Ghavamzadeh","year":"2013"},{"key":"2026033014113327000_ref070","doi-asserted-by":"crossref","first-page":"148","DOI":"10.1111\/j.2517-6161.1979.tb01068.x","article-title":"Bandit processes and dynamic allocation indices","author":"Gittins","year":"1979","journal-title":"Journal of the Royal Statistical Society. Series B (Methodological)"},{"key":"2026033014113327000_ref071","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1145\/84537.84552","article-title":"Likelihood ratio gradient estimation for stochastic systems","volume":"33","author":"Glynn","year":"1990","journal-title":"Communications of the ACM"},{"key":"2026033014113327000_ref072","first-page":"861","article-title":"Thompson sampling for learning parameterized markov decision processes","volume-title":"Proceedings of the 28th Conference on Learning Theory (COLT)","author":"Gopalan","year":"2015"},{"key":"2026033014113327000_ref073","first-page":"100","article-title":"Thompson sampling for complex online problems","volume-title":"Proceedings of the 31st International Conference on Machine Learning","author":"Gopalan","year":"2014"},{"key":"2026033014113327000_ref074","first-page":"13","article-title":"Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft\u2019s Bing search engine","volume-title":"Proceedings of the 27th International Conference on Machine Learning","author":"Graepel","year":"2010"},{"key":"2026033014113327000_ref075","article-title":"Adaptive control of nonlinear stochastic systems by particle filtering","volume-title":"International Conference on Control and Automation","author":"Greenfield","year":"2003"},{"key":"2026033014113327000_ref076","article-title":"Efficient Bayes-adaptive reinforcement learning using sample-based search","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Guez","year":"2012"},{"key":"2026033014113327000_ref077","first-page":"317","article-title":"Stochastic regret minimization via Thompson sampling","volume-title":"Proceedings of The 27th Conference on Learning Theory","author":"Guha","year":"2014"},{"key":"2026033014113327000_ref078","article-title":"Exploiting generative models in discriminative classifiers","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Jaakkola","year":"1999"},{"key":"2026033014113327000_ref079","doi-asserted-by":"crossref","DOI":"10.1007\/11564096_59","article-title":"Active learning in partially observable Markov decision processes","volume-title":"European Conference on Machine Learning","author":"Jaulmes","year":"2005"},{"key":"2026033014113327000_ref080","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1016\/S0004-3702(98)00023-X","article-title":"Planning and acting in partially observable stochastic domains","volume":"101","author":"Kaelbling","year":"1998","journal-title":"Artificial Intelligence"},{"key":"2026033014113327000_ref081","first-page":"592","article-title":"On Bayesian upper confidence bounds for bandit problems","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Kaufmann","year":"2012"},{"key":"2026033014113327000_ref082","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1007\/978-3-642-34106-9_18","article-title":"Thompson sampling: An asymptotically optimal finite-time analysis","volume":"7568","author":"Kaufmann","year":"2012","journal-title":"Algorithmic Learning Theory"},{"key":"2026033014113327000_ref083","article-title":"A greedy approximation of Bayesian reinforcement learning with probably optimistic transition model","volume-title":"Adaptive Learning Agents 2013 (a workshop of AAAMAS)","author":"Kawaguchi","year":"2013"},{"key":"2026033014113327000_ref084","first-page":"260","article-title":"Near-optimal reinforcement learning in polynomial time","volume-title":"International Conference on Machine Learning","author":"Kearns","year":"1998"},{"key":"2026033014113327000_ref085","first-page":"1324","article-title":"A sparse sampling algorithm for near-optimal planning in large Markov decision processes","volume-title":"International Joint Conference on Artificial Intelligence","author":"Kearns","year":"1999"},{"key":"2026033014113327000_ref086","doi-asserted-by":"crossref","DOI":"10.1177\/0278364913495721","article-title":"Reinforcement learning in robotics: A survey","volume-title":"International Journal of Robotics Research (IJRR)","author":"Kober","year":"2013"},{"key":"2026033014113327000_ref087","doi-asserted-by":"crossref","DOI":"10.1007\/11871842_29","article-title":"Bandit based Monte-Carlo planning","volume-title":"Proceedings of the European Conference on Machine Learning (ECML)","author":"Kocsis","year":"2006"},{"key":"2026033014113327000_ref088","first-page":"2619","article-title":"Policy gradient reinforcement learning for fast quadrupedal locomotion","volume-title":"Proceedings of IEEE International Conference on Robotics and Automation","author":"Kohl","year":"2004"},{"key":"2026033014113327000_ref089","doi-asserted-by":"crossref","DOI":"10.1145\/1553374.1553441","article-title":"Near-Bayesian exploration in polynomial time","volume-title":"International Conference on Machine Learning","author":"Kolter","year":"2009"},{"key":"2026033014113327000_ref090","first-page":"1008","article-title":"Actor-Critic algorithms","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Konda","year":"2000"},{"key":"2026033014113327000_ref091","first-page":"1107","article-title":"Least-squares policy iteration","volume":"4","author":"Lagoudakis","year":"2003","journal-title":"Journal of Machine Learning Research"},{"issue":"1","key":"2026033014113327000_ref092","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/0196-8858(85)90002-8","article-title":"Asymptotically efficient adaptive allocation rules","volume":"6","author":"Lai","year":"1985","journal-title":"Advances in Applied Mathematics"},{"key":"2026033014113327000_ref093","first-page":"599","article-title":"Bayesian multi-task reinforcement learning","volume-title":"Proceedings of the 27th International Conference on Machine Learning","author":"Lazaric","year":"2010"},{"key":"2026033014113327000_ref094","volume-title":"A unifying framework for computational reinforcement learning theory","author":"Li","year":"2009"},{"key":"2026033014113327000_ref095","article-title":"On the prior sensitivity of Thompson sampling","volume-title":"CoRR","author":"Liu","year":"2015"},{"key":"2026033014113327000_ref096","first-page":"623","article-title":"The sample complexity of exploration in the multi-armed bandit problem","volume":"5","author":"Mannor","year":"2004","journal-title":"The Journal of Machine Learning Research"},{"key":"2026033014113327000_ref097","article-title":"The cross entropy method for fast policy search","volume-title":"International Conference on Machine Learning","author":"Mannor","year":"2003"},{"issue":"2","key":"2026033014113327000_ref098","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1287\/mnsc.1060.0614","article-title":"Bias and variance approximation in value function estimates","volume":"53","author":"Mannor","year":"2007","journal-title":"Management Science"},{"key":"2026033014113327000_ref099","volume-title":"Simulated-Based Methods for Markov Decision Processes","author":"Marbach","year":"1998"},{"key":"2026033014113327000_ref100","doi-asserted-by":"crossref","DOI":"10.1023\/A:1007618624809","article-title":"Some PAC-Bayesian theorems","volume":"37","author":"McAllester","year":"1999","journal-title":"Machine Learning"},{"issue":"3","key":"2026033014113327000_ref101","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1007\/s10994-008-5061-y","article-title":"Transfer in variable-reward hierarchical reinforcement learning","volume":"73","author":"Mehta","year":"2008","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref102","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-642-33486-3_10","article-title":"Bayesian nonparametric inverse reinforcement learning","volume-title":"Proceedings of the European Conference on Machine Learning","author":"Michini","year":"2012"},{"key":"2026033014113327000_ref103","first-page":"3651","article-title":"Improving the efficiency of Bayesian inverse reinforcement learning","volume-title":"IEEE International Conference on Robotics and Automation","author":"Michini","year":"2012"},{"key":"2026033014113327000_ref104","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1023\/A:1022635613229","article-title":"Prioritized sweeping: Reinforcement learning with less data and less real time","volume":"13","author":"Moore","year":"1993","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref105","article-title":"Apprenticeship learning using inverse reinforcement learning and gradient methods","volume-title":"Proceedings of Conference on Uncertainty in Artificial Intelligence","author":"Neu","year":"2007"},{"key":"2026033014113327000_ref106","first-page":"663","article-title":"Algorithms for inverse reinforcement learning","volume-title":"Proceedings of 17th International Conference on Machine Learning","author":"Ng","year":"2000"},{"key":"2026033014113327000_ref107","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Ng","year":"2004"},{"issue":"5","key":"2026033014113327000_ref108","doi-asserted-by":"crossref","first-page":"780","DOI":"10.1287\/opre.1050.0216","article-title":"Robust control of markov decision processes with uncertain transition matrices","volume":"53","author":"Nilim","year":"2005","journal-title":"Operations Research"},{"issue":"2","key":"2026033014113327000_ref109","doi-asserted-by":"crossref","first-page":"254","DOI":"10.1287\/ijoc.1100.0398","article-title":"Computing a classic index for finite-horizon bandits","volume":"23","author":"Ni\u00f1o-Mora","year":"2011","journal-title":"INFORMS Journal on Computing"},{"key":"2026033014113327000_ref110","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/0378-3758(91)90002-V","article-title":"Bayes-Hermite quadrature","volume":"29","author":"O\u2019Hagan","year":"1991","journal-title":"Journal of Statistical Planning and Inference"},{"key":"2026033014113327000_ref111","article-title":"(More) efficient reinforcement learning via posterior sampling","volume-title":"Proceedings of the Advances in Neural Information Processing Systems (NIPS)","author":"Osband","year":"2013"},{"key":"2026033014113327000_ref112","first-page":"970","article-title":"An online POMDP algorithm for complex multiagent environments","volume-title":"International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS)","author":"Paquet","year":"2005"},{"issue":"7-9","key":"2026033014113327000_ref113","doi-asserted-by":"crossref","first-page":"1180","DOI":"10.1016\/j.neucom.2007.11.026","article-title":"Natural actor-critic","volume":"71","author":"Peters","year":"2008","journal-title":"Neurocomputing"},{"key":"2026033014113327000_ref114","first-page":"1025","article-title":"Point-based value iteration: an anytime algorithm for POMDPs","volume-title":"International Joint Conference on Artificial Intelligence","author":"Pineau","year":"2003"},{"key":"2026033014113327000_ref115","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP.2011.5946754","volume-title":"Bayesian Reinforcement Learning for POMDP-based Dialogue Systems","author":"Png","year":"2011"},{"key":"2026033014113327000_ref116","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP.2011.5946754","article-title":"Bayesian reinforcement learning for POMDP-based dialogue systems","volume-title":"ICASSP","author":"Png","year":"2011"},{"key":"2026033014113327000_ref117","volume-title":"Calcul des Probabilit\u00e9s","author":"Poincar\u00e9","year":"1896"},{"key":"2026033014113327000_ref118","article-title":"Point-based value iteration for continuous POMDPs","volume":"7","author":"Porta","year":"2006","journal-title":"Journal of Machine Learning Research"},{"key":"2026033014113327000_ref119","first-page":"697","article-title":"An analytic solution to discrete Bayesian reinforcement learning","volume-title":"International Conference on Machine learning","author":"Poupart","year":"2006"},{"key":"2026033014113327000_ref120","doi-asserted-by":"crossref","DOI":"10.1002\/9781118029176","volume-title":"Approximate Dynamic Programming: Solving the curses of dimensionality (2nd Edition)","author":"Powell","year":"2011"},{"key":"2026033014113327000_ref121","doi-asserted-by":"crossref","DOI":"10.1002\/9780470316887","volume-title":"Markov Decision Processes","author":"Puterman","year":"1994"},{"key":"2026033014113327000_ref122","volume-title":"IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","author":"Fonteneau","year":"2013"},{"key":"2026033014113327000_ref123","first-page":"2586","article-title":"Bayesian inverse reinforcement learning","volume-title":"Proceedings of the 20th International Joint Conference on Artificial Intelligence","author":"Ramachandran","year":"2007"},{"key":"2026033014113327000_ref124","first-page":"489","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Rasmussen","year":"2003"},{"key":"2026033014113327000_ref125","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Rasmussen","year":"2004"},{"key":"2026033014113327000_ref126","volume-title":"Gaussian Processes for Machine Learning","author":"Rasmussen","year":"2006"},{"key":"2026033014113327000_ref127","doi-asserted-by":"crossref","DOI":"10.1145\/1143844.1143936","article-title":"Maximum margin planning","volume-title":"Proceedings of the 23rd International Conference on Machine Learning","author":"Ratliff","year":"2006"},{"key":"2026033014113327000_ref128","doi-asserted-by":"crossref","DOI":"10.1109\/CDC.1992.371636","article-title":"Bayesian adaptive control of time varying systems","volume-title":"IEEE Conference on Decision and Control","author":"Ravikanth","year":"1992"},{"key":"2026033014113327000_ref129","doi-asserted-by":"crossref","first-page":"816","DOI":"10.1145\/1390156.1390259","article-title":"Online kernel selection for Bayesian reinforcement learning","volume-title":"Proceedings of the 25th International Conference on Machine Learning","author":"Reisinger","year":"2008"},{"key":"2026033014113327000_ref130","article-title":"Model-based Bayesian reinforcement learning in large structured domains","volume-title":"Proceedings of Conference on Uncertainty in Artificial Intelligence","author":"Ross","year":"2008"},{"key":"2026033014113327000_ref131","first-page":"1225","article-title":"Bayes-adaptive POMDPs","volume":"20","author":"Ross","year":"2008","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"2026033014113327000_ref132","doi-asserted-by":"crossref","DOI":"10.1109\/ROBOT.2008.4543641","article-title":"Bayesian reinforcement learning in continuous POMDPs with application to robot navigation","volume-title":"IEEE International Conference on Robotics and Automation","author":"Ross","year":"2008"},{"key":"2026033014113327000_ref133","doi-asserted-by":"crossref","first-page":"663","DOI":"10.1613\/jair.2567","article-title":"Online POMDPs","volume":"32","author":"Ross","year":"2008","journal-title":"Journal of Artificial Intelligence Research (JAIR)"},{"key":"2026033014113327000_ref134","article-title":"A Bayesian approach for learning and planning in partially observable Markov decision processes","volume":"12","author":"Ross","year":"2011","journal-title":"Journal of Machine Learning Research"},{"key":"2026033014113327000_ref135","author":"Rummery","year":"1994"},{"issue":"2","key":"2026033014113327000_ref136","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1287\/moor.1100.0446","article-title":"Linearly parameterized bandits","volume":"35","author":"Rusmevichientong","year":"2010","journal-title":"Mathematics of Operations Research"},{"key":"2026033014113327000_ref137","article-title":"Optimal adaptive control of uncertain stochastic discrete linear systems","volume-title":"IEEE International Conference on Systems, Man and Cybernetics","author":"Rusnak","year":"1995"},{"key":"2026033014113327000_ref138","first-page":"101","article-title":"Learning agents for uncertain environments (extended abstract)","volume-title":"Proceedings of the 11th Annual Conference on Computational Learning Theory","author":"Russell","year":"1998"},{"key":"2026033014113327000_ref139","volume-title":"Artificial Intelligence: A Modern Approach (2nd Edition)","author":"Russell","year":"2002"},{"key":"2026033014113327000_ref140","article-title":"An information-theoretic analysis of Thompson sampling","volume-title":"CoRR","author":"Russo","year":"2014"},{"issue":"4","key":"2026033014113327000_ref141","doi-asserted-by":"crossref","first-page":"1221","DOI":"10.1287\/moor.2014.0650","article-title":"Learning to optimize via posterior sampling","volume":"39","author":"Russo","year":"2014","journal-title":"Mathematics of Operations Research"},{"key":"2026033014113327000_ref142","volume-title":"Statistical Signal Processing","author":"Scharf","year":"1991"},{"key":"2026033014113327000_ref143","volume-title":"Learning with Kernels","author":"Sch\u00f6lkopf","year":"2002"},{"issue":"6","key":"2026033014113327000_ref144","doi-asserted-by":"crossref","first-page":"639","DOI":"10.1002\/asmb.874","article-title":"A modern Bayesian look at the multi-armed bandit","volume":"26","author":"Scott","year":"2010","journal-title":"Applied Stochastic Models in Business and Industry"},{"key":"2026033014113327000_ref145","article-title":"PAC-Bayesian analysis of contextual bandits","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Seldin","year":"2011"},{"key":"2026033014113327000_ref146","article-title":"PAC-Bayesian analysis of the exploration-exploitation trade-off","volume-title":"ICML Workshop on online trading of exploration and exploitation","author":"Seldin","year":"2011"},{"key":"2026033014113327000_ref147","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511809682","volume-title":"Kernel Methods for Pattern Analysis","author":"Shawe-Taylor","year":"2004"},{"issue":"5","key":"2026033014113327000_ref148","doi-asserted-by":"crossref","first-page":"1071","DOI":"10.1287\/opre.21.5.1071","article-title":"The optimal control of partially observable Markov processes over a finite horizon","volume":"21","author":"Smallwood","year":"1973","journal-title":"Operations Research"},{"key":"2026033014113327000_ref149","article-title":"Variance-based rewards for approximate Bayesian reinforcement learning","volume-title":"Proceedings of Conference on Uncertainty in Artificial Intelligence","author":"Sorg","year":"2010"},{"key":"2026033014113327000_ref150","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1613\/jair.1659","article-title":"Perseus: randomized point-based value iteration for POMDPs","volume":"24","author":"Spaan","year":"2005","journal-title":"Journal of Artificial Intelligence Research (JAIR)"},{"key":"2026033014113327000_ref151","doi-asserted-by":"crossref","first-page":"856","DOI":"10.1145\/1102351.1102459","article-title":"A theoretical analysis of model-based interval estimation","volume-title":"International Conference on Machine learning","author":"Strehl","year":"2005"},{"key":"2026033014113327000_ref152","doi-asserted-by":"crossref","first-page":"1209","DOI":"10.1016\/j.jcss.2007.08.009","article-title":"An analysis of model-based interval estimation for Markov decision processes","volume":"74","author":"Strehl","year":"2008","journal-title":"Journal of Computer and System Sciences"},{"key":"2026033014113327000_ref153","article-title":"A Bayesian framework for reinforcement learning","volume-title":"International Conference on Machine Learning","author":"Strens","year":"2000"},{"key":"2026033014113327000_ref154","volume-title":"Temporal credit assignment in reinforcement learning","author":"Sutton","year":"1984"},{"key":"2026033014113327000_ref155","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1023\/A:1022633531479","article-title":"Learning to predict by methods of temporal differences","volume":"3","author":"Sutton","year":"1988","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref156","doi-asserted-by":"crossref","first-page":"160","DOI":"10.1145\/122344.122377","article-title":"DYNA, an integrated architecture for learning, planning, and reacting","volume":"2","author":"Sutton","year":"1991","journal-title":"SIGART Bulletin"},{"key":"2026033014113327000_ref157","doi-asserted-by":"crossref","DOI":"10.1109\/TNN.1998.712192","volume-title":"An Introduction to Reinforcement Learning","author":"Sutton","year":"1998"},{"key":"2026033014113327000_ref158","first-page":"1057","article-title":"Policy gradient methods for reinforcement learning with function approximation","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Sutton","year":"2000"},{"key":"2026033014113327000_ref159","first-page":"1449","article-title":"A game-theoretic approach to apprenticeship learning","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Syed","year":"2008"},{"key":"2026033014113327000_ref160","first-page":"1587","volume-title":"Proceedings of the 22nd ACM international conference on Conference on information & knowledge management","author":"Tang","year":"2013"},{"key":"2026033014113327000_ref161","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1162\/neco.1994.6.2.215","article-title":"TD-Gammon, a self-teaching backgammon program, achieves master-level play","volume":"6","author":"Tesauro","year":"1994","journal-title":"Neural Computation"},{"key":"2026033014113327000_ref162","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1093\/biomet\/25.3-4.285","article-title":"On the likelihood that one unknown probability exceeds another in view of the evidence of two samples","author":"Thompson","year":"1933","journal-title":"Biometrika"},{"key":"2026033014113327000_ref163","first-page":"194","article-title":"A short proof of the gittins index theorem","volume-title":"The Annals of Applied Probability","author":"Tsitsiklis","year":"1994"},{"issue":"9","key":"2026033014113327000_ref164","doi-asserted-by":"crossref","first-page":"1671","DOI":"10.1016\/j.ins.2011.01.001","article-title":"Hessian matrix distribution for Bayesian policy gradient reinforcement learning","volume":"181","author":"Vien","year":"2011","journal-title":"Information Sciences"},{"key":"2026033014113327000_ref165","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v24i1.7689","article-title":"Integrating sample-based planning and model-based reinforcement learning","volume-title":"Association for the Advancement of Artificial Intelligence","author":"Walsh","year":"2010"},{"key":"2026033014113327000_ref166","doi-asserted-by":"crossref","first-page":"956","DOI":"10.1145\/1102351.1102472","article-title":"Bayesian sparse sampling for on-line reward optimization","volume-title":"International Conference on Machine learning","author":"Wang","year":"2005"},{"key":"2026033014113327000_ref167","volume-title":"Learning from Delayed Rewards","author":"Watkins","year":"1989"},{"key":"2026033014113327000_ref168","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1023\/A:1022672621406","article-title":"Simple statistical gradient-following algorithms for connectionist reinforcement learning","volume":"8","author":"Williams","year":"1992","journal-title":"Machine Learning"},{"key":"2026033014113327000_ref169","doi-asserted-by":"crossref","first-page":"1015","DOI":"10.1145\/1273496.1273624","article-title":"Multi-task reinforcement learning: A hierarchical Bayesian approach","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Wilson","year":"2007"},{"key":"2026033014113327000_ref170","doi-asserted-by":"crossref","DOI":"10.1016\/B978-0-08-042375-3.50010-X","article-title":"Adaptive dual control methods: An overview","volume-title":"5th IFAC symposium on Adaptive Systems in Control and Signal Processing","author":"Wittenmark","year":"1995"},{"key":"2026033014113327000_ref171","article-title":"Discrete-time Bayesian adaptive control problems with complete information","volume-title":"IEEE Conference on Decision and Control","author":"Zane","year":"1992"},{"key":"2026033014113327000_ref172","first-page":"1433","article-title":"Maximum entropy inverse reinforcement learning","volume":"3","author":"Ziebart","year":"2008","journal-title":"Proceedings of the 23rd National Conference on Artificial Intelligence"}],"container-title":["Foundations and Trends\u00ae in Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/ftmal\/article-pdf\/8\/5-6\/359\/11155733\/2200000049en.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/www.emerald.com\/ftmal\/article-pdf\/8\/5-6\/359\/11155733\/2200000049en.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T18:12:09Z","timestamp":1774894329000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.emerald.com\/ftmal\/article\/8\/5-6\/359\/1332416\/Bayesian-Reinforcement-Learning-A-Survey"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,11,26]]},"references-count":172,"journal-issue":{"issue":"5-6","published-print":{"date-parts":[[2015,11,26]]}},"URL":"https:\/\/doi.org\/10.1561\/2200000049","relation":{},"ISSN":["1935-8237","1935-8245"],"issn-type":[{"value":"1935-8237","type":"print"},{"value":"1935-8245","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,11,26]]}}}