{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T10:11:18Z","timestamp":1760609478333,"version":"build-2065373602"},"reference-count":53,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,1,18]],"date-time":"2023-01-18T00:00:00Z","timestamp":1674000000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Contextual bandits can solve a huge range of real-world problems. However, current popular algorithms to solve them either rely on linear models or unreliable uncertainty estimation in non-linear models, which are required to deal with the exploration\u2013exploitation trade-off. Inspired by theories of human cognition, we introduce novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces. We present two classes of models, one with neural networks as reward estimators, and the other with energy based models, which model the probability of obtaining an optimal reward given an action. We evaluate the performance of these models in static and dynamic contextual bandit simulation environments. We show that both techniques outperform standard baseline algorithms, such as NN HMC, NN Discrete, Upper Confidence Bound, and Thompson Sampling, where energy based models have the best overall performance. This provides practitioners with new techniques that perform well in static and dynamic settings, and are particularly well suited to non-linear scenarios with continuous action spaces.<\/jats:p>","DOI":"10.3390\/e25020188","type":"journal-article","created":{"date-parts":[[2023,1,18]],"date-time":"2023-01-18T01:57:57Z","timestamp":1674007077000},"page":"188","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models"],"prefix":"10.3390","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6706-8443","authenticated-orcid":false,"given":"Adam","family":"Elwood","sequence":"first","affiliation":[{"name":"lastminute.com Group, Vicolo de Calvi, 2, 6830 Chiasso, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7989-1162","authenticated-orcid":false,"given":"Marco","family":"Leonardi","sequence":"additional","affiliation":[{"name":"lastminute.com Group, Vicolo de Calvi, 2, 6830 Chiasso, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6753-7254","authenticated-orcid":false,"given":"Ashraf","family":"Mohamed","sequence":"additional","affiliation":[{"name":"lastminute.com Group, Vicolo de Calvi, 2, 6830 Chiasso, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6269-1351","authenticated-orcid":false,"given":"Alessandro","family":"Rozza","sequence":"additional","affiliation":[{"name":"lastminute.com Group, Vicolo de Calvi, 2, 6830 Chiasso, Switzerland"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"484","DOI":"10.1038\/nature16961","article-title":"Mastering the game of Go with deep neural networks and tree search","volume":"529","author":"Silver","year":"2016","journal-title":"Nature"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/j.eswa.2017.12.020","article-title":"The use of machine learning algorithms in recommender systems: A systematic review","volume":"97","author":"Portugal","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"160","DOI":"10.1007\/s42979-021-00592-x","article-title":"Machine Learning: Algorithms, Real-World Applications and Research Directions","volume":"2","author":"Sarker","year":"2021","journal-title":"SN Comput. Sci."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Bouneffouf, D., Rish, I., and Aggarwal, C. (2020, January 19\u201324). Survey on applications of multi-armed and contextual bandits. Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK.","DOI":"10.1109\/CEC48606.2020.9185782"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1016\/j.ijar.2018.04.006","article-title":"Improving multi-armed bandit algorithms in online pricing settings","volume":"98","author":"Paladino","year":"2018","journal-title":"Int. J. Approx. Reason."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Xu, X., Dong, F., Li, Y., He, S., and Li, X. (2020, January 7\u201312). Contextual-bandit based personalized recommendation with time-varying user interests. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i04.6125"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Nuara, A., Trovo, F., Gatti, N., and Restelli, M. (2018, January 2\u20137). A combinatorial-bandit algorithm for the online joint bid\/budget optimization of pay-per-click advertising campaigns. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11888"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Gatti, N., Lazaric, A., and Trovo, F. (2012, January 4\u20138). A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities. Proceedings of the 13th ACM Conference on Electronic Commerce, Valencia, Spain.","DOI":"10.1145\/2229012.2229057"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Gasparini, M., Nuara, A., Trov\u00f2, F., Gatti, N., and Restelli, M. (2018, January 8\u201313). Targeting optimization for internet advertising by learning from logged bandit feedback. Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil.","DOI":"10.1109\/IJCNN.2018.8489092"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1093\/biomet\/25.3-4.285","article-title":"On the likelihood that one unknown probability exceeds another in view of the evidence of two samples","volume":"25","author":"Thompson","year":"1933","journal-title":"Biometrika"},{"key":"ref_11","unstructured":"Agrawal, S., and Goyal, N. (2013, January 16\u201321). Thompson sampling for contextual bandits with linear payoffs. Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA."},{"key":"ref_12","first-page":"100","article-title":"Thompson Sampling for Complex Online Problems","volume":"Volume 32","author":"Xing","year":"2014","journal-title":"Proceedings of the 31st International Conference on Machine Learning"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.jphysparis.2006.10.001","article-title":"A free energy principle for the brain","volume":"100","author":"Friston","year":"2006","journal-title":"J. Physiol. Paris"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1016\/j.tics.2009.04.005","article-title":"The free-energy principle: A rough guide to the brain?","volume":"13","author":"Friston","year":"2009","journal-title":"Trends Cogn. Sci."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1038\/nrn2787","article-title":"The free-energy principle: A unified brain theory?","volume":"11","author":"Friston","year":"2010","journal-title":"Nat. Rev. Neurosci."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"43","DOI":"10.3389\/fpsyg.2012.00043","article-title":"Free-Energy and Illusions: The Cornsweet Effect","volume":"3","author":"Brown","year":"2012","journal-title":"Front. Psychol."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"611","DOI":"10.1007\/s00429-012-0475-5","article-title":"Predictions not commands: Active inference in the motor system","volume":"218","author":"Adams","year":"2013","journal-title":"Brain Struct. Funct."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"710","DOI":"10.3389\/fpsyg.2013.00710","article-title":"Exploration, novelty, surprise, and free energy minimization","volume":"4","author":"Schwartenbeck","year":"2013","journal-title":"Front. Psychol."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1016\/j.neunet.2021.08.018","article-title":"An empirical evaluation of active inference in multi-armed bandits","volume":"144","author":"Kiebel","year":"2021","journal-title":"Neural Netw."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"102632","DOI":"10.1016\/j.jmp.2021.102632","article-title":"A step-by-step tutorial on active inference and its application to empirical data","volume":"107","author":"Smith","year":"2022","journal-title":"J. Math. Psychol."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Lee, K., Choy, J., Choi, Y., Kee, H., and Oh, S. (January, January 24). No-Regret Shannon Entropy Regularized Neural Contextual Bandit Online Learning for Robotic Grasping. Proceedings of the 2020 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA.","DOI":"10.1109\/IROS45743.2020.9341123"},{"key":"ref_22","unstructured":"Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv."},{"key":"ref_23","unstructured":"Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017, January 6\u201311). Reinforcement learning with deep energy-based policies. Proceedings of the International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_24","unstructured":"Du, Y., Lin, T., and Mordatch, I. (2019). Model Based Planning with Energy Based Models. arXiv."},{"key":"ref_25","first-page":"1","article-title":"A Contextual Bandit Bake-off","volume":"22","author":"Bietti","year":"2021","journal-title":"J. Mach. Learn. Res."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Cavenaghi, E., Sottocornola, G., Stella, F., and Zanker, M. (2021). Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy, 23.","DOI":"10.3390\/e23030380"},{"key":"ref_27","unstructured":"Abbasi-Yadkori, Y., P\u00e1l, D., and Szepesv\u00e1ri, C. (2011, January 12\u201315). Improved algorithms for linear stochastic bandits. Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/0196-8858(85)90002-8","article-title":"Asymptotically efficient adaptive allocation rules","volume":"6","author":"Lai","year":"1985","journal-title":"Adv. Appl. Math."},{"key":"ref_29","unstructured":"Riquelme, C., Tucker, G., and Snoek, J. (2018). Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv."},{"key":"ref_30","unstructured":"Zhou, D., Li, L., and Gu, Q. (2020, January 13\u201318). Neural contextual bandits with ucb-based exploration. Proceedings of the International Conference on Machine Learning, Virtual."},{"key":"ref_31","unstructured":"Zhang, W., Zhou, D., Li, L., and Gu, Q. (2020). Neural thompson sampling. arXiv."},{"key":"ref_32","unstructured":"Kassraie, P., and Krause, A. (2022, January 28\u201330). Neural contextual bandits without regret. Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1613\/jair.301","article-title":"Reinforcement learning: A survey","volume":"4","author":"Kaelbling","year":"1996","journal-title":"J. Artif. Intell. Res."},{"key":"ref_34","unstructured":"Sutton, R.S., and Barto, A.G. (2018). Reinforcement Learning: An Introduction, MIT Press."},{"key":"ref_35","unstructured":"Kuleshov, V., and Precup, D. (2014). Algorithms for multi-armed bandit problems. arXiv."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data, MIT Press.","DOI":"10.7551\/mitpress\/7443.003.0014"},{"key":"ref_37","unstructured":"Grathwohl, W., Wang, K.C., Jacobsen, J.H., Duvenaud, D., Norouzi, M., and Swersky, K. (2019). Your classifier is secretly an energy based model and you should treat it like one. arXiv."},{"key":"ref_38","first-page":"45","article-title":"Actor-Critic Reinforcement Learning with Energy-Based Policies","volume":"Volume 24","author":"Deisenroth","year":"2013","journal-title":"Proceedings of the Tenth European Workshop on Reinforcement Learning"},{"key":"ref_39","unstructured":"Cesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G. (2017, January 4\u20139). Boltzmann exploration done right. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Degris, T., Pilarski, P.M., and Sutton, R.S. (2012, January 27\u201329). Model-free reinforcement learning with continuous action in practice. Proceedings of the 2012 American Control Conference (ACC), Montreal, QC, Canada.","DOI":"10.1109\/ACC.2012.6315022"},{"key":"ref_41","first-page":"2","article-title":"MCMC using Hamiltonian dynamics","volume":"2","author":"Neal","year":"2011","journal-title":"Handb. Markov Chain. Monte Carlo"},{"key":"ref_42","first-page":"2","article-title":"Hamiltonian Monte Carlo for hierarchical models","volume":"79","author":"Betancourt","year":"2015","journal-title":"Curr. Trends Bayesian Methodol. Appl."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1214\/aos\/1018031103","article-title":"Convergence of a stochastic approximation version of the EM algorithm","volume":"27","author":"Delyon","year":"1999","journal-title":"Ann. Stat."},{"key":"ref_44","unstructured":"Dillon, J.V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R.A. (2017). Tensorflow distributions. arXiv."},{"key":"ref_45","unstructured":"Moerland, T.M., Broekens, J., and Jonker, C.M. (2020). Model-based reinforcement learning: A survey. arXiv."},{"key":"ref_46","unstructured":"Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R.H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., and Levine, S. (2019). Model-based reinforcement learning for atari. arXiv."},{"key":"ref_47","unstructured":"Boney, R., Kannala, J., and Ilin, A. (2020, January 16\u201318). Regularizing model-based planning with energy-based models. Proceedings of the Conference on Robot Learning, Virtual."},{"key":"ref_48","unstructured":"Du, Y., and Mordatch, I. (2019). Implicit generation and generalization in energy-based models. arXiv."},{"key":"ref_49","unstructured":"Song, Y., and Ermon, S. (2019, January 8\u201314). Generative modeling by estimating gradients of the data distribution. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_50","unstructured":"Xie, J., Lu, Y., Zhu, S.C., and Wu, Y. (2016, January 19\u201324). A theory of generative convnet. Proceedings of the International Conference on Machine Learning, New York, NY, USA."},{"key":"ref_51","unstructured":"Lippe, P. (2022, July 22). Tutorial 8: Deep Energy-Based Generative Models. Available online: https:\/\/uvadlc-notebooks.readthedocs.io\/en\/latest\/tutorial_notebooks\/tutorial8\/Deep_Energy_Models.html."},{"key":"ref_52","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_53","unstructured":"Duvenaud, D., Kelly, J., Swersky, K., Hashemi, M., Norouzi, M., and Grathwohl, W. (2021). No MCMC for Me: Amortized Samplers for Fast and Stable Training of Energy-Based Models. arXiv."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/2\/188\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:08:48Z","timestamp":1760119728000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/2\/188"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,18]]},"references-count":53,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,2]]}},"alternative-id":["e25020188"],"URL":"https:\/\/doi.org\/10.3390\/e25020188","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2023,1,18]]}}}