{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T13:25:09Z","timestamp":1777728309923,"version":"3.51.4"},"reference-count":51,"publisher":"SAGE Publications","issue":"2","license":[{"start":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T00:00:00Z","timestamp":1672099200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Intelligenza Artificiale: The international journal of the AIxIA"],"published-print":{"date-parts":[[2022,12,27]]},"abstract":"<jats:p>In this paper, we provide a unified presentation of the Configurable Markov Decision Process (Conf-MDP) framework. A Conf-MDP is an extension of the traditional Markov Decision Process (MDP) that models the possibility to configure some environmental parameters. This configuration activity can be carried out by the learning agent itself or by an external configurator. We introduce a general definition of Conf-MDP, then we particularize it for the cooperative setting, where the configuration is fully functional to the agent\u2019s goals, and non-cooperative setting, in which agent and configurator might have different interests. For both settings, we propose suitable solution concepts. Furthermore, we illustrate how to extend the traditional value functions for MDPs and Bellman operators to this new framework.<\/jats:p>","DOI":"10.3233\/ia-220140","type":"journal-article","created":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T11:34:05Z","timestamp":1672140845000},"page":"165-184","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["A unified view of configurable Markov Decision Processes: Solution concepts, value functions, and operators"],"prefix":"10.1177","volume":"16","author":[{"given":"Alberto Maria","family":"Metelli","sequence":"first","affiliation":[{"name":"Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milan, Italy"}]}],"member":"179","published-online":{"date-parts":[[2022,12,27]]},"reference":[{"key":"ref001","doi-asserted-by":"publisher","DOI":"10.1023\/A:1013689704352"},{"key":"ref002","doi-asserted-by":"publisher","DOI":"10.1090\/S0002-9904-1954-09848-8"},{"key":"ref003","doi-asserted-by":"publisher","DOI":"10.1023\/A:1004637022496"},{"key":"ref004","doi-asserted-by":"crossref","unstructured":"Krishnendu Chatterjee, Rupak Majumdar and Marcin Jurdzinski, \nOn Nash Equilibria in Stochastic Games. In Computer Science Logic, 18th International Workshop (CSL), pages 26\u201340, 2004.","DOI":"10.1007\/978-3-540-30124-0_6"},{"key":"ref005","unstructured":"Kamil\u00a0Andrzej Ciosek and Shimon Whiteson, OFFER: off-environment reinforcement learning, In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), pages 1819\u20131825, AAAI Press, 2017."},{"key":"ref006","doi-asserted-by":"crossref","unstructured":"Vincent Conitzer and Tuomas Sandholm, Computing the optimal strategy to commit to, In Proceedings 7th ACM Conference on Electronic Commerce (EC), pages 82\u201390. ACM, 2006.","DOI":"10.1145\/1134707.1134717"},{"key":"ref007","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2016.2522401"},{"key":"ref008","doi-asserted-by":"crossref","unstructured":"Pierluca D\u2019Oro, Alberto\u00a0Maria Metelli, Andrea Tirinzoni, Matteo Papini and Marcello Restelli, \nGradient-aware model-based policy search, In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), pages 3801\u20133808, AAAI Press, 2020.","DOI":"10.1609\/aaai.v34i04.5791"},{"key":"ref009","unstructured":"Zehao Dou, Zhuoran Yang, Zhaoran Wang and Simon\u00a0S. Du, \nGap-dependent bounds for two-player markov games, CoRR, abs\/2107.00685, 2021."},{"key":"ref010","doi-asserted-by":"crossref","unstructured":"Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker and Sergey Levine, Learning to walk via deep reinforcement learning, In Robotics: Science and Systems XV, University of Freiburg, 2019.","DOI":"10.15607\/RSS.2019.XV.011"},{"key":"ref011","doi-asserted-by":"crossref","unstructured":"Sarah Keren, Luis\u00a0Enrique Pineda, Avigdor Gal, Erez Karpas and Shlomo Zilberstein, \nEqui-reward utility maximizing design in stochastic environments, In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), pages 4353\u20134360. ijcai.org, 2017.","DOI":"10.24963\/ijcai.2017\/608"},{"key":"ref012","unstructured":"Erwan Lecarpentier and Emmanuel Rachelson, \nNon-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019)), pages 7214\u20137223, 2019."},{"key":"ref013","unstructured":"Boris Lesner and Bruno Scherrer, \nNon-stationary approximate modified policy iteration, In Proceedings of the 32nd International Conference on Machine Learning (ICML)), volume 37, pages 1567\u20131575. JMLR.org, 2015."},{"key":"ref014","doi-asserted-by":"publisher","DOI":"10.1016\/j.robot.2020.103568"},{"key":"ref015","author":"Littman Michael L.","year":"1994","journal-title":"Proceedings of the Eleventh International Conference (ICML)"},{"key":"ref016","unstructured":"Yuzhe Ma, Xuezhou Zhang, Wen Sun and Jerry Zhu, \nPolicy poisoning in batch reinforcement learning and control, In Advances in Neural Information Processing Systems 32 (NeurIPS)), pages 14543\u201314553, 2019."},{"key":"ref017","doi-asserted-by":"publisher","DOI":"10.3233\/FAIA361"},{"key":"ref018","doi-asserted-by":"crossref","unstructured":"Alberto\u00a0Maria Metelli, \nConfigurable environments in reinforcement learning: An overview, In Luigi Piroddi, editor, Special Topics in Information Technology, pages 101\u2013113, Cham, 2022. Springer International Publishing.","DOI":"10.1007\/978-3-030-85918-3_9"},{"key":"ref019","unstructured":"Alberto\u00a0Maria Metelli, Emanuele Ghelfi and Marcello Restelli, \nReinforcement learning in configurable continuous environments, In Proceedings of the 36th International Conference on Machine Learning, (ICML), volume\u00a097, pages 4546\u20134555. PMLR, 2019."},{"key":"ref020","unstructured":"Alberto\u00a0Maria Metelli, Amarildo Likmeta and Marcello Restelli, \nPropagating uncertainty in reinforcement learning via wasserstein barycenters, In Advances in Neural Information Processing Systems 32 (NeurIPS)), pages 4335\u20134347, 2019."},{"key":"ref021","doi-asserted-by":"publisher","DOI":"10.1007\/s10994-021-06033-3"},{"key":"ref022","unstructured":"Alberto\u00a0Maria Metelli, Flavio Mazzolini, Lorenzo Bisi, Luca Sabbioni and Marcello Restelli, \nControl frequency adaptation via action persistence in batch reinforcement learning, In Proceedings of the 37th International Conference on Machine Learning (ICML)), volume 119, pages 6862\u20136873. PMLR, 2020."},{"key":"ref023","unstructured":"Alberto\u00a0Maria Metelli, Mirco Mutti and Marcello Restelli, \nConfigurable Markov decision processes, In Proceedings of the 35th International Conference on Machine Learning (ICML)), volume\u00a080, pages 3488\u20133497. PMLR, 2018."},{"key":"ref024","doi-asserted-by":"crossref","unstructured":"Alberto\u00a0Maria Metelli, Matteo Papini, Pierluca D\u2019Oro and Marcello Restelli, \nPolicy optimization as online learning with mediator feedback, In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), pages 8958\u20138966. \nAAAI Press, 2021.","DOI":"10.1609\/aaai.v35i10.17083"},{"key":"ref025","unstructured":"Alberto\u00a0Maria Metelli, Matteo Papini, Francesco Faccio and Marcello Restelli, \nPolicy optimization via importance sampling, In Advances in Neural Information Processing Systems 31 (NeurIPS)), pages 5447\u20135459, 2018."},{"issue":"141","key":"ref026","first-page":"1","volume":"21","year":"2020","journal-title":"Journal of Machine Learning Research"},{"issue":"1","key":"ref027","first-page":"91","volume":"14","year":"2020","journal-title":"Intelligenza Artificiale"},{"key":"ref028","unstructured":"Alberto\u00a0Maria Metelli, Giorgia Ramponi, Alessandro Concetti and Marcello Restelli, \nProvably efficient learning of transferable rewards, In Proceedings of the 38th International Conference on Machine Learning (ICML) ), volume 139, pages 7665\u20137676. PMLR, 2021."},{"key":"ref029","unstructured":"Alberto\u00a0Maria Metelli, Alessio Russo and Marcello Restelli, \nSubgaussian and differentiable importance sampling for off-policy evaluation and learning, In Advances in Neural Information Processing Systems 34 (NeurIPS). 2021."},{"key":"ref030","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"ref031","doi-asserted-by":"publisher","DOI":"10.1137\/040614384"},{"key":"ref032","doi-asserted-by":"publisher","DOI":"10.2307\/1969529"},{"key":"ref033","unstructured":"Andrew Y. Ng and Stuart J. Russell, Algorithms for inverse reinforcement learning, In Proceedings of the Seventeenth International Conference on Machine Learning (ICML)), pages 663\u2013670. Morgan Kaufmann, 2000."},{"key":"ref034","unstructured":"Matteo Papini, Alberto\u00a0Maria Metelli, Lorenzo Lupo and Marcello Restelli, \nOptimistic policy optimization via multiple importance sampling, In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 4989\u20134999. PMLR, 2019."},{"key":"ref035","doi-asserted-by":"publisher","DOI":"10.1007\/BF01737555"},{"key":"ref036","unstructured":"Doina Precup, Richard S. Sutton and Satinder P. Singh, Eligibility traces for off-policy policy evaluation, In Pat Langley, editor,\n                      Proceedings of the Seventeenth International Conference on Machine Learning (ICML)\n                      , pages 759\u2013766. Morgan Kaufmann, 2000."},{"key":"ref037","unstructured":"Martin\u00a0L Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014."},{"key":"ref038","unstructured":"Giorgia Ramponi, Amarildo Likmeta, Alberto\u00a0Maria Metelli, Andrea Tirinzoni and Marcello Restelli, \nTruly batch model-free inverse reinforcement learning about multiple intentions, In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108, pages 2359\u20132369. PMLR, 2020."},{"key":"ref039","unstructured":"Giorgia Ramponi, Alberto\u00a0Maria Metelli, Alessandro Concetti and Marcello Restelli, \nLearning in non-cooperative configurable markov decision processes, In Advances in Neural Information Processing Systems 34 (NeurIPS), 2021."},{"key":"ref040","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.39.10.1095"},{"key":"ref041","doi-asserted-by":"crossref","unstructured":"Rui Silva, Gabriele Farina, Francisco\u00a0S. Melo and Manuela Veloso, \nA theoretical and algorithmic analysis of configurable mdps, In Proceedings of the Twenty-Ninth International Conference on Automated Planning and Scheduling (ICAPS)), pages 455\u2013463. AAAI Press, 2019.","DOI":"10.1609\/icaps.v29i1.3551"},{"key":"ref042","doi-asserted-by":"crossref","unstructured":"Rui Silva, Francisco\u00a0S. Melo and Manuela Veloso, \nWhat if the world were different? gradient-based exploration for new optimal policies, In 4th Global Conference on Artificial Intelligence (GCAI)), volume 55 of EPiC Series in Computing, pages 229\u2013242. EasyChair, 2018.","DOI":"10.29007\/6jsv"},{"key":"ref043","unstructured":"Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2018."},{"key":"ref044","unstructured":"Csaba Szepesv\u00e1ri and Michael\u00a0L. Littman, \nGeneralized markov decision processes: Dynamic-programming and reinforcement-learning algorithms, In Proceedings of International Conference of Machine Learning (ICML), volume\u00a096, 1996."},{"key":"ref045","unstructured":"Heinrich Von\u00a0Stackelberg, Marktform und gleichgewicht, J. Springer, 1934."},{"key":"ref046","unstructured":"Yevgeniy Vorobeychik and Satinder\u00a0P. Singh, \nComputing Stackelberg equilibria in discounted stochastic games, In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012."},{"key":"ref047","unstructured":"Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel and Jimmy Ba, \nBenchmarking model-based reinforcement learning, CoRR, abs\/1907.02057, 2019."},{"key":"ref048","unstructured":"Haoqi Zhang, Yiling Chen and David\u00a0C. Parkes, \nA general approach to environment design with one agent, In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI)), pages 2002\u20132014, 2009."},{"key":"ref049","unstructured":"Haoqi Zhang and David\u00a0C. Parkes, \nValue-based policy teaching with active indirect elicitation, In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 208\u2013214. AAAI Press, 2008."},{"key":"ref050","doi-asserted-by":"crossref","unstructured":"Haoqi Zhang, David\u00a0C. Parkes and Yiling Chen, \nPolicy teaching through reward function learning. In Proceedings 10th ACM Conference on Electronic Commerce (EC), pages 295\u2013304. ACM, 2009.","DOI":"10.1145\/1566374.1566417"},{"key":"ref051","unstructured":"Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell and Anind K. Dey, Maximum entropy inverse reinforcement learning, In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 1433\u20131438. AAAI Press, 2008."}],"container-title":["Intelligenza Artificiale: The international journal of the AIxIA"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/IA-220140","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/IA-220140","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/IA-220140","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T10:51:45Z","timestamp":1777459905000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.3233\/IA-220140"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,27]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,12,27]]}},"alternative-id":["10.3233\/IA-220140"],"URL":"https:\/\/doi.org\/10.3233\/ia-220140","relation":{},"ISSN":["1724-8035","2211-0097"],"issn-type":[{"value":"1724-8035","type":"print"},{"value":"2211-0097","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,27]]}}}