{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T00:38:54Z","timestamp":1769560734724,"version":"3.49.0"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2021,5,12]],"date-time":"2021-05-12T00:00:00Z","timestamp":1620777600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,5,12]],"date-time":"2021-05-12T00:00:00Z","timestamp":1620777600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2021,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.<\/jats:p>","DOI":"10.1007\/s10994-021-05984-x","type":"journal-article","created":{"date-parts":[[2021,5,12]],"date-time":"2021-05-12T19:02:41Z","timestamp":1620846161000},"page":"2295-2334","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Inverse reinforcement learning in contextual MDPs"],"prefix":"10.1007","volume":"110","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0923-3217","authenticated-orcid":false,"given":"Stav","family":"Belogolovsky","sequence":"first","affiliation":[]},{"given":"Philip","family":"Korsunsky","sequence":"additional","affiliation":[]},{"given":"Shie","family":"Mannor","sequence":"additional","affiliation":[]},{"given":"Chen","family":"Tessler","sequence":"additional","affiliation":[]},{"given":"Tom","family":"Zahavy","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,5,12]]},"reference":[{"key":"5984_CR1","doi-asserted-by":"crossref","unstructured":"Abbeel, P., & Ng, A.Y.(2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning (pp. 1). ACM.","DOI":"10.1145\/1015330.1015430"},{"key":"5984_CR2","doi-asserted-by":"publisher","unstructured":"Abbeel, P., & Ng, A.Y. (2005). Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, ICML \u201905 (pp. 1-8). New York, NY, USA: Association for Computing Machinery. ISBN 1595931805. https:\/\/doi.org\/10.1145\/1102351.1102352.","DOI":"10.1145\/1102351.1102352"},{"key":"5984_CR3","unstructured":"Amin, K., Jiang, N., & Singh, S. (2017). Repeated inverse reinforcement learning. Advances in Neural Information Processing Systems, 1815\u20131824."},{"key":"5984_CR4","unstructured":"Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 4055\u20134065."},{"key":"5984_CR5","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1016\/S0167-6377(02)00231-6","volume":"31","author":"A Beck","year":"2003","unstructured":"Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167\u2013175.","journal-title":"Operations Research Letters"},{"issue":"3","key":"5984_CR6","doi-asserted-by":"publisher","first-page":"E172","DOI":"10.21037\/jtd.2016.02.57","volume":"8","author":"SC Berngard","year":"2016","unstructured":"Berngard, S. C., Beitler, J. R., & Malhotra, A. (2016). Personalizing mechanical ventilation for acute respiratory distress syndrome. Journal of thoracic disease, 8(3), E172.","journal-title":"Journal of thoracic disease"},{"issue":"3","key":"5984_CR7","doi-asserted-by":"publisher","first-page":"334","DOI":"10.1057\/palgrave.jors.2600425","volume":"48","author":"DP Bertsekas","year":"1997","unstructured":"Bertsekas, D. P. (1997). Nonlinear programming. Journal of the Operational Research Society, 48(3), 334\u2013334.","journal-title":"Journal of the Operational Research Society"},{"key":"5984_CR8","unstructured":"Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., & Zhang, J., et\u00a0al. (2016). End to end learning for self-driving cars. arXiv preprintarXiv:1604.07316 ."},{"key":"5984_CR9","volume-title":"Linear controller design: Limits of performance","author":"SP Boyd","year":"1991","unstructured":"Boyd, S. P., & Barratt, C. H. (1991). Linear controller design: Limits of performance. Hoboken: Prentice Hall Englewood Cliffs."},{"key":"5984_CR10","doi-asserted-by":"crossref","unstructured":"Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends\u00ae in Machine Learning, 8(3\u20134), 231\u2013357.","DOI":"10.1561\/2200000050"},{"key":"5984_CR11","doi-asserted-by":"publisher","first-page":"447","DOI":"10.1146\/annurev-statistics-022513-115553","volume":"1","author":"B Chakraborty","year":"2014","unstructured":"Chakraborty, B., & Murphy, S. A. (2014). Dynamic treatment regimes. Annual Review of Statistics and its Application, 1, 447\u2013464.","journal-title":"Annual Review of Statistics and its Application"},{"key":"5984_CR12","unstructured":"Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1126\u20131135). JMLR. org."},{"issue":"3","key":"5984_CR13","doi-asserted-by":"publisher","first-page":"1493","DOI":"10.1137\/140985366","volume":"26","author":"D Garber","year":"2016","unstructured":"Garber, D., & Hazan, E. (2016). A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3), 1493\u20131528.","journal-title":"SIAM Journal on Optimization"},{"key":"5984_CR14","unstructured":"Ghasemipour, S. K. S., Gu, S. S., & Zemel, R. (2019). Smile: Scalable meta inverse reinforcement learning through context-conditional policies. Advances in Neural Information Processing Systems, 7879\u20137889."},{"key":"5984_CR15","unstructured":"Hallak, A., Di Castro, D., & Mannor, S. (2015). Contextual markov decision processes. arXiv preprintarXiv:1502.02259."},{"key":"5984_CR16","doi-asserted-by":"crossref","unstructured":"Hazan, E. (2016). Introduction to online convex optimization. Foundations and Trends\u00ae in Optimization, 2(3\u20134), 157\u2013325.","DOI":"10.1561\/2400000013"},{"key":"5984_CR17","unstructured":"Ho, J. & Ermon, S (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565\u20134573."},{"key":"5984_CR18","doi-asserted-by":"crossref","unstructured":"Itenov, T., Murray, D., & Jensen, J. (2018). Sepsis: Personalized medicine utilizing \u2018omic\u2019technologies\u2013a paradigm shift? In Healthcare (pp. 111). Multidisciplinary Digital Publishing Institute.","DOI":"10.3390\/healthcare6030111"},{"key":"5984_CR19","unstructured":"Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization."},{"key":"5984_CR20","unstructured":"Jeter, R., Josef, C., Shashikumar, S., & Nemati, S. (2019). Does the \u201cartificial intelligence clinician\u201d learn optimal treatment strategies for sepsis in intensive care?. URL https:\/\/github.com\/point85AI\/Policy-Iteration-AI-Clinician.git."},{"key":"5984_CR21","doi-asserted-by":"publisher","first-page":"160035","DOI":"10.1038\/sdata.2016.35","volume":"3","author":"AE Johnson","year":"2016","unstructured":"Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-W.H., Feng, M., Ghassemi, M., et al. (2016). Mimic-iii, a freely accessible critical care database. Scientific Data, 3, 160035. https:\/\/doi.org\/10.1038\/sdata.2016.35.","journal-title":"Scientific Data"},{"key":"5984_CR22","unstructured":"Juskalian, R., Regalado, A., Orcutt, M., Piore, A., Rotman, D., Patel, N. V., Lichfield, G., Hao, K., Chen, A., & Temple, J. (2020). Mit technology review. URL https:\/\/www.technologyreview.com\/lists\/technologies\/2020\/."},{"key":"5984_CR23","unstructured":"Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. International conference on Machine learning, 267\u2013274."},{"issue":"2\u20133","key":"5984_CR24","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1023\/A:1017984413808","volume":"49","author":"M Kearns","year":"2002","unstructured":"Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2\u20133), 209\u2013232.","journal-title":"Machine Learning"},{"issue":"11","key":"5984_CR25","doi-asserted-by":"publisher","first-page":"1716","DOI":"10.1038\/s41591-018-0213-5","volume":"24","author":"M Komorowski","year":"2018","unstructured":"Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11), 1716.","journal-title":"Nature Medicine"},{"key":"5984_CR26","unstructured":"Laskey, M., Lee, J., Hsieh, W., Liaw, R., Mahler, J., Fox, R., & Goldberg, K. (2017). Iterative noise injection for scalable imitation learning. arXiv preprintarXiv:1703.09327 ."},{"key":"5984_CR27","doi-asserted-by":"crossref","unstructured":"Lee, D., Srinivasan, S., & Doshi-Velez, F. (2019). Truly batch apprenticeship learning with deep successor features. arXiv preprintarXiv:1903.10077 .","DOI":"10.24963\/ijcai.2019\/819"},{"key":"5984_CR28","unstructured":"MacQueen, J. et\u00a0al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability (pp. 281\u2013297). Oakland, CA, USA."},{"key":"5984_CR29","unstructured":"Modi, A. & Tewari, A. (2019). Contextual markov decision processes using generalized linear models. arXiv preprintarXiv:1903.06187 ."},{"key":"5984_CR30","unstructured":"Modi, A., Jiang, N., Singh, S., & Tewari, A. (2018). Markov decision processes with continuous side information. Algorithmic Learning Theory, 597\u2013618."},{"key":"5984_CR31","volume-title":"In Problem complexity and method efficiency in optimization","author":"AS Nemirovsky","year":"1983","unstructured":"Nemirovsky, A. S., & Yudin, D. B. (1983). In Problem complexity and method efficiency in optimization. New York: Wiley."},{"issue":"2","key":"5984_CR32","doi-asserted-by":"publisher","first-page":"527","DOI":"10.1007\/s10208-015-9296-2","volume":"17","author":"Y Nesterov","year":"2017","unstructured":"Nesterov, Y., & Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 527\u2013566.","journal-title":"Foundations of Computational Mathematics"},{"key":"5984_CR33","first-page":"2","volume":"1","author":"AY Ng","year":"2000","unstructured":"Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. ICML, 1, 2.","journal-title":"ICML"},{"key":"5984_CR34","unstructured":"Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. Advances in Neural Information Processing Systems, pp. 305\u2013313."},{"key":"5984_CR35","unstructured":"Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M., & Engelhardt, B. E. (2017). A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. UAI."},{"key":"5984_CR36","doi-asserted-by":"publisher","DOI":"10.1002\/9780470316887","volume-title":"Markov decision processes: Discrete stochastic dynamic programming","author":"ML Puterman","year":"1994","unstructured":"Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. London: John Wiley & Sons."},{"key":"5984_CR37","doi-asserted-by":"crossref","unstructured":"Ratliff, N., Bagnell, J. A., & Srinivasa, S. S. (2007). Imitation learning for locomotion and manipulation. In 2007 7th IEEE-RAS International Conference on Humanoid Robots (pp. 392\u2013397). IEEE.","DOI":"10.1109\/ICHR.2007.4813899"},{"key":"5984_CR38","doi-asserted-by":"publisher","first-page":"400","DOI":"10.1214\/aoms\/1177729586","volume":"22","author":"H Robbins","year":"1951","unstructured":"Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of Mathematical Statistics, 22, 400\u2013407.","journal-title":"The annals of Mathematical Statistics"},{"key":"5984_CR39","unstructured":"Ross, S., & Bagnell, D. (2010). Efficient reductions for imitation learning. Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 661\u2013668)."},{"key":"5984_CR40","unstructured":"Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 627\u2013635)."},{"key":"5984_CR41","unstructured":"Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprintarXiv:1703.03864 ."},{"key":"5984_CR42","doi-asserted-by":"crossref","unstructured":"Syed, U., & Schapire, R. E. (2008). A game-theoretic approach to apprenticeship learning. Advances in Neural Information Processing Systems, 1449\u20131456.","DOI":"10.1145\/1390156.1390286"},{"key":"5984_CR43","doi-asserted-by":"publisher","first-page":"706","DOI":"10.1016\/j.bja.2018.04.036","volume":"121","author":"E Wesselink","year":"2018","unstructured":"Wesselink, E., Kappen, T., Torn, H., Slooter, A., & van Klei, W. (2018). Intraoperative hypotension and the risk of postoperative adverse outcomes: a systematic review. British Journal of Anaesthesia, 121, 706\u2013721.","journal-title":"British Journal of Anaesthesia"},{"key":"5984_CR44","unstructured":"Xu, K., Ratner, E., Dragan, A., Levine, S., & Finn, C. (2018). Learning a prior over intent via meta-inverse reinforcement learning. arXiv preprintarXiv:1805.12573 ."},{"key":"5984_CR45","doi-asserted-by":"crossref","unstructured":"Zahavy, T., Cohen, A., Kaplan, H., Mansour, Y. (2020). Apprenticeship learning via frank-wolfe.","DOI":"10.1609\/aaai.v34i04.6150"},{"key":"5984_CR46","unstructured":"Zahavy, T., Cohen, A., Kaplan, H., & Mansour, Y. (2020). Average reward reinforcement learning with unknown mixing times. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial. (Intelligence)."},{"key":"5984_CR47","unstructured":"Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03) (pp. 928\u2013936)."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-021-05984-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-021-05984-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-021-05984-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,8,30]],"date-time":"2021-08-30T18:21:06Z","timestamp":1630347666000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-021-05984-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,12]]},"references-count":47,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2021,9]]}},"alternative-id":["5984"],"URL":"https:\/\/doi.org\/10.1007\/s10994-021-05984-x","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,12]]},"assertion":[{"value":"15 May 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 September 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 April 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 May 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}