{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,18]],"date-time":"2026-01-18T10:32:31Z","timestamp":1768732351829,"version":"3.49.0"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2021,5,7]],"date-time":"2021-05-07T00:00:00Z","timestamp":1620345600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,5,7]],"date-time":"2021-05-07T00:00:00Z","timestamp":1620345600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["IIS-1724157"],"award-info":[{"award-number":["IIS-1724157"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["IIS-1638107"],"award-info":[{"award-number":["IIS-1638107"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["IIS-1617639"],"award-info":[{"award-number":["IIS-1617639"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["IIS-1749204"],"award-info":[{"award-number":["IIS-1749204"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000006","name":"Office of Naval Research","doi-asserted-by":"publisher","award":["N00014-18-2243"],"award-info":[{"award-number":["N00014-18-2243"]}],"id":[{"id":"10.13039\/100000006","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["CPS-1739964"],"award-info":[{"award-number":["CPS-1739964"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000083","name":"Directorate for Computer and Information Science and Engineering","doi-asserted-by":"publisher","award":["NRI-1925082"],"award-info":[{"award-number":["NRI-1925082"]}],"id":[{"id":"10.13039\/100000083","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100017561","name":"Future of Life Institute","doi-asserted-by":"crossref","award":["RFP2-000"],"award-info":[{"award-number":["RFP2-000"]}],"id":[{"id":"10.13039\/100017561","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000183","name":"Army Research Office","doi-asserted-by":"publisher","award":["W911NF-19-2-0333"],"award-info":[{"award-number":["W911NF-19-2-0333"]}],"id":[{"id":"10.13039\/100000183","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100002186","name":"Lockheed Martin","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100002186","id-type":"DOI","asserted-by":"publisher"}]},{"name":"General Motors"},{"DOI":"10.13039\/100011993","name":"Robert Bosch","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100011993","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2021,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.<\/jats:p>","DOI":"10.1007\/s10994-020-05938-9","type":"journal-article","created":{"date-parts":[[2021,5,7]],"date-time":"2021-05-07T18:03:58Z","timestamp":1620410638000},"page":"1267-1317","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":22,"title":["Importance sampling in reinforcement learning with an estimated behavior policy"],"prefix":"10.1007","volume":"110","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7411-0398","authenticated-orcid":false,"given":"Josiah P.","family":"Hanna","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Scott","family":"Niekum","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peter","family":"Stone","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,5,7]]},"reference":[{"key":"5938_CR1","unstructured":"Asadi, K., Allen, C., Roderick, M., Mohamed, A.-R., Konidaris, G., & Littman, M. (2017). Mean actor critic. arXiv preprint arXiv:1709.00503v1."},{"key":"5938_CR2","unstructured":"Asis, K.\u00a0D., Hernandez-Garcia, J.\u00a0F., Holland, G.\u00a0Z., & Sutton, R.\u00a0S. (2018). Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI)."},{"issue":"3","key":"5938_CR3","doi-asserted-by":"publisher","first-page":"399","DOI":"10.1080\/00273171.2011.568786","volume":"46","author":"PC Austin","year":"2011","unstructured":"Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3), 399\u2013424.","journal-title":"Multivariate Behavioral Research"},{"issue":"3731","key":"5938_CR4","doi-asserted-by":"publisher","first-page":"34","DOI":"10.1126\/science.153.3731.34","volume":"153","author":"R Bellman","year":"1966","unstructured":"Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34\u201337.","journal-title":"Science"},{"key":"5938_CR5","unstructured":"Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. arXiv preprint arXiv:1606.01540."},{"key":"5938_CR6","doi-asserted-by":"crossref","unstructured":"Ciosek, K., & Whiteson, S. (2017). OFFER: Off-environment reinforcement learning. In Proceedings of the 31st AAAI conference on artificial intelligence (AAAI).","DOI":"10.1609\/aaai.v31i1.10810"},{"key":"5938_CR7","doi-asserted-by":"crossref","unstructured":"Ciosek, K., & Whiteson, S. (2018). Expected policy gradients. In Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI).","DOI":"10.1609\/aaai.v32i1.11607"},{"issue":"4","key":"5938_CR8","doi-asserted-by":"publisher","first-page":"2177","DOI":"10.3150\/15-BEJ725","volume":"22","author":"B Delyon","year":"2016","unstructured":"Delyon, B., & Portier, F. (2016). Integral approximation by kernel smoothing. Bernoulli, 22(4), 2177\u20132208.","journal-title":"Bernoulli"},{"key":"5938_CR9","unstructured":"Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., & Wu, Y. (2017). OpenAI baselines. https:\/\/github.com\/openai\/baselines."},{"key":"5938_CR10","doi-asserted-by":"crossref","unstructured":"Doroudi, S., Thomas, P.\u00a0S., & Brunskill, E. (2017). Importance sampling for fair policy selection. In Proceedings of uncertainty in artificial intelligence (UAI).","DOI":"10.24963\/ijcai.2018\/729"},{"key":"5938_CR11","unstructured":"Dud\u00edk, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th international conference on machine learning (ICML)."},{"issue":"3","key":"5938_CR12","doi-asserted-by":"publisher","first-page":"642","DOI":"10.1214\/aoms\/1177728174","volume":"27","author":"A Dvoretzky","year":"1956","unstructured":"Dvoretzky, A., Kiefer, J., & Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27(3), 642\u2013669.","journal-title":"The Annals of Mathematical Statistics"},{"key":"5938_CR13","unstructured":"Farajtabar, M., Chow, Y., & Ghavamzadeh, M. (2018). More robust doubly robust off-policy evaluation. In Proceedings of the 35th international conference on machine learning (ICML)."},{"key":"5938_CR14","unstructured":"Fellows, M., Ciosek, K., & Whiteson, S. (2018). Fourier policy gradients. arXiv preprint arXiv:1802.06891."},{"key":"5938_CR15","doi-asserted-by":"crossref","unstructured":"Frank, J., Mannor, S., & Precup, D. (2008). Reinforcement learning in the presence of rare events. In Proceedings of the 25th international conference on machine learning, ACM, pp. 336\u2013343.","DOI":"10.1145\/1390156.1390199"},{"key":"5938_CR16","doi-asserted-by":"publisher","first-page":"3647","DOI":"10.1609\/aaai.v33i01.33013647","volume":"33","author":"C Gelada","year":"2019","unstructured":"Gelada, C., & Bellemare, M. G. (2019). Off-policy deep reinforcement learning by bootstrapping the covariate shift. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 3647\u20133655.","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"5938_CR17","first-page":"1471","volume":"5","author":"E Greensmith","year":"2004","unstructured":"Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471\u20131530.","journal-title":"Journal of Machine Learning Research"},{"key":"5938_CR18","unstructured":"Hallak, A., & Mannor, S. (2017). Consistent on-line off-policy evaluation. In Proceedings of the 34th international conference on machine learning, pp. 1372\u20131383."},{"key":"5938_CR19","doi-asserted-by":"crossref","unstructured":"Hammersley, J., & Handscomb, D. (1964). Monte Carlo methods. Methuen & co. Ltd., London, p.\u00a040.","DOI":"10.1007\/978-94-009-5819-7"},{"key":"5938_CR20","unstructured":"Hanna, J.\u00a0P., & Stone, P. (2019). Reducing sampling error in the monte carlo policy gradient estimator. In Proceedings of the 19th international conference on autonomous agents and multi-agent systems (AAMAS)."},{"key":"5938_CR21","unstructured":"Hanna, J. P., Thomas, P. S., Stone, P., & Niekum, S. (2017). Data-efficient policy evaluation through behavior policy search. In Proceedings of the 34th international conference on machine learning (ICML)."},{"key":"5938_CR22","unstructured":"Hanna, J. P., Niekum, S., & Stone, P. (2019). Importance sampling with an estimated behavior policy. In Proceedings of the 36th international conference on machine learning (ICML)."},{"issue":"4","key":"5938_CR23","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1093\/biomet\/asm076","volume":"94","author":"M Henmi","year":"2007","unstructured":"Henmi, M., Yoshida, R., & Eguchi, S. (2007). Importance sampling via the estimated sampler. Biometrika, 94(4), 985\u2013991.","journal-title":"Biometrika"},{"issue":"4","key":"5938_CR24","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.1111\/1468-0262.00442","volume":"71","author":"K Hirano","year":"2003","unstructured":"Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161\u20131189.","journal-title":"Econometrica"},{"key":"5938_CR25","unstructured":"Jiang, N., & Li, L. (2016). Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML)."},{"key":"5938_CR26","unstructured":"Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the international conference on learning representations (ICLR)."},{"key":"5938_CR27","unstructured":"Li, L., Munos, R., & Szepesv\u00e1ri, C. (2015). Toward minimax off-policy value estimation. In Proceedings of the 18th international conference on artificial intelligence and statistics."},{"key":"5938_CR28","unstructured":"Liu, Q., & Lee, J. D. (2017). Black-box importance sampling. In Proceedings of the 20th international conference on artificial intelligence and statistics."},{"key":"5938_CR29","first-page":"5356","volume":"31","author":"Q Liu","year":"2018","unstructured":"Liu, Q., Li, L., Tang, Z., & Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems (NeurIPS), 31, 5356\u20135366.","journal-title":"Advances in Neural Information Processing Systems (NeurIPS)"},{"key":"5938_CR30","first-page":"159","volume":"1","author":"S Mahadevan","year":"1996","unstructured":"Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 1, 159\u2013196.","journal-title":"Machine Learning"},{"key":"5938_CR31","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to information retrieval","author":"CD Manning","year":"2008","unstructured":"Manning, C. D., Raghavan, P., & Sch\u00fctze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press."},{"issue":"7540","key":"5938_CR32","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529\u2013533.","journal-title":"Nature"},{"key":"5938_CR33","unstructured":"Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML), pp. 1928\u20131937."},{"key":"5938_CR34","unstructured":"Moore, A (1990). Efficient Memory-based learning for robot control. PhD thesis, University of Cambridge."},{"key":"5938_CR35","unstructured":"Mousavi, A., Li, L., Liu, Q., & Zhou, D. (2020). Black-box off-policy estimation for infinite-horizon reinforcement learning. In International conference on learning representations (ICLR)"},{"key":"5938_CR36","doi-asserted-by":"crossref","unstructured":"Narita, Y., Yasui, S., & Yata, K. (2019). Efficient counterfactual learning from bandit feedback. In Proceedings of the 35th AAAI conference on artificial intelligence (AAAI).","DOI":"10.2139\/ssrn.3300346"},{"issue":"3","key":"5938_CR37","doi-asserted-by":"publisher","first-page":"695","DOI":"10.1111\/rssb.12185","volume":"79","author":"CJ Oates","year":"2017","unstructured":"Oates, C. J., Girolami, M., & Chopin, N. (2017). Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3), 695\u2013718.","journal-title":"Journal of the Royal Statistical Society: Series B (Statistical Methodology)"},{"key":"5938_CR38","unstructured":"Pavse, B. S., Durugkar, I., Hanna, J. P., & Stone, P. (2020). Reducing sampling error in batch temporal difference learning. In Proceedings of the 37th international conference on machine learning (ICML)."},{"key":"5938_CR39","doi-asserted-by":"crossref","unstructured":"Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682\u2013697.","DOI":"10.1016\/j.neunet.2008.02.003"},{"key":"5938_CR40","unstructured":"Petit, B., Amdahl-Culleton, L., Liu, Y., Smith, J., & Bacon, P. L. (2019). All-action policy gradient methods: A numerical integration approach. arXiv preprint arXiv:1910.09093."},{"key":"5938_CR41","unstructured":"Precup, D., Sutton, R. S., & Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th international conference on machine learning (ICML), pp. 759\u2013766."},{"key":"5938_CR42","volume-title":"Markov decision processes: Discrete stochastic dynamic programming","author":"ML Puterman","year":"2014","unstructured":"Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley."},{"key":"5938_CR43","unstructured":"Raghu, A., Gottesman, O., Liu, Y., Komorowski, M., Faisal, A., Doshi-Velez, F., & Brunskill, F. (2018). Behaviour policy estimation in off-policy policy evaluation: Calibration matters. In Proceedings of the ICML workshop on causal inference, counterfactual prediction, and autonomous action."},{"issue":"398","key":"5938_CR44","doi-asserted-by":"publisher","first-page":"387","DOI":"10.1080\/01621459.1987.10478441","volume":"82","author":"PR Rosenbaum","year":"1987","unstructured":"Rosenbaum, P. R. (1987). Model-based direct adjustment. Journal of the American Statistical Association, 82(398), 387\u2013394.","journal-title":"Journal of the American Statistical Association"},{"key":"5938_CR45","volume-title":"The cross-entropy method: a unified approach to combinatorial optimization Monte Carlo simulation and machine learning","author":"RY Rubinstein","year":"2013","unstructured":"Rubinstein, R. Y., & Kroese, D. P. (2013). The cross-entropy method: a unified approach to combinatorial optimization Monte Carlo simulation and machine learning. Berlin: Springer."},{"key":"5938_CR46","unstructured":"Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems, Vol. 37. Department of Engineering Cambridge, England: University of Cambridge."},{"key":"5938_CR47","unstructured":"Schulman, J., Levine, S., Moritz, P., Jordan, M., & Abbeel, P. (2015). Trust region policy optimization. In Proceedings of the 32nd international conference on machine learning (ICML). URL http:\/\/jmlr.csail.mit.edu\/proceedings\/papers\/v37\/schulman15.html."},{"key":"5938_CR48","unstructured":"Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the international conference on learning representations (ICLR)."},{"key":"5938_CR49","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347."},{"key":"5938_CR50","doi-asserted-by":"crossref","unstructured":"Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the 10th international conference on machine learning (ICML).","DOI":"10.1016\/B978-1-55860-307-3.50045-9"},{"key":"5938_CR51","unstructured":"Shi, L., Li, S., Cao, L., Yang, L., & Pan, G. (2019). TBQ ($$\\sigma$$): Improving efficiency of trace utilization for off-policy reinforcement learning. In Proceedings of the 18th international conference on autonomous agents and multiagent systems (AAMAS), pp. 1025\u20131032."},{"issue":"7587","key":"5938_CR52","doi-asserted-by":"publisher","first-page":"484","DOI":"10.1038\/nature16961","volume":"529","author":"D Silver","year":"2016","unstructured":"Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484\u2013489.","journal-title":"Nature"},{"key":"5938_CR53","first-page":"123","volume":"22","author":"SP Singh","year":"1996","unstructured":"Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123\u2013158.","journal-title":"Machine Learning"},{"key":"5938_CR54","unstructured":"Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts, Amherst."},{"key":"5938_CR55","volume-title":"Reinforcement learning: An introduction","author":"RS Sutton","year":"1998","unstructured":"Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press."},{"key":"5938_CR56","unstructured":"Sutton, R. S., Singh, S., & McAllester, D. (2000). Comparing policy-gradient algorithms."},{"key":"5938_CR57","unstructured":"Thomas, P. S. (2015). Safe reinforcement learning. PhD thesis, University of Massachusetts Amherst."},{"key":"5938_CR58","unstructured":"Thomas, P. S., & Brunskill, E. (2016a). Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd international conference on machine learning (ICML)."},{"key":"5938_CR59","unstructured":"Thomas, P. S., & Brunskill, E. (2016b). Magical policy search: Data efficient reinforcement learning with guarantees of global optimality. In European workshop on reinforcement learning."},{"key":"5938_CR60","doi-asserted-by":"crossref","unstructured":"Thomas, P. S., & Brunskill, E. (2017). Importance sampling with unequal support. In Thirty-first AAAI conference on artificial intelligence.","DOI":"10.1609\/aaai.v31i1.10932"},{"key":"5938_CR61","unstructured":"Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015). High confidence policy improvement. In Proceedings of the 32nd international conference on machine learning (ICML)."},{"key":"5938_CR62","doi-asserted-by":"crossref","unstructured":"Van\u00a0Seijen, H., Van\u00a0Hasselt, H., Whiteson, S., & Wiering, M. (2009). A theoretical and empirical analysis of expected SARSA. In Proceedings of the IEEE symposium on adaptive dynamic programming and reinforcement learning, IEEE, pp. 177\u2013184.","DOI":"10.1109\/ADPRL.2009.4927542"},{"issue":"3\u20134","key":"5938_CR63","first-page":"229","volume":"8","author":"RJ Williams","year":"1992","unstructured":"Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3\u20134), 229\u2013256.","journal-title":"Machine Learning"},{"key":"5938_CR64","unstructured":"Xie, Y., Liu, B., Liu, Q., Wang, Z., Zhou, Y., & Peng, J. (2018). Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy. In Proceedings of the international conference on learning representations (ICLR)."},{"key":"5938_CR65","doi-asserted-by":"crossref","unstructured":"Yang, L., Shi, M., Zheng, Q., Meng, W., & Pan, G. (2018). A unified approach for multi-step temporal-difference learning with eligibility traces in reinforcement learning. In Proceedings of the 27th international joint conference on artificial intelligence (IJCAI).","DOI":"10.24963\/ijcai.2018\/414"},{"key":"5938_CR66","doi-asserted-by":"crossref","unstructured":"Yang, M., Nachum, O., Dai, B., Li, L., & Schuurmans, D. (2020). Off-policy evaluation via the regularized lagrangian. In Advances in neural information processing systems (NeurIPS), Vol.\u00a033.","DOI":"10.1007\/978-3-030-63823-8"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-020-05938-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-020-05938-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-020-05938-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T10:46:24Z","timestamp":1725014784000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-020-05938-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,7]]},"references-count":66,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,6]]}},"alternative-id":["5938"],"URL":"https:\/\/doi.org\/10.1007\/s10994-020-05938-9","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,7]]},"assertion":[{"value":"9 June 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 June 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 December 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 May 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Compliance with ethical standards"}},{"value":"Peter Stone serves as the Executive Director of Sony AI America and receives financial compensation for this work. The terms of this arrangement have been reviewed and approved by the University of Texas at Austin in accordance with its policy on objectivity in research.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}