{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T23:08:48Z","timestamp":1773788928144,"version":"3.50.1"},"reference-count":116,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T00:00:00Z","timestamp":1675728000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100011878","name":"Flemish Government","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100011878","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Recomm. Syst."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>Modern recommender systems are often modelled under the sequential decision-making paradigm, where the system<jats:italic>decides<\/jats:italic>which recommendations to show in order to maximise some notion of either imminent or long-term reward. Such methods often require an explicit model of the reward a certain context-action pair will yield \u2013 for example, the probability of a click on a recommendation. This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions can be skewed by the recommender system itself. Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling. This in turn makes off-policy learning \u2013 the typical setup in industry \u2013 particularly challenging. Existing approaches for value-based learning break down in such environments.<\/jats:p><jats:p>In this work, we propose and validate a general<jats:italic>pessimistic<\/jats:italic>reward modelling approach for off-policy learning in recommendation. Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule. We show how it alleviates a well-known decision making phenomenon known as the Optimiser\u2019s Curse, and draw parallels with existing work on pessimistic policy learning. Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case. Empirical observations in a wide range of simulated environments show that pessimistic decision-making leads to a significant and robust increase in recommendation performance. The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces. We discuss the impact of our contributions in the context of related applications like computational advertising, and present a scope for future research based on hybrid off-\/on-policy bandit learning methods for recommendation.<\/jats:p>","DOI":"10.1145\/3568029","type":"journal-article","created":{"date-parts":[[2022,10,26]],"date-time":"2022-10-26T14:17:24Z","timestamp":1666793844000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["Pessimistic Decision-Making for Recommender Systems"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6256-5814","authenticated-orcid":false,"given":"Olivier","family":"Jeunen","sequence":"first","affiliation":[{"name":"Amazon, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9327-9554","authenticated-orcid":false,"given":"Bart","family":"Goethals","sequence":"additional","affiliation":[{"name":"University of Antwerp, Belgium and Monash University, Clayton VIC, Australia"}]}],"member":"320","published-online":{"date-parts":[[2023,2,7]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098155"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331202"},{"key":"e_1_3_2_4_2","first-page":"4","volume-title":"Proc. of the 2019 World Wide Web Conference (WWW\u201919)","author":"Agarwal A.","year":"2019","unstructured":"A. Agarwal, X. Wang, C. Li, M. Bendersky, and M. Najork. 2019. Addressing trust bias for unbiased learning-to-rank. In Proc. of the 2019 World Wide Web Conference (WWW\u201919). ACM, 4\u201314."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1013689704352"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3412217"},{"key":"e_1_3_2_7_2","first-page":"35","volume-title":"Proc. of the KDD Cup and Workshop","author":"Bennett J.","year":"2007","unstructured":"J. Bennett, S. Lanning, et\u00a0al. 2007. The Netflix Prize. In Proc. of the KDD Cup and Workshop, Vol. 2007. 35."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"J. O. Berger and R. L. Wolpert. 1988. The likelihood principle. IMS.","DOI":"10.1214\/lnms\/1215466210"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/2567709.2567766"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240323.3240370"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.5555\/2986459.2986710"},{"key":"e_1_3_2_12_2","first-page":"456","volume-title":"Proc. of the 12th ACM International Conference on Web Search and Data Mining (WSDM\u201919)","author":"Chen M.","year":"2019","unstructured":"M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi. 2019. Top-k off-policy correction for a REINFORCE recommender system. In Proc. of the 12th ACM International Conference on Web Search and Data Mining (WSDM\u201919). ACM, 456\u2013464."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437963.3441764"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1145\/3460231.3474236","volume-title":"Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921)","author":"Chen M.","year":"2021","unstructured":"M. Chen, Y. Wang, C. Xu, Y. Le, M. Sharma, L. Richardson, S. Wu, and E. Chi. 2021. Values of user exploration in recommender systems. In Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921). ACM, 85\u201395."},{"issue":"4","key":"e_1_3_2_15_2","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1145\/3411754","article-title":"Block-aware item similarity models for top-n recommendation","volume":"38","author":"Chen Y.","year":"2020","unstructured":"Y. Chen, Y. Wang, X. Zhao, J. Zou, and M. de Rijke. 2020. Block-aware item similarity models for top-n recommendation. ACM Trans. Inf. Syst. 38, 4, Article 42 (Sept.2020), 26 pages.","journal-title":"ACM Trans. Inf. Syst."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437963.3441770"},{"key":"e_1_3_2_17_2","volume-title":"Proc. of the 2021 World Wide Web Conference (WWW\u201921)","author":"Choi M.","year":"2021","unstructured":"M. Choi, J. Kim, J. Lee, H. Shim, and J. Lee. 2021. Session-aware linear item-item models for session-based recommendation. In Proc. of the 2021 World Wide Web Conference (WWW\u201921)."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3298689.3347058"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3411962"},{"key":"e_1_3_2_20_2","first-page":"1097","volume-title":"Proc. of the 28th International Conference on International Conference on Machine Learning (ICML\u201911)","author":"Dud\u00edk M.","year":"2011","unstructured":"M. Dud\u00edk, J. Langford, and L. Li. 2011. Doubly robust policy evaluation and learning. In Proc. of the 28th International Conference on International Conference on Machine Learning (ICML\u201911). 1097\u20131104."},{"key":"e_1_3_2_21_2","first-page":"4629","volume-title":"Proc. of the 32nd International Conference on Neural Information Processing Systems (NIPS\u201918)","author":"Dumitrascu B.","year":"2018","unstructured":"B. Dumitrascu, K. Feng, and B. E. Engelhardt. 2018. PG-TS: Improved Thompson Sampling for logistic contextual bandits. In Proc. of the 32nd International Conference on Neural Information Processing Systems (NIPS\u201918). 4629\u20134638."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1201\/9780429246593"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","first-page":"803","DOI":"10.1145\/3460231.3470938","volume-title":"Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921)","author":"Ekstrand M. D.","year":"2021","unstructured":"M. D. Ekstrand, A. Chaney, P. Castells, R. Burke, D. Rohde, and M. Slokom. 2021. SimuRec: Workshop on synthetic data and simulation methods for recommender systems research. In Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921). ACM, 803\u2013805."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3298689.3347036"},{"issue":"1","key":"e_1_3_2_25_2","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1214\/18-STS668","article-title":"Generalized multiple importance sampling","volume":"34","author":"Elvira V.","year":"2019","unstructured":"V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo. 2019. Generalized multiple importance sampling. Statist. Sci. 34, 1 (22019), 129\u2013155.","journal-title":"Statist. Sci."},{"key":"e_1_3_2_26_2","series-title":"Proc. of the 35th International Conference on Machine Learning","first-page":"1447","volume":"80","author":"Farajtabar M.","year":"2018","unstructured":"M. Farajtabar, Y. Chow, and M. Ghavamzadeh. 2018. More robust doubly robust off-policy evaluation. In Proc. of the 35th International Conference on Machine Learning(ICML\u201918, Vol. 80). PMLR, 1447\u20131456."},{"key":"e_1_3_2_27_2","volume-title":"Proc. of the 34th AAAI Conference on Artificial Intelligence (AAAI\u201920)","author":"Faury L.","year":"2020","unstructured":"L. Faury, U. Tanielian, F. Vasile, E. Smirnova, and E. Dohmatob. 2020. Distributionally robust counterfactual risk minimization. In Proc. of the 34th AAAI Conference on Artificial Intelligence (AAAI\u201920). AAAI Press."},{"key":"e_1_3_2_28_2","first-page":"1050","volume-title":"Proc. of the 33rd International Conference on Machine Learning (ICML\u201916)","author":"Gal Y.","year":"2016","unstructured":"Y. Gal and Z. Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proc. of the 33rd International Conference on Machine Learning (ICML\u201916). PMLR, 1050\u20131059."},{"key":"e_1_3_2_29_2","first-page":"169","volume-title":"Proc. of the 8th ACM Conference on Recommender Systems (RecSys\u201914)","author":"Garcin F.","year":"2014","unstructured":"F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, and A. Huber. 2014. Offline and online evaluation of news recommender systems at swissinfo.ch. In Proc. of the 8th ACM Conference on Recommender Systems (RecSys\u201914). 169\u2013176."},{"key":"e_1_3_2_30_2","first-page":"198","volume-title":"Proc. of the 11th ACM International Conference on Web Search and Data Mining (WSDM\u201918)","author":"Gilotte A.","year":"2018","unstructured":"A. Gilotte, C. Calauz\u00e8nes, T. Nedelec, A. Abraham, and S. Doll\u00e9. 2018. Offline A\/B testing for recommender systems. In Proc. of the 11th ACM International Conference on Web Search and Data Mining (WSDM\u201918). ACM, 198\u2013206."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3412214"},{"key":"e_1_3_2_32_2","series-title":"Proc. of the 35th International Conference on Machine Learning","first-page":"1861","volume":"80","author":"Haarnoja T.","year":"2018","unstructured":"T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proc. of the 35th International Conference on Machine Learning(ICML\u201918, Vol. 80). PMLR, 1861\u20131870."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2827872"},{"key":"e_1_3_2_34_2","first-page":"1","volume-title":"Proc. of the 8th International Workshop on Data Mining for Online Advertising (ADKDD\u201914)","author":"He X.","year":"2014","unstructured":"X. He, O. Pan, J.and Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. Candela. 2014. Practical lessons from predicting clicks on ads at Facebook. In Proc. of the 8th International Workshop on Data Mining for Online Advertising (ADKDD\u201914). ACM, 1\u20139."},{"key":"e_1_3_2_35_2","volume-title":"Proc. of the 9th International Conference on Learning Representations (ICLR\u201921)","author":"Hui L.","year":"2021","unstructured":"L. Hui and M. Belkin. 2021. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In Proc. of the 9th International Conference on Learning Representations (ICLR\u201921). arxiv:2006.07322 [cs.LG]."},{"key":"e_1_3_2_36_2","unstructured":"E. Ie C. Hsu M. Mladenov V. Jain S. Narvekar J. Wang R. Wu and C. Boutilier. 2019. RecSim: A Configurable Simulation Platform for Recommender Systems. arxiv:1909.04847 [cs.LG]."},{"key":"e_1_3_2_37_2","first-page":"2592","volume-title":"Proc. of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19)","author":"Ie E.","year":"2019","unstructured":"E. Ie, V. Jain, J. Wang, S. Narvekar, R. Agarwal, R. Wu, H. Cheng, T. Chandra, and C. Boutilier. 2019. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. In Proc. of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19). 2592\u20132599."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1198\/106186008X320456"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401230"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3298689.3347069"},{"key":"e_1_3_2_41_2","volume-title":"Offline Approaches to Recommendation with Online Success","author":"Jeunen O.","year":"2021","unstructured":"O. Jeunen. 2021. Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp."},{"key":"e_1_3_2_42_2","volume-title":"Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions (REVEAL\u201920)","author":"Jeunen O.","year":"2020","unstructured":"O. Jeunen and B. Goethals. 2020. An empirical evaluation of doubly robust learning for recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions (REVEAL\u201920)."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1145\/3460231.3474247","volume-title":"Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921)","author":"Jeunen O.","year":"2021","unstructured":"O. Jeunen and B. Goethals. 2021. Pessimistic reward models for off-policy learning in recommendation. In Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921). ACM, 63\u201374."},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","first-page":"310","DOI":"10.1145\/3460231.3474248","volume-title":"Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921)","author":"Jeunen O.","year":"2021","unstructured":"O. Jeunen and B. Goethals. 2021. Top-k contextual bandits with equity of exposure. In Proc. of the Fifteenth ACM Conference on Recommender Systems (RecSys\u201921). ACM, 310\u2013320."},{"key":"e_1_3_2_45_2","volume-title":"Proc. of the AdKDD Workshop at the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (AdKDD\u201922)","author":"Jeunen O.","year":"2022","unstructured":"O. Jeunen, S. Murphy, and B. Allison. 2022. Learning to bid with AuctionGym. In Proc. of the AdKDD Workshop at the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (AdKDD\u201922)."},{"key":"e_1_3_2_46_2","volume-title":"Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation (REVEAL\u201919)","author":"Jeunen O.","year":"2019","unstructured":"O. Jeunen, D. Mykhaylov, D. Rohde, F. Vasile, A. Gilotte, and M. Bompaire. 2019. Learning from bandit feedback: An overview of the state-of-the-art. In Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation (REVEAL\u201919)."},{"key":"e_1_3_2_47_2","volume-title":"Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation (REVEAL\u201919)","author":"Jeunen O.","year":"2019","unstructured":"O. Jeunen, D. Rohde, and F. Vasile. 2019. On the value of bandit feedback for offline recommender system evaluation. In Proc. of the ACM RecSys Workshop on Reinforcement Learning and Robust Estimators for Recommendation (REVEAL\u201919)."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403175"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3418480"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","DOI":"10.1007\/s11257-021-09314-7","article-title":"Embarrassingly shallow auto-encoders for dynamic collaborative filtering","author":"Jeunen O.","year":"2022","unstructured":"O. Jeunen, J. Van Balen, and B. Goethals. 2022. Embarrassingly shallow auto-encoders for dynamic collaborative filtering. User Modeling and User-Adapted Interaction (2022).","journal-title":"User Modeling and User-Adapted Interaction"},{"key":"e_1_3_2_51_2","unstructured":"Y. Jin Z. Yang and Z. Wang. 2020. Is Pessimism Provably Efficient for Offline RL?arxiv:2012.15085 [cs.LG]."},{"key":"e_1_3_2_52_2","volume-title":"Proc. of the 6th International Conference on Learning Representations (ICLR\u201918)","author":"Joachims T.","year":"2018","unstructured":"T. Joachims, A. Swaminathan, and M. de Rijke. 2018. Deep learning with logged bandit feedback. In Proc. of the 6th International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","first-page":"781","DOI":"10.1145\/3018661.3018699","volume-title":"Proc. of the 10th ACM International Conference on Web Search and Data Mining (WSDM\u201917)","author":"Joachims T.","year":"2017","unstructured":"T. Joachims, A. Swaminathan, and T. Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proc. of the 10th ACM International Conference on Web Search and Data Mining (WSDM\u201917). ACM, 781\u2013789."},{"key":"e_1_3_2_54_2","series-title":"Proc. of the 38th International Conference on Machine Learning","first-page":"5247","volume":"139","author":"Kallus N.","year":"2021","unstructured":"N. Kallus, Y. Saito, and M. Uehara. 2021. Optimal off-policy evaluation from multiple logging policies. In Proc. of the 38th International Conference on Machine Learning(ICML\u201921, Vol. 139). PMLR, 5247\u20135256."},{"key":"e_1_3_2_55_2","series-title":"NeurIPS\u201920","volume-title":"Advances in Neural Information Processing Systems","author":"Kidambi R.","year":"2020","unstructured":"R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. 2020. MOReL: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems(NeurIPS\u201920, Vol. 33)."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_3_2_57_2","series-title":"NeurIPS\u201920","volume-title":"Advances in Neural Information Processing Systems","author":"Kumar A.","year":"2020","unstructured":"A. Kumar, A. Zhou, G. Tucker, and S. Levine. 2020. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems(NeurIPS\u201920, Vol. 33)."},{"key":"e_1_3_2_58_2","article-title":"Large-scale validation of counterfactual learning methods: A test-bed","author":"Lefortier D.","year":"2016","unstructured":"D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, and M. de Rijke. 2016. Large-scale validation of counterfactual learning methods: A test-bed. arXiv preprint arXiv:1612.00367 (2016).","journal-title":"arXiv preprint arXiv:1612.00367"},{"key":"e_1_3_2_59_2","unstructured":"S. Levine A. Kumar G. Tucker and J. Fu. 2020. Offline Reinforcement Learning: Tutorial Review and Perspectives on Open Problems. arxiv:2005.01643 [cs.LG]."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772758"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911548"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178876.3186150"},{"key":"e_1_3_2_63_2","volume-title":"Proc. of the 33rd International Conference on Neural Information Processing Systems (NeurIPS\u201919)","author":"Liu B.","year":"2019","unstructured":"B. Liu, Q. Cai, Z. Yang, and Z. Wang. 2019. Neural proximal\/trust region policy optimization attains globally optimal policy. In Proc. of the 33rd International Conference on Neural Information Processing Systems (NeurIPS\u201919). Article 948, 12 pages."},{"key":"e_1_3_2_64_2","series-title":"NeurIPS\u201920","volume-title":"Advances in Neural Information Processing Systems","author":"Liu Y.","year":"2020","unstructured":"Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill. 2020. Provably good batch off-policy reinforcement learning without great exploration. In Advances in Neural Information Processing Systems(NeurIPS\u201920, Vol. 33)."},{"key":"e_1_3_2_65_2","series-title":"Proc. of the 36th International Conference on Machine Learning","first-page":"4125","volume":"97","author":"London B.","year":"2019","unstructured":"B. London and T. Sandler. 2019. Bayesian counterfactual risk minimization. In Proc. of the 36th International Conference on Machine Learning(ICML\u201919, Vol. 97). PMLR, 4125\u20134133."},{"key":"e_1_3_2_66_2","volume-title":"Proc. of the 35th AAAI Conference on Artificial Intelligence (AAAI\u201921)","author":"Lopez R.","year":"2021","unstructured":"R. Lopez, I. Dhillion, and M. I. Jordan. 2021. Learning from eXtreme bandit feedback. In Proc. of the 35th AAAI Conference on Artificial Intelligence (AAAI\u201921). AAAI Press."},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403147"},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3220007"},{"key":"e_1_3_2_69_2","volume-title":"Proc. of the 2020 World Wide Web Conference (WWW\u201920)","author":"Ma J.","year":"2020","unstructured":"J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H. Chi. 2020. Off-policy learning in two-stage recommender systems. In Proc. of the 2020 World Wide Web Conference (WWW\u201920). ACM."},{"key":"e_1_3_2_70_2","series-title":"Proc. of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)","first-page":"2956","volume":"89","author":"Ma Y.","year":"2019","unstructured":"Y. Ma, Y. Wang, and B. Narayanaswamy. 2019. Imitation-regularized offline learning. In Proc. of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)(AIStats\u201919, Vol. 89). PMLR, 2956\u20132965."},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3412152"},{"key":"e_1_3_2_72_2","series-title":"NeurIPS\u201920","first-page":"5479","volume-title":"Advances in Neural Information Processing Systems","author":"Masegosa A.","year":"2020","unstructured":"A. Masegosa. 2020. Learning under model misspecification: Applications to variational and ensemble methods. In Advances in Neural Information Processing Systems(NeurIPS\u201920, Vol. 33). 5479\u20135491."},{"key":"e_1_3_2_73_2","first-page":"21","article-title":"Empirical Bernstein bounds and sample variance penalization","volume":"1050","author":"Maurer A.","year":"2009","unstructured":"A. Maurer and M. Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. Stat. 1050 (2009), 21.","journal-title":"Stat."},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.5555\/2188385.2343711"},{"key":"e_1_3_2_75_2","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1145\/3240323.3240354","volume-title":"Proc. of the 12th ACM Conference on Recommender Systems (RecSys\u201918)","author":"McInerney J.","year":"2018","unstructured":"J. McInerney, B. Lacker, S. Hansen, K. Higley, H. Bouchard, A. Gruson, and R. Mehrotra. 2018. Explore, exploit, and explain: Personalizing explainable recommendations with bandits. In Proc. of the 12th ACM Conference on Recommender Systems (RecSys\u201918). ACM, 31\u201339."},{"key":"e_1_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/2487575.2488200"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3481893"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3269206.3272027"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403374"},{"key":"e_1_3_2_80_2","volume-title":"Probabilistic Machine Learning: An Introduction","author":"Murphy K. P.","year":"2021","unstructured":"K. P. Murphy. 2021. Probabilistic Machine Learning: An Introduction. MIT Press."},{"key":"e_1_3_2_81_2","volume-title":"Proc. of the NeurIPS Workshop on Causality and Machine Learning (CausalML\u201919)","author":"Mykhaylov D.","year":"2019","unstructured":"D. Mykhaylov, D. Rohde, F. Vasile, M. Bompaire, and O. Jeunen. 2019. Three methods for training on bandit feedback. In Proc. of the NeurIPS Workshop on Causality and Machine Learning (CausalML\u201919)."},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2011.134"},{"key":"e_1_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3209992"},{"key":"e_1_3_2_84_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401102"},{"key":"e_1_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3437963.3441794"},{"key":"e_1_3_2_86_2","first-page":"4026","volume-title":"Advances in Neural Information Processing Systems","author":"Osband I.","year":"2016","unstructured":"I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. 2016. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, Vol. 29. 4026\u20134034."},{"key":"e_1_3_2_87_2","volume-title":"Monte Carlo Theory, Methods and Examples","author":"Owen A. B.","year":"2013","unstructured":"A. B. Owen. 2013. Monte Carlo Theory, Methods and Examples."},{"key":"e_1_3_2_88_2","volume-title":"Proc. of the ACM RecSys Workshop on Offline Evaluation for Recommender Systems (REVEAL\u201918)","author":"Rohde D.","year":"2018","unstructured":"D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou. 2018. RecoGym: A reinforcement learning environment for the problem of product recommendation in online advertising. In Proc. of the ACM RecSys Workshop on Offline Evaluation for Recommender Systems (REVEAL\u201918)."},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959176"},{"key":"e_1_3_2_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403139"},{"key":"e_1_3_2_91_2","unstructured":"Y. Saito S. Aihara M. Matsutani and Y. Narita. 2020. Large-scale Open Dataset Pipeline and Benchmark for Bandit Algorithms. arxiv:2008.07146 [cs.LG]."},{"key":"e_1_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403121"},{"key":"e_1_3_2_93_2","first-page":"1889","volume-title":"Proc. of the 32nd International Conference on Machine Learning","volume":"37","author":"Schulman J.","year":"2015","unstructured":"J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. 2015. Trust region policy optimization. In Proc. of the 32nd International Conference on Machine Learning, Vol. 37. PMLR, 1889\u20131897."},{"key":"e_1_3_2_94_2","article-title":"Proximal policy optimization algorithms","volume":"1707","author":"Schulman J.","year":"2017","unstructured":"J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. 2017. Proximal policy optimization algorithms. CoRR abs\/1707.06347 (2017). arXiv:1707.06347.","journal-title":"CoRR"},{"issue":"1","key":"e_1_3_2_95_2","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v30i1.9991","article-title":"On the effectiveness of linear models for one-class collaborative filtering","volume":"30","author":"Sedhain S.","year":"2016","unstructured":"S. Sedhain, A. Menon, S. Sanner, and D. Braziunas. 2016. On the effectiveness of linear models for one-class collaborative filtering. Proc. of the AAAI Conference on Artificial Intelligence 30, 1 (2016).","journal-title":"Proc. of the AAAI Conference on Artificial Intelligence"},{"issue":"9","key":"e_1_3_2_96_2","article-title":"An MDP-based recommender system.","volume":"6","author":"Shani G.","year":"2005","unstructured":"G. Shani, D. Heckerman, and R. I. Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research 6, 9 (2005).","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371831"},{"key":"e_1_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0378-3758(00)00115-4"},{"key":"e_1_3_2_99_2","volume-title":"International Conference on Machine Learning (ICML\u201920)","author":"Si N.","year":"2020","unstructured":"N. Si, F. Zhang, Z. Zhou, and J. Blanchet. 2020. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning (ICML\u201920)."},{"key":"e_1_3_2_100_2","doi-asserted-by":"publisher","DOI":"10.1287\/mnsc.1050.0451"},{"key":"e_1_3_2_101_2","doi-asserted-by":"publisher","DOI":"10.1145\/1835804.1835895"},{"key":"e_1_3_2_102_2","doi-asserted-by":"crossref","first-page":"3251","DOI":"10.1145\/3308558.3313710","volume-title":"The World Wide Web Conference (WWW\u201919)","author":"Steck H.","year":"2019","unstructured":"H. Steck. 2019. Embarrassingly shallow autoencoders for sparse data. In The World Wide Web Conference (WWW\u201919). ACM, 3251\u20133257."},{"key":"e_1_3_2_103_2","doi-asserted-by":"publisher","DOI":"10.5555\/3524938.3525788"},{"key":"e_1_3_2_104_2","first-page":"6005","volume-title":"International Conference on Machine Learning (ICML\u201919)","author":"Su Y.","year":"2019","unstructured":"Y. Su, L. Wang, M. Santacatterina, and T. Joachims. 2019. CAB: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning (ICML\u201919). 6005\u20136014."},{"key":"e_1_3_2_105_2","first-page":"814","volume-title":"Proc. of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915)","author":"Swaminathan A.","year":"2015","unstructured":"A. Swaminathan and T. Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In Proc. of the 32nd International Conference on International Conference on Machine Learning (ICML\u201915). JMLR.org, 814\u2013823."},{"key":"e_1_3_2_106_2","first-page":"3231","volume-title":"Advances in Neural Information Processing Systems","author":"Swaminathan A.","year":"2015","unstructured":"A. Swaminathan and T. Joachims. 2015. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems. 3231\u20133239."},{"key":"e_1_3_2_107_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3412236"},{"key":"e_1_3_2_108_2","doi-asserted-by":"publisher","DOI":"10.1257\/jep.2.1.191"},{"key":"e_1_3_2_109_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240323.3240347"},{"key":"e_1_3_2_110_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340631.3398666"},{"key":"e_1_3_2_111_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1540-6261.1961.tb02789.x"},{"key":"e_1_3_2_112_2","first-page":"591","volume-title":"Proc. of the 25th Conference on Uncertainty in Artificial Intelligence (UAI\u201909)","author":"Walsh T. J.","year":"2009","unstructured":"T. J. Walsh, I. Szita, C. Diuk, and M. L. Littman. 2009. Exploring compact reinforcement-learning representations with linear regression. In Proc. of the 25th Conference on Uncertainty in Artificial Intelligence (UAI\u201909). AUAI Press, 591\u2013598."},{"key":"e_1_3_2_113_2","series-title":"Proc. of The 35th Uncertainty in Artificial Intelligence Conference","first-page":"113","volume":"115","author":"Wang Y.","year":"2020","unstructured":"Y. Wang, H. He, and X. Tan. 2020. Truly proximal policy optimization. In Proc. of The 35th Uncertainty in Artificial Intelligence Conference(UAI\u201921, Vol. 115). PMLR, 113\u2013122."},{"key":"e_1_3_2_114_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401147"},{"key":"e_1_3_2_115_2","series-title":"NeurIPS\u201920","volume-title":"Advances in Neural Information Processing Systems","author":"Yu T.","year":"2020","unstructured":"T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. 2020. MOPO: Model-based offline policy optimization. In Advances in Neural Information Processing Systems(NeurIPS\u201920, Vol. 33)."},{"key":"e_1_3_2_116_2","doi-asserted-by":"publisher","DOI":"10.1145\/3298689.3346997"},{"key":"e_1_3_2_117_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330668"}],"container-title":["ACM Transactions on Recommender Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3568029","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3568029","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:26:14Z","timestamp":1750281974000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3568029"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,7]]},"references-count":116,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3568029"],"URL":"https:\/\/doi.org\/10.1145\/3568029","relation":{},"ISSN":["2770-6699"],"issn-type":[{"value":"2770-6699","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,7]]},"assertion":[{"value":"2022-03-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-25","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}