{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T05:20:48Z","timestamp":1755926448892,"version":"3.41.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,4,21]],"date-time":"2020-04-21T00:00:00Z","timestamp":1587427200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100002429","name":"VSNU Vereniging van Universiteiten","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002429","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Ahold Delhaize"},{"name":"ICAI Innovation Center for Artificial Intelligence"},{"DOI":"10.13039\/501100003246","name":"Nederlandse Organisatie voor Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["612.001.551"],"award-info":[{"award-number":["612.001.551"]}],"id":[{"id":"10.13039\/501100003246","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2020,7,31]]},"abstract":"<jats:p>Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, and so on. However, existing learning methods for contextual bandit problems have one of two drawbacks: They either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.<\/jats:p>","DOI":"10.1145\/3385670","type":"journal-article","created":{"date-parts":[[2020,5,4]],"date-time":"2020-05-04T09:57:41Z","timestamp":1588586261000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Safe Exploration for Optimizing Contextual Bandits"],"prefix":"10.1145","volume":"38","author":[{"given":"Rolf","family":"Jagerman","sequence":"first","affiliation":[{"name":"University of Amsterdam, Amsterdam, The Netherlands"}]},{"given":"Ilya","family":"Markov","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1086-0202","authenticated-orcid":false,"given":"Maarten De","family":"Rijke","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Amsterdam, The Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2020,4,21]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1145\/1055709.1055714"},{"volume-title":"Proceedings of the International Conference on Machine Learning. 127--135","year":"2013","author":"Agrawal Shipra","key":"e_1_2_1_2_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/1557019.1557040"},{"unstructured":"Nicol\u00f2 Cesa-Bianchi Claudio Gentile Gergely Neu and Gabor Lugosi. 2017. Boltzmann exploration done right. In Advances in Neural Information Processing Systems. 6275--6284.  Nicol\u00f2 Cesa-Bianchi Claudio Gentile Gergely Neu and Gabor Lugosi. 2017. Boltzmann exploration done right. In Advances in Neural Information Processing Systems. 6275--6284.","key":"e_1_2_1_4_1"},{"volume-title":"Proceedings of the Learning to Rank Challenge. 1--24","year":"2011","author":"Chapelle Olivier","key":"e_1_2_1_5_1"},{"doi-asserted-by":"crossref","unstructured":"Aleksandr Chuklin Ilya Markov and Maarten de Rijke. 2015. Click Models for Web Search. Morgan 8 Claypool.  Aleksandr Chuklin Ilya Markov and Maarten de Rijke. 2015. Click Models for Web Search. Morgan 8 Claypool.","key":"e_1_2_1_6_1","DOI":"10.1007\/978-3-031-02294-4"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1145\/2505515.2507859"},{"volume-title":"Proceedings of the 5th Asian Conference on Machine Learning. PMLR, 245--260","year":"2013","author":"Galichet Nicolas","key":"e_1_2_1_8_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1613\/jair.3761"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.1561\/1500000067"},{"doi-asserted-by":"publisher","key":"e_1_2_1_11_1","DOI":"10.1145\/2766462.2767730"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1145\/2433396.2433419"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1007\/978-3-642-20161-5_25"},{"volume-title":"Proceedings of the Conference on Neural Information Processing Systems Workshop on Bayesian Optimization, Experimental Design, and Bandits (NIPS\u201911)","year":"2011","author":"Hofmann Katja","key":"e_1_2_1_14_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1145\/2536736.2536737"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1109\/34.291440"},{"volume-title":"Risk-aware multi-armed bandit problem with application to portfolio selection. Arxiv Preprint Arxiv:1709.04415","year":"2017","author":"Huo Xiaoguang","key":"e_1_2_1_17_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_18_1","DOI":"10.1145\/3331184.3331269"},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1145\/582415.582418"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1145\/775047.775067"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1145\/1076034.1076063"},{"volume-title":"Proceedings of the International Conference on Learning Representations.","year":"2018","author":"Joachims Thorsten","key":"e_1_2_1_22_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/3018661.3018699"},{"doi-asserted-by":"publisher","key":"e_1_2_1_24_1","DOI":"10.1023\/A:1022689909846"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1145\/1390156.1390212"},{"volume-title":"Advances in Neural Information Processing Systems 30. Curran Associates","author":"Kazerouni Abbas","key":"e_1_2_1_26_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.5555\/3091622.3091662"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1145\/1390156.1390223"},{"volume-title":"RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5 (Apr","year":"2004","author":"Lewis David D.","key":"e_1_2_1_29_1"},{"volume-title":"Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI\u201919)","year":"2019","author":"Li Chang","key":"e_1_2_1_30_1"},{"volume-title":"Proceedings of the 19th International Conference on World Wide Web. ACM, 661--670","author":"Li Lihong","key":"e_1_2_1_31_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1145\/1935826.1935878"},{"doi-asserted-by":"publisher","key":"e_1_2_1_33_1","DOI":"10.1145\/2911451.2914763"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/3132847.3132896"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.1007\/978-3-319-30671-1_50"},{"volume-title":"Monte carlo theory, methods and examples. Monte Carlo Theory, Methods and Examples","year":"2013","author":"Owen Art B.","key":"e_1_2_1_36_1"},{"volume-title":"Arxiv Preprint Arxiv:1306.2597","year":"2013","author":"Qin Tao","key":"e_1_2_1_37_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_38_1","DOI":"10.1145\/1458082.1458092"},{"volume-title":"Kakade","year":"2010","author":"Strehl Alex","key":"e_1_2_1_39_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_40_1","DOI":"10.1145\/1143844.1143956"},{"volume-title":"Risk-aware algorithms for adversarial contextual bandits. Arxiv Preprint Arxiv:1610.05129","year":"2016","author":"Sun Wen","key":"e_1_2_1_41_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_42_1","DOI":"10.5555\/2789272.2886805"},{"volume-title":"Proceedings of the International Conference on Machine Learning. 814--823","year":"2015","author":"Swaminathan Adith","key":"e_1_2_1_43_1"},{"volume-title":"Off-policy evaluation for slate recommendation. Arxiv Preprint Arxiv:1605.04812","year":"2016","author":"Swaminathan Adith","key":"e_1_2_1_44_1"},{"volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI\u201915)","year":"2015","author":"Thomas Philip S.","key":"e_1_2_1_45_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_46_1","DOI":"10.1145\/3159652.3159732"},{"doi-asserted-by":"publisher","key":"e_1_2_1_47_1","DOI":"10.1145\/3077136.3080685"},{"doi-asserted-by":"publisher","key":"e_1_2_1_48_1","DOI":"10.1007\/s10791-009-9112-1"},{"volume-title":"Proceedings of the International Conference on Machine Learning. 1254--1262","year":"2016","author":"Wu Yifan","key":"e_1_2_1_49_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_50_1","DOI":"10.1145\/1553374.1553527"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3385670","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3385670","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:32:49Z","timestamp":1750199569000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3385670"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,21]]},"references-count":50,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,7,31]]}},"alternative-id":["10.1145\/3385670"],"URL":"https:\/\/doi.org\/10.1145\/3385670","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"type":"print","value":"1046-8188"},{"type":"electronic","value":"1558-2868"}],"subject":[],"published":{"date-parts":[[2020,4,21]]},"assertion":[{"value":"2019-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-04-21","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}