{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:11:09Z","timestamp":1757617869116,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":31,"publisher":"ACM","funder":[{"DOI":"10.13039\/100000001","name":"NSF (National Science Foundation)","doi-asserted-by":"publisher","award":["IIS-2312865, OAC-2311521"],"award-info":[{"award-number":["IIS-2312865, OAC-2311521"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,22]]},"DOI":"10.1145\/3705328.3748088","type":"proceedings-article","created":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T10:46:13Z","timestamp":1757155573000},"page":"41-50","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-6378-4365","authenticated-orcid":false,"given":"Haruka","family":"Kiyohara","sequence":"first","affiliation":[{"name":"Cornell University, Ithaca, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-5294-5542","authenticated-orcid":false,"given":"Daniel Yiming","family":"Cao","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4357-5835","authenticated-orcid":false,"given":"Yuta","family":"Saito","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3654-3683","authenticated-orcid":false,"given":"Thorsten","family":"Joachims","sequence":"additional","affiliation":[{"name":"Cornell University, Ithaca, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,9,7]]},"reference":[{"key":"e_1_3_3_2_2_2","unstructured":"Yuntao Bai Saurav Kadavath Sandipan Kundu Amanda Askell Jackson Kernion Andy Jones Anna Chen Anna Goldie Azalia Mirhoseini Cameron McKinnon et\u00a0al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2212.08073 (2022)."},{"key":"e_1_3_3_2_3_2","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel\u00a0M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (2020) 1877\u20131901."},{"key":"e_1_3_3_2_4_2","unstructured":"Paul\u00a0F Christiano Jan Leike Tom Brown Miljan Martic Shane Legg and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (2017)."},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.222"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.5555\/3104482.3104620"},{"key":"e_1_3_3_2_7_2","first-page":"12215","volume-title":"Proceedings of the 41th International Conference on Machine Learning","volume":"235","author":"Dwaracherla Vikranth","year":"2024","unstructured":"Vikranth Dwaracherla, Seyed\u00a0Mohammad Asghari, Botao Hao, and Benjamin Van\u00a0Roy. 2024. Efficient Exploration for LLMs. In Proceedings of the 41th International Conference on Machine Learning , Vol.\u00a0235. 12215\u201312227."},{"key":"e_1_3_3_2_8_2","volume-title":"International Conference on Learning Representations","author":"Fu Justin","year":"2020","unstructured":"Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Alexander Novikov, Mengjiao Yang, Michael\u00a0R Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, et\u00a0al. 2020. Benchmarks for Deep Off-Policy Evaluation. In International Conference on Learning Representations."},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3159652.3159687"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"F\u00a0Maxwell Harper and Joseph\u00a0A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5 4 (2015) 1\u201319.","DOI":"10.1145\/2827872"},{"key":"e_1_3_3_2_11_2","first-page":"1645","volume-title":"International Conference on Machine Learning","author":"Jaques Natasha","year":"2017","unstructured":"Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jos\u00e9\u00a0Miguel Hern\u00e1ndez-Lobato, Richard\u00a0E Turner, and Douglas Eck. 2017. Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control. In International Conference on Machine Learning. PMLR, 1645\u20131654."},{"key":"e_1_3_3_2_12_2","unstructured":"Albert\u00a0Q Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra\u00a0Singh Chaplot Diego de\u00a0las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio\u00a0Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven\u00a0Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William\u00a0El Sayed. 2023. Mistral 7B. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2310.06825 (2023)."},{"key":"e_1_3_3_2_13_2","first-page":"1243","volume-title":"International Conference on Artificial Intelligence and Statistics","author":"Kallus Nathan","year":"2018","unstructured":"Nathan Kallus and Angela Zhou. 2018. Policy Evaluation and Optimization with Continuous Treatments. In International Conference on Artificial Intelligence and Statistics. 1243\u20131251."},{"key":"e_1_3_3_2_14_2","unstructured":"Haruka Kiyohara Ren Kishimoto Kosuke Kawakami Ken Kobayashi Kazuhide Nakata and Yuta Saito. 2024. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. International Conference on Learning Representations (2024)."},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3488560.3498380"},{"key":"e_1_3_3_2_16_2","unstructured":"Vijay Konda and John Tsitsiklis. 1999. Actor-Critic Algorithms. Advances in Neural Information Processing Systems 12 (1999)."},{"key":"e_1_3_3_2_17_2","unstructured":"Harrison Lee Samrat Phatale Hassan Mansoor Kellie Lu Thomas Mesnard Colton Bishop Victor Carbune and Abhinav Rastogi. 2023. Rlaif: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.00267 (2023)."},{"key":"e_1_3_3_2_18_2","unstructured":"Xiaoqiang Lin Zhongxiang Dai Arun Verma See-Kiong Ng Patrick Jaillet and Bryan Kian\u00a0Hsiang Low. 2024. Prompt Optimization with Human Feedback. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2405.17346 (2024)."},{"key":"e_1_3_3_2_19_2","volume-title":"International Conference on Learning Representations","author":"Matsushima Tatsuya","year":"2021","unstructured":"Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. 2021. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization. In International Conference on Learning Representations."},{"key":"e_1_3_3_2_20_2","unstructured":"Long Ouyang Jeffrey Wu Xu Jiang Diogo Almeida Carroll Wainwright Pamela Mishkin Chong Zhang Sandhini Agarwal Katarina Slama Alex Ray et\u00a0al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022) 27730\u201327744."},{"key":"e_1_3_3_2_21_2","first-page":"759","volume-title":"Proceedings of the 17th International Conference on Machine Learning","author":"Precup Doina","year":"2000","unstructured":"Doina Precup, Richard\u00a0S. Sutton, and Satinder\u00a0P. Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning. 759\u2013766."},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403139"},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589334.3645501"},{"key":"e_1_3_3_2_24_2","unstructured":"Chitwan Saharia William Chan Saurabh Saxena Lala Li Jay Whang Emily\u00a0L Denton Kamyar Ghasemipour Raphael Gontijo\u00a0Lopes Burcu Karagol\u00a0Ayan Tim Salimans et\u00a0al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35 (2022) 36479\u201336494."},{"key":"e_1_3_3_2_25_2","volume-title":"35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track","author":"Saito Yuta","year":"2021","unstructured":"Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track."},{"key":"e_1_3_3_2_26_2","first-page":"19089","volume-title":"Proceedings of the 39th International; Conference of Machine Learning","volume":"162","author":"Saito Yuta","year":"2022","unstructured":"Yuta Saito and Thorsten Joachims. 2022. Off-policy evaluation for large action spaces via embeddings. In Proceedings of the 39th International; Conference of Machine Learning , Vol.\u00a0162. 19089\u201319122."},{"key":"e_1_3_3_2_27_2","first-page":"29734","volume-title":"Proceedings of the 40th International; Conference of Machine Learning","volume":"202","author":"Saito Yuta","year":"2023","unstructured":"Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In Proceedings of the 40th International; Conference of Machine Learning , Vol.\u00a0202. 29734\u201329759."},{"key":"e_1_3_3_2_28_2","unstructured":"Yuta Saito Jihan Yao and Thorsten Joachims. 2024. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2402.06151 (2024)."},{"key":"e_1_3_3_2_29_2","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1910.01108 (2019)."},{"key":"e_1_3_3_2_30_2","volume-title":"The 11th International Conference on Learning Representations","author":"Snell Charlie\u00a0Victor","year":"2022","unstructured":"Charlie\u00a0Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. 2022. Offline RL for Natural Language Generation with Implicit Language Q Learning. In The 11th International Conference on Learning Representations."},{"key":"e_1_3_3_2_31_2","unstructured":"Nisan Stiennon Long Ouyang Jeffrey Wu Daniel Ziegler Ryan Lowe Chelsea Voss Alec Radford Dario Amodei and Paul\u00a0F Christiano. 2020. Learning to Summarize with Human Feedback. Advances in Neural Information Processing Systems 33 (2020) 3008\u20133021."},{"key":"e_1_3_3_2_32_2","unstructured":"Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. The Journal of Machine Learning Research 16 1 (2015) 1731\u20131755."}],"event":{"name":"RecSys '25: Nineteenth ACM Conference on Recommender Systems","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction","SIGAI ACM Special Interest Group on Artificial Intelligence","SIGIR ACM Special Interest Group on Information Retrieval","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"],"location":"Prague Czech Republic","acronym":"RecSys '25"},"container-title":["Proceedings of the Nineteenth ACM Conference on Recommender Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3705328.3748088","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T11:48:11Z","timestamp":1757159291000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3705328.3748088"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,7]]},"references-count":31,"alternative-id":["10.1145\/3705328.3748088","10.1145\/3705328"],"URL":"https:\/\/doi.org\/10.1145\/3705328.3748088","relation":{},"subject":[],"published":{"date-parts":[[2025,9,7]]},"assertion":[{"value":"2025-09-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}