{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:21:54Z","timestamp":1775067714628,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,1,20]],"date-time":"2020-01-20T00:00:00Z","timestamp":1579478400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,1,20]]},"DOI":"10.1145\/3336191.3371801","type":"proceedings-article","created":{"date-parts":[[2020,1,22]],"date-time":"2020-01-22T19:08:16Z","timestamp":1579720096000},"page":"816-824","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":93,"title":["Pseudo Dyna-Q"],"prefix":"10.1145","author":[{"given":"Lixin","family":"Zou","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Long","family":"Xia","sequence":"additional","affiliation":[{"name":"York University, Toronto, Canada"}]},{"given":"Pan","family":"Du","sequence":"additional","affiliation":[{"name":"University of Montreal, Montreal, Canada"}]},{"given":"Zhuo","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Melbourne, Melbourne, Australia"}]},{"given":"Ting","family":"Bai","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Weidong","family":"Liu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Jian-Yun","family":"Nie","sequence":"additional","affiliation":[{"name":"University of Montreal, Montreal, Canada"}]},{"given":"Dawei","family":"Yin","sequence":"additional","affiliation":[{"name":"JD Data Science Lab, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2020,1,22]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210129"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3331184.3331199"},{"key":"e_1_3_2_1_3_1","volume-title":"Neural networks: Tricks of the trade","author":"Bottou L\u00e9on","unstructured":"L\u00e9on Bottou . 2012. Stochastic gradient descent tricks . In Neural networks: Tricks of the trade . Springer , 421--436. L\u00e9on Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade. Springer, 421--436."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052627"},{"key":"e_1_3_2_1_5_1","volume-title":"Large-scale Interactive Recommendation with Tree-structured Policy Gradient. arXiv preprint arXiv:1811.05869","author":"Chen Haokun","year":"2018","unstructured":"Haokun Chen , Xinyi Dai , Han Cai , Weinan Zhang , XuejianWang, Ruiming Tang , Yuzhou Zhang , and Yong Yu. 2018. Large-scale Interactive Recommendation with Tree-structured Policy Gradient. arXiv preprint arXiv:1811.05869 ( 2018 ). Haokun Chen, Xinyi Dai, Han Cai,Weinan Zhang, XuejianWang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2018. Large-scale Interactive Recommendation with Tree-structured Policy Gradient. arXiv preprint arXiv:1811.05869 (2018)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2019\/293"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3159652.3159668"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2988450.2988454"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052585"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390446"},{"key":"e_1_3_2_1_11_1","volume-title":"Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679","author":"Dulac-Arnold Gabriel","year":"2015","unstructured":"Gabriel Dulac-Arnold , Richard Evans , Hado van Hasselt , Peter Sunehag , Timothy Lillicrap , Jonathan Hunt , Timothy Mann , TheophaneWeber, Thomas Degris , and Ben Coppin . 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 ( 2015 ). Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, TheophaneWeber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015)."},{"key":"e_1_3_2_1_12_1","volume-title":"More Robust Doubly Robust Off-policy Evaluation. ICML'18","author":"Farajtabar Mehrdad","year":"2018","unstructured":"Mehrdad Farajtabar , Yinlam Chow , and Mohammad Ghavamzadeh . 2018 . More Robust Doubly Robust Off-policy Evaluation. ICML'18 (2018). Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More Robust Doubly Robust Off-policy Evaluation. ICML'18 (2018)."},{"key":"e_1_3_2_1_13_1","volume-title":"Hierarchical User Profiling for E-commerce RecommenderSystems. In WSDM'20","author":"Gu Yulong","year":"2020","unstructured":"Yulong Gu , Zhuoye Ding , Shuaiqiang Wang , and Dawei Yin . 2020 . Hierarchical User Profiling for E-commerce RecommenderSystems. In WSDM'20 . ACM. Yulong Gu, Zhuoye Ding, Shuaiqiang Wang, and Dawei Yin. 2020. Hierarchical User Profiling for E-commerce RecommenderSystems. In WSDM'20. ACM."},{"key":"e_1_3_2_1_14_1","first-page":"1457","article-title":"Non-negative matrix factorization with sparseness constraints","author":"Hoyer Patrik O","year":"2004","unstructured":"Patrik O Hoyer . 2004 . Non-negative matrix factorization with sparseness constraints . Journal of machine learning research 5 , Nov (2004), 1457 -- 1469 . Patrik O Hoyer. 2004. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research 5, Nov (2004), 1457--1469.","journal-title":"Journal of machine learning research 5"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330790"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1198\/106186008X320456"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Dietmar Jannach and Malte Ludewig. 2017. When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys'17. ACM 306--310.  Dietmar Jannach and Malte Ludewig. 2017. When recurrent neural networks meet the neighborhood for session-based recommendation. In RecSys'17. ACM 306--310.","DOI":"10.1145\/3109859.3109872"},{"key":"e_1_3_2_1_18_1","volume-title":"ICML'15","author":"Jiang Nan","year":"2015","unstructured":"Nan Jiang and Lihong Li . 2015 . Doubly robust off-policy value evaluation for reinforcement learning . ICML'15 (2015). Nan Jiang and Lihong Li. 2015. Doubly robust off-policy value evaluation for reinforcement learning. ICML'15 (2015)."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3132926"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772758"},{"key":"e_1_3_2_1_22_1","volume-title":"Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602","author":"Mnih Volodymyr","year":"2013","unstructured":"Volodymyr Mnih , Koray Kavukcuoglu , David Silver , Alex Graves , Ioannis Antonoglou , Daan Wierstra , and Martin Riedmiller . 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ). Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)."},{"key":"e_1_3_2_1_23_1","volume-title":"Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. ACL'18","author":"Peng Baolin","year":"2018","unstructured":"Baolin Peng , Xiujun Li , Jianfeng Gao , Jingjing Liu , Kam-Fai Wong , and Shang-Yu Su . 2018 . Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. ACL'18 (2018). Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. 2018. Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. ACL'18 (2018)."},{"key":"e_1_3_2_1_24_1","volume-title":"SDM'14","author":"Qin Lijing","unstructured":"Lijing Qin , Shouyuan Chen , and Xiaoyan Zhu . 2014. Contextual combinatorial bandit and its application on diversified online recommendation . In SDM'14 . SIAM , 461--469. Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. 2014. Contextual combinatorial bandit and its application on diversified online recommendation. In SDM'14. SIAM, 461--469."},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.127"},{"key":"e_1_3_2_1_26_1","volume-title":"BPR: Bayesian personalized ranking from implicit feedback. In UAI'09","author":"Rendle Steffen","year":"2009","unstructured":"Steffen Rendle , Christoph Freudenthaler , Zeno Gantner , and Lars Schmidt-Thieme . 2009 . BPR: Bayesian personalized ranking from implicit feedback. In UAI'09 . AUAI Press , 452--461. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI'09. AUAI Press, 452--461."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772773"},{"key":"e_1_3_2_1_28_1","volume-title":"RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. RecSys'18","author":"Rohde David","year":"2018","unstructured":"David Rohde , Stephen Bonner , Travis Dunlop , Flavian Vasile , and Alexandros Karatzoglou . 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. RecSys'18 ( 2018 ). David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. RecSys'18 (2018)."},{"key":"e_1_3_2_1_29_1","volume-title":"ICML'16","author":"Schnabel Tobias","year":"2016","unstructured":"Tobias Schnabel , Adith Swaminathan , Ashudeep Singh , Navin Chandak , and Thorsten Joachims . 2016 . Recommendations as Treatments: Debiasing Learning and Evaluation . In ICML'16 . 1670--1679. Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learning and Evaluation. In ICML'16. 1670--1679."},{"key":"e_1_3_2_1_30_1","first-page":"1265","article-title":"An MDP-based recommender system","author":"Shani Guy","year":"2005","unstructured":"Guy Shani , David Heckerman , and Ronen I Brafman . 2005 . An MDP-based recommender system . Journal of Machine Learning Research 6 , Sep (2005), 1265 -- 1295 . Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265-- 1295.","journal-title":"Journal of Machine Learning Research 6"},{"key":"e_1_3_2_1_31_1","volume-title":"Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.","author":"Silver David","year":"2016","unstructured":"David Silver , Aja Huang , Chris J Maddison , Arthur Guez , Laurent Sifre , George Van Den Driessche , Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016 . Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"crossref","unstructured":"David Silver Julian Schrittwieser Karen Simonyan Ioannis Antonoglou Aja Huang Arthur Guez Thomas Hubert Lucas Baker Matthew Lai Adrian Bolton etal 2017. Mastering the game of Go without human knowledge. Nature 550 7676 (2017) 354.  David Silver Julian Schrittwieser Karen Simonyan Ioannis Antonoglou Aja Huang Arthur Guez Thomas Hubert Lucas Baker Matthew Lai Adrian Bolton et al. 2017. Mastering the game of Go without human knowledge. Nature 550 7676 (2017) 354.","DOI":"10.1038\/nature24270"},{"key":"e_1_3_2_1_33_1","volume-title":"NIPS'15","author":"Sukhbaatar Sainbayar","year":"2015","unstructured":"Sainbayar Sukhbaatar , JasonWeston, Rob Fergus , 2015 . End-to-end memory networks . In NIPS'15 . 2440--2448. Sainbayar Sukhbaatar, JasonWeston, Rob Fergus, et al. 2015. End-to-end memory networks. In NIPS'15. 2440--2448."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/122344.122377"},{"key":"e_1_3_2_1_35_1","volume-title":"Reinforcement learning: An introduction","author":"Sutton Richard S","unstructured":"Richard S Sutton and Andrew G Barto . 2018. Reinforcement learning: An introduction . MIT press . Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press."},{"key":"e_1_3_2_1_36_1","volume-title":"NIPS'09","author":"Sutton Richard S","year":"2009","unstructured":"Richard S Sutton , Hamid R. Maei , and Csaba Szepesv\u00e1ri . 2009 . A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation . In NIPS'09 . 1609--1616. Richard S Sutton, Hamid R. Maei, and Csaba Szepesv\u00e1ri. 2009. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. In NIPS'09. 1609--1616."},{"key":"e_1_3_2_1_37_1","volume-title":"ICML'15","author":"Swaminathan Adith","year":"2015","unstructured":"Adith Swaminathan and Thorsten Joachims . 2015 . Counterfactual risk minimization: Learning from logged bandit feedback . In ICML'15 . 814--823. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML'15. 814--823."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3159652.3159656"},{"key":"e_1_3_2_1_39_1","volume-title":"AAAI'15","author":"Thomas Philip S","year":"2015","unstructured":"Philip S Thomas , Georgios Theocharous , and Mohammad Ghavamzadeh . 2015 . High-Confidence Off-Policy Evaluation .. In AAAI'15 . 3000--3006. Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. 2015. High-Confidence Off-Policy Evaluation.. In AAAI'15. 3000--3006."},{"key":"e_1_3_2_1_40_1","volume-title":"ICML'17","author":"Touati Ahmed","year":"2017","unstructured":"Ahmed Touati , Pierre-Luc Bacon , Doina Precup , and Pascal Vincent . 2017 . Convergent tree-backup and retrace with function approximation . ICML'17 (2017). Ahmed Touati, Pierre-Luc Bacon, Doina Precup, and Pascal Vincent. 2017. Convergent tree-backup and retrace with function approximation. ICML'17 (2017)."},{"key":"e_1_3_2_1_41_1","volume-title":"NIPS'17","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , ukasz Kaiser, and Illia Polosukhin . 2017 . Attention is all you need . In NIPS'17 . 5998--6008. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS'17. 5998--6008."},{"key":"e_1_3_2_1_42_1","volume-title":"AAAI'17","year":"2017","unstructured":"HuazhengWang, QingyunWu, and HongningWang. 2017 . Factorization Bandits for Interactive Recommendation .. In AAAI'17 . 2695--2702. HuazhengWang, QingyunWu, and HongningWang. 2017. Factorization Bandits for Interactive Recommendation.. In AAAI'17. 2695--2702."},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3159652.3159710"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2835776.2835837"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939878"},{"key":"e_1_3_2_1_46_1","volume-title":"Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB Newsletter Spring","author":"Zhao Xiangyu","year":"2019","unstructured":"Xiangyu Zhao , Long Xia , Jiliang Tang , and Dawei Yin . 2019. Deep reinforcement learning for search, recommendation, and online advertising: a survey by Xiangyu Zhao , Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB Newsletter Spring ( 2019 ), 4. Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. Deep reinforcement learning for search, recommendation, and online advertising: a survey by Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB Newsletter Spring (2019), 4."},{"key":"e_1_3_2_1_47_1","unstructured":"Xiangyu Zhao Long Xia Liang Zhang Zhuoye Ding Dawei Yin and Jiliang Tang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. In RecSys'18. ACM 95--103.  Xiangyu Zhao Long Xia Liang Zhang Zhuoye Ding Dawei Yin and Jiliang Tang. 2018. Deep Reinforcement Learning for Page-wise Recommendations. In RecSys'18. ACM 95--103."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219886"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178876.3185994"},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330668"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-18579-4_7"}],"event":{"name":"WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining","location":"Houston TX USA","acronym":"WSDM '20","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data","SIGIR ACM Special Interest Group on Information Retrieval"]},"container-title":["Proceedings of the 13th International Conference on Web Search and Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336191.3371801","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3336191.3371801","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:26:10Z","timestamp":1750206370000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3336191.3371801"}},"subtitle":["A Reinforcement Learning Framework for Interactive Recommendation"],"short-title":[],"issued":{"date-parts":[[2020,1,20]]},"references-count":51,"alternative-id":["10.1145\/3336191.3371801","10.1145\/3336191"],"URL":"https:\/\/doi.org\/10.1145\/3336191.3371801","relation":{},"subject":[],"published":{"date-parts":[[2020,1,20]]},"assertion":[{"value":"2020-01-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}