{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T05:19:51Z","timestamp":1773465591389,"version":"3.50.1"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2019,10,15]],"date-time":"2019-10-15T00:00:00Z","timestamp":1571097600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004377","name":"The Hong Kong Polytechnic University","doi-asserted-by":"crossref","award":["G-YBP6"],"award-info":[{"award-number":["G-YBP6"]}],"id":[{"id":"10.13039\/501100004377","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61672445"],"award-info":[{"award-number":["61672445"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>In this article, we study a multi-step interactive recommendation problem for explicit-feedback recommender systems. Different from the existing works, we propose a novel user-specific deep reinforcement learning approach to the problem. Specifically, we first formulate the problem of interactive recommendation for each target user as a Markov decision process (MDP). We then derive a multi-MDP reinforcement learning task for all involved users. To model the possible relationships (including similarities and differences) between different users\u2019 MDPs, we construct user-specific latent states by using matrix factorization. After that, we propose a user-specific deep Q-learning (UDQN) method to estimate optimal policies based on the constructed user-specific latent states. Furthermore, we propose Biased UDQN (BUDQN) to explicitly model user-specific information by employing an additional bias parameter when estimating the Q-values for different users. Finally, we validate the effectiveness of our approach by comprehensive experimental results and analysis.<\/jats:p>","DOI":"10.1145\/3359554","type":"journal-article","created":{"date-parts":[[2019,10,15]],"date-time":"2019-10-15T16:35:58Z","timestamp":1571157358000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":39,"title":["Interactive Recommendation with User-Specific Deep Reinforcement Learning"],"prefix":"10.1145","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8482-3140","authenticated-orcid":false,"given":"Yu","family":"Lei","sequence":"first","affiliation":[{"name":"The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenjie","family":"Li","sequence":"additional","affiliation":[{"name":"The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,10,15]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2005.99"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-008-9100-7"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1013689704352"},{"key":"e_1_2_1_4_1","volume-title":"Dynamic programming. Science 153, 3731","author":"Bellman R.","year":"1966"},{"key":"e_1_2_1_5_1","first-page":"71","article-title":"Bandit problems: Sequential allocation of experiments","volume":"5","author":"Berry D. A.","year":"1985","journal-title":"Monographs on Statistics and Applied Probability"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"S. Chen Y. Yu Q. Da J. Tan H. Huang and H. Tang. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In SIGKDD. ACM 1187--1196. S. Chen Y. Yu Q. Da J. Tan H. Huang and H. Tang. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In SIGKDD. ACM 1187--1196.","DOI":"10.1145\/3219819.3220122"},{"key":"e_1_2_1_7_1","doi-asserted-by":"crossref","unstructured":"K. Christakopoulou F. Radlinski and K. Hofmann. 2016. Towards conversational recommender systems. In SIGKDD. ACM 815--824. K. Christakopoulou F. Radlinski and K. Hofmann. 2016. Towards conversational recommender systems. In SIGKDD. ACM 815--824.","DOI":"10.1145\/2939672.2939746"},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"P. Cremonesi Y. Koren and R. Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In RecSys. ACM 39--46. P. Cremonesi Y. Koren and R. Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In RecSys. ACM 39--46.","DOI":"10.1145\/1864708.1864721"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-012-0025-1"},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"J. Gittins K. Glazebrook and R. Weber. 2011. Multi-armed Bandit Allocation Indices. John Wiley 8 Sons. J. Gittins K. Glazebrook and R. Weber. 2011. Multi-armed Bandit Allocation Indices. John Wiley 8 Sons.","DOI":"10.1002\/9780470980033"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"X. He H. Zhang M. Kan and T. Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In SIGIR. ACM 549--558. X. He H. Zhang M. Kan and T. Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In SIGIR. ACM 549--558.","DOI":"10.1145\/2911451.2911489"},{"key":"e_1_2_1_12_1","doi-asserted-by":"crossref","unstructured":"B. Hu C. Shi and J. Liu. 2017. Playlist recommendation based on reinforcement learning. In ICIS. Springer 172--182. B. Hu C. Shi and J. Liu. 2017. Playlist recommendation based on reinforcement learning. In ICIS. Springer 172--182.","DOI":"10.1007\/978-3-319-68121-4_18"},{"key":"e_1_2_1_13_1","unstructured":"J. Kawale H. H. Bui B. Kveton L. Tran-Thanh and S. Chawla. 2015. Efficient thompson sampling for online matrix-factorization recommendation. In NIPS. 1297--1305. J. Kawale H. H. Bui B. Kveton L. Tran-Thanh and S. Chawla. 2015. Efficient thompson sampling for online matrix-factorization recommendation. In NIPS. 1297--1305."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Y. Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In SIGKDD. ACM 426--434. Y. Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In SIGKDD. ACM 426--434.","DOI":"10.1145\/1401890.1401944"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Y. Koren and R. Bell. 2015. Advances in collaborative filtering. In Recommender Systems Handbook. Springer 77--118. Y. Koren and R. Bell. 2015. Advances in collaborative filtering. In Recommender Systems Handbook. Springer 77--118.","DOI":"10.1007\/978-1-4899-7637-6_3"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"L. Li W. Chu J. Langford and R. E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW. ACM 661--670. L. Li W. Chu J. Langford and R. E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW. ACM 661--670.","DOI":"10.1145\/1772690.1772758"},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"V. Mnih K. Kavukcuoglu D. Silver A. A. Rusu J. Veness M. G. Bellemare A. Graves M. Riedmiller A. K. Fidjeland G. Ostrovski S. Petersen C. Beattie A. Sadik I. Antonoglou H. King D. Kumaran D. Wierstra S. Legg and D. Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518 7540 (2015) 529--533. V. Mnih K. Kavukcuoglu D. Silver A. A. Rusu J. Veness M. G. Bellemare A. Graves M. Riedmiller A. K. Fidjeland G. Ostrovski S. Petersen C. Beattie A. Sadik I. Antonoglou H. King D. Kumaran D. Wierstra S. Legg and D. Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518 7540 (2015) 529--533.","DOI":"10.1038\/nature14236"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"R. P\u00e1lovics A. A. Bencz\u00far L. Kocsis T. Kiss and E. Frig\u00f3. 2014. Exploiting temporal influence in online recommendation. In RecSys. ACM 273--280. R. P\u00e1lovics A. A. Bencz\u00far L. Kocsis T. Kiss and E. Frig\u00f3. 2014. Exploiting temporal influence in online recommendation. In RecSys. ACM 273--280.","DOI":"10.1145\/2645710.2645723"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"F. Ricci L. Rokach and B. Shapira. 2011. Introduction to recommender systems handbook. In Recommender Systems Handbook. Springer 1--35. F. Ricci L. Rokach and B. Shapira. 2011. Introduction to recommender systems handbook. In Recommender Systems Handbook. Springer 1--35.","DOI":"10.1007\/978-0-387-85820-3_1"},{"key":"e_1_2_1_22_1","first-page":"1265","article-title":"An MDP-based recommender system","author":"Shani G.","year":"2005","journal-title":"Journal of Machine Learning Research 6"},{"key":"e_1_2_1_23_1","volume-title":"J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.","author":"Silver D.","year":"2016"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/551283"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"N. Taghipour A. Kardan and S. S. Ghidary. 2007. Usage-based web recommendations: A reinforcement learning approach. In RecSys. ACM 113--120. N. Taghipour A. Kardan and S. S. Ghidary. 2007. Usage-based web recommendations: A reinforcement learning approach. In RecSys. ACM 113--120.","DOI":"10.1145\/1297231.1297250"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"L. Tang Y. Jiang L. Li C. Zeng and T. Li. 2015. Personalized recommendation via parameter-free contextual bandits. In SIGIR. ACM 323--332. L. Tang Y. Jiang L. Li C. Zeng and T. Li. 2015. Personalized recommendation via parameter-free contextual bandits. In SIGIR. ACM 323--332.","DOI":"10.1145\/2766462.2767707"},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"H. P. Vanchinathan I. Nikolic F. De Bona and A. Krause. 2014. Explore-exploit in top-n recommender systems via gaussian processes. In RecSys. ACM 225--232. H. P. Vanchinathan I. Nikolic F. De Bona and A. Krause. 2014. Explore-exploit in top-n recommender systems via gaussian processes. In RecSys. ACM 225--232.","DOI":"10.1145\/2645710.2645733"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2623372"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF00992698"},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"X. Zhao L. Xia L. Zhang Z. Ding D. Yin and J. Tang. 2018. Deep reinforcement learning for page-wise recommendations. In RecSys. ACM 95--103. X. Zhao L. Xia L. Zhang Z. Ding D. Yin and J. Tang. 2018. Deep reinforcement learning for page-wise recommendations. In RecSys. ACM 95--103.","DOI":"10.1145\/3240323.3240374"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"X. Zhao L. Zhang Z. Ding L. Xia J. Tang and D. Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In SIGKDD. ACM 1040--1048. X. Zhao L. Zhang Z. Ding L. Xia J. Tang and D. Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In SIGKDD. ACM 1040--1048.","DOI":"10.1145\/3219819.3219886"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"X. Zhao W. Zhang and J. Wang. 2013. Interactive collaborative filtering. In CIKM. ACM 1411--1420. X. Zhao W. Zhang and J. Wang. 2013. Interactive collaborative filtering. In CIKM. ACM 1411--1420.","DOI":"10.1145\/2505515.2505690"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3178876.3185994"}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3359554","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3359554","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:23:55Z","timestamp":1750202635000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3359554"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,15]]},"references-count":32,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3359554"],"URL":"https:\/\/doi.org\/10.1145\/3359554","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"value":"1556-4681","type":"print"},{"value":"1556-472X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,15]]},"assertion":[{"value":"2007-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}