{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T02:52:58Z","timestamp":1773888778553,"version":"3.50.1"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2019,8,15]],"date-time":"2019-08-15T00:00:00Z","timestamp":1565827200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"National Natural Science Fund Projects","award":["61203192"],"award-info":[{"award-number":["61203192"]}]},{"name":"Natural Science Fund Project of Jiangsu Province","award":["BK2011124"],"award-info":[{"award-number":["BK2011124"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,7,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Simple and efficient exploration remains a core challenge in deep reinforcement learning. While many exploration methods can be applied to high-dimensional tasks, these methods manually adjust exploration parameters according to domain knowledge. This paper proposes a novel method that can automatically balance exploration and exploitation, as well as combine on-policy and off-policy update targets through a dynamic weighted way based on value difference. The proposed method does not directly affect the probability of a selected action but utilizes the value difference produced during the learning process to adjust update target for guiding the direction of agent\u2019s learning. We demonstrate the performance of the proposed method on CartPole-v1, MountainCar-v0, and LunarLander-v2 classic control tasks from the OpenAI Gym. Empirical evaluation results show that by integrating on-policy and off-policy update targets dynamically, this method exhibits superior performance and stability than does the exclusive use of the update target.<\/jats:p>","DOI":"10.1093\/comjnl\/bxz066","type":"journal-article","created":{"date-parts":[[2019,5,30]],"date-time":"2019-05-30T19:12:14Z","timestamp":1559243534000},"page":"995-1003","source":"Crossref","is-referenced-by-count":5,"title":["Deep Reinforcement Learning with Adaptive Update Target Combination"],"prefix":"10.1093","volume":"63","author":[{"given":"Z","family":"Xu","sequence":"first","affiliation":[{"name":"Institute of Command and Control Engineering, Army Engineering University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"L","family":"Cao","sequence":"additional","affiliation":[{"name":"Institute of Command and Control Engineering, Army Engineering University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"X","family":"Chen","sequence":"additional","affiliation":[{"name":"Institute of Command and Control Engineering, Army Engineering University, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2019,8,15]]},"reference":[{"key":"2020071706563162800_ref1","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"2020071706563162800_ref2","first-page":"1889","article-title":"Trust region policy optimization","author":"Schulman","year":"2015","journal-title":"Computer Science"},{"key":"2020071706563162800_ref3","first-page":"564","article-title":"Generalization and exploration via randomized value functions","author":"Osband","year":"2014","journal-title":"Computer Science"},{"key":"2020071706563162800_ref4","author":"Osband","year":"2016"},{"key":"2020071706563162800_ref5","volume-title":"Advances in Neural Information Processing Systems (NIPS)","author":"Houthooft","year":"2016"},{"key":"2020071706563162800_ref6","volume-title":"IJCAI","author":"Celiberto","year":"2011"},{"key":"2020071706563162800_ref7","doi-asserted-by":"crossref","first-page":"102","DOI":"10.1016\/j.artint.2015.05.008","article-title":"Transferring knowledge as heuristics in reinforcement learning: a case-based approach","volume":"226","author":"Bianchi","year":"2015","journal-title":"Artificial Intelligence"},{"key":"2020071706563162800_ref8","volume-title":"Advances in Neural Information Processing Systems","author":"Bellemare","year":"2016"},{"key":"2020071706563162800_ref9","first-page":"03012","article-title":"Stochastic neural networks for hierarchical reinforcement learning","author":"Florensa","year":"2017"},{"key":"2020071706563162800_ref10","article-title":"Surprise-based intrinsic motivation for deep reinforcement learning","author":"Achiam","year":"2017"},{"key":"2020071706563162800_ref11","volume-title":"AAAI","author":"Van Hasselt","year":"2016"},{"key":"2020071706563162800_ref12","first-page":"1456","article-title":"Averaged-DQN: variance reduction and stabilization for deep reinforcement learning","author":"Anschel","year":"2016","journal-title":"Computer Science"},{"key":"2020071706563162800_ref13","volume-title":"Deep Reinforcement Learning: Frontiers and Challenges, IJCAI","author":"Hausknecht","year":"2016"},{"key":"2020071706563162800_ref14","first-page":"01327","article-title":"Multi-step reinforcement learning: a unifying algorithm","author":"De Asis","year":"2017"},{"issue":"3\u20134","key":"2020071706563162800_ref15","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/BF00992698","article-title":"Q-learning","volume":"8","author":"Watkins","year":"1992","journal-title":"Machine Learning"},{"key":"2020071706563162800_ref16","first-page":"95","volume-title":"Online Q-learning Using Connectionist Systems","author":"Rummery","year":"1994"},{"key":"2020071706563162800_ref17","article-title":"Playing Atari with deep reinforcement learning","author":"Mnih","year":"2013"},{"key":"2020071706563162800_ref18","first-page":"1502","article-title":"Prioritized experience replay","author":"Schaul","year":"2015","journal-title":"Computer Science"},{"key":"2020071706563162800_ref19","first-page":"1115","article-title":"Dueling network architectures for deep reinforcement learning","author":"Wang","year":"2015","journal-title":"Computer Science"},{"key":"2020071706563162800_ref20","article-title":"Asynchronous methods for deep reinforcement learning","author":"Mnih","year":"2016"},{"key":"2020071706563162800_ref21","first-page":"1","article-title":"Deep reinforcement learning with experience replay based on SARSA","author":"Dongbin","year":"2016","journal-title":"In IEEE Symposium Series on Computational Intelligence (SSCI)."},{"issue":"04","key":"2020071706563162800_ref22","doi-asserted-by":"crossref","first-page":"159","DOI":"10.4236\/jdaip.2016.44014","article-title":"Double Sarsa and double expected Sarsa with shallow and deep learning","volume":"4","author":"Ganger","year":"2016","journal-title":"Journal of Data Analysis and Information Processing"},{"key":"2020071706563162800_ref23","volume-title":"Advances in Neural Information Processing Systems","author":"Hasselt","year":"2010"},{"key":"2020071706563162800_ref24","first-page":"1","article-title":"Ensemble network architecture for deep reinforcement learning","author":"Chen","year":"2018","journal-title":"Mathematical Problems in Engineering"},{"issue":"144","key":"2020071706563162800_ref25","first-page":"645","article-title":"Extending the OpenAI Gym for robotics: a toolkit for reinforcement learning using ROS and Gazebo","volume":"211","author":"Zamora","year":"2016","journal-title":"Computer Science"},{"issue":"5","key":"2020071706563162800_ref26","first-page":"45","article-title":"A method for stochastic optimization","volume":"89","author":"Kingma","year":"2014","journal-title":"Computer Science"},{"key":"2020071706563162800_ref27","first-page":"568","article-title":"Deep reinforcement learning in parameterized action space","author":"Hausknecht","year":"2015","journal-title":"Computer Science."},{"issue":"9","key":"2020071706563162800_ref28","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3022670.2976746","article-title":"TensorFlow: learning functions at scale","author":"Abadi","year":"2016","journal-title":"Acm Sigplan Notices"},{"key":"2020071706563162800_ref29","first-page":"1563","article-title":"Research on timing problem of Lunar Lander guidance and control system based on simulation analysis","author":"Sun","year":"2018","journal-title":"Computer Simulation."},{"key":"2020071706563162800_ref30","first-page":"01540","article-title":"OpenAI Gym","author":"Brockman","year":"2016"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/comjnl\/article-pdf\/63\/7\/995\/33506014\/bxz066.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/comjnl\/article-pdf\/63\/7\/995\/33506014\/bxz066.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,7,17]],"date-time":"2020-07-17T18:38:01Z","timestamp":1595011081000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/63\/7\/995\/5543068"}},"subtitle":[],"editor":[{"given":"Jin-Hee","family":"Cho","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2019,8,15]]},"references-count":30,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2019,8,15]]},"published-print":{"date-parts":[[2020,7,17]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxz066","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"value":"0010-4620","type":"print"},{"value":"1460-2067","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,7]]},"published":{"date-parts":[[2019,8,15]]}}}