{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T04:13:08Z","timestamp":1750824788644,"version":"3.41.0"},"reference-count":34,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2025,1,5]],"date-time":"2025-01-05T00:00:00Z","timestamp":1736035200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,6,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>In deep reinforcement learning (DRL), bias is systematic in asynchronous training due to different state distributions, different policies and lacking knowledge of transition probability in model-free learning. Therefore, we bring the notions of parallel executors, shared actor and central critic into DRL, and propose a general framework that enables parallel collecting, unbiased data processing and centralized training. Specifically, we employ parallel executors to obtain observations, and follow a shared policy from central thread to pass a batch of four-tuple transition slots to the critic. Simultaneously, the next state in the transition slots are fed back to executors. Then, the network parameters are updated by a central learner. A backup storage can be adopted to make the executors, actor and critic work concurrently. There exists two working modes for our framework, and several variants can be achieved to suit different environments by tuning some hyperparameters. One special case of variants is the existing DRL. Another extreme case can produce unbiased estimation of loss function whose estimation exactly matches the joint probability distribution of observations and the policy, thus avoiding the instability of importance sampling. We propose several efficient algorithms under our new framework to deal with typical discrete and continuous scenarios.<\/jats:p>","DOI":"10.1093\/comjnl\/bxae138","type":"journal-article","created":{"date-parts":[[2025,1,6]],"date-time":"2025-01-06T07:10:48Z","timestamp":1736147448000},"page":"649-662","source":"Crossref","is-referenced-by-count":0,"title":["Unbiased training framework on deep reinforcement learning"],"prefix":"10.1093","volume":"68","author":[{"given":"Huihui","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Electrical and Electronic Engineering, Tsinghua University , Beijing 100084 ,","place":["China"]}]}],"member":"286","published-online":{"date-parts":[[2025,1,5]]},"reference":[{"key":"2025062421180812100_ref1","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1145\/203330.203343","article-title":"Temporal difference learning and TD-Gammon","volume":"38","author":"Tesauro","year":"1995","journal-title":"Commun ACM"},{"key":"2025062421180812100_ref2","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1007\/s10514-009-9120-4","article-title":"Reinforcement learning for robot soccer","volume":"27","author":"Riedmiller","year":"2009","journal-title":"Auton Robots"},{"key":"2025062421180812100_ref3","doi-asserted-by":"crossref","first-page":"240","DOI":"10.1145\/1390156.1390187","article-title":"An object-oriented representation for efficient reinforcement learning","volume-title":"Proceedings of the 25th international conference on Machine learning","author":"Diuk","year":"2008"},{"key":"2025062421180812100_ref4","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/978-3-642-27645-3","article-title":"Reinforcement learning","volume":"12","author":"Wiering","year":"2012","journal-title":"Adapt Learn Optim"},{"key":"2025062421180812100_ref5","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/nature24270","article-title":"Mastering the game of go without human knowledge","volume":"550","author":"Silver","year":"2017","journal-title":"Nature"},{"key":"2025062421180812100_ref6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/2200000006","article-title":"Learning deep architectures for ai","volume":"2","author":"Bengio","year":"2009","journal-title":"Found Trends Mach Learn"},{"key":"2025062421180812100_ref7","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume-title":"Adv Neural Inf Process Syst","author":"Krizhevsky","year":"2012"},{"key":"2025062421180812100_ref8","doi-asserted-by":"publisher","first-page":"504","DOI":"10.1126\/science.1127647","article-title":"Reducing the dimensionality of data with neural networks","volume":"313","author":"Hinton","year":"2006","journal-title":"Science"},{"key":"2025062421180812100_ref9","first-page":"2139","article-title":"Data-efficient off-policy policy evaluation for reinforcement learning","author":"Thomas","year":"2016"},{"key":"2025062421180812100_ref10","first-page":"652","article-title":"Doubly robust off-policy value evaluation for reinforcement learning","author":"Jiang","year":"2016"},{"year":"2016","author":"Wang","article-title":"Sample efficient actor-critic with experience replay","key":"2025062421180812100_ref11"},{"key":"2025062421180812100_ref12","first-page":"1146","article-title":"Stabilising experience replay for deep multi-agent reinforcement learning","author":"Foerster","year":"2017"},{"key":"2025062421180812100_ref13","first-page":"5442","article-title":"Policy optimization via importance sampling","author":"Metelli","year":"2018"},{"key":"2025062421180812100_ref14","first-page":"1889","article-title":"Trust region policy optimization","volume-title":"Icml","author":"Schulman","year":"2015"},{"year":"2017","author":"Schulman","article-title":"Proximal policy optimization algorithms","key":"2025062421180812100_ref15"},{"key":"2025062421180812100_ref16","article-title":"Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation","volume-title":"Advances in Neural Information Processing Systems","author":"Yuhuai","year":"2017"},{"key":"2025062421180812100_ref17","article-title":"Reinforcement learning through asynchronous advantage actor-critic on a GPU","volume-title":"Proc. ICLR","author":"Babaeizadeh","year":"2016"},{"key":"2025062421180812100_ref18","first-page":"1928","article-title":"Asynchronous methods for deep reinforcement learning","volume-title":"International conference on machine learning","author":"Mnih","year":"2016"},{"key":"2025062421180812100_ref19","article-title":"IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures","volume-title":"Proc. ICML","author":"Espeholt","year":"2018"},{"key":"2025062421180812100_ref20","article-title":"DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames","volume-title":"Proc. ICLR","author":"Wijmans","year":"2020"},{"key":"2025062421180812100_ref21","doi-asserted-by":"publisher","DOI":"10.34133\/2020\/1375957","article-title":"High-throughput synchronous deep RL","volume-title":"Proc. NeurIPS","author":"Liu","year":"2020"},{"key":"2025062421180812100_ref22","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1007\/978-3-642-29946-9_30","article-title":"Mapreduce for parallel reinforcement learning","volume-title":"European Workshop on Reinforcement Learning","author":"Li","year":"2011"},{"key":"2025062421180812100_ref23","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1007\/978-3-540-77949-0_5","article-title":"Parallel reinforcement learning with linear function approximation","volume-title":"Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning","author":"Grounds","year":"2005"},{"key":"2025062421180812100_ref24","doi-asserted-by":"publisher","first-page":"610","DOI":"10.1109\/TAC.1982.1102980","article-title":"Distributed dynamic programming","volume":"27","author":"Bertsekas","year":"1982","journal-title":"IEEE Trans Automat Contr"},{"year":"2015","author":"Nair","article-title":"Et al","key":"2025062421180812100_ref25"},{"key":"2025062421180812100_ref26","doi-asserted-by":"publisher","first-page":"827","DOI":"10.1109\/JAS.2018.7511144","article-title":"Parallel reinforcement learning: a framework and case study","volume":"5","author":"Liu","year":"2018","journal-title":"IEEE\/CAA J Autom Sin"},{"key":"2025062421180812100_ref27","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"year":"2015","author":"Lillicrap","article-title":"Continuous control with deep reinforcement learning","key":"2025062421180812100_ref28"},{"key":"2025062421180812100_ref29","article-title":"Policy gradient methods for reinforcement learning with function approximation","volume-title":"Proc. NeurIPS","author":"Sutton","year":"2000"},{"key":"2025062421180812100_ref30","first-page":"1587","article-title":"Addressing function approximation error in actor-critic methods","author":"Fujimoto","year":"2018"},{"volume-title":"International Conference on Machine Learning","author":"Haarnoja","article-title":"Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor","key":"2025062421180812100_ref31"},{"year":"2019","author":"Gattami","article-title":"Reinforcement learning for multi-objective and constrained Markov decision processes","key":"2025062421180812100_ref32"},{"key":"2025062421180812100_ref33","doi-asserted-by":"publisher","first-page":"287","DOI":"10.1023\/A:1007678930559","article-title":"Convergence results for single-step on-policy reinforcement-learning algorithms","volume":"38","author":"Singh","year":"2000","journal-title":"Mach Learn"},{"key":"2025062421180812100_ref34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/978-3-031-01551-9","article-title":"Algorithms for reinforcement learning","volume":"4","author":"Szepesv\u00e1ri","year":"2010","journal-title":"Synth Lect Artif Intell Mach Learn"}],"container-title":["The Computer Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/6\/649\/61330931\/bxae138.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/comjnl\/article-pdf\/68\/6\/649\/61330931\/bxae138.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T01:18:17Z","timestamp":1750814297000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/comjnl\/article\/68\/6\/649\/7942795"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,5]]},"references-count":34,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,1,5]]},"published-print":{"date-parts":[[2025,6,12]]}},"URL":"https:\/\/doi.org\/10.1093\/comjnl\/bxae138","relation":{},"ISSN":["0010-4620","1460-2067"],"issn-type":[{"type":"print","value":"0010-4620"},{"type":"electronic","value":"1460-2067"}],"subject":[],"published-other":{"date-parts":[[2025,6]]},"published":{"date-parts":[[2025,1,5]]}}}