{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,11]],"date-time":"2026-01-11T01:13:41Z","timestamp":1768094021849,"version":"3.49.0"},"reference-count":55,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T00:00:00Z","timestamp":1673222400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["51779059"],"award-info":[{"award-number":["51779059"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>The value approximation bias is known to lead to suboptimal policies or catastrophic overestimation bias accumulation that prevent the agent from making the right decisions between exploration and exploitation. Algorithms have been proposed to mitigate the above contradiction. However, we still lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping stable updates. This study aims to clarify the effect of the value bias and improve the reinforcement learning algorithms to enhance sample efficiency.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>This study designs a simple episodic tabular MDP to research value underestimation and overestimation in actor-critic methods. This study proposes a unified framework called Realistic Actor-Critic (RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bound with the same neural network, each with a different under overestimation trade-off.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>This study highlights that agents could over-explore low-value states due to inflexible under-overestimation trade-off in the fixed hyperparameters setting, which is a particular form of the exploration-exploitation dilemma. And RAC performs directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. Through carefully designed experiments, this study empirically verifies that RAC achieves 10x sample efficiency and 25% performance improvement compared to Soft Actor-Critic in the most challenging Humanoid environment. All the source codes are available at <jats:ext-link>https:\/\/github.com\/ihuhuhu\/RAC<\/jats:ext-link>.<\/jats:p><\/jats:sec><jats:sec><jats:title>Discussion<\/jats:title><jats:p>This research not only provides valuable insights for research on the exploration-exploitation trade-off by studying the frequency of policies access to low-value states under different value confidence-bounds guidance, but also proposes a new unified framework that can be combined with current actor-critic methods to improve sample efficiency in the continuous control domain.<\/jats:p><\/jats:sec>","DOI":"10.3389\/fnbot.2022.1081242","type":"journal-article","created":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T08:31:31Z","timestamp":1673253091000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Realistic Actor-Critic: A framework for balance between value overestimation and underestimation"],"prefix":"10.3389","volume":"16","author":[{"given":"Sicen","family":"Li","sequence":"first","affiliation":[]},{"given":"Qinyun","family":"Tang","sequence":"additional","affiliation":[]},{"given":"Yiming","family":"Pang","sequence":"additional","affiliation":[]},{"given":"Xinmeng","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Gang","family":"Wang","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,1,9]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1016\/j.inffus.2021.05.008","article-title":"A review of uncertainty quantification in deep learning: techniques, applications and challenges","volume":"76","author":"Abdar","year":"2021","journal-title":"Inf. Fusion"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1804.06318","article-title":"Learning awareness models","author":"Amos","year":"2018","journal-title":"arXiv preprint arXiv:1804.06318."},{"key":"B3","unstructured":"Averaged-DQN: variance reduction and stabilization for deep reinforcement learning176185\n            AnschelO.\n            BaramN.\n            ShimkinN.\n          International Conference on Machine Learning2017"},{"key":"B4","unstructured":"Agent57: outperforming the atari human benchmark507517\n            BadiaA. P.\n            PiotB.\n            KapturowskiS.\n            SprechmannP.\n            VitvitskyiA.\n            GuoZ. D.\n          International Conference on Machine Learning"},{"key":"B5","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2002.06038","article-title":"Never give up: learning directed exploration strategies","author":"Badia","year":"","journal-title":"arXiv preprint arXiv:2002.06038."},{"key":"B6","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1162\/153244303765208377","article-title":"R-max-a general polynomial time algorithm for near-optimal reinforcement learning","volume":"3","author":"Brafman","year":"2002","journal-title":"J. Mach. Learn. Res"},{"key":"B7","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1606.01540","article-title":"Openai gym","author":"Brockman","year":"2016","journal-title":"arXiv preprint arXiv:1606.01540."},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1902.05551","article-title":"Off-policy actor-critic in an ensemble: achieving maximum general entropy and effective environment exploration in deep reinforcement learning","author":"Chen","year":"2019","journal-title":"arXiv preprint arXiv:1902.05551."},{"key":"B9","doi-asserted-by":"publisher","first-page":"883562","DOI":"10.3389\/fnbot.2022.883562","article-title":"Deep reinforcement learning based trajectory planning under uncertain constraints","volume":"16","author":"Chen","year":"2022","journal-title":"Front. Neurorobot"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1706.01502","article-title":"Ucb exploration via q-ensembles","author":"Chen","year":"2017","journal-title":"arXiv preprint arXiv:1706.01502"},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2101.05982","article-title":"Randomized ensembled double q-learning: Learning fast without a model","author":"Chen","year":"2021","journal-title":"arXiv preprint arXiv:2101.05982."},{"key":"B12","unstructured":"Better exploration with optimistic actor critic\n            CiosekK.\n            VuongQ.\n            LoftinR.\n            HofmannK.\n          Advances in Neural Information Processing Systems 322019"},{"key":"B13","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2102.04881","article-title":"Measuring progress in deep reinforcement learning sample efficiency","author":"Dorner","year":"2021","journal-title":"arXiv preprint arXiv:2102.04881."},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2003.11881","article-title":"An empirical investigation of the challenges of real-world reinforcement learning","author":"Dulac-Arnold","year":"2020","journal-title":"arXiv preprint arXiv:2003.11881"},{"key":"B15","unstructured":"Efficient and scalable bayesian neural nets with rank-1 factors27822792\n            DusenberryM.\n            JerfelG.\n            WenY.\n            MaY.\n            SnoekJ.\n            HellerK.\n          International Conference on Machine Learning2020"},{"key":"B16","unstructured":"Addressing function approximation error in actor-critic methods15871596\n            FujimotoS.\n            HoofH.\n            MegerD.\n          International Conference on Machine Learning2018"},{"key":"B17","doi-asserted-by":"publisher","first-page":"1310389","DOI":"10.34133\/2020\/1310389","article-title":"Cyborg and bionic systems: Signposting the future","volume":"2020","author":"Fukuda","year":"2020","journal-title":"Cyborg Bionic Syst"},{"key":"B18","unstructured":"Deep sparse rectifier neural networks315323\n            GlorotX.\n            BordesA.\n            BengioY.\n          Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics2011"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1906.10667","article-title":"Reinforcement learning with competitive ensembles of information-constrained primitives","author":"Goyal","year":"2019","journal-title":"arXiv preprint arXiv:1906.10667"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1812.05905","article-title":"Soft actor-critic algorithms and applications","author":"Haarnoja","year":"2018","journal-title":"arXiv preprint arXiv:1812.05905"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2010.06610","article-title":"Training independent subnetworks for robust prediction","author":"Havasi","year":"2020","journal-title":"arXiv preprint arXiv:2010.06610"},{"key":"B22","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2109.10552","article-title":"MEPG: a minimalist ensemble policy gradient framework for deep reinforcement learning","author":"He","year":"2021","journal-title":"arXiv preprint arXiv:2109.10552"},{"key":"B23","unstructured":"When to trust your model: Model-based policy optimization\n            JannerM.\n            FuJ.\n            ZhangM.\n            LevineS.\n          Advances in Neural Information Processing Systems 322019"},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2001.02907","article-title":"Population-guided parallel policy search for reinforcement learning","author":"Jung","year":"2020","journal-title":"arXiv preprint arXiv:2001.02907"},{"key":"B25","unstructured":"Uncertainty-driven imagination for continuous deep reinforcement learning195206\n            KalweitG.\n            BoedeckerJ.\n          Conference on Robot Learning2017"},{"key":"B26","doi-asserted-by":"publisher","first-page":"32","DOI":"10.3389\/fnbot.2018.00032","article-title":"Experience replay using transition sequences","volume":"12","author":"Karimpanal","year":"2018","journal-title":"Front. Neurorobot"},{"key":"B27","unstructured":"EMI: exploration with mutual information33603369\n            KimH.\n            KimJ.\n            JeongY.\n            LevineS.\n            SongH. O.\n          International Conference on Machine Learning2019"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1412.6980","article-title":"Adam: a method for stochastic optimization","author":"Kingma","year":"2014","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"B29","doi-asserted-by":"publisher","first-page":"18560","DOI":"10.48550\/arXiv.2003.07305","article-title":"Discor: Corrective feedback in reinforcement learning via distribution correction","volume":"33","author":"Kumar","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B30","unstructured":"Automating control of overestimation bias for continuous reinforcement learning\n            KuznetsovA.\n            GrishinA.\n            TsypinA.\n            AshukhaA.\n            VetrovD.\n          10.48550\/arXiv.2110.13523arXiv preprint arXiv:2110.135232021"},{"key":"B31","first-page":"5556","article-title":"Controlling overestimation bias with truncated mixture of continuous distributional quantile critics","author":"Kuznetsov","year":"2020","journal-title":"International Conference on Machine Learning"},{"key":"B32","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2002.06487","article-title":"Maxmin q-learning: controlling the estimation bias of q-learning","author":"Lan","year":"2020","journal-title":"arXiv preprint arXiv:2002.06487."},{"key":"B33","unstructured":"Sunrise: a simple unified framework for ensemble learning in deep reinforcement learning61316141\n            LeeK.\n            LaskinM.\n            SrinivasA.\n            AbbeelP.\n          International Conference on Machine Learning2021"},{"key":"B34","doi-asserted-by":"publisher","first-page":"421","DOI":"10.1177\/0278364917710318","article-title":"Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection","volume":"37","author":"Levine","year":"2018","journal-title":"Int. J. Rob. Res"},{"key":"B35","unstructured":"On the effect of auxiliary tasks on representation dynamics19\n            LyleC.\n            RowlandM.\n            OstrovskiG.\n            DabneyW.\n          International Conference on Artificial Intelligence and Statistics2021"},{"key":"B36","doi-asserted-by":"publisher","first-page":"9851834","DOI":"10.34133\/2021\/9851834","article-title":"Origami folding by multifingered hands with motion primitives","volume":"2021","author":"Namiki","year":"2021","journal-title":"Cyborg Bionic Syst."},{"key":"B37","unstructured":"Deep exploration via bootstrapped DQN\n            OsbandI.\n            BlundellC.\n            PritzelA.\n            Van RoyB.\n          Advances in Neural Information Processing Systems 292016"},{"key":"B38","doi-asserted-by":"publisher","first-page":"18050","DOI":"10.48550\/arXiv.2002.00632","article-title":"Effective diversity in population based reinforcement learning","volume":"33","author":"Parker-Holder","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B39","unstructured":"Self-supervised exploration via disagreement50625071\n            PathakD.\n            GandhiD.\n            GuptaA.\n          International Conference on Machine Learning2019"},{"key":"B40","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2103.00445","article-title":"Ensemble bootstrapping for q-learning","author":"Peer","year":"2021","journal-title":"arXiv preprint arXiv:2103.00445."},{"key":"B41","author":"Pendrith","year":"1997","journal-title":"Estimator variance in reinforcement learning: Theoretical problems and practical solutions"},{"key":"B42","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2002.12174","article-title":"Optimistic exploration even with a pessimistic initialization","author":"Rashid","year":"2020","journal-title":"International Conference on Learning Representations (ICLR)"},{"key":"B43","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2001.05209","article-title":"SEERL: sample efficient ensemble reinforcement learning","author":"Saphal","year":"2020","journal-title":"arXiv preprint arXiv:2001.05209"},{"key":"B44","unstructured":"Universal value function approximators13121320\n            SchaulT.\n            HorganD.\n            GregorK.\n            SilverD.\n          International Conference on Machine Learning2015"},{"key":"B45","unstructured":"SuttonR. S.\n            BartoA. G.\n          MIT PressReinforcement Learning: An Introduction2018"},{"key":"B46","first-page":"255","article-title":"Issues in using function approximation for reinforcement learning","volume-title":"Proceedings of the Fourth Connectionist Models Summer School","author":"Thrun","year":"1993"},{"key":"B47","doi-asserted-by":"crossref","DOI":"10.1109\/IROS.2012.6386109","article-title":"MuJoCo: a physics engine for model-based control","volume-title":"2012 IEEE\/RSJ International Conference on Intelligent Robots and Systems","author":"Todorov","year":"2012"},{"key":"B48","unstructured":"Deep reinforcement learning with double q-learning\n            Van HasseltH.\n            GuezA.\n            SilverD.\n          Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 302016"},{"key":"B49","unstructured":"WarwickD. P.\n            LiningerC. A.\n          The Sample Survey: Theory and Practice. McGraw-Hill1975"},{"key":"B50","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2002.06715","article-title":"Batchensemble: an alternative approach to efficient ensemble and lifelong learning","author":"Wen","year":"2020","journal-title":"arXiv preprint arXiv:2002.06715"},{"key":"B51","doi-asserted-by":"publisher","first-page":"6514","DOI":"10.48550\/arXiv.2006.13570","article-title":"Hyperparameter ensembles for robustness and uncertainty quantification","volume":"33","author":"Wenzel","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B52","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2105.08140","article-title":"Uncertainty weighted actor-critic for offline reinforcement learning","author":"Wu","year":"2021","journal-title":"arXiv preprint arXiv:2105.08140"},{"key":"B53","unstructured":"Towards sample efficient reinforcement learning\n            YuY.\n          26903687IJCAI2018"},{"key":"B54","unstructured":"Self-adaptive double bootstrapped DDPG\n            ZhengZ.\n            YuanC.\n            LinZ.\n            ChengY.\n          International Joint Conference on Artificial Intelligence2018"},{"key":"B55","unstructured":"ZiebartB. D.\n          Modeling Purposeful Adaptive Behavior With the Principle of Maximum Causal Entropy. Carnegie Mellon University2010"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1081242\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T08:32:13Z","timestamp":1673253133000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1081242\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,9]]},"references-count":55,"alternative-id":["10.3389\/fnbot.2022.1081242"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2022.1081242","relation":{},"ISSN":["1662-5218"],"issn-type":[{"value":"1662-5218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,9]]},"article-number":"1081242"}}