{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T16:04:19Z","timestamp":1753891459034,"version":"3.41.2"},"reference-count":38,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2022,12,13]],"date-time":"2022-12-13T00:00:00Z","timestamp":1670889600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["11832009"],"award-info":[{"award-number":["11832009"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>The atypical Markov decision processes (MDPs) are decision-making for maximizing the immediate returns in only one state transition. Many complex dynamic problems can be regarded as the atypical MDPs, e.g., football trajectory control, approximations of the compound Poincar\u00e9 maps, and parameter identification. However, existing deep reinforcement learning (RL) algorithms are designed to maximize long-term returns, causing a waste of computing resources when applied in the atypical MDPs. These existing algorithms are also limited by the estimation error of the value function, leading to a poor policy. To solve such limitations, this paper proposes an immediate-return algorithm for the atypical MDPs with continuous action space by designing an unbiased and low variance target Q-value and a simplified network framework. Then, two examples of atypical MDPs considering the uncertainty are presented to illustrate the performance of the proposed algorithm, i.e., passing the football to a moving player and chipping the football over the human wall. Compared with the existing deep RL algorithms, such as deep deterministic policy gradient and proximal policy optimization, the proposed algorithm shows significant advantages in learning efficiency, the effective rate of control, and computing resource usage.<\/jats:p>","DOI":"10.3389\/fnbot.2022.1012427","type":"journal-article","created":{"date-parts":[[2022,12,13]],"date-time":"2022-12-13T05:09:47Z","timestamp":1670908187000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["An immediate-return reinforcement learning for the atypical Markov decision processes"],"prefix":"10.3389","volume":"16","author":[{"given":"Zebang","family":"Pan","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guilin","family":"Wen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhao","family":"Tan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shan","family":"Yin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoyan","family":"Hu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2022,12,13]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"679","DOI":"10.1512\/iumj.1957.6.56038","article-title":"A Markovian decision process","volume":"6","author":"Bellman","year":"1957","journal-title":"J. Mathem. Mech."},{"key":"B2","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1016\/j.neucom.2017.02.096","article-title":"Multi-objectivization and ensembles of shapings in reinforcement learning","volume":"263","author":"Brys","year":"2017","journal-title":"Neurocomputing"},{"key":"B3","doi-asserted-by":"publisher","first-page":"883562","DOI":"10.3389\/fnbot.2022.883562","article-title":"Deep reinforcement learning based trajectory planning under uncertain constraints","volume":"16","author":"Chen","year":"2022","journal-title":"Front. Neurorob"},{"key":"B4","article-title":"Reinforcement learning and the reward engineering principle","author":"Dewey","year":"2014","journal-title":"2014 AAAI Spring Symposium Series"},{"key":"B5","doi-asserted-by":"publisher","first-page":"1509","DOI":"10.1519\/JSC.0000000000001642","article-title":"Maximal sprinting speed of elite soccer players during training and matches","volume":"31","author":"Djaoui","year":"2017","journal-title":"J. Strength Condit. Res"},{"key":"B6","article-title":"Addressing function approximation error in actor-critic methods","author":"Fujimoto","year":"2018","journal-title":"International Conference on Machine Learning"},{"key":"B7","article-title":"Learning both weights and connections for efficient neural network","author":"Han","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"B8","doi-asserted-by":"publisher","first-page":"5353","DOI":"10.1109\/CVPR.2015.7299173","article-title":"Convolutional neural networks at constrained time cost","author":"He","year":"2015","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B9","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11694","article-title":"Deep reinforcement learning that matters","author":"Henderson","year":"2018","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"B10","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1017\/S0022112009993934","article-title":"The effect of Reynolds number on the dynamics and wakes of freely rising and falling spheres","volume":"651","author":"Horowitz","year":"2010","journal-title":"J. Fluid Mech."},{"key":"B11","doi-asserted-by":"publisher","first-page":"4076","DOI":"10.1109\/TIE.2016.2636126","article-title":"An overview of dynamic-linearization-based data-driven control and applications","volume":"64","author":"Hou","year":"2016","journal-title":"IEEE T Ind. Electron"},{"key":"B12","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1016\/j.ins.2012.07.014","article-title":"From model-based control to data-driven control: Survey, classification and perspective","volume":"235","author":"Hou","year":"2013","journal-title":"Inform Sci"},{"key":"B13","doi-asserted-by":"publisher","DOI":"10.1051\/matecconf\/201814501002","article-title":"Study of soccer ball flight trajectory","author":"Javorova","year":"2018","journal-title":"MATEC Web of Conferences"},{"key":"B14","doi-asserted-by":"publisher","first-page":"34001","DOI":"10.1088\/1361-6404\/aaa888","article-title":"An aerodynamic analysis of recent FIFA world cup balls","volume":"39","author":"Kiratidis","year":"2018","journal-title":"Eur. J. Phys"},{"key":"B15","doi-asserted-by":"publisher","first-page":"6202","DOI":"10.1007\/s10489-021-02218-4","article-title":"Learning to trade in financial time series using high-frequency through wavelet transformation and deep reinforcement learning","volume":"51","author":"Lee","year":"2021","journal-title":"Appl. Intell"},{"article-title":"Offline reinforcement learning: Tutorial, review, and perspectives on open problems","year":"2020","author":"Levine","key":"B16"},{"key":"B17","doi-asserted-by":"publisher","first-page":"1141","DOI":"10.1007\/s40435-020-00678-z","article-title":"Global dynamic analysis of the North Pacific Ocean by data-driven generalized cell mapping method","volume":"8","author":"Li","year":"2020","journal-title":"Int. J. Dynam. Control"},{"article-title":"Continuous control with deep reinforcement learning","year":"2015","author":"Lillicrap","key":"B18"},{"key":"B19","doi-asserted-by":"publisher","first-page":"864380","DOI":"10.3389\/fnbot.2022.864380","article-title":"Model-Based and Model-Free Replay Mechanisms for Reinforcement Learning in Neurorobotics","volume":"16","author":"Massi","year":"2022","journal-title":"Front. Neurorobot"},{"volume-title":"Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem","year":"1954","author":"Minsky","key":"B20"},{"key":"B21","article-title":"Asynchronous methods for deep reinforcement learning","author":"Mnih","year":"2016","journal-title":"International Conference on Machine Learning"},{"key":"B22","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"B23","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1007\/s12283-012-0105-8","article-title":"A mathematical analysis of the motion of an in-flight soccer ball","volume":"16","author":"Myers","year":"2013","journal-title":"Sports Eng"},{"volume-title":"The Dynamic Testing of Soccer Balls.","year":"2003","author":"Neilson","key":"B24"},{"key":"B25","doi-asserted-by":"publisher","first-page":"1439","DOI":"10.1007\/s00348-011-1161-8","article-title":"Unsteady force measurements in sphere flow from subcritical to supercritical Reynolds numbers","volume":"51","author":"Norman","year":"2011","journal-title":"Exp. Fluids."},{"key":"B26","doi-asserted-by":"publisher","first-page":"522304","DOI":"10.1007\/s10409-022-22304-x","article-title":"Reinforcement learning control for a three-link biped robot with energy-efficient periodic gaits","volume":"39","author":"Pan","year":"2023","journal-title":"Acta Mechan. Sinica"},{"volume-title":"Optimizing expectations: From deep reinforcement learning to stochastic computation graphs","year":"2016","author":"Schulman","key":"B27"},{"article-title":"Proximal policy optimization algorithms","year":"2017","author":"Schulman","key":"B28"},{"key":"B29","article-title":"MRL extended team description 2011","author":"Sharbafi","year":"2011","journal-title":"Proceedings of the 15th international RoboCup symposium, Istanbul, Turkey"},{"key":"B30","doi-asserted-by":"publisher","first-page":"103535","DOI":"10.1016\/j.artint.2021.103535","article-title":"Reward is enough","volume":"299","author":"Silver","year":"2021","journal-title":"Artif. Intell"},{"volume-title":"Reinforcement Learning: An Introduction.","year":"2018","author":"Sutton","key":"B31"},{"key":"B32","article-title":"Policy gradient methods for reinforcement learning with function approximation","author":"Sutton","year":"1999","journal-title":"Advances in Neural Information Processing Systems"},{"key":"B33","doi-asserted-by":"publisher","first-page":"108","DOI":"10.1002\/oca.2156","article-title":"Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control","volume":"37","author":"Tutsoy","year":"2016","journal-title":"Optimal Control Applic. Methods."},{"key":"B34","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10295","article-title":"Deep reinforcement learning with double q-learning","author":"Van Hasselt","year":"2016","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"B35","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1115\/1.3424276","article-title":"Analysis of sheet metal stamping by a finite-element method","volume":"45","author":"Wang","year":"1978","journal-title":"J. Appl. Mech"},{"key":"B36","article-title":"Dueling network architectures for deep reinforcement learning","author":"Wang","year":"2016","journal-title":"International Conference on Machine Learning"},{"journal-title":"Learning from Delayed Rewards","year":"1989","author":"Watkins","key":"B37"},{"key":"B38","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1016\/j.ijrmms.2007.04.012","article-title":"Numerical investigation of blasting-induced damage in cylindrical rocks","volume":"45","author":"Zhu","year":"2008","journal-title":"Int. J. Rock. Mech. Min"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1012427\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,13]],"date-time":"2022-12-13T05:09:54Z","timestamp":1670908194000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1012427\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,13]]},"references-count":38,"alternative-id":["10.3389\/fnbot.2022.1012427"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2022.1012427","relation":{},"ISSN":["1662-5218"],"issn-type":[{"type":"electronic","value":"1662-5218"}],"subject":[],"published":{"date-parts":[[2022,12,13]]},"article-number":"1012427"}}