{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T04:57:18Z","timestamp":1758776238579,"version":"3.44.0"},"reference-count":48,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T00:00:00Z","timestamp":1758758400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Robot. AI"],"abstract":"<jats:p>This study investigates a novel nonlinear update rule for value and policy functions based on temporal difference (TD) errors in reinforcement learning (RL). The update rule in standard RL states that the TD error is linearly proportional to the degree of updates, treating all rewards equally without any bias. On the other hand, recent biological studies have revealed that there are nonlinearities in the TD error and the degree of updates, biasing policies towards being either optimistic or pessimistic. Such biases in learning due to nonlinearities are expected to be useful and intentionally leftover features in biological learning. Therefore, this research explores a theoretical framework that can leverage the nonlinearity between the degree of the update and TD errors. To this end, we focus on a <jats:italic>control as inference<\/jats:italic> framework utilized in the previous work, in which the uncomputable nonlinear term needed to be approximately excluded from the derivation of the standard RL. By analyzing it, the Weber\u2013Fechner law (WFL) is found, in which perception (i.e., the degree of updates) in response to a change in stimulus (i.e., TD error) is attenuated as the stimulus intensity (i.e., the value function) increases. To numerically demonstrate the utilities of WFL on RL, we propose a practical implementation using a reward\u2013punishment framework and modify the definition of optimality. Further analysis of this implementation reveals that two utilities can be expected: i) to accelerate escaping from the situations with small rewards and ii) to pursue the minimum punishment as much as possible. We finally investigate and discuss the expected utilities through simulations and robot experiments. As a result, the proposed RL algorithm with WFL shows the expected utilities that accelerate the reward-maximizing startup and continue to suppress punishments during learning.<\/jats:p>","DOI":"10.3389\/frobt.2025.1649154","type":"journal-article","created":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T04:17:50Z","timestamp":1758773870000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Weber\u2013Fechner law in temporal difference learning derived from control as inference"],"prefix":"10.3389","volume":"12","author":[{"given":"Keiichiro","family":"Takahashi","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Taisuke","family":"Kobayashi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tomoya","family":"Yamanokuchi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Takamitsu","family":"Matsubara","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2025,9,25]]},"reference":[{"key":"B1","first-page":"1300","article-title":"Robel: robotics benchmarks for learning with low-cost robots","volume-title":"Conference on robot learning","author":"Ahn","year":"2020"},{"key":"B2","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1162\/089976698300017746","article-title":"Natural gradient works efficiently in learning","volume":"10","author":"Amari","year":"1998","journal-title":"Neural Comput."},{"key":"B3","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1177\/0278364919887447","article-title":"Learning dexterous in-hand manipulation","volume":"39","author":"Andrychowicz","year":"2020","journal-title":"Int. J. Robotics Res."},{"key":"B4","doi-asserted-by":"publisher","first-page":"1036","DOI":"10.1007\/s11055-023-01497-3","article-title":"Magnetic navigation in animals, visual contrast sensitivity and the weber\u2013fechner law","volume":"53","author":"Binhi","year":"2023","journal-title":"Neurosci. Behav. Physiology"},{"key":"B5","article-title":"Provably robust temporal difference learning for heavy-tailed rewards","volume":"36","author":"Cayci","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"B6","doi-asserted-by":"publisher","first-page":"671","DOI":"10.1038\/s41586-019-1924-6","article-title":"A distributional code for value in dopamine-based reinforcement learning","volume":"577","author":"Dabney","year":"2020","journal-title":"Nature"},{"key":"B7","doi-asserted-by":"publisher","first-page":"285","DOI":"10.1016\/s0896-6273(02)00963-7","article-title":"Reward, motivation, and reinforcement learning","volume":"36","author":"Dayan","year":"2002","journal-title":"Neuron"},{"key":"B8","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1016\/j.cobeha.2021.07.003","article-title":"Canonical cortical circuits and the duality of bayesian inference and optimal control","volume":"41","author":"Doya","year":"2021","journal-title":"Curr. Opin. Behav. Sci."},{"key":"B9","article-title":"Sharpness-aware minimization for efficiently improving generalization","volume-title":"International conference on learning Representations","author":"Foret","year":"2021"},{"key":"B10","first-page":"1861","article-title":"Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor","volume-title":"International conference on machine learning","author":"Haarnoja","year":"2018"},{"key":"B11","doi-asserted-by":"publisher","first-page":"e2422144122","DOI":"10.1073\/pnas.2422144122","article-title":"Evolving choice hysteresis in reinforcement learning: comparing the adaptive value of positivity bias and gradual perseveration","volume":"122","author":"Hoxha","year":"2025","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"B12","first-page":"12585","article-title":"Cleanrl: high-quality single-file implementations of deep reinforcement learning algorithms","volume":"23","author":"Huang","year":"2022","journal-title":"J. Mach. Learn. Res."},{"key":"B13","doi-asserted-by":"publisher","first-page":"126692","DOI":"10.1016\/j.neucom.2023.126692","article-title":"Adaterm: adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization","volume":"557","author":"Ilboudo","year":"2023","journal-title":"Neurocomputing"},{"key":"B14","doi-asserted-by":"publisher","first-page":"982","DOI":"10.1038\/s41586-023-06419-4","article-title":"Champion-level drone racing using deep reinforcement learning","volume":"620","author":"Kaufmann","year":"2023","journal-title":"Nature"},{"key":"B15","first-page":"4032","article-title":"L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning","volume-title":"IEEE\/RSJ international Conference on intelligent Robots and systems","author":"Kobayashi","year":""},{"key":"B16","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1016\/j.neunet.2022.04.021","article-title":"Optimistic reinforcement learning by forward kullback\u2013leibler divergence optimization","volume":"152","author":"Kobayashi","year":"","journal-title":"Neural Netw."},{"key":"B17","doi-asserted-by":"publisher","first-page":"100192","DOI":"10.1016\/j.rico.2022.100192","article-title":"Proximal policy optimization with adaptive threshold for symmetric relative density ratio","volume":"10","author":"Kobayashi","year":"2023","journal-title":"Results Control Optim."},{"key":"B18","first-page":"1","article-title":"Consolidated adaptive t-soft update for deep reinforcement learning","volume-title":"International joint Conference on neural networks","author":"Kobayashi","year":""},{"key":"B19","article-title":"Drop: distributional and regular optimism and pessimism for reinforcement learning","author":"Kobayashi","year":""},{"key":"B20","first-page":"37","article-title":"Reward-punishment actor-critic algorithm applying to robotic non-grasping manipulation","volume-title":"Joint IEEE international Conference on Development and Learning and Epigenetic robotics","author":"Kobayashi","year":"2019"},{"key":"B21","doi-asserted-by":"publisher","first-page":"425","DOI":"10.1016\/s0079-6123(07)64023-0","article-title":"Emergence and development of embodied cognition: a constructivist approach using robots","volume":"164","author":"Kuniyoshi","year":"2007","journal-title":"Prog. brain Res."},{"key":"B22","article-title":"Mixing adam and sgd: a combined optimization method","author":"Landro","year":"2020"},{"key":"B23","article-title":"Reinforcement learning and control as probabilistic inference: Tutorial and review","author":"Levine","year":"2018"},{"key":"B24","article-title":"Eureka: human-level reward design via coding large language models","volume-title":"International conference on learning Representations","author":"Ma","year":"2024"},{"key":"B25","article-title":"Reinforcement learning with adaptive temporal discounting","author":"Maini","year":"2025","journal-title":"Reinf. Learn. J"},{"key":"B26","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1038\/s41593-023-01535-w","article-title":"Distributional reinforcement learning in prefrontal cortex","volume":"27","author":"Muller","year":"2024","journal-title":"Nat. Neurosci."},{"key":"B27","first-page":"8651","article-title":"Robot skill adaptation via soft actor-critic Gaussian mixture models","volume-title":"International Conference on Robotics and automation","author":"Nematollahi","year":"2022"},{"key":"B28","first-page":"4104","article-title":"Exponential td learning: a risk-sensitive actor-critic reinforcement learning algorithm","volume-title":"American control conference","author":"Noorani","year":"2023"},{"key":"B29","doi-asserted-by":"publisher","first-page":"199","DOI":"10.1007\/s10658-005-4732-9","article-title":"The role of psychophysics in phytopathology: the weber\u2013fechner law revisited","volume":"114","author":"Nutter","year":"2006","journal-title":"Eur. J. Plant Pathology"},{"key":"B30","doi-asserted-by":"publisher","first-page":"329","DOI":"10.1016\/s0896-6273(03)00169-7","article-title":"Temporal difference models and reward-related learning in the human brain","volume":"38","author":"O\u2019Doherty","year":"2003","journal-title":"Neuron"},{"key":"B31","doi-asserted-by":"crossref","DOI":"10.7551\/mitpress\/12441.001.0001","volume-title":"Active inference: the free energy principle in mind, brain, and behavior","author":"Parr","year":"2022"},{"key":"B32","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1007\/s11023-010-9221-z","article-title":"Weber-fechner law and the optimality of the logarithmic scale","volume":"21","author":"Portugal","year":"2011","journal-title":"Minds Mach."},{"key":"B33","doi-asserted-by":"publisher","first-page":"eadi9579","DOI":"10.1126\/scirobotics.adi9579","article-title":"Real-world humanoid locomotion with reinforcement learning","volume":"9","author":"Radosavovic","year":"2024","journal-title":"Sci. Robotics"},{"key":"B34","first-page":"1","article-title":"Stable-baselines3: reliable reinforcement learning implementations","volume":"22","author":"Raffin","year":"2021","journal-title":"J. Mach. Learn. Res."},{"key":"B35","doi-asserted-by":"publisher","first-page":"1222","DOI":"10.12688\/f1000research.12130.1","article-title":"Logarithmic distributions prove that intrinsic learning is Hebbian","volume":"6","author":"Scheler","year":"2017","journal-title":"F1000Research"},{"key":"B36","article-title":"Proximal policy optimization algorithms","author":"Schulman","year":"2017"},{"key":"B37","doi-asserted-by":"publisher","first-page":"900","DOI":"10.1523\/jneurosci.13-03-00900.1993","article-title":"Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task","volume":"13","author":"Schultz","year":"1993","journal-title":"J. Neurosci."},{"key":"B38","doi-asserted-by":"publisher","first-page":"1298","DOI":"10.1162\/neco_a_00600","article-title":"Risk-sensitive reinforcement learning","volume":"26","author":"Shen","year":"2014","journal-title":"Neural Comput."},{"key":"B39","article-title":"Don\u2019t decay the learning rate, increase the batch size","volume-title":"International conference on learning Representations","author":"Smith","year":"2018"},{"key":"B40","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1016\/j.conb.2020.08.014","article-title":"Dopamine signals as temporal difference errors: recent advances","volume":"67","author":"Starkweather","year":"2021","journal-title":"Curr. Opin. Neurobiol."},{"key":"B41","first-page":"2904","article-title":"Least absolute policy iteration for robust value function approximation","volume-title":"IEEE international conference on robotics and automation","author":"Sugiyama","year":"2009"},{"key":"B42","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/bf00115009","article-title":"Learning to predict by the methods of temporal differences","volume":"3","author":"Sutton","year":"1988","journal-title":"Mach. Learn."},{"key":"B43","volume-title":"Reinforcement learning: an introduction","author":"Sutton","year":"2018"},{"key":"B44","article-title":"Policy gradient methods for reinforcement learning with function approximation","volume":"12","author":"Sutton","year":"1999","journal-title":"Adv. neural Inf. Process. Syst."},{"key":"B45","first-page":"5026","article-title":"Mujoco: a physics engine for model-based control","volume-title":"IEEE\/RSJ international conference on intelligent robots and systems","author":"Todorov","year":"2012"},{"key":"B46","first-page":"1684","article-title":"Learning object-conditioned exploration using distributed soft actor critic","volume-title":"Conference on robot learning","author":"Wahid","year":"2021"},{"key":"B47","doi-asserted-by":"publisher","first-page":"115","DOI":"10.1016\/j.neunet.2020.12.001","article-title":"Modular deep reinforcement learning from reward and punishment for robot navigation","volume":"135","author":"Wang","year":"2021","journal-title":"Neural Netw."},{"key":"B48","doi-asserted-by":"publisher","first-page":"8964","DOI":"10.1109\/lra.2022.3189156","article-title":"Randomized-to-canonical model predictive control for real-world visual robotic manipulation","volume":"7","author":"Yamanokuchi","year":"2022","journal-title":"IEEE Robotics Automation Lett."}],"container-title":["Frontiers in Robotics and AI"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2025.1649154\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,25]],"date-time":"2025-09-25T04:17:54Z","timestamp":1758773874000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frobt.2025.1649154\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,25]]},"references-count":48,"alternative-id":["10.3389\/frobt.2025.1649154"],"URL":"https:\/\/doi.org\/10.3389\/frobt.2025.1649154","relation":{},"ISSN":["2296-9144"],"issn-type":[{"value":"2296-9144","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,25]]},"article-number":"1649154"}}