{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T05:53:39Z","timestamp":1764050019950,"version":"3.45.0"},"reference-count":45,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T00:00:00Z","timestamp":1763596800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Innovation Teams of Ordinary Universities in Guangdong Province","award":["2021KCXTD038","2023KCXTD022"],"award-info":[{"award-number":["2021KCXTD038","2023KCXTD022"]}]},{"name":"Key Laboratory of Ordinary Universities in Guangdong Province","award":["2022KSYS003"],"award-info":[{"award-number":["2022KSYS003"]}]},{"name":"Key Discipline Research Ability Improvement Project of Guangdong Province","award":["2021ZDJS043","2022ZDJS068"],"award-info":[{"award-number":["2021ZDJS043","2022ZDJS068"]}]},{"DOI":"10.13039\/501100003453","name":"Natural Science Foundation of Guangdong Province","doi-asserted-by":"crossref","award":["2022A1515010990"],"award-info":[{"award-number":["2022A1515010990"]}],"id":[{"id":"10.13039\/501100003453","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Research Fund of the Department of Education of Guangdong Province","award":["2022KTSCX079","2023ZDZX1013","2022ZDZX3012","2022ZDZX3011","2023ZDZX2038"],"award-info":[{"award-number":["2022KTSCX079","2023ZDZX1013","2022ZDZX3012","2022ZDZX3011","2023ZDZX2038"]}]},{"name":"Chaozhou Engineering Technology Research Center","award":["z25025"],"award-info":[{"award-number":["z25025"]}]},{"DOI":"10.13039\/501100010844","name":"Hanshan Normal University","doi-asserted-by":"crossref","award":["XY202105"],"award-info":[{"award-number":["XY202105"]}],"id":[{"id":"10.13039\/501100010844","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>In numerous episodic reinforcement learning (RL) environments, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Traditional SARSA algorithms face challenges in achieving an optimal balance between bias and variation, primarily due to their dependence on a single, constant discount factor (\u03b7). This study enhances the temporal difference decomposition method, TD(\u0394), by applying it to the SARSA algorithm, wherein the action-value function is segmented into several components based on the differences between action-value functions linked to specific discount factors. Each component, referred to as a delta estimator (D), is linked to a specific discount factor and learned independently. This modified technique is referred to as SARSA(\u0394). SARSA is a widely used on-policy RL method that enhances action-value functions via temporal difference updates. This decomposition, namely SARSA(\u0394), facilitates learning across a range of time scales. This analysis makes learning more effective and guarantees consistency, especially in situations where long-horizon improvement is needed. The results of this research show that the proposed technique works to lower bias in SARSA\u2019s updates and speed up convergence in both deterministic and stochastic settings, even in dense-reward Atari environments. Experimental results from a variety of benchmark settings show that the proposed SARSA(\u0394) outperforms existing TD learning techniques in both tabular and deep RL environments.<\/jats:p>","DOI":"10.3390\/a18110729","type":"journal-article","created":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T15:00:17Z","timestamp":1763650817000},"page":"729","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Segmenting Action-Value Functions over Time Scales in SARSA via TD(\u0394)"],"prefix":"10.3390","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6629-8186","authenticated-orcid":false,"given":"Mahammad","family":"Humayoo","sequence":"first","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"},{"name":"CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"University of Chinese Academy of Sciences, Beijing 101408, China"},{"name":"School of Computer Science, Beijing Institute of Technology, Beijing 100081, China"}]},{"given":"Gengzhong","family":"Zheng","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Xiaoqing","family":"Dong","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Wei","family":"Huang","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Liming","family":"Miao","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Shuwei","family":"Qiu","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Zexun","family":"Zhou","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Peitao","family":"Wang","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Zakir","family":"Ullah","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing 101408, China"},{"name":"School of Data Science, Capital University of Economics and Business, Beijing 100070, China"}]},{"given":"Naveed Ur Rehman","family":"Junejo","sequence":"additional","affiliation":[{"name":"Hanshan Normal University, Chaozhou 521041, China"}]},{"given":"Xueqi","family":"Cheng","sequence":"additional","affiliation":[{"name":"CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"University of Chinese Academy of Sciences, Beijing 101408, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,20]]},"reference":[{"key":"ref_1","unstructured":"Romoff, J., Henderson, P., Touati, A., Brunskill, E., Pineau, J., and Ollivier, Y. (2019, January 9\u201315). Separating value functions across time-scales. Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA."},{"key":"ref_2","unstructured":"Sutton, R.S., and Barto, A.G. (2018). Reinforcement Learning: An Introduction, MIT Press."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016, January 5\u201310). Unifying count-based exploration and intrinsic motivation. Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.","DOI":"10.1609\/aaai.v30i1.10303"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"997","DOI":"10.1109\/72.623201","article-title":"Adaptive critic designs","volume":"8","author":"Prokhorov","year":"1997","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_5","unstructured":"Mnih, V. (2013). Playing atari with deep reinforcement learning. arXiv."},{"key":"ref_6","unstructured":"Berner, C., Brockman, G., Chan, B., Cheung, V., D\u0119biak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., and Hesse, C. (2019). Dota 2 with large scale deep reinforcement learning. arXiv."},{"key":"ref_7","unstructured":"Fran\u00e7ois-Lavet, V., Fonteneau, R., and Ernst, D. (2015). How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv."},{"key":"ref_8","unstructured":"Xu, Z., van Hasselt, H.P., and Silver, D. (2018, January 3\u20138). Meta-gradient reinforcement learning. Proceedings of the NIPS\u201918: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. (April, January 30). A theoretical and empirical analysis of expected sarsa. Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, TN, USA.","DOI":"10.1109\/ADPRL.2009.4927542"},{"key":"ref_10","unstructured":"Fedus, W., Gelada, C., Bengio, Y., Bellemare, M.G., and Larochelle, H. (2019). Hyperbolic discounting and learning over multiple horizons. arXiv."},{"key":"ref_11","unstructured":"Kearns, M.J., and Singh, S. (July, January 28). Bias-Variance Error Bounds for Temporal Difference Updates. Proceedings of the COLT, the Thirteenth Annual Conference on Computational Learning Theory, Palo Alto, CA, USA."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Sutton, R.S., and Barto, A.G. (1998). Introduction to Reinforcement Learning, MIT Press.","DOI":"10.1109\/TNN.1998.712192"},{"key":"ref_13","first-page":"3","article-title":"Generalizing value estimation over timescale","volume":"2","author":"Sherstan","year":"2018","journal-title":"Network"},{"key":"ref_14","unstructured":"Ali, R.F., Woods, J., Seraj, E., Duong, K., Behzadan, V., and Hsu, W. (2024, January 9). Hyperbolic Discounting in Multi-Agent Reinforcement Learning. Proceedings of the Finding the Frame: An RLC Workshop for Examining Conceptual Frameworks, Amherst, MA, USA."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kim, M., Kim, J.S., Choi, M.S., and Park, J.H. (2022). Adaptive discount factor for deep reinforcement learning in continuing tasks with uncertainty. Sensors, 22.","DOI":"10.3390\/s22197266"},{"key":"ref_16","unstructured":"Amit, R., Meir, R., and Ciosek, K. (2020, January 13\u201318). Discount factor as a regularizer in reinforcement learning. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_17","unstructured":"Wang, J., Zhang, Q., Zhao, D., Zhao, M., and Hao, J. (2020). Dynamic horizon value estimation for model-based reinforcement learning. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Dietterich, T.G. (1999). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. arXiv.","DOI":"10.1007\/3-540-44914-0_2"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Henderson, P., Chang, W.D., Bacon, P.L., Meger, D., Pineau, J., and Precup, D. (2017). OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning. arXiv.","DOI":"10.1609\/aaai.v32i1.11775"},{"key":"ref_20","unstructured":"Hengst, B. (2002, January 8\u201312). Discovering Hierarchy in Reinforcement Learning with HEXQ. Proceedings of the International Conference on Machine Learning, San Francisco, CA, USA."},{"key":"ref_21","unstructured":"Reynolds, S.I. (1999, January 27\u201330). Decision Boundary Partitioning: Variable Resolution Model-Free Reinforcement Learning. Proceedings of the International Conference on Machine Learning, San Francisco, CA, USA."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Menache, I., Mannor, S., and Shimkin, N. (2002, January 19\u201323). Q-Cut-Dynamic Discovery of Sub-goals in Reinforcement Learning. Proceedings of the European Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1007\/3-540-36755-1_25"},{"key":"ref_23","unstructured":"Russell, S.J., and Zimdars, A. (2003, January 5). Q-Decomposition for Reinforcement Learning Agents. Proceedings of the International Conference on Machine Learning, Xi\u2019an, China."},{"key":"ref_24","unstructured":"van Seijen, H., Fatemi, M., Laroche, R., Romoff, J., Barnes, T., and Tsang, J. (2017). Hybrid Reward Architecture for Reinforcement Learning. arXiv."},{"key":"ref_25","unstructured":"Ali, R.F., Nafi, N.M., Duong, K., and Hsu, W. (December, January 28). Efficient Multi-Horizon Learning for Off-Policy Reinforcement Learning. Proceedings of the Deep Reinforcement Learning Workshop NeurIPS, New Orleans, LA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ali, R.F., Duong, K., Nafi, N.M., and Hsu, W. (2023, January 7\u201314). Multi-horizon learning in procedurally-generated environments for off-policy reinforcement learning (student abstract). Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i13.26935"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"7715","DOI":"10.1109\/LRA.2024.3426273","article-title":"Multi-Horizon Multi-Agent Planning Using Decentralised Monte Carlo Tree Search","volume":"9","author":"Seiler","year":"2024","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_28","unstructured":"Benechehab, A., Paolo, G., Thomas, A., Filippone, M., and K\u00e9gl, B. (2023). Multi-timestep models for Model-based Reinforcement Learning. arXiv."},{"key":"ref_29","unstructured":"Bonnet, C., Caron, P., Barrett, T., Davies, I., and Laterre, A. (2021). One step at a time: Pros and cons of multi-step meta-gradient reinforcement learning. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"104019","DOI":"10.1016\/j.robot.2021.104019","article-title":"Adaptive and multiple time-scale eligibility traces for online deep reinforcement learning","volume":"151","author":"Kobayashi","year":"2022","journal-title":"Robot. Auton. Syst."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1023\/A:1022633531479","article-title":"Learning to predict by the methods of temporal differences","volume":"3","author":"Sutton","year":"1988","journal-title":"Mach. Learn."},{"key":"ref_32","unstructured":"Tsitsiklis, J., and Van Roy, B. (1996, January 2). Analysis of temporal-diffference learning with function approximation. Proceedings of the Advances in Neural Information Processing Systems 9 (NIPS 1996), Denver, CO, USA."},{"key":"ref_33","first-page":"679","article-title":"A Markovian decision process","volume":"6","author":"Bellman","year":"1957","journal-title":"J. Math. Mech."},{"key":"ref_34","unstructured":"Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning. [Ph.D. Thesis, University of Massachusetts]."},{"key":"ref_35","unstructured":"Schulman, J., Moritz, P., Levine, S., Jordan, M.I., and Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv."},{"key":"ref_36","unstructured":"Sutton, R.S., McAllester, D.A., Singh, S., and Mansour, Y. (December, January 29). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Proceedings of the Neural Information Processing Systems, Denver, CO, USA."},{"key":"ref_37","unstructured":"Konda, V.R., and Tsitsiklis, J.N. (December, January 29). Actor-Critic Algorithms. Proceedings of the Neural Information Processing Systems, Denver, CO, USA."},{"key":"ref_38","unstructured":"Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016, January 19\u201324). Asynchronous Methods for Deep Reinforcement Learning. Proceedings of the International Conference on Machine Learning, New York, NY, USA."},{"key":"ref_39","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv."},{"key":"ref_40","unstructured":"Henderson, P., Romoff, J., and Pineau, J. (2018). Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods. arXiv."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1023\/A:1017984413808","article-title":"Near-optimal reinforcement learning in polynomial time","volume":"49","author":"Kearns","year":"2002","journal-title":"Mach. Learn."},{"key":"ref_42","unstructured":"Brockman, G. (2016). OpenAI Gym. arXiv."},{"key":"ref_43","unstructured":"Peter Henderson, W.D.C. (2025, August 20). Value and Policy Iterations. Available online: https:\/\/github.com\/Breakend\/ValuePolicyIterationVariations."},{"key":"ref_44","unstructured":"Kostrikov, I. (2025, August 20). PyTorch Implementations of Reinforcement Learning Algorithms. Available online: https:\/\/github.com\/ikostrikov\/pytorch-a2c-ppo-acktr."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"583","DOI":"10.1080\/01621459.1952.10483441","article-title":"Use of ranks in one-criterion variance analysis","volume":"47","author":"Kruskal","year":"1952","journal-title":"J. Am. Stat. Assoc."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/11\/729\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T05:13:42Z","timestamp":1764047622000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/18\/11\/729"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,20]]},"references-count":45,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["a18110729"],"URL":"https:\/\/doi.org\/10.3390\/a18110729","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2025,11,20]]}}}