{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T10:45:39Z","timestamp":1761129939078,"version":"build-2065373602"},"reference-count":41,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2021,7,2]],"date-time":"2021-07-02T00:00:00Z","timestamp":1625184000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Reinforcement Learning (RL) enables an agent to learn control policies for achieving its long-term goals. One key parameter of RL algorithms is a discount factor that scales down future cost in the state\u2019s current value estimate. This study introduces and analyses a transition-based discount factor in two model-free reinforcement learning algorithms: Q-learning and SARSA, and shows their convergence using the theory of stochastic approximation for finite state and action spaces. This causes an asymmetric discounting, favouring some transitions over others, which allows (1) faster convergence than constant discount factor variant of these algorithms, which is demonstrated by experiments on the Taxi domain and MountainCar environments; (2) provides better control over the RL agents to learn risk-averse or risk-taking policy, as demonstrated in a Cliff Walking experiment.<\/jats:p>","DOI":"10.3390\/sym13071197","type":"journal-article","created":{"date-parts":[[2021,7,2]],"date-time":"2021-07-02T10:06:34Z","timestamp":1625220394000},"page":"1197","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Transition Based Discount Factor for Model Free Algorithms in Reinforcement Learning"],"prefix":"10.3390","volume":"13","author":[{"given":"Abhinav","family":"Sharma","sequence":"first","affiliation":[{"name":"Department of Computer Science, PDPM Indian Institute of Information Technology Jabalpur, Madhya Pradesh 482005, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9970-3889","authenticated-orcid":false,"given":"Ruchir","family":"Gupta","sequence":"additional","affiliation":[{"name":"Department of Computer Science, JNU, Delhi 110001, India"}]},{"given":"K.","family":"Lakshmanan","sequence":"additional","affiliation":[{"name":"Department of Computer Science, IIT (BHU) Varanasi, Uttar Pradesh 221005, India"}]},{"given":"Atul","family":"Gupta","sequence":"additional","affiliation":[{"name":"Department of Computer Science, PDPM Indian Institute of Information Technology Jabalpur, Madhya Pradesh 482005, India"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Sutton, R.S., and Barto, A.G. (1998). Reinforcement Learning: An Introduction, MIT Press.","DOI":"10.1109\/TNN.1998.712192"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"125915","DOI":"10.1016\/j.jclepro.2021.125915","article-title":"Deep reinforcement learning optimization framework for a power generation plant considering performance and environmental issues","volume":"291","author":"Adams","year":"2021","journal-title":"J. Clean. Prod."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1038\/s41586-019-1724-z","article-title":"Grandmaster level in StarCraft II using multi-agent reinforcement learning","volume":"575","author":"Vinyals","year":"2019","journal-title":"Nature"},{"key":"ref_4","unstructured":"Napolitano, N. (2020). Testing match-3 video games with Deep Reinforcement Learning. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J.A., Solowjow, E., and Levine, S. (2019, January 20\u201324). Residual reinforcement learning for robot control. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794127"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"103078","DOI":"10.1016\/j.autcon.2020.103078","article-title":"Complete coverage path planning using reinforcement learning for tetromino based cleaning and maintenance robot","volume":"112","author":"Lakshmanan","year":"2020","journal-title":"Autom. Constr."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"6255","DOI":"10.1109\/TWC.2020.3001736","article-title":"Power allocation in multi-user cellular networks: Deep reinforcement learning approaches","volume":"19","author":"Meng","year":"2020","journal-title":"IEEE Trans. Wirel. Commun."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"108759","DOI":"10.1016\/j.automatica.2019.108759","article-title":"Deep reinforcement learning for wireless sensor scheduling in cyber\u2013physical systems","volume":"113","author":"Leong","year":"2020","journal-title":"Automatica"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1049\/iet-its.2019.0317","article-title":"Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data","volume":"14","author":"Duan","year":"2020","journal-title":"IET Intell. Transp. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Kiran, B.R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A.A., Yogamani, S., and P\u00e9rez, P. (2021). Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst.","DOI":"10.1109\/TITS.2021.3054625"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Hu, B., Li, J., Yang, J., Bai, H., Li, S., Sun, Y., and Yang, X. (2019). Reinforcement learning approach to design practical adaptive control for a small-scale intelligent vehicle. Symmetry, 11.","DOI":"10.3390\/sym11091139"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1007\/BF00992698","article-title":"Q-learning","volume":"8","author":"Watkins","year":"1992","journal-title":"Mach. Learn."},{"key":"ref_13","unstructured":"Rummery, G.A., and Niranjan, M. (1994). On-Line Q-Learning Using Connectionist Systems, Department of Engineering, University of Cambridge."},{"key":"ref_14","unstructured":"Bertsekas, D.P. (2019). Reinforcement Learning and Optimal Control, Athena Scientific."},{"key":"ref_15","unstructured":"Sutton, R.S. (1996). Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. Advances in Neural Information Processing Systems, The MIT Press. Available online: http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.51.4764&rep=rep1&type=pdf."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1613\/jair.301","article-title":"Reinforcement learning: A survey","volume":"4","author":"Kaelbling","year":"1996","journal-title":"J. Artif. Intell. Res."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1109\/MSP.2017.2743240","article-title":"Deep reinforcement learning: A brief survey","volume":"34","author":"Arulkumaran","year":"2017","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_18","unstructured":"Fran\u00e7ois-Lavet, V., Fonteneau, R., and Ernst, D. (2015). How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies. arXiv."},{"key":"ref_19","unstructured":"Edwards, A., Littman, M.L., and Isbell, C.L. (2021, June 16). Expressing Tasks Robustly via Multiple Discount Factors. Available online: https:\/\/www.semanticscholar.org\/paper\/Expressing-Tasks-Robustly-via-Multiple-Discount-Edwards-Littman\/3b4f5a83ca49d09ce3bf355be8b7e1e956dc27fe."},{"key":"ref_20","first-page":"7949","article-title":"Rethinking the discount factor in reinforcement learning: A decision theoretic approach","volume":"33","author":"Pitis","year":"2019","journal-title":"Proc. Aaai Conf. Artif. Intell."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"377","DOI":"10.1007\/s00186-020-00716-8","article-title":"Discrete-time control with non-constant discount factor","volume":"92","author":"Menaldi","year":"2020","journal-title":"Math. Methods Oper. Res."},{"key":"ref_22","first-page":"369","article-title":"Markov decision processes with state-dependent discount factors and unbounded rewards\/costs","volume":"39","author":"Wei","year":"2011","journal-title":"Oper. Res. Lett."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Groman, S.M. (2020). The Neurobiology of Impulsive Decision-Making and Reinforcement Learning in Nonhuman Animals, Springer.","DOI":"10.1007\/7854_2020_127"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/s12035-012-8232-6","article-title":"The role of serotonin in the regulation of patience and impulsivity","volume":"45","author":"Miyazaki","year":"2012","journal-title":"Mol. Neurobiol."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Ayd\u0131n, A., and Surer, E. (2020). Using Generative Adversarial Nets on Atari Games for Feature Extraction in Deep Reinforcement Learning. arXiv.","DOI":"10.1109\/SIU49456.2020.9302454"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ning, Z., Zhang, K., Wang, X., Obaidat, M.S., Guo, L., Hu, X., Hu, B., Guo, Y., Sadoun, B., and Kwok, R.Y. (2020). Joint computing and caching in 5G-envisioned Internet of vehicles: A deep reinforcement learning-based traffic control system. IEEE Trans. Intell. Transp. Syst.","DOI":"10.1109\/TITS.2020.2970276"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1002\/oca.2156","article-title":"Chaotic dynamics and convergence analysis of temporal difference algorithms with bang-bang control","volume":"37","author":"Tutsoy","year":"2016","journal-title":"Optim. Control Appl. Methods"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1186","DOI":"10.1177\/0142331215581638","article-title":"Reinforcement learning analysis for a minimum time balance problem","volume":"38","author":"Tutsoy","year":"2016","journal-title":"Trans. Inst. Meas. Control"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1007\/s00186-006-0092-2","article-title":"Markov control processes with randomized discounted cost","volume":"65","year":"2007","journal-title":"Math. Methods Oper. Res."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yoshida, N., Uchibe, E., and Doya, K. (2013, January 18\u201322). Reinforcement learning with state-dependent discount factor. Proceedings of the 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), Osaka, Japan.","DOI":"10.1109\/DevLrn.2013.6652533"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"105190","DOI":"10.1016\/j.jet.2021.105190","article-title":"Dynamic programming with state-dependent discounting","volume":"192","author":"Stachurski","year":"2021","journal-title":"J. Econ. Theory"},{"key":"ref_32","unstructured":"Zhang, S., Veeriah, V., and Whiteson, S. (2020). Learning retrospective knowledge with reverse reinforcement learning. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Hasanbeig, M., Abate, A., and Kroening, D. (2020). Cautious reinforcement learning with logical constraints. arXiv.","DOI":"10.1007\/978-3-030-57628-8_1"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Hasanbeig, M., Kroening, D., and Abate, A. (2020). Deep reinforcement learning with temporal logics. International Conference on Formal Modeling and Analysis of Timed Systems, Vienna, Austria, 1\u20133 September 2020, Springer.","DOI":"10.1007\/978-3-030-57628-8_1"},{"key":"ref_35","unstructured":"White, M. (2017, January 6\u201311). Unifying Task Specification in Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1007\/BF00993306","article-title":"Asynchronous stochastic approximation and Q-learning","volume":"16","author":"Tsitsiklis","year":"1994","journal-title":"Mach. Learn."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1185","DOI":"10.1162\/neco.1994.6.6.1185","article-title":"On the convergence of stochastic iterative dynamic programming algorithms","volume":"6","author":"Jaakkola","year":"1994","journal-title":"Neural Comput."},{"key":"ref_38","unstructured":"Rummery, G.A., and Niranjan, M. (1994). On-Line Q-Learning Using Connectionist Systems, Cambridge University Engineering Department. Technical Report TR 166."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1023\/A:1007678930559","article-title":"Convergence results for single-step on-policy reinforcement-learning algorithms","volume":"38","author":"Singh","year":"2000","journal-title":"Mach. Learn."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1613\/jair.639","article-title":"Hierarchical reinforcement learning with the MAXQ value function decomposition","volume":"13","author":"Dietterich","year":"2000","journal-title":"J. Artif. Intell. Res."},{"key":"ref_41","unstructured":"Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/7\/1197\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:25:10Z","timestamp":1760163910000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/7\/1197"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,2]]},"references-count":41,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["sym13071197"],"URL":"https:\/\/doi.org\/10.3390\/sym13071197","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2021,7,2]]}}}