{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T08:44:25Z","timestamp":1762332265537,"version":"build-2065373602"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T00:00:00Z","timestamp":1756598400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T00:00:00Z","timestamp":1756598400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005765","name":"Universidade de Lisboa","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005765","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Appl Math Optim"],"published-print":{"date-parts":[[2025,10]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Reinforcement learning algorithms aim at solving discrete time stochastic control problems with unknown underlying dynamical systems by an iterative process of interaction. The process is formalized as a Markov decision process, where at each time step, a control action is given, the system provides a reward, and the state changes stochastically. The objective of the controller is the expected sum of rewards obtained throughout the interaction. When the set of states and or actions is large, it is necessary to use some form of function approximation. But even if the function approximation set is simply a linear span of fixed features, the reinforcement learning algorithms may diverge. In this work, we propose and analyze regularized two-time-scale variations of the algorithms, and prove that they are guaranteed to converge almost-surely to a unique solution to the reinforcement learning problem.<\/jats:p>","DOI":"10.1007\/s00245-025-10304-z","type":"journal-article","created":{"date-parts":[[2025,8,31]],"date-time":"2025-08-31T07:33:21Z","timestamp":1756625601000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Regularization and Two Time Scales for Convergence of Reinforcement Learning"],"prefix":"10.1007","volume":"92","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3008-7322","authenticated-orcid":false,"given":"Diogo S.","family":"Carvalho","sequence":"first","affiliation":[]},{"given":"Pedro A.","family":"Santos","sequence":"additional","affiliation":[]},{"given":"Francisco S.","family":"Melo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,31]]},"reference":[{"key":"10304_CR1","unstructured":"Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, (2005)"},{"key":"10304_CR2","unstructured":"Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, 2nd edition, (2018)"},{"key":"10304_CR3","first-page":"369","volume":"7","author":"J Boyan","year":"1995","unstructured":"Boyan, J., Moore, A.: Generalization in reinforcement learning: safely approximating the value function. Adv. Neural. Inf. Process. Syst. 7, 369\u2013376 (1995)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"10304_CR4","unstructured":"Gordon, G.J.: Reinforcement learning with function approximation converges to a region. In Advances in Neural Information Processing Systems, pp. 1040\u20131046, (2001)"},{"key":"10304_CR5","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1023\/A:1018008221616","volume":"22","author":"J Tsitsiklis","year":"1996","unstructured":"Tsitsiklis, J., Van Roy, B.: Feature-based methods for large scale dynamic programming. Mach. Learn. 22, 59\u201394 (1996)","journal-title":"Mach. Learn."},{"key":"10304_CR6","first-page":"19412","volume":"33","author":"DS Carvalho","year":"2020","unstructured":"Carvalho, D.S., Melo, F.S., Santos, P.: A new convergent variant of q-learning with linear function approximation. Adv. Neural. Inf. Process. Syst. 33, 19412 (2020)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"10304_CR7","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529\u2013533 (2015)","journal-title":"Nature"},{"issue":"3","key":"10304_CR8","doi-asserted-by":"publisher","first-page":"400","DOI":"10.1214\/aoms\/1177729586","volume":"22","author":"H Robbins","year":"1951","unstructured":"Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400\u2013407 (1951)","journal-title":"Ann. Math. Stat."},{"key":"10304_CR9","doi-asserted-by":"publisher","DOI":"10.1007\/978-93-86279-38-5","volume-title":"Stochastic Approximation: A Dynamical Systems Viewpoint","author":"V Borkar","year":"2008","unstructured":"Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)"},{"issue":"3","key":"10304_CR10","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1080\/00207179208934253","volume":"55","author":"Aleksandr Mikhailovich Lyapunov","year":"1992","unstructured":"Aleksandr Mikhailovich Lyapunov: The general problem of the stability of motion. Int. J. Control 55(3), 531\u2013534 (1992)","journal-title":"Int. J. Control"},{"key":"10304_CR11","doi-asserted-by":"publisher","first-page":"108","DOI":"10.1016\/j.automatica.2016.12.014","volume":"79","author":"C Lakshminarayanan","year":"2017","unstructured":"Lakshminarayanan, C., Bhatnagar, S.: A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108\u2013114 (2017)","journal-title":"Automatica"},{"issue":"4","key":"10304_CR12","doi-asserted-by":"publisher","first-page":"3001","DOI":"10.1137\/08073041X","volume":"47","author":"O Bokanowski","year":"2009","unstructured":"Bokanowski, O., Maroso, S., Zidani, H.: Some convergence results for Howard\u2019s algorithm. SIAM J. Numer. Anal. 47(4), 3001\u20133026 (2009)","journal-title":"SIAM J. Numer. Anal."},{"issue":"2","key":"10304_CR13","doi-asserted-by":"publisher","first-page":"331","DOI":"10.1007\/s10208-020-09460-1","volume":"21","author":"K Ito","year":"2021","unstructured":"Ito, K., Reisinger, C., Zhang, Y.: A neural network-based policy iteration algorithm with global h 2-superlinear convergence for stochastic games on domains. Found. Comput. Math. 21(2), 331\u2013374 (2021)","journal-title":"Found. Comput. Math."},{"key":"10304_CR14","unstructured":"Yang, L., Wang, M.: Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pp. 6995\u20137004. PMLR, (2019)"},{"key":"10304_CR15","first-page":"59172","volume":"36","author":"G Weisz","year":"2024","unstructured":"Weisz, G., Gy\u00f6rgy, A., Szepesv\u00e1ri, C.: Online rl in linearly qpi-realizable mdps is as easy as in linear mdps if you learn what to ignore. Adv. Neural. Inf. Process. Syst. 36, 59172 (2024)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"10304_CR16","doi-asserted-by":"crossref","unstructured":"Baird, L.: Residual algorithms: reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, pp. 30\u201337, (1995)","DOI":"10.1016\/B978-1-55860-377-6.50013-X"},{"issue":"2","key":"10304_CR17","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1023\/A:1017928328829","volume":"49","author":"D Ormoneit","year":"2002","unstructured":"Ormoneit, D., Sen, \u015a: Kernel-based reinforcement learning. Mach. Learn. 49(2), 161\u2013178 (2002)","journal-title":"Mach. Learn."},{"key":"10304_CR18","unstructured":"Singh, S., Jaakkola, T., Jordan, M.: Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems, 7, (1994)"},{"key":"10304_CR19","doi-asserted-by":"crossref","unstructured":"Szepesv\u00e1ri, C., Smart, W. D.: Interpolation-based q-learning. In International Conference on Machine Learning, p. 100, (2004)","DOI":"10.1145\/1015330.1015445"},{"key":"10304_CR20","doi-asserted-by":"crossref","unstructured":"Melo, F.S., Meyn, S., Ribeiro, M.I.: An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine learning, pp. 664\u2013671, (2008)","DOI":"10.1145\/1390156.1390240"},{"issue":"7540","key":"10304_CR21","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S.: Human-level control through deep reinforcement learning. Nature 518(7540), 529\u2013533 (2015)","journal-title":"Nature"},{"key":"10304_CR22","unstructured":"Chen, Z., Clarke, J. P., Maguluri, S. T.: Target network and truncation overcome the deadly triad in q-learning. arXiv preprint arXiv:2203.02628, (2022)"},{"key":"10304_CR23","unstructured":"Zhang, S., Yao, H., Whiteson, S.: Breaking the deadly triad with a target network. arXiv preprint arXiv:2101.08862, (2021)"},{"key":"10304_CR24","unstructured":"Pich\u00e9, A., Thomas, V., Pardinas, R., Marino, J., Marconi, G. M., Pal, C., Khan, M. E.: Beyond target networks: improving deep $$ q $$-learning with functional regularization. arXiv preprint arXiv:2106.02613, (2021)"},{"key":"10304_CR25","doi-asserted-by":"crossref","unstructured":"Farahmand, A.-M.: Regularization in reinforcement learning. PhD Thesis, (2011)","DOI":"10.1007\/s10994-011-5254-7"},{"key":"10304_CR26","unstructured":"Agarwal, N., Chaudhuri, S., Jain, P., Nagaraj, D., Netrapalli, P.: Online target q-learning with reverse experience replay: efficiently finding the optimal policy for linear mdps. arXiv preprintarXiv:2110.08440, (2021)"},{"key":"10304_CR27","unstructured":"Lim, H. D., Lee, D.: Regularized q-learning. arXiv preprint arXiv:2202.05404, (2022)"},{"issue":"1\u20132","key":"10304_CR28","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1007\/s10107-016-1017-3","volume":"161","author":"M Wang","year":"2017","unstructured":"Wang, M., Fang, E.X., Liu, H.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1\u20132), 419\u2013449 (2017)","journal-title":"Math. Program."},{"key":"10304_CR29","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, (2017)"},{"key":"10304_CR30","unstructured":"Maei, H.R., Szepesv\u00e1ri, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, pp. 719\u2013726, (2010)"},{"key":"10304_CR31","doi-asserted-by":"crossref","unstructured":"Sutton, R. S., Maei, H., Szepesv\u00e1ri, C.: A convergent o (n) temporal-difference algorithm for off-policy learning with linear function approximation. In NIPS, (2008)","DOI":"10.1145\/1553374.1553501"},{"key":"10304_CR32","unstructured":"Perkins, T.J., Precup, D.: A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems, pp. 1627\u20131634, (2003)"},{"issue":"2","key":"10304_CR33","first-page":"55","volume":"10","author":"C Pukdeboon","year":"2011","unstructured":"Pukdeboon, C.: A review of fundamentals of Lyapunov theory. J. Appl. Sci. 10(2), 55\u201361 (2011)","journal-title":"J. Appl. Sci."}],"container-title":["Applied Mathematics &amp; Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00245-025-10304-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00245-025-10304-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00245-025-10304-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T08:40:40Z","timestamp":1762332040000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00245-025-10304-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,31]]},"references-count":33,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,10]]}},"alternative-id":["10304"],"URL":"https:\/\/doi.org\/10.1007\/s00245-025-10304-z","relation":{},"ISSN":["0095-4616","1432-0606"],"issn-type":[{"type":"print","value":"0095-4616"},{"type":"electronic","value":"1432-0606"}],"subject":[],"published":{"date-parts":[[2025,8,31]]},"assertion":[{"value":"5 August 2025","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 August 2025","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no relevant financial or non-financial interests to disclose.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"30"}}