{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T17:54:38Z","timestamp":1740160478948,"version":"3.37.3"},"reference-count":53,"publisher":"Springer Science and Business Media LLC","license":[{"start":{"date-parts":[[2022,6,16]],"date-time":"2022-06-16T00:00:00Z","timestamp":1655337600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,6,16]],"date-time":"2022-06-16T00:00:00Z","timestamp":1655337600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Dyn Games Appl"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Multi-agent actor-critic algorithms are an important part of the Reinforcement Learning (RL) paradigm. We propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms in this work. The objective is to collectively find a joint policy that maximizes the average long-term return of these agents. In the absence of a central controller and to preserve privacy, agents communicate some information to their neighbors via a time-varying communication network. We prove convergence of all the three MAN algorithms to a globally asymptotically stable set of the ODE corresponding to actor update; these use linear function approximations. We show that the Kullback\u2013Leibler divergence between policies of successive iterates is proportional to the objective function\u2019s gradient. We observe that the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension. Using this, we theoretically show that the optimal value of the deterministic variant of the MAN algorithm at each iterate dominates that of the standard gradient-based multi-agent actor-critic (MAAC) algorithm. To our knowledge, it is the first such result in multi-agent reinforcement learning (MARL). To illustrate the usefulness of our proposed algorithms, we implement them on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic 15 agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm.<\/jats:p>","DOI":"10.1007\/s13235-022-00449-9","type":"journal-article","created":{"date-parts":[[2022,6,16]],"date-time":"2022-06-16T17:02:32Z","timestamp":1655398952000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Multi-Agent Natural Actor-Critic Reinforcement Learning Algorithms"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7709-1452","authenticated-orcid":false,"given":"Prashant","family":"Trivedi","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nandyala","family":"Hemachandra","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,6,16]]},"reference":[{"issue":"98","key":"449_CR1","first-page":"1","volume":"22","author":"A Alekh","year":"2021","unstructured":"Alekh A, Kakade Sham M, Lee Jason D, Gaurav M (2021) On the theory of policy gradient methods: optimality, approximation, and distribution shift. J Mach Learn Res 22(98):1\u201376","journal-title":"J Mach Learn Res"},{"issue":"2","key":"449_CR2","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1162\/089976698300017746","volume":"10","author":"SI Amari","year":"1998","unstructured":"Amari SI (1998) Natural gradient works efficiently in learning. Neural Comput 10(2):251\u2013276","journal-title":"Neural Comput"},{"key":"449_CR3","doi-asserted-by":"crossref","unstructured":"Amari SI, Douglas SC (1998) Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP\u201998 (Cat. No. 98CH36181), 2:1213\u20131216. IEEE","DOI":"10.1109\/ICASSP.1998.675489"},{"key":"449_CR4","volume-title":"Optimization of stochastic systems: topics in discrete-time systems","author":"M Aoki","year":"2016","unstructured":"Aoki M (2016) Optimization of stochastic systems: topics in discrete-time systems. Elsevier, Amsterdam"},{"key":"449_CR5","volume-title":"Dynamic programming and optimal control","author":"P Bertsekas Dimitri","year":"1995","unstructured":"Bertsekas Dimitri P (1995) Dynamic programming and optimal control, vol 1. Athena scientific, Belmont"},{"key":"449_CR6","volume-title":"Reinforcement learning and optimal control","author":"DP Bertsekas","year":"2019","unstructured":"Bertsekas DP (2019) Reinforcement learning and optimal control. Athena scientific, Belmont"},{"issue":"11","key":"449_CR7","doi-asserted-by":"publisher","first-page":"2471","DOI":"10.1016\/j.automatica.2009.07.008","volume":"45","author":"S Bhatnagar","year":"2009","unstructured":"Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45(11):2471\u20132482","journal-title":"Automatica"},{"key":"449_CR8","doi-asserted-by":"crossref","unstructured":"Bhatnagar S, Sutton RiS, Ghavamzadeh M, Lee M (2009) Natural actor\u2013critic algorithms. University of Alberta Department of Computing Science Technical Report TR 09-10","DOI":"10.1016\/j.automatica.2009.07.008"},{"key":"449_CR9","unstructured":"Bhatnagar S (2020) Single and multi-agent reinforcement learning in changing environments. Master\u2019s thesis, IE & OR, IIT Bombay"},{"issue":"11","key":"449_CR10","doi-asserted-by":"publisher","first-page":"7405","DOI":"10.1109\/TIT.2013.2275131","volume":"59","author":"P Bianchi","year":"2013","unstructured":"Bianchi P, Fort G, Hachem W (2013) Performance of a distributed stochastic approximation algorithm. IEEE Trans Inf Theory 59(11):7405\u20137418","journal-title":"IEEE Trans Inf Theory"},{"key":"449_CR11","volume-title":"Stochastic approximation: a dynamical systems viewpoint","author":"VS Borkar","year":"2009","unstructured":"Borkar VS (2009) Stochastic approximation: a dynamical systems viewpoint, vol 48. Springer, Cham"},{"issue":"2","key":"449_CR12","doi-asserted-by":"publisher","first-page":"447","DOI":"10.1137\/S0363012997331639","volume":"38","author":"S Borkar Vivek","year":"2000","unstructured":"Borkar Vivek S, Meyn Sean P (2000) The ode method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447\u2013469","journal-title":"SIAM J Control Optim"},{"issue":"2","key":"449_CR13","doi-asserted-by":"publisher","first-page":"223","DOI":"10.1137\/16M1080173","volume":"60","author":"L Bottou","year":"2018","unstructured":"Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223\u2013311","journal-title":"SIAM Rev"},{"issue":"6","key":"449_CR14","doi-asserted-by":"publisher","first-page":"2508","DOI":"10.1109\/TIT.2006.874516","volume":"52","author":"S Boyd","year":"2006","unstructured":"Boyd S, Ghosh A, Prabhakar B, Shah D (2006) Randomized gossip algorithms. IEEE Trans Inf Theory 52(6):2508\u20132530","journal-title":"IEEE Trans Inf Theory"},{"key":"449_CR15","doi-asserted-by":"crossref","unstructured":"Busoniu L, Babuska R, De\u00a0Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern Part C (Appl Rev), 38(2):156\u2013172","DOI":"10.1109\/TSMCC.2007.913919"},{"key":"449_CR16","doi-asserted-by":"crossref","unstructured":"Corke P, Peterson R, Rus D (2005) Networked robots: flying robot navigation using a sensor net. In: Robotics Research. The eleventh international symposium pp. 234\u2013243. Springer","DOI":"10.1007\/11008941_25"},{"issue":"3","key":"449_CR17","doi-asserted-by":"publisher","first-page":"1464","DOI":"10.1109\/TSG.2013.2248175","volume":"4","author":"E Dall\u2019Anese","year":"2013","unstructured":"Dall\u2019Anese E, Zhu H, Giannakis GB (2013) Distributed optimal power flow for smart microgrids. IEEE Trans Smart Grid 4(3):1464\u20131475","journal-title":"IEEE Trans Smart Grid"},{"key":"449_CR18","first-page":"809","volume":"15","author":"C Dann","year":"2014","unstructured":"Dann C, Neumann G, Peters J (2014) Policy evaluation with temporal differences: a survey and comparison. JMLR 15:809\u2013883","journal-title":"JMLR"},{"key":"449_CR19","unstructured":"Dewanto V, Dunn G, Eshragh A, Gallagher M, Roosta F (2020) Average-reward model-free reinforcement learning: a systematic review and literature mapping. arxiv:2010.08920"},{"issue":"1","key":"449_CR20","doi-asserted-by":"publisher","first-page":"298","DOI":"10.1137\/20M1311971","volume":"3","author":"TT Doan","year":"2021","unstructured":"Doan TT, Maguluri ST, Romberg J (2021) Finite-time performance of distributed temporal-difference learning with linear function approximation. SIAM J Math Data Sci 3(1):298\u2013320","journal-title":"SIAM J Math Data Sci"},{"issue":"9","key":"449_CR21","doi-asserted-by":"publisher","first-page":"1465","DOI":"10.1109\/TAC.2004.834433","volume":"49","author":"J Alexander Fax","year":"2004","unstructured":"Alexander Fax J, Murray RM (2004) Information flow and cooperative control of vehicle formations. IEEE Trans Autom Control 49(9):1465\u20131476","journal-title":"IEEE Trans Autom Control"},{"key":"449_CR22","volume-title":"Real analysis: modern techniques and their applications","author":"GB Folland","year":"1999","unstructured":"Folland GB (1999) Real analysis: modern techniques and their applications, vol 40. John Wiley & Sons, New York"},{"issue":"6","key":"449_CR23","doi-asserted-by":"publisher","first-page":"1291","DOI":"10.1109\/TSMCC.2012.2218595","volume":"42","author":"G Ivo","year":"2012","unstructured":"Ivo G, Lucian B, Lopes Gabriel AD, Robert B (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(6):1291\u20131307","journal-title":"IEEE Trans Syst Man Cybern Part C (Appl Rev)"},{"issue":"20","key":"449_CR24","doi-asserted-by":"publisher","first-page":"363","DOI":"10.1016\/j.ifacol.2019.12.182","volume":"52","author":"PC Heredia","year":"2019","unstructured":"Heredia PC, Mou S (2019) Distributed multi-agent reinforcement learning by actor-critic method. IFAC-PapersOnLine 52(20):363\u2013368","journal-title":"IFAC-PapersOnLine"},{"key":"449_CR25","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9781139020411","volume-title":"Matrix analysis","author":"Horn Roger","year":"2012","unstructured":"Roger Horn, Johnson R (2012) Matrix analysis. Cambridge University Press, Cambridge"},{"key":"449_CR26","unstructured":"Kakade SM (2001) A natural policy gradient. Adv Neural Inf Process Syst 14"},{"key":"449_CR27","unstructured":"Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. Adv Neural Inf Process Syst pp. 1008\u20131014. Citeseer"},{"key":"449_CR28","unstructured":"Krajzewicz D, Erdmann J, Behrisch M, Bieker L (2012) Recent development and applications of sumo-simulation of urban mobility. Int J Adv Syst Measure, 5(3 &4)"},{"key":"449_CR29","volume-title":"Stochastic approximation methods for constrained and unconstrained systems","author":"HJ Kushner","year":"2012","unstructured":"Kushner HJ, Clark DS (2012) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer Science & Business Media, Cham"},{"issue":"1","key":"449_CR30","first-page":"159","volume":"22","author":"S Mahadevan","year":"1996","unstructured":"Mahadevan S (1996) Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach Learn 22(1):159\u2013195","journal-title":"Mach Learn"},{"issue":"146","key":"449_CR31","first-page":"1","volume":"21","author":"J Martens","year":"2020","unstructured":"Martens J (2020) New insights and perspectives on the natural gradient method. J Mach Learn Res 21(146):1\u201376","journal-title":"J Mach Learn Res"},{"issue":"3","key":"449_CR32","doi-asserted-by":"publisher","first-page":"1535","DOI":"10.1137\/140992588","volume":"54","author":"AS Mathkar","year":"2016","unstructured":"Mathkar AS, Borkar VS (2016) Nonlinear gossip. SIAM J Control Optim 54(3):1535\u20131557","journal-title":"SIAM J Control Optim"},{"key":"449_CR33","unstructured":"Mei J, Xiao C, Szepesvari C, Schuurmans D (2020) On the global convergence rates of softmax policy gradient methods. In: International conference on machine learning, pp. 6820\u20136829. PMLR"},{"issue":"4","key":"449_CR34","doi-asserted-by":"publisher","first-page":"2597","DOI":"10.1137\/16M1084316","volume":"27","author":"A Nedic","year":"2017","unstructured":"Nedic A, Olshevsky A, Shi W (2017) Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J Optim 27(4):2597\u20132633","journal-title":"SIAM J Optim"},{"issue":"1","key":"449_CR35","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1109\/TAC.2008.2009515","volume":"54","author":"A Nedic","year":"2009","unstructured":"Nedic A, Ozdaglar A (2009) Distributed subgradient methods for multi-agent optimization. IEEE Trans Autom Control 54(1):48\u201361","journal-title":"IEEE Trans Autom Control"},{"key":"449_CR36","unstructured":"Neveu TP (1975) Jacques translated by\u00a0Speed. Discrete-parameter martingales, vol. 10. North-Holland, Amsterdam"},{"key":"449_CR37","volume-title":"Numerical optimization","author":"J Nocedal","year":"2006","unstructured":"Nocedal J, Wright S (2006) Numerical optimization. Springer Science & Business Media, Cham"},{"issue":"2","key":"449_CR38","doi-asserted-by":"publisher","first-page":"337","DOI":"10.1080\/00207177108931948","volume":"13","author":"JW Patchell","year":"1971","unstructured":"Patchell JW, Jacobs OLR (1971) Separability, neutrality and certainty equivalence. Int J Control 13(2):337\u2013342","journal-title":"Int J Control"},{"key":"449_CR39","unstructured":"Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. In: 3rd International conference on humanoid robots, pp. 1\u201320"},{"issue":"7\u20139","key":"449_CR40","doi-asserted-by":"publisher","first-page":"1180","DOI":"10.1016\/j.neucom.2007.11.026","volume":"71","author":"J Peters","year":"2008","unstructured":"Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7\u20139):1180\u20131190","journal-title":"Neurocomputing"},{"key":"449_CR41","volume-title":"Markov decision processes: discrete stochastic dynamic programming","author":"L Puterman Martin","year":"2014","unstructured":"Puterman Martin L (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, New York"},{"key":"449_CR42","unstructured":"Ratliff N (2013) Information geometry and natural gradients. Available at https:\/\/ipvs.informatik.uni-stuttgart.de\/mlr\/wp-content\/uploads\/2015\/01\/mathematics_for_intelligent_systems_lecture12_notes_I.pdf. Last accessed August 20, 2021"},{"issue":"1","key":"449_CR43","doi-asserted-by":"publisher","first-page":"124","DOI":"10.1214\/aoms\/1177729893","volume":"21","author":"J Sherman","year":"1950","unstructured":"Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124\u2013127","journal-title":"Ann Math Stat"},{"issue":"2","key":"449_CR44","doi-asserted-by":"publisher","first-page":"1549","DOI":"10.1016\/j.ifacol.2020.12.2021","volume":"53","author":"W Suttle","year":"2020","unstructured":"Suttle W, Yang Z, Zhang K, Wang Z, Ba\u015far T, Liu J (2020) A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. IFAC-PapersOnLine 53(2):1549\u20131554","journal-title":"IFAC-PapersOnLine"},{"key":"449_CR45","volume-title":"Reinforcement learning: an introduction","author":"RS Sutton","year":"2018","unstructured":"Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge"},{"key":"449_CR46","first-page":"1057","volume":"99","author":"RS Sutton","year":"1999","unstructured":"Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 99:1057\u20131063","journal-title":"Adv Neural Inf Process Syst"},{"key":"449_CR47","doi-asserted-by":"crossref","unstructured":"Thoppe G, Borkar V (2019) A concentration bound for stochastic approximation via Alekseev\u2019s formula. Stoch Syst 9(1):1\u201326","DOI":"10.1287\/stsy.2018.0019"},{"key":"449_CR48","doi-asserted-by":"crossref","unstructured":"Thoppe GC, Kumar B (2021) A law of iterated logarithm for multi-agent reinforcement learning. Adv Neural Inf Process Syst, 34","DOI":"10.1109\/ICC54714.2021.9702912"},{"key":"449_CR49","unstructured":"Trivedi P, Hemachandra N (2021) Multi-agent natural actor-critic reinforcement learning algorithms. arxiv:2109.01654"},{"issue":"3","key":"449_CR50","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1609\/aimag.v33i3.2426","volume":"33","author":"K Tuyls","year":"2012","unstructured":"Tuyls K, Weiss G (2012) Multiagent learning: basics, challenges, and prospects. AI Magaz 33(3):41\u201341","journal-title":"AI Magaz"},{"issue":"6","key":"449_CR51","doi-asserted-by":"publisher","first-page":"802","DOI":"10.1631\/FITEE.1900661","volume":"22","author":"K Zhang","year":"2021","unstructured":"Zhang K, Yang Z, Ba\u015far T (2021) Decentralized multi-agent reinforcement learning with networked agents: recent advances. Front Inf Technol Electron Eng 22(6):802\u2013814","journal-title":"Front Inf Technol Electron Eng"},{"key":"449_CR52","doi-asserted-by":"crossref","unstructured":"Zhang K, Yang Z, Ba\u015far T (2021) Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of reinforcement learning and control, pp. 321\u2013384","DOI":"10.1007\/978-3-030-60990-0_12"},{"key":"449_CR53","doi-asserted-by":"crossref","unstructured":"Zhang K, Yang Z, Liu H, Zhang T, Basar T (2018) Fully decentralized multi-agent reinforcement learning with networked agents. In: International conference on machine learning, pp. 5872\u20135881. PMLR","DOI":"10.1109\/CDC.2018.8619581"}],"container-title":["Dynamic Games and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13235-022-00449-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s13235-022-00449-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s13235-022-00449-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,27]],"date-time":"2024-09-27T02:10:35Z","timestamp":1727403035000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s13235-022-00449-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,16]]},"references-count":53,"alternative-id":["449"],"URL":"https:\/\/doi.org\/10.1007\/s13235-022-00449-9","relation":{},"ISSN":["2153-0785","2153-0793"],"issn-type":[{"type":"print","value":"2153-0785"},{"type":"electronic","value":"2153-0793"}],"subject":[],"published":{"date-parts":[[2022,6,16]]},"assertion":[{"value":"11 April 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 June 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}