{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T15:44:08Z","timestamp":1775576648603,"version":"3.50.1"},"reference-count":119,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2021,6,23]],"date-time":"2021-06-23T00:00:00Z","timestamp":1624406400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,6,23]],"date-time":"2021-06-23T00:00:00Z","timestamp":1624406400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100014013","name":"UK Research and Innovation","doi-asserted-by":"crossref","award":["Future Leaders Fellowship MR\/S032525\/1"],"award-info":[{"award-number":["Future Leaders Fellowship MR\/S032525\/1"]}],"id":[{"id":"10.13039\/100014013","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100013296","name":"Max-Planck-Institut f\u00fcr Mathematik in den Naturwissenschaften","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100013296","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2022,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A dynamical systems perspective on multi-agent learning, based on the link between evolutionary game theory and reinforcement learning, provides an improved, qualitative understanding of the emerging collective learning dynamics. However, confusion exists with respect to how this dynamical systems account of multi-agent learning should be interpreted. In this article, I propose to embed the dynamical systems description of multi-agent learning into different abstraction levels of cognitive analysis. The purpose of this work is to make the connections between these levels explicit in order to gain improved insight into multi-agent learning. I demonstrate the usefulness of this framework with the general and widespread class of temporal-difference reinforcement learning. I find that its deterministic dynamical systems description follows a minimum free-energy principle and unifies a boundedly rational account of game theory with decision-making under uncertainty. I then propose an on-line sample-batch temporal-difference algorithm which is characterized by the combination of applying a memory-batch and separated state-action value estimation. I find that this algorithm serves as a micro-foundation of the deterministic learning equations by showing that its learning trajectories approach the ones of the deterministic learning equations under large batch sizes. Ultimately, this framework of embedding a dynamical systems description into different abstraction levels gives guidance on how to unleash the full potential of the dynamical systems approach to multi-agent learning.<\/jats:p>","DOI":"10.1007\/s00521-021-06117-0","type":"journal-article","created":{"date-parts":[[2021,6,23]],"date-time":"2021-06-23T03:48:50Z","timestamp":1624420130000},"page":"1653-1671","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Dynamical systems as a level of cognitive analysis of multi-agent learning"],"prefix":"10.1007","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9077-5242","authenticated-orcid":false,"given":"Wolfram","family":"Barfuss","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,6,23]]},"reference":[{"key":"6117_CR1","unstructured":"Abdallah S, Kaisers M (2013) Addressing the policy-bias of q-learning by repeating updates. In: Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1045\u20131052. International Foundation for Autonomous Agents and Multiagent Systems"},{"issue":"1","key":"6117_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-018-04968-1","volume":"9","author":"M Abou Chakra","year":"2018","unstructured":"Abou Chakra M, Bumann S, Schenk H, Oschlies A, Traulsen A (2018) Immediate action is the best strategy when facing uncertain climate change. Nat Commun 9(1):1\u20139","journal-title":"Nat Commun"},{"issue":"5\u20136","key":"6117_CR3","doi-asserted-by":"publisher","first-page":"433","DOI":"10.1016\/S0968-090X(02)00030-X","volume":"10","author":"JL Adler","year":"2002","unstructured":"Adler JL, Blue VJ (2002) A cooperative multi-agent transportation management and route guidance system. Transp Res Part C Emerg Technol 10(5\u20136):433\u2013454","journal-title":"Transp Res Part C Emerg Technol"},{"key":"6117_CR4","doi-asserted-by":"crossref","unstructured":"Anderson SP, Goeree JK, Holt CA (2002) The logit equilibrium: a perspective on intuitive behavioral anomalies. Southern Econ J pp. 21\u201347","DOI":"10.1002\/j.2325-8012.2002.tb00476.x"},{"key":"6117_CR5","unstructured":"Barfuss W (2020) Reinforcement learning dynamics in the infinite memory limit. In: AAMAS, pp. 1768\u20131770"},{"key":"6117_CR6","unstructured":"Barfuss W (2020) Towards a unified treatment of the dynamics of collective learning. Challenges and Opportunities for Multi-Agent Reinforcement Learning, AAAI Spring Symposium"},{"key":"6117_CR7","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.99.043305","author":"W Barfuss","year":"2019","unstructured":"Barfuss W, Donges JF, Kurths J (2019) Deterministic limit of temporal difference reinforcement learning for stochastic games. Phys Rev E. https:\/\/doi.org\/10.1103\/PhysRevE.99.043305","journal-title":"Phys Rev E"},{"issue":"1","key":"6117_CR8","doi-asserted-by":"publisher","first-page":"2354","DOI":"10.1038\/s41467-018-04738-z","volume":"9","author":"W Barfuss","year":"2018","unstructured":"Barfuss W, Donges JF, Lade SJ, Kurths J (2018) When optimization for governing human-environment tipping elements is neither sustainable nor safe. Nat commun 9(1):2354. https:\/\/doi.org\/10.1038\/s41467-018-04738-z","journal-title":"Nat commun"},{"issue":"23","key":"6117_CR9","doi-asserted-by":"publisher","first-page":"12915","DOI":"10.1073\/pnas.1916545117","volume":"117","author":"W Barfuss","year":"2020","unstructured":"Barfuss W, Donges JF, Vasconcelos VV, Kurths J, Levin SA (2020) Caring for the future can turn tragedy into comedy for long-term collective action under risk of collapse. Proc Natl Acad Sci 117(23):12915\u201312922","journal-title":"Proc Natl Acad Sci"},{"issue":"2","key":"6117_CR10","doi-asserted-by":"publisher","first-page":"255","DOI":"10.5194\/esd-8-255-2017","volume":"8","author":"W Barfuss","year":"2017","unstructured":"Barfuss W, Donges JF, Wiedermann M, Lucht W (2017) Sustainable use of renewable resources in a stylized social-ecological network model under heterogeneous resource distribution. Earth Syst Dyn 8(2):255\u2013264","journal-title":"Earth Syst Dyn"},{"issue":"1\u20132","key":"6117_CR11","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1016\/0004-3702(94)00011-O","volume":"72","author":"AG Barto","year":"1995","unstructured":"Barto AG, Bradtke SJ, Singh SP (1995) Learning to act using real-time dynamic programming. Artif Intell 72(1\u20132):81\u2013138","journal-title":"Artif Intell"},{"issue":"1\u20132","key":"6117_CR12","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1016\/0004-3702(94)00005-L","volume":"72","author":"RD Beer","year":"1995","unstructured":"Beer RD (1995) A dynamical systems perspective on agent-environment interaction. Artif Intell 72(1\u20132):173\u2013215","journal-title":"Artif Intell"},{"issue":"3","key":"6117_CR13","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1016\/S1364-6613(99)01440-0","volume":"4","author":"RD Beer","year":"2000","unstructured":"Beer RD (2000) Dynamical approaches to cognitive science. Trends Cognit Sci 4(3):91\u201399","journal-title":"Trends Cognit Sci"},{"key":"6117_CR14","doi-asserted-by":"publisher","DOI":"10.1103\/physreve.84.041132","author":"AJ Bladon","year":"2011","unstructured":"Bladon AJ, Galla T (2011) Learning dynamics in public goods games. Phys Rev E. https:\/\/doi.org\/10.1103\/physreve.84.041132","journal-title":"Phys Rev E"},{"key":"6117_CR15","doi-asserted-by":"publisher","first-page":"659","DOI":"10.1613\/jair.4818","volume":"53","author":"D Bloembergen","year":"2015","unstructured":"Bloembergen D, Tuyls K, Hennes D, Kaisers M (2015) Evolutionary dynamics of multi-agent learning: a survey. J Artif Intell Res 53:659\u2013697. https:\/\/doi.org\/10.1613\/jair.4818","journal-title":"J Artif Intell Res"},{"issue":"2","key":"6117_CR16","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1016\/S0004-3702(02)00121-2","volume":"136","author":"M Bowling","year":"2002","unstructured":"Bowling M, Veloso M (2002) Multiagent learning using a variable learning rate. Artif Intell 136(2):215\u2013250","journal-title":"Artif Intell"},{"issue":"2","key":"6117_CR17","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1109\/TSMCC.2007.913919","volume":"38","author":"L Busoniu","year":"2008","unstructured":"Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybernet Part C Appl Rev 38(2):156\u2013172","journal-title":"IEEE Trans Syst Man Cybernet Part C Appl Rev"},{"issue":"1","key":"6117_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1006\/jeth.1997.2319","volume":"77","author":"T B\u00f6rgers","year":"1997","unstructured":"B\u00f6rgers T, Sarin R (1997) Learning through reinforcement and replicator dynamics. J Econ Theory 77(1):1\u201314. https:\/\/doi.org\/10.1006\/jeth.1997.2319","journal-title":"J Econ Theory"},{"issue":"1","key":"6117_CR19","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1016\/S1389-0417(01)00013-4","volume":"2","author":"C Castelfranchi","year":"2001","unstructured":"Castelfranchi C (2001) The theory of social functions: challenges for computational social science and multi-agent learning. Cognit Syst Res 2(1):5\u201338","journal-title":"Cognit Syst Res"},{"key":"6117_CR20","first-page":"2","volume":"746\u2013752","author":"C Claus","year":"1998","unstructured":"Claus C, Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. AAAI\/IAAI 746\u2013752:2","journal-title":"AAAI\/IAAI"},{"issue":"Supplement 3","key":"6117_CR21","doi-asserted-by":"publisher","first-page":"10810","DOI":"10.1073\/pnas.1400823111","volume":"111","author":"R Cressman","year":"2014","unstructured":"Cressman R, Tao Y (2014) The replicator equation and other game dynamics. Proc Natl Acad Sci 111(Supplement 3):10810\u201310817","journal-title":"Proc Natl Acad Sci"},{"issue":"2","key":"6117_CR22","doi-asserted-by":"publisher","first-page":"239","DOI":"10.2307\/1882186","volume":"87","author":"JG Cross","year":"1973","unstructured":"Cross JG (1973) A stochastic learning model of economic behavior. Q J Econ 87(2):239\u2013266. https:\/\/doi.org\/10.2307\/1882186","journal-title":"Q J Econ"},{"issue":"1","key":"6117_CR23","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1146\/annurev.ps.31.020180.001125","volume":"31","author":"RM Dawes","year":"1980","unstructured":"Dawes RM (1980) Social dilemmas. Ann Rev Psychol 31(1):169\u2013193","journal-title":"Ann Rev Psychol"},{"issue":"2","key":"6117_CR24","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1016\/j.conb.2008.08.003","volume":"18","author":"P Dayan","year":"2008","unstructured":"Dayan P, Niv Y (2008) Reinforcement learning: the good, the bad and the ugly. Curr Opin Neurobiol 18(2):185\u2013196","journal-title":"Curr Opin Neurobiol"},{"issue":"12","key":"6117_CR25","doi-asserted-by":"publisher","first-page":"101752","DOI":"10.1016\/j.isci.2020.101752","volume":"23","author":"EF Domingos","year":"2020","unstructured":"Domingos EF, Gruji\u0107 J, Burguillo JC, Kirchsteiger G, Santos FC, Lenaerts T (2020) Timing uncertainty in collective risk dilemmas encourages group reciprocation and polarization. Iscience 23(12):101752","journal-title":"Iscience"},{"issue":"3","key":"6117_CR26","doi-asserted-by":"publisher","first-page":"369","DOI":"10.3982\/TE632","volume":"5","author":"U Doraszelski","year":"2010","unstructured":"Doraszelski U, Escobar JF (2010) A theory of regular markov perfect equilibria in dynamic stochastic games: genericity, stability, and purification. Theor Econ 5(3):369\u2013402","journal-title":"Theor Econ"},{"key":"6117_CR27","unstructured":"Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608"},{"issue":"7256","key":"6117_CR28","doi-asserted-by":"publisher","first-page":"685","DOI":"10.1038\/460685a","volume":"460","author":"JD Farmer","year":"2009","unstructured":"Farmer JD, Foley D (2009) The economy needs agent-based modelling. Nature 460(7256):685\u2013686","journal-title":"Nature"},{"issue":"1","key":"6117_CR29","first-page":"89","volume":"28","author":"AM Fink","year":"1964","unstructured":"Fink AM et al (1964) Equilibrium in a stochastic $$ n $$-person game. J Sci Hiroshima Univ 28(1):89\u201393","journal-title":"J Sci Hiroshima Univ"},{"issue":"2","key":"6117_CR30","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1038\/nrn2787","volume":"11","author":"K Friston","year":"2010","unstructured":"Friston K (2010) The free-energy principle: a unified brain theory? Nat Rev Neurosci 11(2):127\u2013138","journal-title":"Nat Rev Neurosci"},{"key":"6117_CR31","volume-title":"The theory of learning in games","author":"D Fudenberg","year":"1998","unstructured":"Fudenberg D, Levine DK (1998) The theory of learning in games, vol 2. MIT Press Cambridge, Massachusetts, London, England"},{"key":"6117_CR32","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevLett.103.198702","author":"T Galla","year":"2009","unstructured":"Galla T (2009) Intrinsic Noise in Game Dynamical Learning. Physical Review Letters. https:\/\/doi.org\/10.1103\/PhysRevLett.103.198702","journal-title":"Physical Review Letters"},{"issue":"08","key":"6117_CR33","doi-asserted-by":"publisher","first-page":"P08007","DOI":"10.1088\/1742-5468\/2011\/08\/p08007","volume":"2011","author":"T Galla","year":"2011","unstructured":"Galla T (2011) Cycles of cooperation and defection in imperfect learning. J Stat Mech Theory Exp 2011(08):P08007. https:\/\/doi.org\/10.1088\/1742-5468\/2011\/08\/p08007","journal-title":"J Stat Mech Theory Exp"},{"issue":"4","key":"6117_CR34","doi-asserted-by":"publisher","first-page":"1232","DOI":"10.1073\/pnas.1109672110","volume":"110","author":"T Galla","year":"2013","unstructured":"Galla T, Farmer JD (2013) Complex dynamics in learning complicated games. Proc Natl Acad Sci 110(4):1232\u20131236. https:\/\/doi.org\/10.1073\/pnas.1109672110","journal-title":"Proc Natl Acad Sci"},{"issue":"2","key":"6117_CR35","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1111\/tops.12142","volume":"7","author":"TL Griffiths","year":"2015","unstructured":"Griffiths TL, Lieder F, Goodman ND (2015) Rational use of cognitive resources: levels of analysis between the computational and the algorithmic. Top Cognit Sci 7(2):217\u2013229","journal-title":"Top Cognit Sci"},{"key":"6117_CR36","unstructured":"Hafner D, Ortega PA, Ba J, Parr T, Friston K, Heess N (2020) Action and perception as divergence minimization. arXiv preprint arXiv:2009.01791"},{"issue":"2","key":"6117_CR37","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1145\/1998549.1998551","volume":"10","author":"JY Halpern","year":"2011","unstructured":"Halpern JY, Pass R (2011) Algorithmic rationality: adding cost of computation to game theory. ACM SIGecom Exch 10(2):9\u201315","journal-title":"ACM SIGecom Exch"},{"key":"6117_CR38","first-page":"2613","volume":"23","author":"H Hasselt","year":"2010","unstructured":"Hasselt H (2010) Double q-learning. Adv Neural Inf Process Syst 23:2613\u20132621","journal-title":"Adv Neural Inf Process Syst"},{"key":"6117_CR39","unstructured":"Heess N, Silver D, Teh YW (2013) Actor-critic reinforcement learning with energy-based policies. In: European Workshop on Reinforcement Learning, pp. 45\u201358"},{"key":"6117_CR40","unstructured":"Hennes D, Kaisers M, Tuyls K (2010) RESQ-learning in stochastic games. In: Adaptive and Learning Agents Workshop at AAMAS, ALA\u201910"},{"key":"6117_CR41","unstructured":"Hennes D, Tuyls K, Rauterberg M (2009) State-coupled replicator dynamics. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2009, pp. 789\u2013796"},{"key":"6117_CR42","unstructured":"Hernandez-Leal P, Kaisers M, Baarslag T, de\u00a0Cote EM (2017) A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183"},{"issue":"6","key":"6117_CR43","doi-asserted-by":"publisher","first-page":"750","DOI":"10.1007\/s10458-019-09421-1","volume":"33","author":"P Hernandez-Leal","year":"2019","unstructured":"Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agents Multi-Agent Syst 33(6):750\u2013797","journal-title":"Auton Agents Multi-Agent Syst"},{"key":"6117_CR44","doi-asserted-by":"crossref","unstructured":"Hester T, Stone P (2012) Learning and using models. In: Reinforcement learning, pp. 111\u2013141. Springer","DOI":"10.1007\/978-3-642-27645-3_4"},{"key":"6117_CR45","doi-asserted-by":"publisher","first-page":"106685","DOI":"10.1016\/j.knosys.2020.106685","volume":"214","author":"A Heuillet","year":"2021","unstructured":"Heuillet A, Couthouis F, D\u00edaz-Rodr\u00edguez N (2021) Explainability in deep reinforcement learning. Knowl Based Syst 214:106685","journal-title":"Knowl Based Syst"},{"issue":"6","key":"6117_CR46","doi-asserted-by":"publisher","first-page":"e66490","DOI":"10.1371\/journal.pone.0066490","volume":"8","author":"C Hilbe","year":"2013","unstructured":"Hilbe C, Abou Chakra M, Altrock PM, Traulsen A (2013) The evolution of strategic timing in collective-risk dilemmas. PloS one 8(6):e66490","journal-title":"PloS one"},{"issue":"7713","key":"6117_CR47","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1038\/s41586-018-0277-x","volume":"559","author":"C Hilbe","year":"2018","unstructured":"Hilbe C, \u0160imsa \u0160, Chatterjee K, Nowak MA (2018) Evolution of cooperation in stochastic games. Nature 559(7713):246\u2013249","journal-title":"Nature"},{"key":"6117_CR48","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139173179","volume-title":"Evolutionary games and population dynamics","author":"J Hofbauer","year":"1998","unstructured":"Hofbauer J, Sigmund K (1998) Evolutionary games and population dynamics. Cambridge University Press, Cambridge"},{"issue":"4","key":"6117_CR49","doi-asserted-by":"publisher","first-page":"479","DOI":"10.1090\/S0273-0979-03-00988-1","volume":"40","author":"J Hofbauer","year":"2003","unstructured":"Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc 40(4):479\u2013519","journal-title":"Bull Am Math Soc"},{"issue":"4","key":"6117_CR50","doi-asserted-by":"publisher","first-page":"717","DOI":"10.1037\/a0017187","volume":"116","author":"A Howes","year":"2009","unstructured":"Howes A, Lewis RL, Vera A (2009) Rational adaptation under task and processing constraints: implications for testing theories of cognition and action. Psychol Rev 116(4):717","journal-title":"Psychol Rev"},{"key":"6117_CR51","unstructured":"Hu H, Lerer A, Peysakhovich A, Foerster J (2020) \u201cother-play\u201d for zero-shot coordination. In: International Conference on Machine Learning, pp. 4399\u20134410. PMLR"},{"key":"6117_CR52","unstructured":"Icard T (2014) Toward boundedly rational analysis. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol.\u00a036"},{"key":"6117_CR53","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511790423","volume-title":"Probability Theory: The Logic of Science","author":"ET Jaynes","year":"2003","unstructured":"Jaynes ET (2003) Probability Theory: The Logic of Science. Cambridge University Press, Cambridge. https:\/\/doi.org\/10.1017\/CBO9780511790423"},{"key":"6117_CR54","unstructured":"John GH (1994) When the best move isn\u2019t optimal: Q-learning with exploration. In: AAAI, p. 1464. Citeseer"},{"key":"6117_CR55","unstructured":"Kaisers M, Tuyls K (2010) Frequency adjusted multi-agent Q-learning. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1, AAMAS \u201910, pp. 309\u2013316. International Foundation for Autonomous Agents and Multiagent Systems, Toronto, Canada"},{"key":"6117_CR56","unstructured":"Kaisers M, Tuyls K (2011) FAQ-Learning in matrix games: Demonstrating convergence near nash equilibria, and bifurcation of attractors in the battle of sexes. In: Proceedings of the 13th AAAI Conference on Interactive Decision Theory and Game Theory, AAAIWS\u201911-13, p. 36\u201342"},{"issue":"2","key":"6117_CR57","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1007\/s10994-012-5278-7","volume":"87","author":"HJ Kappen","year":"2012","unstructured":"Kappen HJ, G\u00f3mez V, Opper M (2012) Optimal control as a graphical model inference problem. Mach Learn 87(2):159\u2013182","journal-title":"Mach Learn"},{"issue":"4","key":"6117_CR58","doi-asserted-by":"publisher","first-page":"041145","DOI":"10.1103\/PhysRevE.85.041145","volume":"85","author":"A Kianercy","year":"2012","unstructured":"Kianercy A, Galstyan A (2012) Dynamics of Boltzmann Q learning in two-player two-action games. Phys Rev E 85(4):041145","journal-title":"Phys Rev E"},{"key":"6117_CR59","unstructured":"Konda VR, Tsitsiklis JN (2000) Actor-critic algorithms. In: Advances in neural information processing systems, pp. 1008\u20131014"},{"key":"6117_CR60","doi-asserted-by":"crossref","unstructured":"Lange S, Gabel T, Riedmiller M (2012) Batch reinforcement learning. In: Reinforcement learning, pp. 45\u201373. Springer","DOI":"10.1007\/978-3-642-27645-3_2"},{"issue":"6","key":"6117_CR61","doi-asserted-by":"publisher","first-page":"864","DOI":"10.1109\/TSMCA.2007.904825","volume":"37","author":"JW Lee","year":"2007","unstructured":"Lee JW, Park J, Jangmin O, Lee J, Hong E (2007) A multiagent approach to $$ q $$-learning for daily stock trading. IEEE Trans Syst Man Cybern Part A Syst Hum 37(6):864\u2013877","journal-title":"IEEE Trans Syst Man Cybern Part A Syst Hum"},{"key":"6117_CR62","unstructured":"Levine S (2018) Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909 [cs, stat]. URL http:\/\/arxiv.org\/abs\/1805.00909"},{"issue":"2","key":"6117_CR63","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1111\/tops.12086","volume":"6","author":"RL Lewis","year":"2014","unstructured":"Lewis RL, Howes A, Singh S (2014) Computational rationality: linking mechanism and behavior through bounded utility maximization. Top Cognit Sci 6(2):279\u2013311","journal-title":"Top Cognit Sci"},{"issue":"3\u20134","key":"6117_CR64","doi-asserted-by":"publisher","first-page":"293","DOI":"10.1007\/BF00992699","volume":"8","author":"LJ Lin","year":"1992","unstructured":"Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8(3\u20134):293\u2013321","journal-title":"Mach Learn"},{"issue":"3","key":"6117_CR65","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1145\/3236386.3241340","volume":"16","author":"ZC Lipton","year":"2018","unstructured":"Lipton ZC (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31\u201357","journal-title":"Queue"},{"key":"6117_CR66","doi-asserted-by":"crossref","unstructured":"Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine learning proceedings 1994, pp. 157\u2013163. Elsevier","DOI":"10.1016\/B978-1-55860-335-6.50027-1"},{"key":"6117_CR67","volume-title":"Information theory, inference and learning algorithms","author":"DJ MacKay","year":"2003","unstructured":"MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge"},{"issue":"44","key":"6117_CR68","doi-asserted-by":"publisher","first-page":"E10387","DOI":"10.1073\/pnas.1811964115","volume":"115","author":"RP Mann","year":"2018","unstructured":"Mann RP (2018) Collective decision making by rational individuals. Proc Natl Acad Sci 115(44):E10387\u2013E10396","journal-title":"Proc Natl Acad Sci"},{"issue":"20","key":"6117_CR69","doi-asserted-by":"publisher","first-page":"5077","DOI":"10.1073\/pnas.1618722114","volume":"114","author":"RP Mann","year":"2017","unstructured":"Mann RP, Helbing D (2017) Optimal incentives for collective intelligence. Proc Natl Acad Sci 114(20):5077\u20135082","journal-title":"Proc Natl Acad Sci"},{"key":"6117_CR70","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/9780262514620.001.0001","volume-title":"Vision: a computational investigation into the human representation and processing of visual information","author":"D Marr","year":"2010","unstructured":"Marr D (2010) Vision: a computational investigation into the human representation and processing of visual information. MIT press, Cambridge"},{"key":"6117_CR71","first-page":"470","volume":"15","author":"D Marr","year":"1977","unstructured":"Marr D, Poggio T (1977) From understanding computation to understanding neural circuitry. Neurosci Res Prog Bull 15:470\u2013488","journal-title":"Neurosci Res Prog Bull"},{"issue":"1","key":"6117_CR72","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1017\/S0269888912000057","volume":"27","author":"L Matignon","year":"2012","unstructured":"Matignon L, Laurent GJ, Le Fort-Piat N (2012) Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. Knowl Eng Rev 27(1):1\u201331. https:\/\/doi.org\/10.1017\/S0269888912000057","journal-title":"Knowl Eng Rev"},{"issue":"2","key":"6117_CR73","doi-asserted-by":"publisher","first-page":"251","DOI":"10.1007\/s10640-009-9314-4","volume":"45","author":"M McGinty","year":"2010","unstructured":"McGinty M (2010) International environmental agreements as evolutionary games. Environ Res Econ 45(2):251\u2013269","journal-title":"Environ Res Econ"},{"issue":"1","key":"6117_CR74","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1006\/game.1995.1023","volume":"10","author":"RD McKelvey","year":"1995","unstructured":"McKelvey RD, Palfrey TR (1995) Quantal response equilibria for normal form games. Games Econ Behav 10(1):6\u201338","journal-title":"Games Econ Behav"},{"issue":"2","key":"6117_CR75","doi-asserted-by":"publisher","first-page":"186","DOI":"10.1111\/j.1468-5876.1996.tb00043.x","volume":"47","author":"RD McKelvey","year":"1996","unstructured":"McKelvey RD, Palfrey TR (1996) A statistical theory of equilibrium in games. Jpn Econ Rev 47(2):186\u2013209","journal-title":"Jpn Econ Rev"},{"issue":"7540","key":"6117_CR76","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529\u2013533","journal-title":"Nature"},{"key":"6117_CR77","unstructured":"O\u2019Donoghue B, Munos R, Kavukcuoglu K, Mnih V (2017) Combining policy gradient and q-learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. URL https:\/\/openreview.net\/forum?id=B1kJ6H9ex"},{"issue":"1","key":"6117_CR78","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-019-45619-9","volume":"9","author":"S Omidshafiei","year":"2019","unstructured":"Omidshafiei S, Papadimitriou C, Piliouras G, Tuyls K, Rowland M, Lespiau JB, Czarnecki WM, Lanctot M, Perolat J, Munos R (2019) $$\\alpha $$-rank: multi-agent evaluation by evolution. Sci Rep 9(1):1\u201329","journal-title":"Sci Rep"},{"key":"6117_CR79","doi-asserted-by":"crossref","unstructured":"Ortega DA, Braun PA (2011) Information, utility and bounded rationality. In: International Conference on Artificial General Intelligence, pp. 269\u2013274. Springer","DOI":"10.1007\/978-3-642-22887-2_28"},{"issue":"2153","key":"6117_CR80","first-page":"20120683","volume":"469","author":"PA Ortega","year":"2013","unstructured":"Ortega PA, Braun DA (2013) Thermodynamics as a theory of decision-making with information-processing costs. Proc R Soc A Math Phys Eng Sci 469(2153):20120683","journal-title":"Proc R Soc A Math Phys Eng Sci"},{"key":"6117_CR81","first-page":"423","volume":"9","author":"L Panait","year":"2008","unstructured":"Panait L, Tuyls K, Luke S (2008) Theoretical advantages of lenient learners: an evolutionary game theoretic perspective. J Mach Learn Res 9:423\u2013457","journal-title":"J Mach Learn Res"},{"key":"6117_CR82","doi-asserted-by":"crossref","unstructured":"Riedmiller M, Moore A, Schneider J (2000) Reinforcement learning for cooperating and communicating reactive agents in electrical power grids. In: Workshop on Balancing Reactivity and Social Deliberation in Multi-Agent Systems, pp. 137\u2013149. Springer","DOI":"10.1007\/3-540-44568-4_9"},{"issue":"1\u20132","key":"6117_CR83","doi-asserted-by":"publisher","first-page":"57","DOI":"10.1016\/S0004-3702(97)00026-X","volume":"94","author":"SJ Russell","year":"1997","unstructured":"Russell SJ (1997) Rationality and intelligence. Artif Intell 94(1\u20132):57\u201377","journal-title":"Artif Intell"},{"key":"6117_CR84","first-page":"1063","volume":"5","author":"B Sallans","year":"2004","unstructured":"Sallans B, Hinton GE (2004) Reinforcement learning with factored states and actions. J Mach Learn Res 5:1063\u20131088","journal-title":"J Mach Learn Res"},{"issue":"26","key":"6117_CR85","doi-asserted-by":"publisher","first-page":"10421","DOI":"10.1073\/pnas.1015648108","volume":"108","author":"FC Santos","year":"2011","unstructured":"Santos FC, Pacheco JM (2011) Risk of collective failure provides an escape from the tragedy of the commons. Proc Natl Acad Sci 108(26):10421\u201310425","journal-title":"Proc Natl Acad Sci"},{"key":"6117_CR86","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.67.015206","author":"Y Sato","year":"2003","unstructured":"Sato Y, Crutchfield JP (2003) Coupled replicator equations for the dynamics of learning in multiagent systems. Phys Rev E. https:\/\/doi.org\/10.1103\/PhysRevE.67.015206","journal-title":"Phys Rev E"},{"issue":"5306","key":"6117_CR87","doi-asserted-by":"publisher","first-page":"1593","DOI":"10.1126\/science.275.5306.1593","volume":"275","author":"W Schultz","year":"1997","unstructured":"Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275(5306):1593\u20131599","journal-title":"Science"},{"key":"6117_CR88","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1016\/j.conb.2017.03.013","volume":"43","author":"W Schultz","year":"2017","unstructured":"Schultz W, Stauffer WR, Lak A (2017) The phasic dopamine signal maturing: from reward via behavioural activation to formal economic utility. Curr Opin Neurobiol 43:139\u2013148","journal-title":"Curr Opin Neurobiol"},{"key":"6117_CR89","unstructured":"Settles B (2009) Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Tech. rep"},{"key":"6117_CR90","unstructured":"Shalev-Shwartz S, Shammah S, Shashua A (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295"},{"issue":"3","key":"6117_CR91","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","volume":"27","author":"CE Shannon","year":"1948","unstructured":"Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379\u2013423","journal-title":"Bell Syst Tech J"},{"key":"6117_CR92","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511811654","volume-title":"Multiagent systems: algorithmic, game-theoretic, and logical foundations","author":"Y Shoham","year":"2008","unstructured":"Shoham Y, Leyton-Brown K (2008) Multiagent systems: algorithmic, game-theoretic, and logical foundations. Cambridge University Press, USA"},{"issue":"7","key":"6117_CR93","doi-asserted-by":"publisher","first-page":"365","DOI":"10.1016\/j.artint.2006.02.006","volume":"171","author":"Y Shoham","year":"2007","unstructured":"Shoham Y, Powers R, Grenager T (2007) If multi-agent learning is the answer, what is the question? Artif Intell 171(7):365\u2013377","journal-title":"Artif Intell"},{"key":"6117_CR94","doi-asserted-by":"crossref","unstructured":"Singh SP, Jaakkola T, Jordan MI (1994) Learning without state-estimation in partially observable markovian decision processes. In: Machine Learning Proceedings 1994, pp. 284\u2013292. Elsevier","DOI":"10.1016\/B978-1-55860-335-6.50042-8"},{"key":"6117_CR95","doi-asserted-by":"crossref","unstructured":"Stone P, Kaminka G, Kraus S, Rosenschein J (2010) Ad hoc autonomous agent teams: Collaboration without pre-coordination. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a024","DOI":"10.1609\/aaai.v24i1.7529"},{"issue":"1","key":"6117_CR96","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/BF00115009","volume":"3","author":"RS Sutton","year":"1988","unstructured":"Sutton RS (1988) Learning to predict by the methods of temporal differences. Mach Learn 3(1):9\u201344","journal-title":"Mach Learn"},{"key":"6117_CR97","doi-asserted-by":"crossref","unstructured":"Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Machine learning proceedings 1990, pp. 216\u2013224. Elsevier","DOI":"10.1016\/B978-1-55860-141-3.50030-4"},{"key":"6117_CR98","unstructured":"Sutton RS, Barto AG (2018) Reinforcement Learning, Second Edition | The MIT Press. The MIT Press. URL https:\/\/mitpress.mit.edu\/books\/reinforcement-learning-second-edition"},{"key":"6117_CR99","unstructured":"Sutton RS, McAllester DA, Singh SP, Mansour Y, et\u00a0al (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp. 1057\u20131063"},{"key":"6117_CR100","doi-asserted-by":"publisher","DOI":"10.1093\/acprof:oso\/9780195315448.001.0001","volume-title":"Ecological rationality: intelligence in the world","author":"PM Todd","year":"2012","unstructured":"Todd PM, Gigerenzer GE (2012) Ecological rationality: intelligence in the world. Oxford University Press, Oxford"},{"key":"6117_CR101","doi-asserted-by":"crossref","unstructured":"Todorov E (2007) Linearly-solvable markov decision problems. In: Advances in neural information processing systems, pp. 1369\u20131376","DOI":"10.7551\/mitpress\/7503.003.0176"},{"key":"6117_CR102","doi-asserted-by":"crossref","unstructured":"Tokic M, Palm G (2011) Value-difference based exploration: adaptive control between epsilon-greedy and softmax. In: Annual Conference on Artificial Intelligence, pp. 335\u2013346. Springer","DOI":"10.1007\/978-3-642-24455-1_33"},{"issue":"1","key":"6117_CR103","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1017\/S026988890500041X","volume":"20","author":"K Tuyls","year":"2005","unstructured":"Tuyls K, Now\u00e9 A (2005) Evolutionary game theory and multi-agent reinforcement learning. Knowl Eng Rev 20(1):63\u201390. https:\/\/doi.org\/10.1017\/S026988890500041X","journal-title":"Knowl Eng Rev"},{"key":"6117_CR104","doi-asserted-by":"publisher","unstructured":"Tuyls K, Verbeeck K, Lenaerts T (2003) A selection-mutation model for q-learning in multi-agent systems. In: Proceedings of the second international joint conference on Autonomous agents and multiagent systems, AAMAS \u201903, pp. 693\u2013700. Association for Computing Machinery, Melbourne, Australia . https:\/\/doi.org\/10.1145\/860575.860687","DOI":"10.1145\/860575.860687"},{"issue":"3","key":"6117_CR105","first-page":"41","volume":"33","author":"K Tuyls","year":"2012","unstructured":"Tuyls K, Weiss G (2012) Multiagent learning: Basics, challenges, and prospects. Ai Mag 33(3):41\u201341","journal-title":"Ai Mag"},{"key":"6117_CR106","doi-asserted-by":"crossref","unstructured":"Van\u00a0Seijen H, Van\u00a0Hasselt H, Whiteson S, Wiering M (2009) A theoretical and empirical analysis of expected sarsa. In: 2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pp. 177\u2013184. IEEE","DOI":"10.1109\/ADPRL.2009.4927542"},{"key":"6117_CR107","unstructured":"Vanseijen H, Sutton R (2015) A deeper look at planning as learning from replay. In: International conference on machine learning, pp. 2314\u20132322"},{"issue":"6","key":"6117_CR108","doi-asserted-by":"publisher","first-page":"2212","DOI":"10.1073\/pnas.1323479111","volume":"111","author":"VV Vasconcelos","year":"2014","unstructured":"Vasconcelos VV, Santos FC, Pacheco JM, Levin SA (2014) Climate policies under wealth inequality. Proc Natl Acad Sci 111(6):2212\u20132216","journal-title":"Proc Natl Acad Sci"},{"key":"6117_CR109","unstructured":"Vrancx P, Tuyls K, Westra R (2008) Switching dynamics of multi-agent learning. In: Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent systems, AAMAS 2008, pp. 307\u2013313"},{"issue":"1","key":"6117_CR110","doi-asserted-by":"publisher","first-page":"016101","DOI":"10.1103\/PhysRevE.80.016101","volume":"80","author":"J Wang","year":"2009","unstructured":"Wang J, Fu F, Wu T, Wang L (2009) Emergence of social cooperation in threshold public goods games with collective risk. Phys Rev E 80(1):016101","journal-title":"Phys Rev E"},{"key":"6117_CR111","doi-asserted-by":"publisher","first-page":"158","DOI":"10.1016\/j.comnet.2015.12.017","volume":"101","author":"S Wang","year":"2016","unstructured":"Wang S, Wan J, Zhang D, Li D, Zhang C (2016) Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput Netw 101:158\u2013168","journal-title":"Comput Netw"},{"issue":"3\u20134","key":"6117_CR112","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1007\/BF00992698","volume":"8","author":"CJ Watkins","year":"1992","unstructured":"Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3\u20134):279\u2013292","journal-title":"Mach Learn"},{"key":"6117_CR113","unstructured":"Wiering MA (2000) Multi-agent reinforcement learning for traffic light control. In: Machine Learning: Proceedings of the Seventeenth International Conference (ICML\u20192000), pp. 1151\u20131158"},{"key":"6117_CR114","doi-asserted-by":"crossref","unstructured":"Wolpert DH (2006) Information theory\u2014the bridge connecting bounded rational game theory and statistical physics. In: Complex Engineered Systems, pp. 262\u2013290. Springer","DOI":"10.1007\/3-540-32834-3_12"},{"issue":"3","key":"6117_CR115","doi-asserted-by":"publisher","first-page":"036102","DOI":"10.1103\/PhysRevE.85.036102","volume":"85","author":"DH Wolpert","year":"2012","unstructured":"Wolpert DH, Harr\u00e9 M, Olbrich E, Bertschinger N, Jost J (2012) Hysteresis effects of changing the parameters of noncooperative games. Phys Rev E 85(3):036102. https:\/\/doi.org\/10.1103\/PhysRevE.85.036102","journal-title":"Phys Rev E"},{"key":"6117_CR116","unstructured":"Wunder M, Littman M, Babes M (2010) Classes of multiagent Q-learning dynamics with epsilon-greedy exploration. In: Proceedings of the 27th International Conference on Machine Learning, ICML\u201910, pp. 1167\u20131174"},{"key":"6117_CR117","unstructured":"Zhang K, Yang Z, Ba\u015far T (2019) Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635"},{"key":"6117_CR118","unstructured":"Zhang S, Sutton R (2018) A deeper look at experience replay. arXiv preprint arXiv:1712.01275"},{"key":"6117_CR119","unstructured":"Ziebart BD (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph.D. thesis"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06117-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-021-06117-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06117-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,5]],"date-time":"2023-11-05T06:50:28Z","timestamp":1699167028000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-021-06117-0"}},"subtitle":["Algorithmic foundations of temporal-difference learning dynamics"],"short-title":[],"issued":{"date-parts":[[2021,6,23]]},"references-count":119,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,2]]}},"alternative-id":["6117"],"URL":"https:\/\/doi.org\/10.1007\/s00521-021-06117-0","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,23]]},"assertion":[{"value":"12 November 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 May 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 June 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The author declares that he has no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Python code to generate all reported results is available at Zenodo () and GitHub ().","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Code availability"}}]}}