{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,25]],"date-time":"2026-01-25T02:38:52Z","timestamp":1769308732671,"version":"3.49.0"},"reference-count":57,"publisher":"Springer Science and Business Media LLC","issue":"19","license":[{"start":{"date-parts":[[2022,11,11]],"date-time":"2022-11-11T00:00:00Z","timestamp":1668124800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,11]],"date-time":"2022-11-11T00:00:00Z","timestamp":1668124800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/R001227\/1"],"award-info":[{"award-number":["EP\/R001227\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100007601","name":"Horizon 2020","doi-asserted-by":"publisher","award":["758824"],"award-info":[{"award-number":["758824"]}],"id":[{"id":"10.13039\/501100007601","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent\u2019s contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the <jats:italic>Q<\/jats:italic>-function as done by counterfactual multi-agent policy gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.<\/jats:p>","DOI":"10.1007\/s00521-022-07960-5","type":"journal-article","created":{"date-parts":[[2022,11,11]],"date-time":"2022-11-11T09:06:42Z","timestamp":1668157602000},"page":"13163-13186","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Difference rewards policy gradients"],"prefix":"10.1007","volume":"37","author":[{"given":"Jacopo","family":"Castellini","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sam","family":"Devlin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Frans A.","family":"Oliehoek","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rahul","family":"Savani","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,11,11]]},"reference":[{"key":"7960_CR1","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1007\/s10458-008-9046-9","volume":"17","author":"AK Agogino","year":"2008","unstructured":"Agogino AK, Tumer K (2008) Analyzing and visualizing multiagent rewards in dynamic and stochastic domains. Auton Agent Multi-Agent Syst 17:320\u2013338","journal-title":"Auton Agent Multi-Agent Syst"},{"key":"7960_CR2","first-page":"3","volume":"8","author":"CE Bonferroni","year":"1936","unstructured":"Bonferroni CE (1936) Teoria statistica delle classi e calcolo delle probabilit\u00e0. Pubbl del R Ist Super di Sci Econ e Commer di Firenze 8:3\u201362","journal-title":"Pubbl del R Ist Super di Sci Econ e Commer di Firenze"},{"key":"7960_CR3","doi-asserted-by":"crossref","unstructured":"Bottou L (1998) Online learning and stochastic approximations","DOI":"10.1017\/CBO9780511569920.003"},{"key":"7960_CR4","unstructured":"Boutilier C (1996) Planning, learning and coordination in multiagent decision processes. In: Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge. Morgan Kaufmann Publishers Inc., TARK \u201996, pp. 195\u2013210"},{"key":"7960_CR5","first-page":"156","volume":"38","author":"L Busoniu","year":"2008","unstructured":"Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics. Part C Appl Rev 38:156\u2013172","journal-title":"Part C Appl Rev"},{"issue":"1","key":"7960_CR6","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1109\/TII.2012.2219061","volume":"9","author":"Y Cao","year":"2013","unstructured":"Cao Y, Yu W, Ren W et al (2013) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans Industr Inf 9(1):427\u2013438","journal-title":"IEEE Trans Industr Inf"},{"key":"7960_CR7","unstructured":"Castellini J, Oliehoek FA, Savani R, et\u00a0al (2019) The representational capacity of action-value networks for multi-agent reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, AAMAS\u201919, pp 1862\u20131864"},{"key":"7960_CR8","unstructured":"Castellini J, Devlin S, Oliehoek FA, et\u00a0al (2021) Difference rewards policy gradients. In: proceedings of the 20th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201921, pp 1475\u20131477"},{"key":"7960_CR9","unstructured":"Chang YH, Ho T, Kaelbling LP (2003) All learning is local: multi-agent learning in global reward games. In: advances in neural information processing systems 16. NIPS\u201903, MIT Press, p 807\u2013814"},{"key":"7960_CR10","unstructured":"Chung J, Gulcehre C, Cho KH, et\u00a0al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS\u201914 workshop on deep learning and representation learning. NIPS\u201914"},{"key":"7960_CR11","unstructured":"Claus C, Boutilier C (1998) The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the 15th\/10th AAAI conference on artificial intelligence\/innovative applications of artificial intelligence. american association for artificial intelligence, AAAI\u201998\/IAAI\u201998, pp 746\u2013752"},{"key":"7960_CR12","unstructured":"Colby MK, Curran W, Rebhuhn C, et\u00a0al (2014) Approximating difference evaluations with local knowledge. In: Proceedings of the 13th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201914, pp 1577\u20131578"},{"key":"7960_CR13","unstructured":"Colby MK, Curran W, Tumer K (2015) Approximating difference evaluations with local information. In: Proceedings of the 14th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201915, pp 1659\u20131660"},{"key":"7960_CR14","unstructured":"Devlin S, Kudenko D (2011) Theoretical considerations of potential-based reward shaping for multi-agent systems. In: AAMAS. international foundation for autonomous agents and multiagent systems, pp 225\u2013232"},{"key":"7960_CR15","unstructured":"Devlin S, Kudenko D (2012) Dynamic potential-based reward shaping. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201912, pp 433\u2013440"},{"key":"7960_CR16","unstructured":"Devlin S, Yliniemi L, Kudenko D, et\u00a0al (2014) Potential-based difference rewards for multiagent reinforcement learning. In: Proceedings of the 13th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201914, pp 165\u2013172"},{"key":"7960_CR17","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1007\/s10458-015-9292-6","volume":"30","author":"A Eck","year":"2015","unstructured":"Eck A, Soh LK, Devlin S et al (2015) Potential-based reward shaping for finite horizon online pomdp planning. Auton Agent Multi-Agent Syst 30:403\u2013445","journal-title":"Auton Agent Multi-Agent Syst"},{"key":"7960_CR18","doi-asserted-by":"crossref","unstructured":"Foerster JN, Farquhar G, Afouras T, et\u00a0al (2018) Counterfactual multi-agent policy gradients. In: Proceedings of the 32th AAAI conference on artificial intelligence. AAAI Press, AAAI\u201918, pp 2974\u20132982","DOI":"10.1609\/aaai.v32i1.11794"},{"key":"7960_CR19","unstructured":"Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. In: Proceedings of the 36th international conference on machine learning. PMLR, ICML\u201918, pp 1587\u20131596"},{"key":"7960_CR20","first-page":"1471","volume":"5","author":"E Greensmith","year":"2004","unstructured":"Greensmith E, Bartlett PL, Baxter J (2004) Variance reduction techniques for gradient estimates in reinforcement learning. J Mach Learn Res 5:1471\u20131530","journal-title":"J Mach Learn Res"},{"key":"7960_CR21","unstructured":"Guestrin C, Lagoudakis MG, Parr R (2002) Coordinated reinforcement learning. In: Proceedings of the 19th international conference on machine learning. morgan kaufmann publishers Inc., ICML\u201902, pp 227\u2013234"},{"key":"7960_CR22","doi-asserted-by":"crossref","unstructured":"Gupta JK, Egorov M, Kochenderfer MJ (2017) Cooperative multi-agent control using deep reinforcement learning. autonomous agents and multi-agent systems. Springer, Cham pp 66\u201383 pp 66-83","DOI":"10.1007\/978-3-319-71682-4_5"},{"key":"7960_CR23","unstructured":"Hansen EA, Bernstein DS, Zilberstein S (2004) Dynamic programming for partially observable stochastic games. In: Proceedings of the 19th AAAI conference on artifical intelligence. AAAI Press, AAAI\u201904, pp 709\u2013715"},{"key":"7960_CR24","doi-asserted-by":"publisher","first-page":"750","DOI":"10.1007\/s10458-019-09421-1","volume":"33","author":"P Hernandez-Leal","year":"2019","unstructured":"Hernandez-Leal P, Kartal B, Taylor ME (2019) A survey and critique of multiagent deep reinforcement learning. Auton Agent Multi-Agent Syst 33:750\u2013797","journal-title":"Auton Agent Multi-Agent Syst"},{"key":"7960_CR25","unstructured":"Jaderberg M, Mnih V, Czarnecki WM, et\u00a0al (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv abs\/1611.05397"},{"issue":"1","key":"7960_CR26","doi-asserted-by":"publisher","first-page":"237","DOI":"10.1613\/jair.301","volume":"4","author":"LP Kaelbling","year":"1996","unstructured":"Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. J Artif Intell Res 4(1):237\u2013285","journal-title":"J Artif Intell Res"},{"issue":"4","key":"7960_CR27","doi-asserted-by":"publisher","first-page":"1143","DOI":"10.1137\/S0363012901385691","volume":"42","author":"VR Konda","year":"2003","unstructured":"Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Control Optim 42(4):1143\u20131166","journal-title":"SIAM J Control Optim"},{"key":"7960_CR28","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1016\/j.neucom.2016.01.031","volume":"190","author":"L Kraemer","year":"2016","unstructured":"Kraemer L, Banerjee B (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190:82\u201394","journal-title":"Neurocomputing"},{"key":"7960_CR29","unstructured":"Lowe R, Wu Y, Tamar A, et\u00a0al (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems 30. NIPS\u201917, curran associates, Inc., p 6379\u20136390"},{"issue":"1","key":"7960_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1017\/S0269888912000057","volume":"27","author":"L Matignon","year":"2012","unstructured":"Matignon L, Laurent GJ, Le Fort-Piat N (2012) Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. Knowl Eng Rev 27(1):1\u201331","journal-title":"Knowl Eng Rev"},{"issue":"7540","key":"7960_CR31","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529\u2013533","journal-title":"Nature"},{"key":"7960_CR32","unstructured":"Mnih V, Badia AP, Mirza M, et\u00a0al (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings 33rd international conference on machine learning. PMLR, ICML\u201916, pp 1928\u20131937"},{"key":"7960_CR33","unstructured":"Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on international conference on machine learning. omnipress, ICML\u201910, pp 807\u2013814"},{"key":"7960_CR34","unstructured":"Ng AY, Harada D, Russell S (1999) Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, ICML\u201999, pp 278\u2013287"},{"key":"7960_CR35","unstructured":"Nguyen DT, Kumar A, Lau HC (2018) Credit assignment for collective multiagent rl with global rewards. In: Advances in neural information processing systems 32. NIPS\u201918, curran associates, Inc., p 8113\u20138124"},{"key":"7960_CR36","unstructured":"Nissim R, Brafman RI (2012) Multi-agent a* for parallel and distributed systems. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201912, pp 1265\u20131266"},{"key":"7960_CR37","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-28929-8","volume-title":"A concise introduction to decentralized POMDPs","author":"FA Oliehoek","year":"2016","unstructured":"Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs, 1st edn. Springer Publishing Company, Incorporated","edition":"1"},{"key":"7960_CR38","unstructured":"Papoudakis G, Christianos F, Rahman A, et\u00a0al (2019) Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv abs\/1906.04737"},{"key":"7960_CR39","unstructured":"Peshkin L, Kim KE, Meuleau N, et\u00a0al (2000) Learning to cooperate via policy search. In: Proceedings of the 16th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., UAI\u201900, pp 489\u2014496"},{"key":"7960_CR40","unstructured":"Van\u00a0der Pol E, Oliehoek FA (2016) Coordinated deep reinforcement learners for traffic light control. In: NIPS\u201916 workshop on learning, inference and control of multi-agent systems. NIPS\u201916"},{"key":"7960_CR41","unstructured":"Proper S, Tumer K (2012) Modeling difference rewards for multiagent learning. In: Proceedings of the 11th international conference on autonomous agents and multiagent systems. international foundation for autonomous agents and multiagent systems, AAMAS\u201912, pp 1397\u20131398"},{"key":"7960_CR42","unstructured":"Romoff J, Henderson P, Piche A, et\u00a0al (2018) Reward estimation for variance reduction in deep reinforcement learning. In: Proceedings of the 6th international conference on learning representations, ICLR\u201918"},{"key":"7960_CR43","unstructured":"Samvelyan M, Rashid T, Schr\u00f6eder\u00a0de Witt C, et\u00a0al (2019) The starcraft multi-agent challenge. arXiv abs\/1902.04043"},{"key":"7960_CR44","unstructured":"Srinivasan S, Lanctot M, Zambaldi V, et\u00a0al (2018) Actor-critic policy optimization in partially observable multiagent environments. In: Advances in neural information processing systems 32. NIPS\u201918, Curran Associates Inc., p 3426\u20133439"},{"issue":"1","key":"7960_CR45","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/BF00115009","volume":"3","author":"RS Sutton","year":"1988","unstructured":"Sutton RS (1988) Learning to predict by the methods of temporal differences. Mach Learn 3(1):9\u201344","journal-title":"Mach Learn"},{"key":"7960_CR46","volume-title":"Introduction to reinforcement learning","author":"RS Sutton","year":"1998","unstructured":"Sutton RS, Barto AG (1998) Introduction to reinforcement learning, 1st edn. MIT Press","edition":"1"},{"key":"7960_CR47","unstructured":"Sutton RS, McAllester DA, Singh SP, et\u00a0al (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems 12. NIPS\u201900, MIT Press, p 1057\u20131063"},{"key":"7960_CR48","doi-asserted-by":"crossref","unstructured":"Tan M (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In: Proceedings of the 10th international conference on machine learning. Morgan Kaufmann Publishers Inc., ICML\u201993, pp 330\u2013337","DOI":"10.1016\/B978-1-55860-307-3.50049-6"},{"key":"7960_CR49","doi-asserted-by":"crossref","unstructured":"Tumer K, Agogino A (2007) Distributed agent-based air traffic flow management. In: Proceedings of the 6th international conference on autonomous agents and multiagent systems. association for computing machinery, AAMAS\u201907","DOI":"10.1145\/1329125.1329434"},{"key":"7960_CR50","unstructured":"Vinyals O, Ewalds T, Bartunov S, et\u00a0al (2017) StarCraft II: A new challenge for reinforcement learning. arXiv abs\/1708.04782"},{"key":"7960_CR51","unstructured":"Wang Y, Han B, Wang T, et\u00a0al (2020) Off-policy multi-agent decomposed policy gradients. arXiv abs\/2007.12322"},{"issue":"3","key":"7960_CR52","doi-asserted-by":"publisher","first-page":"229","DOI":"10.1007\/BF00992696","volume":"8","author":"RJ Williams","year":"1992","unstructured":"Williams RJ (1992) Simple statistical gradient-gollowing algorithms for connectionist reinforcement learning. Mach Learn 8(3):229\u201356","journal-title":"Mach Learn"},{"key":"7960_CR53","unstructured":"Wolpert DH, Tumer K (1999) An introduction to collective intelligence. Tech. rep., NASA-ARC-IC-99-63, Nasa Ames Research Center"},{"key":"7960_CR54","doi-asserted-by":"publisher","first-page":"265","DOI":"10.1142\/S0219525901000188","volume":"4","author":"DH Wolpert","year":"2001","unstructured":"Wolpert DH, Tumer K (2001) Optimal payoff functions for members of collectives. Adv Complex Syst 4:265\u2013280","journal-title":"Adv Complex Syst"},{"issue":"5","key":"7960_CR55","doi-asserted-by":"publisher","first-page":"10026","DOI":"10.3390\/s150510026","volume":"15","author":"D Ye","year":"2015","unstructured":"Ye D, Zhang M, Yang Y (2015) A multi-agent framework for packet routing in wireless sensor networks. Sensors 15(5):10026\u201347","journal-title":"Sensors"},{"key":"7960_CR56","doi-asserted-by":"crossref","unstructured":"Yliniemi L, Tumer K (2014) Multi-objective multiagent credit assignment through difference rewards in reinforcement learning. In: Asia-Pacific conference on simulated evolution and learning. Springer International Publishing, pp 407\u2013418","DOI":"10.1007\/978-3-319-13563-2_35"},{"key":"7960_CR57","doi-asserted-by":"crossref","unstructured":"Zhang Y, Zavlanos MM (2019) Distributed off-policy actor-critic reinforcement learning with policy consensus. arXiv abs\/1903.09255","DOI":"10.1109\/CDC40024.2019.9029969"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-022-07960-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-022-07960-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-022-07960-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T08:28:35Z","timestamp":1751012915000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-022-07960-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,11]]},"references-count":57,"journal-issue":{"issue":"19","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["7960"],"URL":"https:\/\/doi.org\/10.1007\/s00521-022-07960-5","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,11]]},"assertion":[{"value":"17 November 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 October 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 November 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no conflict of interest to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of Interest"}}]}}