{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T09:56:39Z","timestamp":1775037399915,"version":"3.50.1"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"31","license":[{"start":{"date-parts":[[2025,1,18]],"date-time":"2025-01-18T00:00:00Z","timestamp":1737158400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,1,18]],"date-time":"2025-01-18T00:00:00Z","timestamp":1737158400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"University of Innsbruck and Medical University of Innsbruck"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>In this paper, we provide a novel view upon <jats:italic>reinforcement learning<\/jats:italic> (<jats:italic>RL<\/jats:italic> for short). In particular, we are interested in applications of RL in use cases, where average rewards may be nonzero. While RL methodologies have been extensively researched upon, this particular application area has only received scarce attention in the Literature. In part, our motivation stems from applications in Operation Research (OR for short), where it is typically the case that rewards are profit derived. Similar use cases can be found in more general applications in economics. Based on a principled study of the mathematical background of discounted reinforcement learning we establish a novel adaptation of standard RL, dubbed <jats:italic>Average Reward Adjusted Discounted Reinforcement Learning<\/jats:italic> (<jats:italic>ARAL<\/jats:italic> for short). Our approach stems from revisiting the Laurent Series expansion of the discounted state value and a subsequent reformulation of the target function guiding the learning process. While the theoretical advance is arguably incremental, we provide ample experimental evidence that the thus obtained novel RL methodology compares favorable to well-established techniques like Q-learning or R-learning.<\/jats:p>","DOI":"10.1007\/s00521-024-10620-5","type":"journal-article","created":{"date-parts":[[2025,1,18]],"date-time":"2025-01-18T14:42:18Z","timestamp":1737211338000},"page":"25663-25694","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Average reward adjusted discounted reinforcement learning"],"prefix":"10.1007","volume":"37","author":[{"given":"Manuel","family":"Schneckenreither","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Georg","family":"Moser","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,1,18]]},"reference":[{"key":"10620_CR1","volume-title":"Reinforcement learning: an introduction","author":"RS Sutton","year":"1998","unstructured":"Sutton RS, Barto AG (1998) Reinforcement learning: an introduction, vol 1. MIT press Cambridge, Cambridge"},{"key":"10620_CR2","unstructured":"Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347"},{"key":"10620_CR3","unstructured":"Schwartz A (1993) Thinking locally to act globally: a novel approach to reinforcement learning. In: Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pp. 906\u2013911, Lawrence Erlbaum Associates Hillsdale, NJ"},{"key":"10620_CR4","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1023\/A:1018064306595","volume":"22","author":"S Mahadevan","year":"1996","unstructured":"Mahadevan S (1996) Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach Learn 22:159\u2013195","journal-title":"Mach Learn"},{"key":"10620_CR5","unstructured":"Mahadevan S (1996) Optimality criteria in reinforcement learning. In: Proceedings of the AAAI Fall Symposium on Learning Complex Behaviors in Adaptive Intelligent Systems"},{"key":"10620_CR6","unstructured":"Mahadevan S (1996) An average-reward reinforcement learning algorithm for computing bias-optimal policies. In: AAAI\/IAAI, Vol. 1, pp. 875\u2013880"},{"key":"10620_CR7","unstructured":"Mahadevan S (1996) Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In: ICML, pp. 328\u2013336"},{"key":"10620_CR8","unstructured":"Mahadevan S, Marchalleck N, Das TK, Gosavi A (1997) Self-improving factory simulation using continuous-time average-reward reinforcement learning. In: Machine Learning-International Workshop Then Conference, pp. 202\u2013210. Morgan Kaufmann Publishers, Inc"},{"issue":"3","key":"10620_CR9","doi-asserted-by":"publisher","first-page":"313","DOI":"10.1007\/s002919900032","volume":"22","author":"WH Zijm","year":"2000","unstructured":"Zijm WH (2000) Towards intelligent manufacturing planning and control systems. OR-Spektrum 22(3):313\u2013345","journal-title":"OR-Spektrum"},{"issue":"4","key":"10620_CR10","doi-asserted-by":"publisher","first-page":"471","DOI":"10.1007\/s00291-004-0170-x","volume":"26","author":"J Rohde","year":"2004","unstructured":"Rohde J (2004) Hierarchical supply chain planning using artificial neural networks to anticipate base-level outcomes. OR Spectrum 26(4):471\u2013492","journal-title":"OR Spectrum"},{"key":"10620_CR11","unstructured":"Hax AC, Meal HC (1973) Hierarchical integration of production planning and scheduling. Report, DTIC Document"},{"key":"10620_CR12","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2022.108765","author":"M Schneckenreither","year":"2022","unstructured":"Schneckenreither M, Haeussler S, Peiro J (2022) Average reward adjusted deep reinforcement learning for order release planning in manufacturing. Knowl Based Syst. https:\/\/doi.org\/10.1016\/j.knosys.2022.108765","journal-title":"Knowl. Based Syst."},{"key":"10620_CR13","unstructured":"Boyan J, Moore A (1994) Generalization in reinforcement learning: safely approximating the value function. Advances in neural information processing systems, 7"},{"key":"10620_CR14","unstructured":"Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King\u2019s College"},{"key":"10620_CR15","doi-asserted-by":"crossref","unstructured":"Schwartz A (1993) A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International Conference on Machine Learning, vol. 298, pp. 298\u2013305","DOI":"10.1016\/B978-1-55860-307-3.50045-9"},{"issue":"7540","key":"10620_CR16","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529","journal-title":"Nature"},{"key":"10620_CR17","unstructured":"Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928\u20131937"},{"key":"10620_CR18","unstructured":"Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815"},{"key":"10620_CR19","doi-asserted-by":"publisher","DOI":"10.4324\/9780203842423","volume-title":"Finance: the basics","author":"E Banks","year":"2010","unstructured":"Banks E (2010) Finance: the basics. Routledge, Oxfordshire"},{"key":"10620_CR20","doi-asserted-by":"publisher","first-page":"1994","DOI":"10.1002\/9780470316887","volume-title":"Markov decision processes","author":"ML Puterman","year":"1994","unstructured":"Puterman ML (1994) Markov decision processes. Wiley, Hoboken, p 1994"},{"key":"10620_CR21","doi-asserted-by":"crossref","unstructured":"Schneckenreither M, Haeussler S (2018) Reinforcement learning methods for operations research applications: the order release problem. In: International Conference on Machine Learning, Optimization, and Data Science, pp. 545\u2013559. Springer","DOI":"10.1007\/978-3-030-13709-0_46"},{"key":"10620_CR22","doi-asserted-by":"crossref","unstructured":"Gijsbrechts J, Boute RN, Van\u00a0Mieghem JA, Zhang D (2018) Can deep reinforcement learning improve inventory management? Performance and implementation of dual sourcing-mode problems. Performance and Implementation of Dual Sourcing-Mode Problems (Dec 17, 2018)","DOI":"10.2139\/ssrn.3302881"},{"key":"10620_CR23","unstructured":"Balaji B, Bell-Masterson J, Bilgin E, Damianou A, Garcia PM, Jain A, Luo R, Maggiar A, Narayanaswamy B, Ye C (2019) Orl: Reinforcement learning benchmarks for online stochastic optimization problems. arXiv preprint arXiv:1911.10641"},{"key":"10620_CR24","unstructured":"Nazari M, Oroojlooy A, Snyder L, Tak\u00e1c M (2018) Reinforcement learning for solving the vehicle routing problem. In: Advances in Neural Information Processing Systems, pp. 9839\u20139849"},{"key":"10620_CR25","doi-asserted-by":"crossref","unstructured":"Vera JM, Abad AG (2019) Deep reinforcement learning for routing a heterogeneous fleet of vehicles. arXiv preprint arXiv:1912.03341","DOI":"10.1109\/LA-CCI47412.2019.9037042"},{"key":"10620_CR26","unstructured":"Bello I, Pham H, Le QV, Norouzi M, Bengio S (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940"},{"key":"10620_CR27","unstructured":"Kool W, van Hoof H, Welling M (2018) Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475"},{"issue":"4","key":"10620_CR28","doi-asserted-by":"publisher","first-page":"949","DOI":"10.1016\/j.dss.2008.03.007","volume":"45","author":"SK Chaharsooghi","year":"2008","unstructured":"Chaharsooghi SK, Heydari J, Zegordi SH (2008) A reinforcement learning model for supply chain ordering management: an application to the beer game. Decis Support Syst 45(4):949\u2013959","journal-title":"Decis Support Syst"},{"key":"10620_CR29","unstructured":"Oroojlooyjadid A, Nazari M, Snyder L, Tak\u00e1\u010d M (2017) A deep q-network for the beer game: a reinforcement learning algorithm to solve inventory optimization problems. arXiv preprint arXiv:1708.05924"},{"key":"10620_CR30","doi-asserted-by":"crossref","unstructured":"Merkle F, Haeussler S, Blossey G, Schneckenreither M (2023) Reinforcement learning for multi-vehicle systems: a structured review. Hawaii International Conference on System Sciences","DOI":"10.1145\/3651781.3651828"},{"key":"10620_CR31","doi-asserted-by":"crossref","unstructured":"Schneckenreither M, Windmueller S, Haeussler S (2021) Smart short term capacity planning: a reinforcement learning approach. In: IFIP International Conference on Advances in Production Management Systems, pp. 258\u2013266. Springer","DOI":"10.1007\/978-3-030-85874-2_27"},{"issue":"2","key":"10620_CR32","doi-asserted-by":"publisher","first-page":"366","DOI":"10.1214\/aoms\/1177697700","volume":"40","author":"BL Miller","year":"1969","unstructured":"Miller BL, Veinott AF (1969) Discrete dynamic programming with a small interest rate. Ann Math Stat 40(2):366\u2013370","journal-title":"Ann Math Stat"},{"key":"10620_CR33","doi-asserted-by":"crossref","unstructured":"Ma X, Tang X, Xia L, Yang J, Zhao Q (2021) Average-reward reinforcement learning with trust region methods. arXiv preprint arXiv:2106.03442","DOI":"10.24963\/ijcai.2021\/385"},{"key":"10620_CR34","unstructured":"Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning, pp. 1861\u20131870. PMLR"},{"key":"10620_CR35","unstructured":"Hafner D, Pasukonis J, Ba J, Lillicrap T (2023) Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104"},{"key":"10620_CR36","doi-asserted-by":"crossref","unstructured":"Hessel M, Modayil J, Van\u00a0Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32","DOI":"10.1609\/aaai.v32i1.11796"},{"issue":"1\u20132","key":"10620_CR37","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1016\/S0004-3702(98)00002-2","volume":"100","author":"P Tadepalli","year":"1998","unstructured":"Tadepalli P, Ok D (1998) Model-based average reward reinforcement learning. Artif Intell 100(1\u20132):177\u2013224","journal-title":"Artif Intell"},{"key":"10620_CR38","unstructured":"Mahadevan S, Theocharous G (1998) Optimizing production manufacturing using reinforcement learning. In: FLAIRS Conference, pp. 372\u2013377"},{"issue":"4","key":"10620_CR39","doi-asserted-by":"publisher","first-page":"560","DOI":"10.1287\/mnsc.45.4.560","volume":"45","author":"TK Das","year":"1999","unstructured":"Das TK, Gosavi A, Mahadevan S, Marchalleck N (1999) Solving semi-markov decision problems using average reward reinforcement learning. Manage Sci 45(4):560\u2013574","journal-title":"Manage Sci"},{"issue":"1","key":"10620_CR40","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1108\/09576060410512365","volume":"15","author":"ST Enns","year":"2004","unstructured":"Enns ST, Suwanruji P (2004) Work load responsive adjustment of planned lead times. J Manuf Technol Manag 15(1):90\u2013100","journal-title":"J Manuf Technol Manag"},{"key":"10620_CR41","doi-asserted-by":"crossref","unstructured":"Bertsekas DP, Bertsekas DP, Bertsekas DP, Bertsekas DP (1995) Dynamic Programming and Optimal Control, vol 1. Athena scientific Belmont, MA, Massachusetts Institute of Technology","DOI":"10.1007\/978-3-030-54621-2_440-1"},{"issue":"3","key":"10620_CR42","doi-asserted-by":"publisher","first-page":"681","DOI":"10.1137\/S0363012999361974","volume":"40","author":"J Abounadi","year":"2001","unstructured":"Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for markov decision processes with average cost. SIAM J Control Optim 40(3):681\u2013698","journal-title":"SIAM J Control Optim"},{"issue":"3","key":"10620_CR43","doi-asserted-by":"publisher","first-page":"654","DOI":"10.1016\/S0377-2217(02)00874-3","volume":"155","author":"A Gosavi","year":"2004","unstructured":"Gosavi A (2004) Reinforcement learning for long-run average cost. Eur J Oper Res 155(3):654\u2013674","journal-title":"Eur J Oper Res"},{"key":"10620_CR44","doi-asserted-by":"crossref","unstructured":"Auer P, Ortner R (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 49\u201356","DOI":"10.7551\/mitpress\/7503.003.0011"},{"key":"10620_CR45","doi-asserted-by":"crossref","unstructured":"Yang S, Gao Y, An B, Wang H, Chen X (2016) Efficient average reward reinforcement learning using constant shifting values. In: AAAI, pp. 2258\u20132264","DOI":"10.1609\/aaai.v30i1.10285"},{"key":"10620_CR46","volume-title":"Dynamic programming and Markov processes","author":"RA Howard","year":"1960","unstructured":"Howard RA (1960) Dynamic programming and Markov processes. Wiley, London"},{"key":"10620_CR47","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1214\/aoms\/1177704593","volume":"344","author":"D Blackwell","year":"1962","unstructured":"Blackwell D (1962) Discrete dynamic programming. Ann Math Stat 344:719\u2013726","journal-title":"Ann Math Stat"},{"issue":"5","key":"10620_CR48","doi-asserted-by":"publisher","first-page":"1635","DOI":"10.1214\/aoms\/1177697379","volume":"40","author":"AF Veinott","year":"1969","unstructured":"Veinott AF (1969) Discrete dynamic programming with sensitive discount optimality criteria. Ann Math Stat 40(5):1635\u20131660. https:\/\/doi.org\/10.1214\/aoms\/1177697379","journal-title":"Ann Math Stat"},{"key":"10620_CR49","first-page":"1114","volume":"95","author":"W Zhang","year":"1995","unstructured":"Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. IJCAI 95:1114\u20131120","journal-title":"IJCAI"},{"key":"10620_CR50","unstructured":"Boyan JA, Moore AW (1995) Generalization in reinforcement learning: safely approximating the value function. In: Advances in Neural Information Processing Systems, pp. 369\u2013376"},{"issue":"7540","key":"10620_CR51","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529","journal-title":"Nature"},{"key":"10620_CR52","unstructured":"Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971"},{"issue":"1","key":"10620_CR53","doi-asserted-by":"publisher","first-page":"136","DOI":"10.1239\/jap\/1032192558","volume":"35","author":"M Haviv","year":"1998","unstructured":"Haviv M, Puterman ML (1998) Bias optimality in controlled queueing systems. J Appl Probab 35(1):136\u2013150","journal-title":"J Appl Probab"},{"key":"10620_CR54","doi-asserted-by":"publisher","first-page":"834","DOI":"10.1109\/TSMC.1983.6313077","volume":"5","author":"AG Barto","year":"1983","unstructured":"Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 5:834\u2013846","journal-title":"IEEE Trans Syst Man Cybern"},{"key":"10620_CR55","unstructured":"Florian RV (2007) Correct equations for the dynamics of the cart-pole system. Center for Cognitive and Neural Studies (Coneural), Romania, 63"},{"key":"10620_CR56","unstructured":"Moore AW (1990) Efficient memory-based learning for robot control. Technical report, University of Cambridge, Computer Laboratory"},{"issue":"1","key":"10620_CR57","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","volume":"57","author":"Y Benjamini","year":"1995","unstructured":"Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol) 57(1):289\u2013300","journal-title":"J Roy Stat Soc: Ser B (Methodol)"},{"key":"10620_CR58","doi-asserted-by":"crossref","unstructured":"Watkins CJ, Dayan P (1992) Q-learning. Machine learning, 8(3\u20134):279\u2013292","DOI":"10.1023\/A:1022676722315"},{"key":"10620_CR59","unstructured":"Tsitsiklis JN, Van\u00a0Roy B (1997) Analysis of temporal-diffference learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1075\u20131081"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-024-10620-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-024-10620-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-024-10620-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,19]],"date-time":"2025-10-19T05:02:34Z","timestamp":1760850154000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-024-10620-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,18]]},"references-count":59,"journal-issue":{"issue":"31","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["10620"],"URL":"https:\/\/doi.org\/10.1007\/s00521-024-10620-5","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"value":"0941-0643","type":"print"},{"value":"1433-3058","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,18]]},"assertion":[{"value":"17 November 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 October 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 January 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}}]}}