{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T03:36:18Z","timestamp":1776310578664,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"11","license":[{"start":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T00:00:00Z","timestamp":1761696000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Commun. ACM"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:p>\n                    Reinforcement learning (RL) is a prominent machine learning technique used to optimize an agent\u2019s performance in potentially unknown environments. Despite its popularity and success, RL lacks safety guarantees, both during the learning phase and deployment. This paper reviews a runtime enforcement method called\n                    <jats:italic toggle=\"yes\">shielding<\/jats:italic>\n                    that ensures provable safety for RL. We describe the underlying models, the types of guarantees that can be delivered, and the process of computing shields. Furthermore, we describe several techniques for integrating shields into RL, discuss the advantages and potential drawbacks of this integration, and highlight the current challenges in shielded learning.\n                  <\/jats:p>","DOI":"10.1145\/3715958","type":"journal-article","created":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T16:16:31Z","timestamp":1760976991000},"page":"80-90","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Shields for Safe Reinforcement Learning"],"prefix":"10.1145","volume":"68","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5183-5452","authenticated-orcid":false,"given":"Bettina","family":"K\u00f6nighofer","sequence":"first","affiliation":[{"name":"Graz University of Technology, Graz, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1411-5744","authenticated-orcid":false,"given":"Roderick","family":"Bloem","sequence":"additional","affiliation":[{"name":"Graz University of Technology, Graz, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1318-8973","authenticated-orcid":false,"given":"Nils","family":"Jansen","sequence":"additional","affiliation":[{"name":"Ruhr-Universitat Bochum, Bochum, Germany"},{"name":"Radboud Universiteit, Nijmegen, Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0978-8466","authenticated-orcid":false,"given":"Sebastian","family":"Junges","sequence":"additional","affiliation":[{"name":"Radboud Universiteit, Nijmegen, Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6011-9925","authenticated-orcid":false,"given":"Stefan","family":"Pranger","sequence":"additional","affiliation":[{"name":"Graz University of Technology, Graz, Austria"}]}],"member":"320","published-online":{"date-parts":[[2025,10,29]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Arulkumaran K. et al. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34 6 (2017) 26\u201338.","DOI":"10.1109\/MSP.2017.2743240"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-25540-4_36"},{"key":"e_1_3_2_4_2","volume-title":"Principles of Model Checking","author":"Baier C.","year":"2008","unstructured":"Baier, C. and Katoen, J. Principles of Model Checking. MIT Press\u00a0(2008)."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.23919\/ACC.2019.8815233"},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","unstructured":"Bloem R. et al. Shield synthesis: Runtime enforcement for reactive systems. In Intern. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS) vol. 9035 of Lecture Notes in Computer Science.\u00a0Springer\u00a0(2015) 533\u2013548.","DOI":"10.1007\/978-3-662-46681-0_51"},{"key":"e_1_3_2_7_2","first-page":"33","volume-title":"Symposiom on Bridging the Gap Between AI and Reality (AISoLA), vol. 14380 of Lecture Notes in Computer Science.","author":"Brorholt A.H.","year":"2023","unstructured":"Brorholt, A.H. et al. Shielded reinforcement learning for hybrid systems. In Symposiom on Bridging the Gap Between AI and Reality (AISoLA), vol. 14380 of Lecture Notes in Computer Science.\u00a0Springer\u00a0(2023), 33\u201354."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Carr S. Safe reinforcement learning via shielding under partial observability. In Conf. on Artificial Intelligence (AAAI) AAAI Press (2023) 14748\u201314756.","DOI":"10.1609\/aaai.v37i12.26723"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10575-8"},{"key":"e_1_3_2_10_2","unstructured":"Dalrymple D. et al. Towards guaranteed safe AI: A framework for ensuring robust and reliable AI systems. CoRR abs\/2405.06624 (2024)."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-46681-0_16"},{"issue":"2","key":"e_1_3_2_12_2","article-title":"Permissive controller synthesis for probabilistic systems","volume":"11","author":"Dr\u00e4ger K.","year":"2015","unstructured":"Dr\u00e4ger, K. et al. Permissive controller synthesis for probabilistic systems. Logical Methods in Computer Science 11, 2 (2015).","journal-title":"Logical Methods in Computer Science"},{"key":"e_1_3_2_13_2","unstructured":"Elsayed-Aly I. et al. Safe multi-agent reinforcement learning via shielding. In Intern. Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS). ACM\u00a0(2021) 483\u2013491."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.5555\/3115971.3116162"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.2018.2876389"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12107"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.5555\/2789272.2886795"},{"key":"e_1_3_2_18_2","unstructured":"Giacobbe M. et al.\u00a0Shielding atari games with bounded prescience. In AAMAS \u201921.\u00a0\u00a0International Foundation for Autonomous Agents and Multiagent Systems\u00a0(May 2021) 1507\u20131509."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2022.3155229"},{"key":"e_1_3_2_20_2","first-page":"3:1","volume-title":"Intern. Conf. on Concurrency Theory (CONCUR), vol. 171 of LIPIcs.","author":"Jansen N.","year":"2020","unstructured":"Jansen, N. et al. Safe reinforcement learning using probabilistic shields (invited paper). In Intern. Conf. on Concurrency Theory (CONCUR), vol. 171 of LIPIcs.\u00a0Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik\u00a0(2020), 3:1\u20133:16."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2021.3097660"},{"key":"e_1_3_2_22_2","doi-asserted-by":"crossref","unstructured":"Johnson T.T. et al. Real-time reachability for verified simplex design. ACM Transactions on Embedded Computing Systems (TECS) 15 2 (2016) 1\u201327.","DOI":"10.1145\/2723871"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Junges S. et al. Safety-constrained reinforcement learning for mdps. In Intern. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS) vol. 9636 of Lecture Notes in Computer Science.\u00a0Springer\u00a0(2016) 130\u2013146.","DOI":"10.1007\/978-3-662-49674-9_8"},{"key":"e_1_3_2_24_2","unstructured":"Kochdumper N. et al. Provably safe reinforcement learning via action projection using reachability analysis and polynomial zonotopes. CoRR abs\/2210.10691 (2022)."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10703-017-0276-9"},{"key":"e_1_3_2_26_2","first-page":"440","volume-title":"Intern. Conf. on Machine Learning, (ICML)","author":"Laud A.","year":"2003","unstructured":"Laud, A. and DeJong, G. The influence of reward on the speed of reinforcement learning: An analysis of shaping. In Intern. Conf. on Machine Learning, (ICML) AAAI Press, (2003), 440\u2013447."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(02)00378-8"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-32157-3_8"},{"key":"e_1_3_2_29_2","unstructured":"Melcer D. Amato C. and Tripakis S. Shield decentralization for safe multi-agent reinforcement learning. In Annual Conf. on Neural Information Processing Systems\u00a0(2022)."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"e_1_3_2_31_2","unstructured":"Moldovan T. M. and Abbeel P. Safe exploration in markov decision processes. In Intern. Conf. on Machine Learning\u00a0(2012)."},{"key":"e_1_3_2_32_2","doi-asserted-by":"crossref","unstructured":"Mousavi S.S. Schukat M. and Howley E. Deep reinforcement learning: An overview. In Proceedings of SAI Intelligent Systems Conf. 2016: Volume 2. Springer\u00a0(2018) 426\u2013440.","DOI":"10.1007\/978-3-319-56991-8_32"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-55754-6_6"},{"key":"e_1_3_2_34_2","doi-asserted-by":"crossref","unstructured":"Pranger S. et al. TEMPEST - synthesis tool for reactive systems and shields in probabilistic environments. In Automated Technology for Verification and Analysis 2021 vol. 12971 of Lecture Notes in Computer Science.\u00a0Springer\u00a0(2021) 222\u2013228.","DOI":"10.1007\/978-3-030-88885-5_15"},{"key":"e_1_3_2_35_2","first-page":"268:1","article-title":"Stable-baselines3: Reliable reinforcement learning implementations","volume":"22","author":"Raffin A.","year":"2021","unstructured":"Raffin, A. et al. Stable-baselines3: Reliable reinforcement learning implementations. J. of Machine Learning Research 22 (2021), 268:1\u2013268:8.","journal-title":"J. of Machine Learning Research"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1137\/0325013"},{"key":"e_1_3_2_37_2","unstructured":"Rodriguez A. et al. Shield synthesis for LTL modulo theories. CoRR abs\/2406.04184 (2024)."},{"key":"e_1_3_2_38_2","unstructured":"Schulman J. et al. Proximal policy optimization algorithms. CoRR abs\/1707.06347 (2017)."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503914"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.1998.712192"},{"issue":"3","key":"e_1_3_2_41_2","first-page":"200","article-title":"Implementing action mask in proximal policy optimization (PPO) algorithm","volume":"6","author":"Tang C.","year":"2020","unstructured":"Tang, C.\u00a0et al. Implementing action mask in proximal policy optimization (PPO) algorithm. Information & Communications Technology Express 6, 3 (2020), 200\u2013203.","journal-title":"Information & Communications Technology Express"}],"container-title":["Communications of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3715958","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,29]],"date-time":"2025-10-29T17:56:48Z","timestamp":1761760608000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3715958"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,29]]},"references-count":40,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["10.1145\/3715958"],"URL":"https:\/\/doi.org\/10.1145\/3715958","relation":{},"ISSN":["0001-0782","1557-7317"],"issn-type":[{"value":"0001-0782","type":"print"},{"value":"1557-7317","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,29]]},"assertion":[{"value":"2024-07-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}