{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,9]],"date-time":"2026-06-09T15:48:14Z","timestamp":1781020094150,"version":"3.54.1"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2022,9,23]],"date-time":"2022-09-23T00:00:00Z","timestamp":1663891200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,9,23]],"date-time":"2022-09-23T00:00:00Z","timestamp":1663891200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100007601","name":"Horizon 2020","doi-asserted-by":"publisher","award":["956123"],"award-info":[{"award-number":["956123"]}],"id":[{"id":"10.13039\/501100007601","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100018461","name":"Silicon Austria Labs","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100018461","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100008332","name":"Technische Universit\u00e4t Graz","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100008332","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Lamarr Security Research"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Innovations Syst Softw Eng"],"published-print":{"date-parts":[[2023,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next <jats:italic>k<\/jats:italic> steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game <jats:sc>Snake<\/jats:sc>. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.<\/jats:p>","DOI":"10.1007\/s11334-022-00480-4","type":"journal-article","created":{"date-parts":[[2022,9,23]],"date-time":"2022-09-23T12:03:40Z","timestamp":1663934620000},"page":"379-394","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":26,"title":["Online shielding for reinforcement learning"],"prefix":"10.1007","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5183-5452","authenticated-orcid":false,"given":"Bettina","family":"K\u00f6nighofer","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Julian","family":"Rudolf","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alexander","family":"Palmisano","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Martin","family":"Tappler","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Roderick","family":"Bloem","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2022,9,23]]},"reference":[{"key":"480_CR1","doi-asserted-by":"crossref","unstructured":"Alshiekh M, Bloem R, Ehlers R, et\u00a0al (2018) Safe reinforcement learning via shielding. In: AAAI. AAAI Press","DOI":"10.1609\/aaai.v32i1.11797"},{"key":"480_CR2","unstructured":"Amodei D, Olah C, Steinhardt J, et\u00a0al (2016) Concrete problems in AI safety. arXiv:1606.06565"},{"key":"480_CR3","doi-asserted-by":"publisher","first-page":"630","DOI":"10.1007\/978-3-030-25540-4_36","volume-title":"CAV 2019, Part I, LNCS","author":"G Avni","year":"2019","unstructured":"Avni G, Bloem R, Chatterjee K et al (2019) Run-time optimization for learned controllers through quantitative games. In: Dillig I, Tasiran S (eds) CAV 2019, Part I, LNCS, vol 11561. Springer, Cham, pp 630\u2013649. https:\/\/doi.org\/10.1007\/978-3-030-25540-4_36"},{"key":"480_CR4","volume-title":"Principles of model checking","author":"C Baier","year":"2008","unstructured":"Baier C, Katoen J (2008) Principles of model checking. MIT Press, Cambridge"},{"key":"480_CR5","doi-asserted-by":"crossref","unstructured":"Bloem R, K\u00f6nighofer B, K\u00f6nighofer R, et\u00a0al (2015) Shield synthesis: - runtime enforcement for reactive systems. In: TACAS, LNCS, vol 9035. Springer, pp 533\u2013548","DOI":"10.1007\/978-3-662-46681-0_51"},{"key":"480_CR6","doi-asserted-by":"publisher","unstructured":"Carr S, Jansen N, Junges S, et\u00a0al (2022) Safe reinforcement learning via shielding for pomdps. https:\/\/doi.org\/10.48550\/arXiv.2204.00755,","DOI":"10.48550\/arXiv.2204.00755"},{"key":"480_CR7","doi-asserted-by":"crossref","unstructured":"Cheng R, Orosz G, Murray RM, et\u00a0al (2019) End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: AAAI","DOI":"10.1609\/aaai.v33i01.33013387"},{"key":"480_CR8","unstructured":"Chow Y, Nachum O, Duenez-Guzman E, et\u00a0al (2018) A Lyapunov-based approach to safe reinforcement learning. In: NIPS, pp 8103\u20138112"},{"key":"480_CR9","doi-asserted-by":"crossref","unstructured":"Dehnert C, Junges S, Katoen J, et\u00a0al (2017) A storm is coming: A modern probabilistic model checker. In: CAV (2), LNCS, vol 10427. Springer, pp 592\u2013600","DOI":"10.1007\/978-3-319-63390-9_31"},{"key":"480_CR10","unstructured":"Elsayed-Aly I, Bharadwaj S, Amato C, et\u00a0al (2021) Safe multi-agent reinforcement learning via shielding. In: Dignum F, Lomuscio A, Endriss U, et\u00a0al (eds) AAMAS \u201921: 20th international conference on autonomous agents and multiagent systems, virtual event, United Kingdom, May 3-7, 2021. ACM, pp 483\u2013491, https:\/\/dl.acm.org\/doi\/10.5555\/3463952.3464013"},{"key":"480_CR11","doi-asserted-by":"crossref","unstructured":"Fulton N, Platzer A (2018) Safe reinforcement learning via formal methods: Toward safe control through proof and learning. In: AAAI. AAAI Press","DOI":"10.1609\/aaai.v32i1.12107"},{"key":"480_CR12","doi-asserted-by":"publisher","unstructured":"Fulton N, Platzer A (2019) Verifiably safe off-model reinforcement learning. In: Vojnar T, Zhang L (eds) Tools and algorithms for the construction and analysis of systems - 25th international conference, TACAS 2019, held as part of the European joint conferences on theory and practice of software, ETAPS 2019, Prague, Czech Republic, April 6\u201311, 2019, Proceedings, Part I, Lecture Notes in Computer Science, vol 11427. Springer, pp 413\u2013430, https:\/\/doi.org\/10.1007\/978-3-030-17462-0_28","DOI":"10.1007\/978-3-030-17462-0_28"},{"issue":"1","key":"480_CR13","first-page":"1437","volume":"16","author":"J Garc\u0131a","year":"2015","unstructured":"Garc\u0131a J, Fern\u00e1ndez F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16(1):1437\u20131480","journal-title":"J Mach Learn Res"},{"key":"480_CR14","unstructured":"Giacobbe M, Hasanbeig M, Kroening D, et\u00a0al (2021) Shielding atari games with bounded prescience. In: Dignum F, Lomuscio A, Endriss U, et\u00a0al (eds) AAMAS \u201921: 20th International conference on autonomous agents and multiagent systems, virtual event, United Kingdom, May 3\u20137, 2021. ACM, pp 1507\u20131509, https:\/\/dl.acm.org\/doi\/10.5555\/3463952.3464141"},{"key":"480_CR15","doi-asserted-by":"crossref","unstructured":"Hahn EM, Perez M, Schewe S, et\u00a0al (2019) Omega-regular objectives in model-free reinforcement learning. In: TACAS (1), LNCS, vol 11427. Springer, pp 395\u2013412","DOI":"10.1007\/978-3-030-17462-0_27"},{"key":"480_CR16","unstructured":"Hasanbeig M, Abate A, Kroening D (2019) Certified reinforcement learning with logic guidance. arXiv:1902.00778"},{"key":"480_CR17","unstructured":"Hasanbeig M, Abate A, Kroening D (2020) Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th international conference on autonomous agents and multiagent systems, AAMAS 2020, Auckland, New Zealand, May 9-13, 2020. International Foundation for Autonomous Agents and Multiagent Systems, pp 483\u2013491, https:\/\/dl.acm.org\/doi\/abs\/10.5555\/3398761.3398821"},{"key":"480_CR18","doi-asserted-by":"publisher","unstructured":"Hunt N, Fulton N, Magliacane S, et\u00a0al (2021) Verifiably safe exploration for end-to-end reinforcement learning. In: Bogomolov S, Jungers RM (eds) HSCC \u201921: 24th ACM International Conference on Hybrid Systems: Computation and Control, Nashville, Tennessee, May 19-21, 2021. ACM, pp 14:1\u201314:11, https:\/\/doi.org\/10.1145\/3447928.3456653","DOI":"10.1145\/3447928.3456653"},{"issue":"8","key":"480_CR19","doi-asserted-by":"publisher","first-page":"2589","DOI":"10.3390\/s21082589","volume":"21","author":"TB Ionescu","year":"2021","unstructured":"Ionescu TB (2021) Adaptive simplex architecture for safe, real-time robot path planning. Sensors 21(8):2589","journal-title":"Sensors"},{"key":"480_CR20","doi-asserted-by":"publisher","unstructured":"Jansen N, K\u00f6nighofer B, Junges S, et\u00a0al (2020) Safe reinforcement learning using probabilistic shields (invited paper). In: Konnov I, Kov\u00e1cs L (eds) CONCUR, LIPIcs, vol 171. Schloss Dagstuhl - Leibniz-Zentrum f\u00fcr Informatik, pp 3:1\u20133:16, https:\/\/doi.org\/10.4230\/LIPIcs.CONCUR.2020.3","DOI":"10.4230\/LIPIcs.CONCUR.2020.3"},{"key":"480_CR21","doi-asserted-by":"crossref","unstructured":"Katoen JP (2016) The probabilistic model checking landscape. In: LICS. ACM, pp 31\u201345","DOI":"10.1145\/2933575.2934574"},{"key":"480_CR22","doi-asserted-by":"publisher","unstructured":"K\u00f6nighofer B, Lorber F, Jansen N et al (2020) Shield synthesis for reinforcement learning. ISoLA, Part I. pp 290\u2013306. https:\/\/doi.org\/10.1007\/978-3-030-61362-4_16","DOI":"10.1007\/978-3-030-61362-4_16"},{"key":"480_CR23","doi-asserted-by":"publisher","unstructured":"K\u00f6nighofer B, Rudolf J, Palmisano A, et\u00a0al (2021) Online shielding for stochastic systems. In: Dutle A, Moscato MM, Titolo L, et\u00a0al (eds) NASA formal methods\u201413th international symposium, NFM 2021, Virtual Event, May 24-28, 2021, proceedings, lecture notes in computer science, vol 12673. Springer, Berlin, pp 231\u2013248, https:\/\/doi.org\/10.1007\/978-3-030-76384-8_15","DOI":"10.1007\/978-3-030-76384-8_15"},{"key":"480_CR24","doi-asserted-by":"crossref","unstructured":"Kwiatkowska MZ (2003) Model checking for probability and time: from theory to practice. In: LICS. IEEE CS, p 351","DOI":"10.1109\/LICS.2003.1210075"},{"key":"480_CR25","doi-asserted-by":"crossref","unstructured":"Kwiatkowska MZ, Norman G, Parker D (2011) PRISM 4.0: Verification of probabilistic real-time systems. In: CAV, LNCS, vol 6806. Springer, pp 585\u2013591","DOI":"10.1007\/978-3-642-22110-1_47"},{"key":"480_CR26","doi-asserted-by":"publisher","unstructured":"Li S, Bastani O (2020) Robust model predictive shielding for safe reinforcement learning with stochastic dynamics. In: ICRA. IEEE, pp 7166\u20137172, https:\/\/doi.org\/10.1109\/ICRA40945.2020.9196867","DOI":"10.1109\/ICRA40945.2020.9196867"},{"issue":"2","key":"480_CR27","doi-asserted-by":"publisher","first-page":"255","DOI":"10.1007\/s10994-016-5565-9","volume":"105","author":"H Mao","year":"2016","unstructured":"Mao H, Chen Y, Jaeger M et al (2016) Learning deterministic probabilistic automata from a model checking perspective. Mach Learn 105(2):255\u2013299. https:\/\/doi.org\/10.1007\/s10994-016-5565-9","journal-title":"Mach Learn"},{"key":"480_CR28","doi-asserted-by":"crossref","unstructured":"Ohnishi M, Wang L, Notomista G, et\u00a0al (2019) Barrier-certified adaptive reinforcement learning with applications to brushbot navigation. IEEE Trans Robot 35:1\u201320","DOI":"10.1109\/TRO.2019.2920206"},{"key":"480_CR29","unstructured":"Paszke A, Gross S, Massa F, et\u00a0al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Wallach HM, Larochelle H, Beygelzimer A, et\u00a0al (eds) Advances in neural information processing systems 32. Curran Associates, Inc., p 8024\u20138035"},{"key":"480_CR30","doi-asserted-by":"crossref","unstructured":"Pnueli A (1977) The temporal logic of programs. In: Foundations of Computer Science, IEEE, pp 46\u201357","DOI":"10.1109\/SFCS.1977.32"},{"key":"480_CR31","doi-asserted-by":"publisher","unstructured":"Pranger S, K\u00f6nighofer B, Posch L, et\u00a0al (2021a) TEMPEST - synthesis tool for reactive systems and shields in probabilistic environments. In: Hou Z, Ganesh V (eds) Automated Technology for Verification and Analysis\u201419th International Symposium, ATVA 2021, Gold Coast, QLD, Australia, October 18-22, 2021, Proceedings, Lecture Notes in Computer Science, vol 12971. Springer, pp 222\u2013228, https:\/\/doi.org\/10.1007\/978-3-030-88885-5_15","DOI":"10.1007\/978-3-030-88885-5_15"},{"key":"480_CR32","doi-asserted-by":"publisher","unstructured":"Pranger S, K\u00f6nighofer B, Tappler M, et\u00a0al (2021b) Adaptive shielding under uncertainty. In: 2021 American Control Conference, ACC 2021, New Orleans, LA, USA, May 25-28, 2021. IEEE, pp 3467\u20133474, https:\/\/doi.org\/10.23919\/ACC50511.2021.9482889","DOI":"10.23919\/ACC50511.2021.9482889"},{"key":"480_CR33","unstructured":"Sadigh D, Sastry S, Seshia SA, et\u00a0al (2016) Planning for autonomous cars that leverage effects on human actions. In: Robotics: Science and Systems"},{"issue":"7","key":"480_CR34","doi-asserted-by":"publisher","first-page":"1405","DOI":"10.1007\/s10514-018-9746-1","volume":"42","author":"D Sadigh","year":"2018","unstructured":"Sadigh D, Landolfi N, Sastry SS et al (2018) Planning for cars that coordinate with people: leveraging effects on human actions for planning and active information gathering over human internal state. Auton Robots 42(7):1405\u20131426","journal-title":"Auton Robots"},{"issue":"7587","key":"480_CR35","doi-asserted-by":"publisher","first-page":"484","DOI":"10.1038\/nature16961","volume":"529","author":"D Silver","year":"2016","unstructured":"Silver D, Huang A, Maddison CJ et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484","journal-title":"Nature"},{"key":"480_CR36","volume-title":"Reinforcement learning: an introduction","author":"RS Sutton","year":"1998","unstructured":"Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge"},{"key":"480_CR37","doi-asserted-by":"publisher","unstructured":"Tappler M, Aichernig BK, Bacci G et al (2021) $$L^*$$-based learning of markov decision processes (extended version). Formal Aspects Comput 33(4):575\u2013615. https:\/\/doi.org\/10.1007\/s00165-021-00536-5","DOI":"10.1007\/s00165-021-00536-5"},{"key":"480_CR38","doi-asserted-by":"crossref","unstructured":"Tappler M, Mu\u0161kardin E, Aichernig BK, et\u00a0al (2021b) Active model learning of stochastic reactive systems. In: SEFM 2021, in press","DOI":"10.1007\/978-3-030-92124-8_27"},{"key":"480_CR39","doi-asserted-by":"crossref","unstructured":"Wang A, Kurutach T, Liu K, et\u00a0al (2019) Learning robotic manipulation through visual planning and acting. arXiv preprint arXiv:1905.04411","DOI":"10.15607\/RSS.2019.XV.074"},{"key":"480_CR40","unstructured":"Zhang W, Bastani O (2019) MAMPS: safe multi-agent reinforcement learning via model predictive shielding. arXiv:1910.12639"}],"container-title":["Innovations in Systems and Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11334-022-00480-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11334-022-00480-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11334-022-00480-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,11]],"date-time":"2023-11-11T01:05:49Z","timestamp":1699664749000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11334-022-00480-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,23]]},"references-count":40,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12]]}},"alternative-id":["480"],"URL":"https:\/\/doi.org\/10.1007\/s11334-022-00480-4","relation":{},"ISSN":["1614-5046","1614-5054"],"issn-type":[{"value":"1614-5046","type":"print"},{"value":"1614-5054","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,23]]},"assertion":[{"value":"16 November 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 August 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 September 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}