{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T19:19:25Z","timestamp":1773775165839,"version":"3.50.1"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2022,6,23]],"date-time":"2022-06-23T00:00:00Z","timestamp":1655942400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,6,23]],"date-time":"2022-06-23T00:00:00Z","timestamp":1655942400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001831","name":"Technische Universiteit Delft","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001831","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100005320","name":"Xidian University","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100005320","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2023,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.<\/jats:p>","DOI":"10.1007\/s10994-022-06187-8","type":"journal-article","created":{"date-parts":[[2022,6,23]],"date-time":"2022-06-23T22:26:02Z","timestamp":1656023162000},"page":"859-887","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":27,"title":["Safety-constrained reinforcement learning with a distributional safety critic"],"prefix":"10.1007","volume":"112","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9686-2697","authenticated-orcid":false,"given":"Qisong","family":"Yang","sequence":"first","affiliation":[]},{"given":"Thiago D.","family":"Sim\u00e3o","sequence":"additional","affiliation":[]},{"given":"Simon H.","family":"Tindemans","sequence":"additional","affiliation":[]},{"given":"Matthijs T. J.","family":"Spaan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,6,23]]},"reference":[{"key":"6187_CR1","unstructured":"Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. Proceedings of the 34th international conference on machine learning (pp. 22-31). PMLR."},{"key":"6187_CR2","unstructured":"Altman, E. (1999). Constrained Markov decision processes (Vol. 7). CRC Press."},{"key":"6187_CR3","unstructured":"Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. Proceedings of the 34th international conference on machine learning (pp. 449-458). PMLR."},{"key":"6187_CR4","doi-asserted-by":"crossref","unstructured":"Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods (Vol. 1). Academic press.","DOI":"10.1016\/B978-0-12-093480-5.50005-2"},{"key":"6187_CR5","unstructured":"Bharadhwaj, H., Kumar, A., Rhinehart, N., Levine, S., Shkurti, F., & Garg, A. (2021). Conservative safety critics for exploration. 9th international conference on learning representations (pp. 1-9)."},{"issue":"3","key":"6187_CR6","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1016\/j.sysconle.2004.08.007","volume":"54","author":"VS Borkar","year":"2005","unstructured":"Borkar, V. S. (2005). An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3), 207\u2013213.","journal-title":"Systems & Control Letters"},{"issue":"1","key":"6187_CR7","first-page":"6070","volume":"18","author":"Y Chow","year":"2017","unstructured":"Chow, Y., Ghavamzadeh, M., Janson, L., & Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1), 6070\u20136120.","journal-title":"The Journal of Machine Learning Research"},{"key":"6187_CR8","unstructured":"Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. Proceedings of the 35th international conference on machine learning (pp. 1096-1105)."},{"key":"6187_CR9","doi-asserted-by":"crossref","unstructured":"Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). Distributional reinforcement learning with quantile regression. Thirty-Second AAAI Conference on Artificial Intelligence (pp. 2892-2901). AAAI Press.","DOI":"10.1609\/aaai.v32i1.11791"},{"key":"6187_CR10","unstructured":"Duan, J., Guan, Y., Li, S. E., Ren, Y., & Cheng, B. (2020). Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. arXiv preprint arxiv:2001.02811."},{"key":"6187_CR11","doi-asserted-by":"crossref","unstructured":"Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., & Hester, T. (2021). Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 2419-2468.","DOI":"10.1007\/s10994-021-05961-4"},{"key":"6187_CR12","unstructured":"Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th international conference on machine learning (pp. 1126-1135). PMLR."},{"issue":"1","key":"6187_CR13","first-page":"1437","volume":"16","author":"J Garc\u00eda","year":"2015","unstructured":"Garc\u00eda, J., & Fern\u00e1ndez, F. (2015). A comprehensive survey on safe reinforcement learning. The Journal of Machine Learning Research, 16(1), 1437\u20131480.","journal-title":"The Journal of Machine Learning Research"},{"key":"6187_CR14","unstructured":"Ha, S., Xu, P., Tan, Z., Levine, S., & Tan, J. (2020). Learning to walk in the real world with minimal human effort. arXiv preprint arxiv:2002.08550."},{"key":"6187_CR15","unstructured":"Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th international conference on machine learning (pp. 1861-1870). PMLR."},{"key":"6187_CR16","unstructured":"Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., & Levine, S. (2018). Soft actor-critic algorithms and applications. arXiv preprint arxiv:1812.05905."},{"key":"6187_CR17","doi-asserted-by":"crossref","unstructured":"Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 73-101.","DOI":"10.1214\/aoms\/1177703732"},{"key":"6187_CR18","doi-asserted-by":"crossref","unstructured":"Kamran, D., Lopez, C. F., Lauer, M., & Stiller, C. (2020). Risk-aware highlevel decisions for automated driving at occluded intersections with reinforcement learning. IEEE intelligent vehicles symposium, IV (pp. 1205-1212). IEEE.","DOI":"10.1109\/IV47402.2020.9304606"},{"key":"6187_CR19","doi-asserted-by":"crossref","unstructured":"Keramati, R., Dann, C., Tamkin, A., & Brunskill, E. (2020). Being optimistic to be conservative: Quickly learning a cvar policy. Proceedings of the AAAI conference on artificial intelligence (pp. 4436-4443).","DOI":"10.1609\/aaai.v34i04.5870"},{"issue":"6","key":"6187_CR20","first-page":"70","volume":"2","author":"V Khokhlov","year":"2016","unstructured":"Khokhlov, V. (2016). Conditional value-at-risk for elliptical distributions. Evropsk\u1ef3 \u010dasopis ekonomiky a managementu, 2(6), 70\u201379.","journal-title":"Evropsk\u1ef3 \u010dasopis ekonomiky a managementu"},{"issue":"1","key":"6187_CR21","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1214\/aoms\/1177729694","volume":"22","author":"S Kullback","year":"1951","unstructured":"Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79\u201386.","journal-title":"The Annals of Mathematical Statistics"},{"key":"6187_CR22","first-page":"1179","volume":"33","author":"A Kumar","year":"2020","unstructured":"Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33, 1179\u20131191.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6187_CR23","unstructured":"Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Wierstra, D. (2015). Continuous control with deep reinforcement learning. 4th international conference on learning representations (pp. 1-10). ICLR."},{"key":"6187_CR24","doi-asserted-by":"crossref","unstructured":"Liu, Y., Ding, J., & Liu, X. (2020). IPO: Interior-point policy optimization under constraints. Proceedings of the AAAI conference on artificial intelligence (pp. 4940-4947).","DOI":"10.1609\/aaai.v34i04.5932"},{"key":"6187_CR25","unstructured":"Ma, X., Zhang, Q., Xia, L., Zhou, Z., Yang, J., & Zhao, Q. (2020). Distributional soft actor critic for risk sensitive learning. arXiv preprint arxiv:2004.14547."},{"issue":"7540","key":"6187_CR26","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529\u2013533.","journal-title":"Nature"},{"key":"6187_CR27","unstructured":"Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., & Tanaka, T. (2010). Parametric return density estimation for reinforcement learning. Twenty-sixth conference on uncertainty in artificial intelligence (pp. 368-375). AUAI Press."},{"key":"6187_CR28","doi-asserted-by":"publisher","first-page":"257","DOI":"10.1016\/0024-3795(82)90112-4","volume":"48","author":"I Olkin","year":"1982","unstructured":"Olkin, I., & Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 48, 257\u2013263.","journal-title":"Linear Algebra and its Applications"},{"key":"6187_CR29","doi-asserted-by":"crossref","unstructured":"Pecka, M., & Svoboda, T. (2014). Safe exploration techniques for reinforcement learning\u2013an overview. First international workshop on modelling and simulation for autonomous systems (pp. 357-375). Springer.","DOI":"10.1007\/978-3-319-13823-7_31"},{"key":"6187_CR30","unstructured":"Rakelly, K., Zhou, A., Finn, C., Levine, S., & Quillen, D. (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. Proceedings of the 36th international conference on machine learning (Vol. 97, pp. 5331-5340). PMLR."},{"key":"6187_CR31","unstructured":"Ray, A., Achiam, J., & Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. Retrieved from https:\/\/cdn.openai.com\/safexp-short.pdf"},{"issue":"3","key":"6187_CR32","doi-asserted-by":"publisher","first-page":"21","DOI":"10.21314\/JOR.2000.038","volume":"2","author":"RT Rockafellar","year":"2000","unstructured":"Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal of Risk, 2(3), 21\u201341.","journal-title":"Journal of Risk"},{"key":"6187_CR33","unstructured":"Rowland, M., Dadashi, R., Kumar, S., Munos, R., Bellemare, M. G., & Dabney, W. (2019). Statistics and samples in distributional reinforcement learning. Proceedings of the 36th international conference on machine learning (pp. 5528-5536)."},{"key":"6187_CR34","unstructured":"Roy, J., Girgis, R., Romoff, J., Bacon, P.-L., & Pal, C. (2021). Direct behavior specification via constrained reinforcement learning. arXiv preprint arxiv:2112.12228."},{"key":"6187_CR35","unstructured":"Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of the 32nd international conference on machine learning (pp. 1889-1897). JMLR.org."},{"key":"6187_CR36","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy optimization algorithms. arXiv preprint arxiv:1707.06347."},{"key":"6187_CR37","unstructured":"Sim\u00e3o, T. D., Jansen, N., & Spaan, M. T. J. (2021). AlwaysSafe: Reinforcement learning without safety constraint violations during training. Proceedings of the 20th international conference on autonomous agents and multiagent systems (AAMAS) (pp. 1226-1235). IFAAMAS."},{"issue":"4","key":"6187_CR38","doi-asserted-by":"publisher","first-page":"794","DOI":"10.2307\/3213832","volume":"19","author":"MJ Sobel","year":"1982","unstructured":"Sobel, M. J. (1982). The variance of discounted markov decision processes. Journal of Applied Probability, 19(4), 794\u2013802.","journal-title":"Journal of Applied Probability"},{"key":"6187_CR39","unstructured":"Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (Vol. 2). MIT press."},{"issue":"1","key":"6187_CR40","first-page":"361","volume":"17","author":"A Tamar","year":"2016","unstructured":"Tamar, A., Di Castro, D., & Mannor, S. (2016). Learning the variance of the reward-To-Go. The Journal of Machine Learning Research, 17(1), 361\u2013396.","journal-title":"The Journal of Machine Learning Research"},{"key":"6187_CR41","unstructured":"Tang, Y. C., Zhang, J., & Salakhutdinov, R. (2020). Worst cases policy gradients. 3rd annual conference on robot learning (pp. 1078-1093). PMLR."},{"key":"6187_CR42","unstructured":"Th\u00e9ate, T., Wehenkel, A., Bolland, A., Louppe, G., & Ernst, D. (2021). Distributional reinforcement learning with unconstrained monotonic neural networks. arXiv preprint arxiv:2106.03228."},{"key":"6187_CR43","unstructured":"Urp\u00ed, N. A., Curi, S., & A. K. (2021). Risk-averse offline reinforcement learning. 9th international conference on learning representations."},{"key":"6187_CR46","unstructured":"Yang, T.-Y., Rosca, J., Narasimhan, K., & Ramadge, P. J. (2020). Projection-based constrained policy optimization. 8th international conference on learning representations."},{"key":"6187_CR48","unstructured":"Yang, Q., Sim\u00e3o, T. D., Jansen, N., Tindemans, S. H., & Spaan, M. T. J. (2022). Training and transferring safe policies in reinforcement learning. AAMAS 2022 Workshop on Adaptive Learning Agents."},{"key":"6187_CR45","doi-asserted-by":"crossref","unstructured":"Yang, Q., Sim\u00e3o, T. D., Tindemans, S. H., & Spaan, M. T. J. (2021). WCSAC: Worst-case soft actor critic for safety-constrained reinforcement learning. Thirty-Fifth AAAI conference on artificial intelligence (pp. 10639\u201310646). AAAI Press.","DOI":"10.1609\/aaai.v35i12.17272"},{"key":"6187_CR44","unstructured":"Yang, D., Zhao, L., Lin, Z., Qin, T., Bian, J., & Liu, T.-Y. (2019). Fully parameterized quantile function for distributional reinforcement learning. Advances in Neural Information Processing Systems 32 (pp. 6193-6202). Curran Associates, Inc."},{"key":"6187_CR47","unstructured":"Zheng, L., & Ratliff, L. (2020). Constrained upper confidence reinforcement learning. Proceedings of the 2nd conference on learning for dynamics and control (pp. 620-629). online: PMLR."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-022-06187-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-022-06187-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-022-06187-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,3,6]],"date-time":"2023-03-06T19:11:17Z","timestamp":1678129877000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-022-06187-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,23]]},"references-count":48,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,3]]}},"alternative-id":["6187"],"URL":"https:\/\/doi.org\/10.1007\/s10994-022-06187-8","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,23]]},"assertion":[{"value":"17 November 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 February 2022","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 May 2022","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 June 2022","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest\/Competing interests"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"Not applicable.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}