{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T05:26:54Z","timestamp":1740979614194,"version":"3.38.0"},"reference-count":69,"publisher":"SAGE Publications","issue":"9","license":[{"start":{"date-parts":[[2023,7,25]],"date-time":"2023-07-25T00:00:00Z","timestamp":1690243200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of Robotics Research"],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p> State-of-the art deep reinforcement learning has enabled autonomous agents to learn complex strategies from scratch on many problems including continuous control tasks. Deep Q-networks (DQN) and deep deterministic policy gradients (DDPGs) are two such algorithms which are both based on Q-learning. They therefore all share function approximation, off-policy behavior, and bootstrapping\u2014the constituents of the so-called deadly triad that is known for its convergence issues. We suggest to take a graph perspective on the data an agent has collected and show that the structure of this data graph is linked to the degree of divergence that can be expected. We further demonstrate that a subset of states and actions from the data graph can be selected such that the resulting finite graph can be interpreted as a simplified Markov decision process (MDP) for which the Q-values can be computed analytically. These Q-values are lower bounds for the Q-values in the original problem, and enforcing these bounds in temporal difference learning can help to prevent soft divergence. We show further effects on a simulated continuous control task, including improved sample efficiency, increased robustness toward hyperparameters as well as a better ability to cope with limited replay memory. Finally, we demonstrate the benefits of our method on a large robotic benchmark with an industrial assembly task and approximately 60\u00a0h of real-world interaction. <\/jats:p>","DOI":"10.1177\/02783649231185165","type":"journal-article","created":{"date-parts":[[2023,7,25]],"date-time":"2023-07-25T08:54:49Z","timestamp":1690275289000},"page":"633-654","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Stabilizing deep Q-learning with Q-graph-based bounds"],"prefix":"10.1177","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6958-8015","authenticated-orcid":false,"given":"Sabrina","family":"Hoppe","sequence":"first","affiliation":[{"name":"Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, Germany"},{"name":"Machine Learning and Robotics Lab, University of Stuttgart, Stuttgart, Germany"}]},{"given":"Markus","family":"Giftthaler","sequence":"additional","affiliation":[{"name":"Google Germany GmbH, Munich, Germany"}]},{"given":"Robert","family":"Krug","sequence":"additional","affiliation":[{"name":"Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, Germany"}]},{"given":"Marc","family":"Toussaint","sequence":"additional","affiliation":[{"name":"Learning and Intelligent Systems Group, TU Berlin, Berlin, Germany"}]}],"member":"179","published-online":{"date-parts":[[2023,7,25]]},"reference":[{"key":"bibr1-02783649231185165","unstructured":"Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. URL: https:\/\/www.tensorflow.org\/"},{"key":"bibr2-02783649231185165","unstructured":"Achiam J, Knight E, Abbeel P (2019) Towards characterizing divergence in deep Q-learning. ArXiv Preprint arXiv:1903.08894."},{"key":"bibr3-02783649231185165","unstructured":"Amiranashvili A, Dosovitskiy A, Koltun V, et al. (2018) Analyzing the role of temporal differencing in deep reinforcement learning. In: Proceedings of the international conference on learning representations. URL: https:\/\/openreview.net\/forum?id=HyiAuyb0b"},{"key":"bibr4-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/56.20440"},{"key":"bibr5-02783649231185165","unstructured":"Anschel O, Baram N, Shimkin N (2017) Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34 th International Conference on Machine, Sydney, Australia, PMLR 70, 2017, August 6\u201311 2017, pp. 176\u2013185."},{"key":"bibr6-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1016\/B978-1-55860-377-6.50013-X"},{"key":"bibr7-02783649231185165","unstructured":"Baird LC (1999) Reinforcement learning through gradient descent. PhD Thesis, Pittsburgh, PA: Carnegie Mellon University, URL: http:\/\/reports-archive.adm.cs.cmu.edu\/anon\/1999\/CMU-CS-99-132.pdf"},{"key":"bibr8-02783649231185165","unstructured":"Blundell C, Uria B, Pritzel A, et al. (2016) Model-free episodic control. ArXiv Preprint arXiv:1606.04460."},{"key":"bibr9-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ROBOT.1995.525545"},{"volume-title":"Pinocchio: Fast Forward and Inverse Dynamics for Poly-Articulated Systems","year":"2015","author":"Carpentier J","key":"bibr10-02783649231185165"},{"key":"bibr11-02783649231185165","unstructured":"Corneil D, Gerstner W, Brea J (2018) Efficient model-based deep reinforcement learning with variational state tabulation. In: Proceedings of the international conference on machine learning, Stockholm, Sweden, 10\u201315 July, pp. 1049\u20131058."},{"key":"bibr12-02783649231185165","doi-asserted-by":"crossref","unstructured":"De Asis K, Chan A, Pitis S, et al. (2019) Fixed-horizon temporal difference methods for stable reinforcement learning. ArXiv Preprint arXiv:1909.03906.","DOI":"10.1609\/aaai.v34i04.5784"},{"key":"bibr13-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"bibr14-02783649231185165","unstructured":"Durugkar I, Stone P (2018) TD learning with constrained gradients. URL: https:\/\/openreview.net\/forum?id=Bk-ofQZRb"},{"key":"bibr15-02783649231185165","first-page":"15246","volume":"32","author":"Eysenbach B","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"bibr16-02783649231185165","unstructured":"Fazeli N, Zapolsky S, Drumwright E, et al. (2017) Learning data-efficient rigid-body contact models: case study of planar impact. In: Proceedings of the 1st annual conference on robot learning, Mountain View, United States, 13\u201315 November, pp. 388\u2013397."},{"key":"bibr17-02783649231185165","unstructured":"Fedus W, Ramachandran P, Agarwal R, et al. (2020) Revisiting fundamentals of experience replay. In: Proceedings of the international conference on machine learning, Virtual Event, 13\u201318 July."},{"key":"bibr18-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1023\/A:1022698606139"},{"key":"bibr19-02783649231185165","unstructured":"Fu J, Kumar A, Soh M, et al. (2019) Diagnosing bottlenecks in deep Q-learning algorithms. In: Proceedings of the international conference on machine learning, Long Beach, CA, 9\u201315 June, pp. 2021\u20132030."},{"key":"bibr20-02783649231185165","unstructured":"Fujimoto S, Meger D, Precup D (2019) Off-policy deep reinforcement learning without exploration. In: Proceedings of the international conference on machine learning, Long Beach, CA, 9\u201315 June, p. 2052."},{"key":"bibr21-02783649231185165","first-page":"1587","volume":"80","author":"Fujimoto S","year":"2018","journal-title":"Proceedings of Machine Learning Research"},{"key":"bibr22-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/SIMPAR.2018.8376281"},{"key":"bibr23-02783649231185165","unstructured":"Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, Sardinia, Italy, 13\u201315 May, pp. 249\u2013256."},{"key":"bibr24-02783649231185165","first-page":"3846","volume-title":"NeurIPS","author":"Gu SS","year":"2017"},{"key":"bibr25-02783649231185165","unstructured":"Haarnoja T, Zhou A, Abbeel P, et al. (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. ArXiv Preprint arXiv:1801.01290."},{"key":"bibr26-02783649231185165","unstructured":"He FS, Liu Y, Schwing AG, et al. (2017) Learning to play in a day: faster deep reinforcement learning by optimality tightening. In: Proceedings of the international conference on learning representations, Toulon, France, 24\u201326 April."},{"key":"bibr27-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.123"},{"key":"bibr28-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11694"},{"key":"bibr29-02783649231185165","unstructured":"Hernandez-Garcia JF, Sutton RS (2019) Understanding multi-step deep reinforcement learning: a systematic study of the dqn target. ArXiv Preprint arXiv:1901.07510."},{"key":"bibr30-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/IROS45743.2020.9341390"},{"key":"bibr31-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2019.2928212"},{"key":"bibr32-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2017.8202244"},{"key":"bibr33-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8794127"},{"key":"bibr34-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793542"},{"key":"bibr35-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1613\/jair.301"},{"key":"bibr36-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-007-2069-5_83"},{"key":"bibr37-02783649231185165","unstructured":"Kloss A, Schaal S, Bohg J (2017) Combining learned and analytical models for predicting action effects. ArXiv Preprint arXiv:1710.04102."},{"key":"bibr38-02783649231185165","first-page":"11784","volume-title":"NeurIPS","author":"Kumar A","year":"2019"},{"key":"bibr39-02783649231185165","unstructured":"Kumar A, Gupta A, Levine S (2020a) Discor: corrective feedback in reinforcement learning via distribution correction. ArXiv Preprint arXiv:2003.07305."},{"key":"bibr40-02783649231185165","unstructured":"Kumar A, Zhou A, Tucker G, et al. (2020b) Conservative Q-learning for offline reinforcement learning. ArXiv Preprint arXiv:2006.04779."},{"key":"bibr41-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v29i1.9700"},{"key":"bibr42-02783649231185165","first-page":"2112","volume-title":"NeurIPS","author":"Lee SY","year":"2019"},{"key":"bibr43-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1163\/156855305323383767"},{"key":"bibr44-02783649231185165","unstructured":"Levine S, Kumar A, Tucker G, et al. (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. ArXiv Preprint arXiv:2005.01643."},{"journal-title":"CoRR Abs\/1603.02199","year":"2016","author":"Levine S","key":"bibr45-02783649231185165"},{"key":"bibr46-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/100.591646"},{"key":"bibr47-02783649231185165","unstructured":"Lillicrap TP, Hunt JJ, Pritzel A, et al. (2015) Continuous control with deep reinforcement learning. ArXiv Preprint arXiv:1509.02971."},{"key":"bibr48-02783649231185165","unstructured":"Mnih V, Badia AP, Mirza M, et al. (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings of the international conference on machine learning, New York, NY, USA, 19\u2013 24 June, pp. 1928\u20131937."},{"key":"bibr49-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"bibr50-02783649231185165","first-page":"1054","volume-title":"NeurIPS","author":"Munos R","year":"2016"},{"key":"bibr51-02783649231185165","doi-asserted-by":"publisher","DOI":"10.3389\/fnbot.2019.00103"},{"key":"bibr52-02783649231185165","unstructured":"Plappert M, Houthooft R, Dhariwal P, et al. (2017) Parameter space noise for exploration. ArXiv Preprint arXiv:1706.01905."},{"key":"bibr53-02783649231185165","unstructured":"Precup D, Sutton RS, Singh S (2000) Eligibility traces for off-policy policy evaluation. In: Proceedings of the international conference on machine learning, Stanford, USA, June 29\u2013July 2."},{"key":"bibr54-02783649231185165","unstructured":"Rueb A, Becker L (US Patent US10480923B2, November 2016) Sensor apparatus and robot system having the sensor apparatus."},{"key":"bibr55-02783649231185165","doi-asserted-by":"crossref","unstructured":"Schoettler G, Nair A, Luo J, et al. (2019) Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. ArXiv Preprint arXiv:1906.05841.","DOI":"10.1109\/IROS45743.2020.9341714"},{"key":"bibr56-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2021.103535"},{"key":"bibr57-02783649231185165","unstructured":"Silver T, Allen K, Tenenbaum J, et al. (2018) Residual policy learning. ArXiv Preprint arXiv:1812.06298."},{"key":"bibr58-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/TMECH.2019.2891177"},{"volume-title":"Reinforcement Learning: An Introduction","year":"2018","author":"Sutton RS","key":"bibr59-02783649231185165"},{"key":"bibr60-02783649231185165","unstructured":"Tang Y (2020) Self-imitation learning via generalized lower bound Q-learning. ArXiv Preprint arXiv:2006.07442."},{"key":"bibr61-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2018.8460696"},{"key":"bibr62-02783649231185165","unstructured":"Touati A, Zhang A, Pineau J, et al. (2020) Stable policy optimization via off-policy divergence regularization. ArXiv Preprint arXiv:2003.04108."},{"key":"bibr63-02783649231185165","unstructured":"Van Hasselt H, Doron Y, Strub F, et al. (2018) Deep reinforcement learning and the deadly triad. ArXiv Preprint arXiv:1812.02648."},{"key":"bibr64-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10295"},{"key":"bibr65-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1007\/BF00992698"},{"volume-title":"Learning from delayed rewards","year":"1989","author":"Watkins CJCH","key":"bibr66-02783649231185165"},{"key":"bibr67-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2018.8460995"},{"key":"bibr68-02783649231185165","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2019.2939174"},{"key":"bibr69-02783649231185165","unstructured":"Zhu G, Lin Z, Yang G, et al. (2019) Episodic reinforcement learning with associative memory. In: Proceedings of the international conference on learning representations, New Orleans, USA, 6-9 May."}],"container-title":["The International Journal of Robotics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649231185165","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/02783649231185165","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/02783649231185165","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,2]],"date-time":"2025-03-02T07:10:29Z","timestamp":1740899429000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/02783649231185165"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,25]]},"references-count":69,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.1177\/02783649231185165"],"URL":"https:\/\/doi.org\/10.1177\/02783649231185165","relation":{},"ISSN":["0278-3649","1741-3176"],"issn-type":[{"type":"print","value":"0278-3649"},{"type":"electronic","value":"1741-3176"}],"subject":[],"published":{"date-parts":[[2023,7,25]]}}}