{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,13]],"date-time":"2025-12-13T23:11:09Z","timestamp":1765667469357,"version":"3.37.3"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2024,2,5]],"date-time":"2024-02-05T00:00:00Z","timestamp":1707091200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,5]],"date-time":"2024-02-05T00:00:00Z","timestamp":1707091200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000266","name":"Engineering and Physical Sciences Research Council","doi-asserted-by":"publisher","award":["EP\/V024868\/1"],"award-info":[{"award-number":["EP\/V024868\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Offline reinforcement learning (RL) aims to create policies for sequential decision-making using exclusively offline datasets. This presents a significant challenge, especially when attempting to accomplish multiple distinct goals or outcomes within a given scenario while receiving sparse rewards. Prior methods using advantage weighting for offline goal-conditioned learning improve policies monotonically. However, they still face challenges from distribution shift and multi-modality that arise due to conflicting ways to reach a goal. This issue is especially challenging in long-horizon tasks, where the presence of multiple, often conflicting, solutions makes it hard to identify a single optimal policy for transitioning from a state to a desired goal. To address these challenges, we introduce a complementary advantage-based weighting scheme that incorporates an additional source of inductive bias. Given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Our proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL, outperforms several competing offline algorithms in widely used benchmarks. Furthermore, we provide a theoretical guarantee that the learned policy will not be inferior to the underlying behavior policy.<\/jats:p>","DOI":"10.1007\/s10994-023-06500-z","type":"journal-article","created":{"date-parts":[[2024,2,5]],"date-time":"2024-02-05T21:22:02Z","timestamp":1707168122000},"page":"2435-2465","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Goal-conditioned offline reinforcement learning through state space partitioning"],"prefix":"10.1007","volume":"113","author":[{"given":"Mianchu","family":"Wang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yue","family":"Jin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3942-3900","authenticated-orcid":false,"given":"Giovanni","family":"Montana","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,2,5]]},"reference":[{"key":"6500_CR1","unstructured":"Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., Zaremba, W. (2017). Hindsight experience replay. In: Advances in neural information processing systems."},{"key":"6500_CR2","first-page":"103","volume":"15","author":"M Bain","year":"1995","unstructured":"Bain, M., & Sammut, C. (1995). A framework for behavioural cloning. Machine Intelligence, 15, 103\u2013129.","journal-title":"Machine Intelligence"},{"key":"6500_CR3","unstructured":"Chane-Sane, E., Schmid, C., Laptev, I. (2021). Goal-conditioned reinforcement learning with imagined subgoals. In: International conference on machine learning."},{"key":"6500_CR4","first-page":"8532","volume":"33","author":"H Charlesworth","year":"2020","unstructured":"Charlesworth, H., & Montana, G. (2020). Plangan: Model-based planning with sparse rewards and multiple goals. Advances in Neural Information Processing Systems, 33, 8532\u20138542.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR5","unstructured":"Chebotar, Y., Hausman, K., Lu, Y., Xiao, T., Kalashnikov, D., Varley, J., Irpan, A., Eysenbach, B., Julian, R., Finn, C., Levine, S. (2021). Actionable models: Unsupervised offline reinforcement learning of robotic skills. In: International conference on machine learning."},{"key":"6500_CR6","unstructured":"Ding, Y., Florensa, C., Abbeel, P., Phielipp, M. (2019). Goal-conditioned imitation learning. In: Advances in Neural Information Processing Systems."},{"key":"6500_CR7","first-page":"8622","volume":"34","author":"I Durugkar","year":"2021","unstructured":"Durugkar, I., Tec, M., Niekum, S., & Stone, P. (2021). Adversarial intrinsic motivation for reinforcement learning. Advances in Neural Information Processing Systems, 34, 8622\u20138636.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR8","unstructured":"Emmons, S., Eysenbach, B., Kostrikov, I., Levine, S. (2022). RvS: What is essential for offline RL via supervised learning? In: International conference on learning representations."},{"key":"6500_CR9","first-page":"35603","volume":"35","author":"B Eysenbach","year":"2022","unstructured":"Eysenbach, B., Zhang, T., Levine, S., & Salakhutdinov, R. (2022). Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35, 35603\u201335620.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR10","unstructured":"Fang, M., Zhou, T., Du, Y., Han, L., Zhang, Z. (2019). Curriculum-guided hindsight experience replay. In: Advances in neural information processing systems."},{"key":"6500_CR11","unstructured":"Ferret, J., Pietquin, O., Geist, M. (2021). Self-imitation advantage learning. In: International conference on autonomous agents and multiagent systems."},{"key":"6500_CR12","unstructured":"Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S. (2020).D4RL: Datasets for deep data-driven reinforcement learning."},{"key":"6500_CR13","unstructured":"Fujimoto, S., Hoof, H., Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International conference on machine learning."},{"key":"6500_CR14","unstructured":"Fujimoto, S., Meger, D., Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In: International conference on machine learning."},{"key":"6500_CR15","first-page":"20132","volume":"34","author":"S Fujimoto","year":"2021","unstructured":"Fujimoto, S., & Gu, S. (2021). A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 20132\u201320145.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR16","unstructured":"Ghosh, D., Gupta, A., Reddy, A., Fu, J., Devin, C.M., Eysenbach, B., Levine, S. (2021). Learning to reach goals via iterated supervised learning. In: International conference on learning representations."},{"key":"6500_CR17","unstructured":"Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., Levine, S. (2018). Divide-and-conquer reinforcement learning. In: International conference on learning representations."},{"key":"6500_CR18","unstructured":"Ho, J., Ermon, S. (2016). Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems."},{"key":"6500_CR19","unstructured":"Jurgenson, T., Avner, O., Groshev, E., Tamar, A. (2020). Sub-goal trees a framework for goal-based reinforcement learning. In: International conference on machine learning."},{"key":"6500_CR20","unstructured":"Kaelbling, L.P. (1993). Learning to achieve goals. In: International joint conference on artificial intelligence."},{"key":"6500_CR21","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1016\/j.neucom.2017.04.074","volume":"263","author":"TG Karimpanal","year":"2017","unstructured":"Karimpanal, T. G., & Wilhelm, E. (2017). Identification and off-policy learning of multiple objectives using adaptive clustering. Neurocomputing, 263, 39\u201347.","journal-title":"Neurocomputing"},{"key":"6500_CR22","first-page":"28336","volume":"34","author":"J Kim","year":"2021","unstructured":"Kim, J., Seo, Y., & Shin, J. (2021). Landmark-guided subgoal generation in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 34, 28336\u201328349.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR23","unstructured":"Kingma, D.P., Ba, J. (2014). Adam: A method for stochastic optimization."},{"key":"6500_CR24","first-page":"1179","volume":"33","author":"A Kumar","year":"2020","unstructured":"Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33, 1179\u20131191.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6500_CR25","unstructured":"Li, Y., Gao, T., Yang, J., Xu, H., Wu, Y. (2022). Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. In: International conference on machine learning."},{"key":"6500_CR26","unstructured":"Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M.O., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2016). Continuous control with deep reinforcement learning. In: International conference on learning representations."},{"issue":"4","key":"6500_CR27","doi-asserted-by":"publisher","first-page":"10216","DOI":"10.1109\/LRA.2022.3190100","volume":"7","author":"J Li","year":"2022","unstructured":"Li, J., Tang, C., Tomizuka, M., & Zhan, W. (2022). Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 7(4), 10216\u201310223.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"6500_CR28","doi-asserted-by":"crossref","unstructured":"Liu, M., Zhu, M., Zhang, W. (2022). Goal-conditioned reinforcement learning: Problems and solutions. In: International joint conference on artificial intelligence.","DOI":"10.24963\/ijcai.2022\/770"},{"key":"6500_CR29","unstructured":"Ma, X., Zhao, S.-Y., Yin, Z.-H., Li, W.-J. (2020). Clustered reinforcement learning."},{"key":"6500_CR30","doi-asserted-by":"crossref","unstructured":"Mandlekar, A., Ramos, F., Boots, B., Savarese, S., Fei-Fei, L., Garg, A., Fox, D. (2020). Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In: International conference on robotics and automation.","DOI":"10.1109\/ICRA40945.2020.9196935"},{"key":"6500_CR31","doi-asserted-by":"crossref","unstructured":"Mannor, S., Menache, I., Hoze, A., Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In: International conference on machine learning.","DOI":"10.1145\/1015330.1015355"},{"key":"6500_CR32","unstructured":"Mezghani, L., Sukhbaatar, S., Bojanowski, P., Lazaric, A., Alahari, K. (2022). Learning goal-conditioned policies offline with self-supervised reward shaping. In: Conference on robot learning."},{"issue":"7540","key":"6500_CR33","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G.,  et al., (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529\u2013533.","journal-title":"Nature"},{"key":"6500_CR34","unstructured":"Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S. (2018). Visual reinforcement learning with imagined goals. In: Advances in neural information processing systems."},{"key":"6500_CR35","unstructured":"Nasiriany, S., Pong, V., Lin, S., Levine, S. (2019). Planning with goal-conditioned policies. In: Advances in neural information processing systems."},{"key":"6500_CR36","unstructured":"Oh, J., Guo, Y., Singh, S., Lee, H. (2018). Self-imitation learning. In: International conference on machine learning."},{"key":"6500_CR37","unstructured":"Peng, X.B., Kumar, A., Zhang, G., Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning."},{"key":"6500_CR38","unstructured":"Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., Kumar, V., Zaremba, W. (2018). Multi-goal reinforcement learning: Challenging robotics environments and request for research."},{"issue":"268","key":"6500_CR39","first-page":"1","volume":"22","author":"A Raffin","year":"2021","unstructured":"Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 1\u20138.","journal-title":"Journal of Machine Learning Research"},{"key":"6500_CR40","unstructured":"Schaul, T., Horgan, D., Gregor, K., Silver, D. (2015). Universal value function approximators. In: International conference on machine learning."},{"key":"6500_CR41","volume-title":"Reinforcement learning: An introduction","author":"RS Sutton","year":"2018","unstructured":"Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press."},{"key":"6500_CR42","unstructured":"Wang, Q., Xiong, J., Han, L., sun, p., Liu, H., Zhang, T. (2018). Exponentially weighted imitation learning for batched historical data. In: Advances in neural information processing systems."},{"key":"6500_CR43","doi-asserted-by":"crossref","unstructured":"Wei, H., Corder, K., Decker, K. (2018). Q-learning acceleration via state-space partitioning. In: International conference on machine learning and applications.","DOI":"10.1109\/ICMLA.2018.00050"},{"key":"6500_CR44","unstructured":"Yang, R., Fang, M., Han, L., Du, Y., Luo, F., Li, X. (2021). MHER: Model-based hindsight experience replay. In: Deep RL Workshop NeurIPS 2021."},{"key":"6500_CR45","unstructured":"Yang, R., Lu, Y., Li, W., Sun, H., Fang, M., Du, Y., Li, X., Han, L., Zhang, C. (2022). Rethinking goal-conditioned supervised learning and its connection to offline RL. In: International conference on learning representations."},{"key":"6500_CR46","unstructured":"Zhang, L., Yang, G., Stadie, B.C. (2021). World model as a graph: Learning latent landmarks for planning. In: International conference on machine learning."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06500-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-023-06500-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06500-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,2]],"date-time":"2024-05-02T18:14:17Z","timestamp":1714673657000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-023-06500-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,5]]},"references-count":46,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5]]}},"alternative-id":["6500"],"URL":"https:\/\/doi.org\/10.1007\/s10994-023-06500-z","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"type":"print","value":"0885-6125"},{"type":"electronic","value":"1573-0565"}],"subject":[],"published":{"date-parts":[[2024,2,5]]},"assertion":[{"value":"15 February 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 November 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 December 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 February 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"No competing and financial interests to disclose.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"The authors give their consent to participate.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"The authors give their consent for publication.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}