{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T01:33:24Z","timestamp":1761528804951,"version":"build-2065373602"},"reference-count":54,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,10,22]],"date-time":"2025-10-22T00:00:00Z","timestamp":1761091200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003086","name":"the Basque Government","doi-asserted-by":"publisher","award":["KK-2024\/00064"],"award-info":[{"award-number":["KK-2024\/00064"]}],"id":[{"id":"10.13039\/501100003086","id-type":"DOI","asserted-by":"publisher"}]},{"name":"MATHMODE","award":["IT1866-26"],"award-info":[{"award-number":["IT1866-26"]}]},{"DOI":"10.13039\/501100019124","name":"IKASLAGUN project","doi-asserted-by":"publisher","award":["2024-CIE2-000006-01"],"award-info":[{"award-number":["2024-CIE2-000006-01"]}],"id":[{"id":"10.13039\/501100019124","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Diputaci\u00f3n Foral de Gipuzcoa"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Reinforcement learning (RL) agents face significant challenges in sparse-reward environments, as insufficient exploration of the state space can result in inefficient training or incomplete policy learning. To address this challenge, this work proposes a teacher\u2013student framework for RL that leverages the inherent knowledge of large language models (LLMs) to decompose complex tasks into manageable subgoals. The capabilities of LLMs to comprehend problem structure and objectives, based on textual descriptions, can be harnessed to generate subgoals, similar to the guidance a human supervisor would provide. For this purpose, we introduce the following three subgoal types: positional, representation-based, and language-based. Moreover, we propose an LLM surrogate model to reduce computational overhead and demonstrate that the supervisor can be decoupled once the policy has been learned, further lowering computational costs. Under this framework, we evaluate the performance of three open-source LLMs (namely, Llama, DeepSeek, and Qwen). Furthermore, we assess our teacher\u2013student framework on the MiniGrid benchmark\u2014a collection of procedurally generated environments that demand generalization to previously unseen tasks. Experimental results indicate that our teacher\u2013student framework facilitates more efficient learning and encourages enhanced exploration in complex tasks, resulting in faster training convergence and outperforming recent teacher\u2013student methods designed for sparse-reward environments.<\/jats:p>","DOI":"10.3390\/make7040126","type":"journal-article","created":{"date-parts":[[2025,10,23]],"date-time":"2025-10-23T01:14:02Z","timestamp":1761182042000},"page":"126","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Large Language Models for Structured Task Decomposition in Reinforcement Learning Problems with Sparse Rewards"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-2536-5964","authenticated-orcid":false,"given":"Unai","family":"Ruiz-Gonzalez","sequence":"first","affiliation":[{"name":"Department of Communications Engineering, University of the Basque Country (UPV\/EHU), 48013 Bilbao, Spain"},{"name":"TECNALIA, Basque Research and Technology Alliance (BRTA), 48160 Derio, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4688-1304","authenticated-orcid":false,"given":"Alain","family":"Andres","sequence":"additional","affiliation":[{"name":"TECNALIA, Basque Research and Technology Alliance (BRTA), 48160 Derio, Spain"},{"name":"Faculty of Engineering, University of Deusto, 20012 Donostia, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1260-9775","authenticated-orcid":false,"given":"Javier","family":"Del Ser","sequence":"additional","affiliation":[{"name":"TECNALIA, Basque Research and Technology Alliance (BRTA), 48160 Derio, Spain"},{"name":"Department of Mathematics, University of the Basque Country (UPV\/EHU), 48940 Leioa, Spain"}]}],"member":"1968","published-online":{"date-parts":[[2025,10,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pathak, D., Agrawal, P., Efros, A.A., and Darrell, T. (2017, January 6\u201311). Curiosity-driven exploration by self-supervised prediction. Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia.","DOI":"10.1109\/CVPRW.2017.70"},{"key":"ref_2","unstructured":"Rengarajan, D., Vaidya, G., Sarvesh, A., Kalathil, D., and Shakkotta, S. (2022, January 25\u201329). Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_3","unstructured":"Klink, P., D\u2019Eramo, C., Peters, J.R., and Pajarinen, J. (2020, January 6\u201312). Self-paced deep reinforcement learning. Proceedings of the Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"129079","DOI":"10.1016\/j.neucom.2024.129079","article-title":"Using offline data to speed up reinforcement learning in procedurally generated environments","volume":"618","author":"Andres","year":"2025","journal-title":"Neurocomputing"},{"key":"ref_5","unstructured":"Vezhnevets, A.S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017, January 6\u201311). Feudal networks for hierarchical reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Anand, D., Gupta, V., Paruchuri, P., and Ravindran, B. (2021, January 2\u20139). An enhanced advising model in teacher-student framework using state categorization. Proceedings of the AAAI Conference on Artificial Intelligence, Virtual.","DOI":"10.1609\/aaai.v35i8.16823"},{"key":"ref_7","unstructured":"Hu, S., Huang, T., Liu, G., Kompella, R.R., Ilhan, F., Tekin, S.F., Xu, Y., Yahn, Z., and Liu, L. (2024). A survey on large language model-based game agents. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"109517","DOI":"10.1016\/j.compeleceng.2024.109517","article-title":"How rationals boost textual entailment modeling: Insights from large language models","volume":"119","author":"Pham","year":"2024","journal-title":"Comput. Electr. Eng."},{"key":"ref_9","unstructured":"Ruiz-Gonzalez, U., Andres, A., Bascoy, P.G., and Ser, J.D. (2024, January 15). Words as Beacons: Words as Beacons: Guiding RL Agents with High-Level Language Prompts. Proceedings of the NeurIPS 2024 Workshop on Open-World Agents, Vancouver, BC, Canada."},{"key":"ref_10","unstructured":"Chevalier-Boisvert, M., Dai, B., Towers, M., de Lazcano, R., Willems, L., Lahlou, S., Pal, S., Castro, P.S., and Terry, J. (2023, January 10\u201316). Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. Proceedings of the 37th International Conference on Neural Information Processing System, Orleans, LA, USA."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1016\/0010-0277(93)90058-4","article-title":"Learning and development in neural networks: The importance of starting small","volume":"48","author":"Elman","year":"1993","journal-title":"Cognition"},{"key":"ref_12","unstructured":"Mu, J., Zhong, V., Raileanu, R., Jiang, M., Goodman, N., Rockt\u00e4schel, T., and Grefenstette, E. (December, January 28). Improving intrinsic exploration with language abstractions. Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009, January 14\u201318). Curriculum learning. Proceedings of the International Conference on Machine Learning (ICML), Montreal, QC, Canada.","DOI":"10.1145\/1553374.1553380"},{"key":"ref_14","unstructured":"Graves, A., Bellemare, M.G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017, January 6\u201311). Automated curriculum learning for neural networks. Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia."},{"key":"ref_15","unstructured":"Hacohen, G., and Weinshall, D. (2019, January 10\u201315). On the power of curriculum learning in training deep networks. Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., and Mitchell, T.M. (2019, January 2\u20137). Competence-based curriculum learning for neural machine translation. Proceedings of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA.","DOI":"10.18653\/v1\/N19-1119"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"3732","DOI":"10.1109\/TNNLS.2019.2934906","article-title":"Teacher\u2013student curriculum learning","volume":"31","author":"Matiisen","year":"2019","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_18","unstructured":"Jiang, M., Grefenstette, E., and Rocktaschel, T. (2021, January 18\u201324). Prioritized Level Replay. Proceedings of the 38th International Conference on Machine Learning, Virtual."},{"key":"ref_19","unstructured":"Kanitscheider, I., Huizinga, J., Farhi, D., Guss, W.H., Houghton, B., Sampedro, R., Zhokhov, P., Baker, B., Ecoffet, A., and Tang, J. (2021). Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv."},{"key":"ref_20","unstructured":"Racani\u00e8re, S., Lampinen, A.K., Santoro, A., Reichert, D.P., Firoiu, V., and Lillicrap, T.P. (2019, January 6\u20139). Automated curricula through setter-solver interactions. Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_21","unstructured":"Campero, A., Raileanu, R., Kuttler, H., Tenenbaum, J.B., Rocktaschel, T., and Grefenstette, E. (2021, January 3\u20137). Learning with AMIGo: Adversarially motivated intrinsic goals. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_22","unstructured":"Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018, January 10\u201315). Automatic goal generation for reinforcement learning agents. Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden."},{"key":"ref_23","unstructured":"Dayan, P., and Hinton, G.E. (December, January 30). Feudal Reinforcement Learning. Proceedings of the Neural Information Processing Systems (NeurIPS), Denver, CO, USA."},{"key":"ref_24","unstructured":"Jiang, Y., Gu, S.S., Murphy, K.P., and Finn, C. (2019, January 8\u201314). Language as an abstraction for hierarchical deep reinforcement learning. Proceedings of the Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_25","unstructured":"Goecks, V.G., Waytowich, N.R., Watkins-Valls, D., and Prakash, B. (2022, January 21\u201323). Combining learning from human feedback and knowledge engineering to solve hierarchical tasks in minecraft. Proceedings of the Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE), Stanford University, Palo Alto, CA, USA."},{"key":"ref_26","unstructured":"Prakash, R., Pavlamuri, S., and Chernova, S. (2021). Interactive Hierarchical Task Learning from Language Instructions and Demonstrations, Association for Computational Linguistics."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Prakash, B., Oates, T., and Mohsenin, T. (2024, January 20\u201323). Using LLMs for augmenting hierarchical agents with common sense priors. Proceedings of the International FLAIRS Conference, Daytona Beach, FL, USA.","DOI":"10.32473\/flairs.37.1.135602"},{"key":"ref_28","unstructured":"Shridhar, M., Yuan, X., Cote, M.A., Bisk, Y., Trischler, A., and Hausknecht, M. (2021, January 3\u20137). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_29","unstructured":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language Models Are Unsupervised Multitask Learners, OpenAI Blog."},{"key":"ref_30","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020, January 6\u201312). Language models are few-shot learners. Proceedings of the Neural Information Processing Systems (NeurIPS), Virtual."},{"key":"ref_31","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman F., L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). GPT-4 Technical Report, OpenAI Blog."},{"key":"ref_32","unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv."},{"key":"ref_33","unstructured":"Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., and Huang, F. (2023). Qwen Technical Report. arXiv."},{"key":"ref_34","unstructured":"Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., and Fu, Z. (2024). DeepSeek LLM: Scaling open-source language models with longtermism. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"9737","DOI":"10.1109\/TNNLS.2024.3497992","article-title":"Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods","volume":"36","author":"Cao","year":"2024","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_36","unstructured":"Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., Chen, T., Huang, D., Aky\u00fcrek, E., and Anandkumar, A. (December, January 28). Pre-trained language models for interactive decision-making. Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_37","unstructured":"Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., and Oudeyer, P. (2023, January 23\u201329). Grounding large language models in interactive environments with online reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_38","unstructured":"Ma, Y.J., Liang, W., Wang, G., Huang, D., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. (2023). Eureka: Human-level reward design via coding large language models. arXiv."},{"key":"ref_39","unstructured":"Kwon, M., Xie, S.M., Bullard, K., and Sadigh, D. (2023, January 1\u20135). Reward design with language models. Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda."},{"key":"ref_40","unstructured":"Klissarov, M., D\u2019Oro, P., Sodhani, S., Raileanu, R., Bacon, P., Vincent, P., Zhang, A., and Henaff, M. (2024, January 7\u201311). Motif: Intrinsic motivation from artificial intelligence feedback. Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria."},{"key":"ref_41","unstructured":"Klissarov, M., Henaff, M., Raileanu, R., Sodhani, S., Vincent, P., Zhang, A., Bacon, P., Precup, D., Machado M., C., and D\u2019Oro, P. (2025, January 24\u201328). MaestroMotif: Skill Design from Artificial Intelligence Feedback. Proceedings of the International Conference on Learning Representations (ICLR), Singapore."},{"key":"ref_42","unstructured":"Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., and Liang, Y. (2023, January 10\u201316). Describe, Explain, Plan and Select: Interactive planning with large language models enables open-world multi-task agents. Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_43","unstructured":"Du, Y., Watkins, O., Wang, Z., Colas, C., Darrell, T., Abbeel, P., Gupta, A., and Andreas, J. (2023, January 23\u201329). Guiding Pretraining in reinforcement learning with large language models. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_44","unstructured":"Pignatelli, E., Ferret, J., Rockaschel, T., Grefenstette, E., Paglieri, D., Coward, S., and Toni, L. (2024). Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL. arXiv."},{"key":"ref_45","unstructured":"Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., and Hausman, K. (2022, January 14\u201318). Do as i can, not as i say: Grounding language in robotic affordances. Proceedings of the Conference on Robot Learning (CoRL), Auckland, New Zealand."},{"key":"ref_46","unstructured":"Wang, Z., Cai, S., Liu, A., Jin, Y., Hou, J., Zhang, B., Lin, H., He, Z., Zheng, Z., and Yang, Y. (2023). JARVIS-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv."},{"key":"ref_47","unstructured":"Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. (2020, January 13\u201318). Leveraging Procedural Generation to Benchmark Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning, Virtual."},{"key":"ref_48","unstructured":"Emani, M., Foreman, S., Sastry, V., Xie, Z., Raskar, S., Arnold, W., Thakur, R., Vishwanath, V., and Papka, M.E. (2023). A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Reynolds, L., and McDonell, K. (2021, January 8\u201313). Prompt programming for large language models: Beyond the few-shot paradigm. Proceedings of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan.","DOI":"10.1145\/3411763.3451760"},{"key":"ref_50","unstructured":"Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., and Iwasawa, Y. (December, January 28). Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing System, New Orleans, LA, USA."},{"key":"ref_51","unstructured":"Reimers, N., and Gurevych, I. (2025, August 22). all-MiniLM-L6-v2: Sentence Embeddings Using MiniLM-L6-v2. Available online: https:\/\/huggingface.co\/sentence-transformers\/all-MiniLM-L6-v2."},{"key":"ref_52","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Andres, A., Villar-Rodriguez, E., and Del Ser, J. (2022). An evaluation study of intrinsic motivation techniques applied to reinforcement learning over hard exploration environments. Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Springer.","DOI":"10.1007\/978-3-031-14463-9_13"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3459991","article-title":"A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments","volume":"54","author":"Padakandla","year":"2021","journal-title":"ACM Comput. Surv."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/126\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T01:31:45Z","timestamp":1761528705000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/126"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,22]]},"references-count":54,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["make7040126"],"URL":"https:\/\/doi.org\/10.3390\/make7040126","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2025,10,22]]}}}