{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T05:20:39Z","timestamp":1766035239963,"version":"3.48.0"},"reference-count":53,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2025,12,16]],"date-time":"2025-12-16T00:00:00Z","timestamp":1765843200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272239"],"award-info":[{"award-number":["62272239"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007540","name":"Jiangsu Agriculture Science and Technology Innovation Fund","doi-asserted-by":"crossref","award":["CX(22)1007"],"award-info":[{"award-number":["CX(22)1007"]}],"id":[{"id":"10.13039\/100007540","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Natural Science Research Start-up Foundation for Recruiting Talents of Nanjing University","award":["NY222029"],"award-info":[{"award-number":["NY222029"]}]},{"name":"Guizhou Provincial Key Technology R&amp;D Program","award":["[2023]272"],"award-info":[{"award-number":["[2023]272"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Systems"],"abstract":"<jats:p>A central obstacle to the practical deployment of Reinforcement Learning (RL) is the prevalence of sparse rewards, which often necessitates task-specific dense signals crafted through costly trial-and-error. Automated reward decomposition and return\u2013redistribution methods can reduce this burden, but they are largely semantically agnostic and may fail to capture the multifaceted nature of task performance, leading to reward hacking or stalled exploration. Recent work uses Large Language Models (LLMs) to generate reward functions from high-level task descriptions, but these specifications are typically static and may encode biases or inaccuracies from the pretrained model, resulting in a priori reward misspecification. To address this, we propose the Metacognitive Introspective Reward Architecture (MIRA), a closed-loop architecture that treats LLM-generated reward code as a dynamic object refined through empirical feedback. An LLM first produces a set of computable reward factors. A dual-loop design then decouples policy learning from reward revision: an inner loop jointly trains the agent\u2019s policy and a reward-synthesis network to align with sparse ground-truth outcomes, while an outer loop monitors learning dynamics via diagnostic metrics and, upon detecting pathological signatures, invokes the LLM to perform targeted structural edits. Experiments on MuJoCo benchmarks show that MIRA corrects flawed initial specifications and improves asymptotic performance and sample efficiency over strong reward-design baselines.<\/jats:p>","DOI":"10.3390\/systems13121124","type":"journal-article","created":{"date-parts":[[2025,12,16]],"date-time":"2025-12-16T15:41:12Z","timestamp":1765899672000},"page":"1124","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design"],"prefix":"10.3390","volume":"13","author":[{"given":"Weiying","family":"Zhang","sequence":"first","affiliation":[{"name":"Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1545-3753","authenticated-orcid":false,"given":"Yuhua","family":"Xu","sequence":"additional","affiliation":[{"name":"Post Industry Technology R&D Center of the State Posts Bureau (IoT Technology), Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China"}]},{"given":"Zhixin","family":"Sun","sequence":"additional","affiliation":[{"name":"Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China"},{"name":"Post Industry Technology R&D Center of the State Posts Bureau (IoT Technology), Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1038\/nature24270","article-title":"Mastering the game of go without human knowledge","volume":"550","author":"Silver","year":"2017","journal-title":"Nature"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1038\/s41586-019-1724-z","article-title":"Grandmaster level in StarCraft II using multi-agent reinforcement learning","volume":"575","author":"Vinyals","year":"2019","journal-title":"Nature"},{"key":"ref_3","unstructured":"Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., and Ribas, R. (2019). Solving rubik\u2019s cube with a robot hand. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Chen, M., Li, Y., Dai, Z., Zhang, T., Zhou, Y., and Wang, H. (2025). A Robust Multi-Domain Adaptive Anti-Jamming Communication System for a UAV Swarm in Urban ITS Traffic Monitoring via Multi-Agent Deep Deterministic Policy Gradient. IEEE Trans. Intell. Transp. Syst., 1\u201317.","DOI":"10.1109\/TITS.2025.3584216"},{"key":"ref_5","unstructured":"Ng, A.Y., and Russell, S. (July, January 29). Algorithms for inverse reinforcement learning. Proceedings of the ICML, Stanford, CA, USA."},{"key":"ref_6","unstructured":"Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S.J., and Dragan, A. (2017). Inverse reward design. Adv. Neural Inf. Process. Syst., 30."},{"key":"ref_7","unstructured":"Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. (2022, January 18\u201321). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. Proceedings of the International Conference on Machine Learning, Guangzhou, China."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. (2023). Code as policies: Language model programs for embodied control. arXiv.","DOI":"10.1109\/ICRA48891.2023.10160591"},{"key":"ref_9","unstructured":"Yu, W., Gileadi, N., Fu, C., Kirmani, S., Lee, K.H., Arenas, M.G., Chiang, H.T.L., Erez, T., Hasenclever, L., and Humplik, J. (2023). Language to rewards for robotic skill synthesis. arXiv."},{"key":"ref_10","unstructured":"Ho, J., and Ermon, S. (2016). Generative adversarial imitation learning. Adv. Neural Inf. Process. Syst., 29."},{"key":"ref_11","unstructured":"Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Adv. Neural Inf. Process. Syst., 30."},{"key":"ref_12","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_14","unstructured":"Zhang, M., Press, O., Merrill, W., Liu, A., and Smith, N.A. (2023). How language model hallucinations can snowball. arXiv."},{"key":"ref_15","unstructured":"Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man\u00e9, D. (2016). Concrete problems in AI safety. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"580","DOI":"10.1038\/s41586-020-03157-9","article-title":"First return, then explore","volume":"590","author":"Ecoffet","year":"2021","journal-title":"Nature"},{"key":"ref_17","unstructured":"Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., and Hochreiter, S. (2019). Rudder: Return decomposition for delayed rewards. Adv. Neural Inf. Process. Syst., 32."},{"key":"ref_18","unstructured":"Xu, Z., van Hasselt, H.P., and Silver, D. (2018). Meta-gradient reinforcement learning. Adv. Neural Inf. Process. Syst., 31."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"681","DOI":"10.1109\/TCDS.2023.3286465","article-title":"Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks","volume":"16","author":"Wang","year":"2023","journal-title":"IEEE Trans. Cogn. Dev. Syst."},{"key":"ref_20","unstructured":"Zhang, X., Ruan, J., Ma, X., Zhu, Y., Chen, J., Zeng, K., and Cai, X. (2023). Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"103829","DOI":"10.1016\/j.artint.2022.103829","article-title":"Reward (mis) design for autonomous driving","volume":"316","author":"Knox","year":"2023","journal-title":"Artif. Intell."},{"key":"ref_22","unstructured":"Ng, A.Y., Harada, D., and Russell, S. (1999, January 27\u201330). Policy invariance under reward transformations: Theory and application to reward shaping. Proceedings of the ICML, Bled, Slovenia."},{"key":"ref_23","unstructured":"Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., and Henighan, T. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv."},{"key":"ref_24","unstructured":"Pan, A., Bhatia, K., and Steinhardt, J. (2022). The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv."},{"key":"ref_25","first-page":"20208","article-title":"Interpretable reward redistribution in reinforcement learning: A causal approach","volume":"36","author":"Zhang","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. (2018, January 2\u20137). Counterfactual multi-agent policy gradients. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11794"},{"key":"ref_27","unstructured":"Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., and Hausman, K. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv."},{"key":"ref_28","unstructured":"Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023, January 1\u20135). React: Synergizing reasoning and acting in language models. Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda."},{"key":"ref_29","first-page":"8634","article-title":"Reflexion: Language agents with verbal reinforcement learning","volume":"36","author":"Shinn","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_30","unstructured":"Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. (2023). Voyager: An open-ended embodied agent with large language models. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., and Lin, Y. (2024). A survey on large language model based autonomous agents. Front. Comput. Sci., 18.","DOI":"10.1007\/s11704-024-40231-1"},{"key":"ref_32","unstructured":"Zeng, F., Gan, W., Wang, Y., Liu, N., and Yu, P.S. (2023). Large language models for robotics: A survey. arXiv."},{"key":"ref_33","unstructured":"Kwon, M., Xie, S.M., Bullard, K., and Sadigh, D. (2023). Reward design with language models. arXiv."},{"key":"ref_34","unstructured":"Ma, Y.J., Liang, W., Wang, G., Huang, D.A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. (2023). Eureka: Human-level reward design via coding large language models. arXiv."},{"key":"ref_35","unstructured":"Qu, Y., Jiang, Y., Wang, B., Mao, Y., Wang, C., Liu, C., and Ji, X. (March, January 25). Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA."},{"key":"ref_36","unstructured":"Puterman, M.L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Sutton, R.S., and Barto, A.G. (1998). Reinforcement Learning: An Introduction, MIT Press.","DOI":"10.1109\/TNN.1998.712192"},{"key":"ref_38","unstructured":"Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by random network distillation. arXiv."},{"key":"ref_39","unstructured":"Badia, A.P., Sprechmann, P., Vitvitskyi, A., Guo, D., Piot, B., Kapturowski, S., Tieleman, O., Arjovsky, M., Pritzel, A., and Bolt, A. (2020). Never give up: Learning directed exploration strategies. arXiv."},{"key":"ref_40","unstructured":"Raileanu, R., and Rockt\u00e4schel, T. (2020). Ride: Rewarding impact-driven exploration for procedurally-generated environments. arXiv."},{"key":"ref_41","unstructured":"Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., van Hasselt, H.P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. Adv. Neural Inf. Process. Syst., 30."},{"key":"ref_42","first-page":"10026","article-title":"Compositional reinforcement learning from logical specifications","volume":"34","author":"Jothimurugan","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_43","unstructured":"Ziebart, B.D., Maas, A.L., Bagnell, J.A., and Dey, A.K. (2008, January 13\u201317). Maximum entropy inverse reinforcement learning. Proceedings of the AAAI, Chicago, IL, USA."},{"key":"ref_44","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_45","unstructured":"Zhang, H., Sun, K., Xu, B., Kong, L., and M\u00fcller, M. (2021). A Distance-based Anomaly Detection Framework for Deep Reinforcement Learning. arXiv."},{"key":"ref_46","unstructured":"M\u00fcller, R., Illium, S., Phan, T., Haider, T., and Linnhoff-Popien, C. (2022, January 9\u201313). Towards anomaly detection in reinforcement learning. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, Virtual."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Sorrenti, A., Bellitto, G., Salanitri, F.P., Pennisi, M., Spampinato, C., and Palazzo, S. (2023, January 1\u20136). Selective freezing for efficient continual learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCVW60793.2023.00381"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Todorov, E., Erez, T., and Tassa, Y. (2012, January 7\u201312). Mujoco: A physics engine for model-based control. Proceedings of the 2012 IEEE\/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal.","DOI":"10.1109\/IROS.2012.6386109"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Efroni, Y., Merlis, N., and Mannor, S. (2021, January 19\u201321). Reinforcement learning with trajectory feedback. Proceedings of the AAAI Conference on Artificial Intelligence, Virtually.","DOI":"10.1609\/aaai.v35i8.16895"},{"key":"ref_50","first-page":"822","article-title":"Learning guidance rewards with trajectory-space smoothing","volume":"33","author":"Gangwani","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_51","unstructured":"Alemi, A.A., Fischer, I., Dillon, J.V., and Murphy, K. (2016). Deep variational information bottleneck. arXiv."},{"key":"ref_52","unstructured":"Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv."},{"key":"ref_53","unstructured":"Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018, January 10\u201315). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the International Conference on Machine Learning, Stockholm, Sweden."}],"container-title":["Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-8954\/13\/12\/1124\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T05:16:09Z","timestamp":1766034969000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-8954\/13\/12\/1124"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,16]]},"references-count":53,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["systems13121124"],"URL":"https:\/\/doi.org\/10.3390\/systems13121124","relation":{},"ISSN":["2079-8954"],"issn-type":[{"type":"electronic","value":"2079-8954"}],"subject":[],"published":{"date-parts":[[2025,12,16]]}}}