{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:06:46Z","timestamp":1760058406504,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,3,30]],"date-time":"2025-03-30T00:00:00Z","timestamp":1743292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>This paper studies value function transfer within reinforcement learning frameworks, focusing on tasks continuously assigned to an agent through a probabilistic distribution. Specifically, we focus on environments characterized by sparse rewards with a terminal goal. Initially, we propose and theoretically demonstrate that the distribution of the computed value function from such environments, whether in cases where the goals or the dynamics are changing across tasks, can be reformulated as the distribution of the number of steps to the goal generated by their optimal policies, which we name the expected optimal path length. To test our propositions, we hypothesize that the distribution of the expected optimal path lengths resulting from the task distribution is normal. This claim leads us to propose that if the distribution is normal, then the distribution of the value function follows a log-normal pattern. Leveraging this insight, we introduce \u201cLogQInit\u201d as a novel value function transfer method, based on the properties of log-normality. Finally, we run experiments on a scenario of goals and dynamics distributions, validate our proposition by providing an a dequate analysis of the results, and demonstrate that LogQInit outperforms existing methods of value function initialization, policy transfer, and reward shaping.<\/jats:p>","DOI":"10.3390\/e27040367","type":"journal-article","created":{"date-parts":[[2025,3,31]],"date-time":"2025-03-31T05:10:07Z","timestamp":1743397807000},"page":"367","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["From Task Distributions to Expected Paths Lengths Distributions: Value Function Initialization in Sparse Reward Environments for Lifelong Reinforcement Learning"],"prefix":"10.3390","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-4108-5191","authenticated-orcid":false,"given":"Soumia","family":"Mehimeh","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Nangang District, Harbin 150001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xianglong","family":"Tang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology, 92 West Dazhi Street, Nangang District, Harbin 150001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,30]]},"reference":[{"key":"ref_1","unstructured":"Sutton, R., and Barto, A. (2018). Reinforcement Learning: An Introduction, MIT Press."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1401","DOI":"10.1613\/jair.1.13673","article-title":"Towards continual reinforcement learning: A review and perspectives","volume":"75","author":"Khetarpal","year":"2022","journal-title":"J. Artif. Intell. Res."},{"key":"ref_3","first-page":"1633","article-title":"Transfer learning for reinforcement learning domains: A survey","volume":"10","author":"Taylor","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"13344","DOI":"10.1109\/TPAMI.2023.3292075","article-title":"Transfer learning in deep reinforcement learning: A survey","volume":"45","author":"Zhu","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Bi, Z., Guo, X., Wang, J., Qin, S., and Liu, G. (2023). Deep Reinforcement Learning for Truck-Drone Delivery Problem. Drones, 7.","DOI":"10.3390\/drones7070445"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3459991","article-title":"A survey of reinforcement learning algorithms for dynamically varying environments","volume":"54","author":"Padakandla","year":"2021","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J., Solowjow, E., and Levine, S. (2019, January 20\u201324). Residual reinforcement learning for robot control. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada.","DOI":"10.1109\/ICRA.2019.8794127"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"153171","DOI":"10.1109\/ACCESS.2021.3126658","article-title":"Crossing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning","volume":"9","author":"Salvato","year":"2021","journal-title":"IEEE Access"},{"key":"ref_9","first-page":"216","article-title":"Edge-enhanced attentions for drone delivery in presence of winds and recharging stations","volume":"20","author":"Liu","year":"2023","journal-title":"J. Aerosp. Inf. Syst."},{"key":"ref_10","first-page":"20","article-title":"Policy and value transfer in lifelong reinforcement learning","volume":"80","author":"Abel","year":"2018","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"111036","DOI":"10.1016\/j.knosys.2023.111036","article-title":"Value function optimistic initialization with uncertainty and confidence awareness in lifelong reinforcement learning","volume":"280","author":"Mehimeh","year":"2023","journal-title":"Knowl.-Based Syst."},{"key":"ref_12","unstructured":"Uchendu, I., Xiao, T., Lu, Y., Zhu, B., Yan, M., Simon, J., Bennice, M., Fu, C., Ma, C., and Jiao, J. (2022). Jump-start reinforcement learning. arXiv."},{"key":"ref_13","first-page":"2413","article-title":"Reinforcement Learning in Finite MDPs: PAC Analysis","volume":"10","author":"Strehl","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"ref_14","first-page":"36626","article-title":"Offline meta reinforcement learning with in-distribution online adaptation","volume":"202","author":"Wang","year":"2023","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_15","first-page":"25856","article-title":"Distributionally Adaptive Meta Reinforcement Learning","volume":"35","author":"Ajay","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_16","first-page":"5331","article-title":"Efficient off-policy meta-reinforcement learning via probabilistic context variables","volume":"97","author":"Rakelly","year":"2019","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_17","unstructured":"Brys, T., Harutyunyan, A., Taylor, M., and Now\u00e9, A. (2015, January 4\u20138). Policy Transfer using Reward Shaping. Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, Istanbul, Turkey."},{"key":"ref_18","unstructured":"D\u2019Eramo, C., Tateo, D., Bonarini, A., Restelli, M., and Peters, J. (2024). Sharing knowledge in multi-task deep reinforcement learning. arXiv."},{"key":"ref_19","first-page":"4936","article-title":"Importance weighted transfer of samples in reinforcement learning","volume":"80","author":"Tirinzoni","year":"2018","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_20","unstructured":"Agrawal, P., and Agrawal, S. (2024). Optimistic Q-learning for average reward and episodic reinforcement learning. arXiv."},{"key":"ref_21","first-page":"7612","article-title":"Optimistic initialization for exploration in continuous control","volume":"36","author":"Lobel","year":"2022","journal-title":"Proc. Aaai Conf. Artif. Intell."},{"key":"ref_22","first-page":"1065","article-title":"Using bisimulation for policy transfer in MDPs","volume":"24","author":"Castro","year":"2010","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Lecarpentier, E., Abel, D., Asadi, K., Jinnai, Y., Rachelson, E., and Littman, M. (2020). Lipschitz lifelong reinforcement learning. arXiv.","DOI":"10.1609\/aaai.v35i9.17006"},{"key":"ref_24","first-page":"28311","article-title":"Understanding and addressing the pitfalls of bisimulation-based representations in offline reinforcement learning","volume":"36","author":"Zang","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_25","first-page":"8500","article-title":"Goal-conditioned Q-learning as knowledge distillation","volume":"37","author":"Levine","year":"2023","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liu, M., Zhu, M., and Zhang, W. (2022). Goal-conditioned reinforcement learning: Problems and solutions. arXiv.","DOI":"10.24963\/ijcai.2022\/770"},{"key":"ref_27","first-page":"1","article-title":"Hindsight experience replay","volume":"30","author":"Andrychowicz","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","first-page":"1401","article-title":"Learning goal-conditioned policies offline with self-supervised reward shaping","volume":"205","author":"Mezghani","year":"2023","journal-title":"Conf. Robot. Learn."},{"key":"ref_29","first-page":"1146","article-title":"Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward","volume":"205","author":"Guo","year":"2023","journal-title":"Conf. Robot. Learn."},{"key":"ref_30","first-page":"11210","article-title":"Learning task-distribution reward shaping with meta-learning","volume":"35","author":"Zou","year":"2021","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"847","DOI":"10.1613\/jair.1.13326","article-title":"Computational benefits of intermediate rewards for goal-reaching policy learning","volume":"73","author":"Zhai","year":"2022","journal-title":"J. Artif. Intell. Res."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhu, T., Qiu, Y., Zhou, H., and Li, J. (2023, January 19\u201325). Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), Macao, China.","DOI":"10.24963\/ijcai.2023\/522"},{"key":"ref_33","first-page":"449","article-title":"A distributional perspective on reinforcement learning","volume":"70","author":"Bellemare","year":"2017","journal-title":"Int. Conf. Mach. Learn."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Li, S., and Zhang, C. (2018, January 2\u20137). An optimal online method of selecting source policies for reinforcement learning. Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LO, USA.","DOI":"10.1609\/aaai.v32i1.11718"},{"key":"ref_35","unstructured":"Moore, A. (1990). Efficient Memory-Based Learning for Robot Control, University of Cambridge, Computer Laboratory."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/4\/367\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:06:00Z","timestamp":1760029560000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/4\/367"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,30]]},"references-count":35,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["e27040367"],"URL":"https:\/\/doi.org\/10.3390\/e27040367","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2025,3,30]]}}}