{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T15:10:09Z","timestamp":1753888209357,"version":"3.41.2"},"reference-count":52,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2023,4,18]],"date-time":"2023-04-18T00:00:00Z","timestamp":1681776000000},"content-version":"vor","delay-in-days":107,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2022ZD0116401"],"award-info":[{"award-number":["2022ZD0116401"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["International Journal of Intelligent Systems"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:p>Reinforcement learning (RL) with sparse and deceptive rewards is a significant challenge because nonzero rewards are rarely obtained, and hence, the gradient calculated by the agent can be stochastic and without valid information. Recent work demonstrates that using memory buffers of previous experiences can lead to a more efficient learning process. However, existing methods usually require these experiences to be successful and may overly exploit them, which can cause the agent to adopt suboptimal behaviors. This study develops an approach that exploits diverse past trajectories for faster and more efficient online RL, even if these trajectories are suboptimal or not highly rewarded. The proposed algorithm merges a policy improvement step with an additional policy exploration step by using offline demonstration data. The main contribution of this study is that by regarding diverse past trajectories as guidance, instead of imitating them, our method directs its policy to follow and expand past trajectories, while still being able to learn without rewards and gradually approach optimality. Furthermore, a novel diversity measurement is introduced to maintain the diversity of the team and regulate exploration. The proposed algorithm is evaluated on a series of discrete and continuous control tasks with sparse and deceptive rewards. In comparison with the existing RL methods, the experimental results indicate that our proposed algorithm is significantly better than the baseline methods in terms of diverse exploration and avoiding local optima.<\/jats:p>","DOI":"10.1155\/2023\/4705291","type":"journal-article","created":{"date-parts":[[2023,4,18]],"date-time":"2023-04-18T23:50:07Z","timestamp":1681861807000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Learning Diverse Policies with Soft Self\u2010Generated Guidance"],"prefix":"10.1155","volume":"2023","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7456-9844","authenticated-orcid":false,"given":"Guojian","family":"Wang","sequence":"first","affiliation":[]},{"given":"Faguo","family":"Wu","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4927-5016","authenticated-orcid":false,"given":"Xiao","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Jianxiang","family":"Liu","sequence":"additional","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2023,4,18]]},"reference":[{"key":"e_1_2_12_1_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature14236"},{"key":"e_1_2_12_2_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_2_12_3_2","unstructured":"LillicrapT. HuntJ. J. PritzelA. HeessN. ErezT. TassaY. SilverD. andWierstraD. Continuous control with deep reinforcement learning 2016 https:\/\/arxiv.org\/abs\/1509.02971."},{"key":"e_1_2_12_4_2","unstructured":"SchulmanJ. LevineS. AbbeelP. JordanM. andMoritzP. Trust region policy optimization Proceedings of the International Conference on Machine Learning July 2015 Lille France 1889\u20131897."},{"key":"e_1_2_12_5_2","unstructured":"FujimotoS. HoofH. andMegerD. Addressing function approximation error in actor-critic methods Proceedings of the International Conference on Machine Learning January 2018 Stockholm Sweden 1587\u20131596."},{"key":"e_1_2_12_6_2","doi-asserted-by":"crossref","unstructured":"NairA. McGrewB. AndrychowiczM. ZarembaW. andAbbeelP. Overcoming exploration in reinforcement learning with demonstrations Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA) May 2018 Brisbane Australia IEEE 6292\u20136299.","DOI":"10.1109\/ICRA.2018.8463162"},{"key":"e_1_2_12_7_2","article-title":"Vime: variational information maximizing exploration","volume":"29","author":"Houthooft R.","year":"2016","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_8_2","unstructured":"FlorensaC. DuanY. andAbbeelP. Stochastic neural networks for hierarchical reinforcement learning Proceedings of the International Conference on Learning Representations August 2017 Sydney Australia."},{"key":"e_1_2_12_9_2","unstructured":"OhJ. GuoY. SinghS. andLeeH. Self-imitation learning Proceedings of the International Conference on Machine Learning July 2018 Xi\u2032an China PMLR 3878\u20133887."},{"key":"e_1_2_12_10_2","article-title":"Memory based trajectory-conditioned policies for learning from sparse rewards","volume":"33","author":"Guo Y.","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_11_2","unstructured":"SchaulT. QuanJ. AntonoglouI. andSilverD. Prioritized experience replay 2015 https:\/\/arxiv.org\/abs\/1511.05952."},{"key":"e_1_2_12_12_2","unstructured":"GangwaniT. LiuQ. andPengJ. Learning self-imitating diverse policies Proceedings of the International Conference on Learning Representations May 2019 New Orleans LO USA."},{"key":"e_1_2_12_13_2","unstructured":"OhJ. GuoY. SinghS. andLeeH. Generative adversarial self-imitation learning Proceedings of the International Conference on Learning Representations May 2019 Montreal Canada."},{"key":"e_1_2_12_14_2","unstructured":"LiangC. NorouziM. BerantJ. LeQ. V. andLaoN. Memory augmented policy optimization for program synthesis and semantic parsing Proceedings of the Advances in Neural Information Processing Systems December 2018 Montr\u00e9al Canada."},{"key":"e_1_2_12_15_2","unstructured":"LinZ. ZhaoT. YangG. andZhangL. Episodic memory deep q-networks 2018 https:\/\/arxiv.org\/abs\/1805.07603."},{"key":"e_1_2_12_16_2","unstructured":"BlundellC. UriaB. PritzelA. LiY. RudermanA. LeiboJ. Z. RaeJ. WierstraD. andHassabisD. Model-free episodic control 2016 https:\/\/arxiv.org\/abs\/1606.04460."},{"key":"e_1_2_12_17_2","unstructured":"PritzelA. UriaB. SrinivasanS. BadiaA. P. VinyalsO. HassabisD. WierstraD. andBlundellC. Neural episodic control Proceedings of the International Conference on Machine Learning May 2017 Sydney Australia PMLR 2827\u20132836."},{"key":"e_1_2_12_18_2","unstructured":"PengZ. SunH. andZhouB. Non-local policy optimization via diversity-regularized collaborative exploration 2020 https:\/\/arxiv.org\/abs\/2006.07781."},{"key":"e_1_2_12_19_2","unstructured":"ZhangY. YuW. andTurkG. Learning novel policies for tasks Proceedings of the International Conference on Machine Learning January 2019 Long Beach CA USA PMLR 7483\u20137492."},{"key":"e_1_2_12_20_2","unstructured":"SchulmanJ. WolskiF. DhariwalP. RadfordA. andKlimovO. Proximal policy optimization algorithms 2017 https:\/\/arxiv.org\/abs\/1707.06347."},{"key":"e_1_2_12_21_2","unstructured":"PlappertM. HouthooftR. DhariwalP. SidorS. ChenR. Y. ChenX. AsfourT. AbbeelP. andAndrychowiczM. Parameter space noise for exploration Proceedings of the International Conference on Learning Representations August 2018 Xi\u2032an China."},{"key":"e_1_2_12_22_2","unstructured":"FortunatoM. AzarM. G. PiotB. MenickJ. HesselM. OsbandI. GravesA. MnihV. MunosR. andHassabisD. Noisy networks for exploration Proceedings of the International Conference on Learning Representations March 2018 Vancouver Canada."},{"key":"e_1_2_12_23_2","first-page":"1","article-title":"Deep exploration via randomized value functions","volume":"20","author":"Osband I.","year":"2019","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_12_24_2","doi-asserted-by":"crossref","unstructured":"PathakD. AgrawalP. EfrosA. A. andDarrellT. Curiosity-driven exploration by self-supervised prediction Proceedings of the International Conference on Machine Learning May 2017 Sydney Australia PMLR 2778\u20132787.","DOI":"10.1109\/CVPRW.2017.70"},{"key":"e_1_2_12_25_2","unstructured":"AchiamJ.andSastryS. Surprise-based intrinsic motivation for deep reinforcement learning 2017 https:\/\/arxiv.org\/abs\/1703.01732."},{"key":"e_1_2_12_26_2","unstructured":"SavinovN. RaichukA. VincentD. MarinierR. PollefeysM. LillicrapT. andGellyS. Episodic curiosity through reachability Proceedings of the International Conference on Learning Representations December 2019 Sydney Australia."},{"key":"e_1_2_12_27_2","article-title":"Diversity-driven exploration strategy for deep reinforcement learning","volume":"31","author":"Hong Z.-W.","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_28_2","unstructured":"HanS.andSungY. Diversity actor-critic: sample-aware entropy regularization for sample-efficient exploration Proceedings of the 38th International Conference on Machine Learning ICML 2021 August 2021 Sydney Australia PMLR 4018\u20134029."},{"key":"e_1_2_12_29_2","doi-asserted-by":"crossref","unstructured":"MasoodM. A.andDoshi-VelezF. Diversity-inducing policy gradient: using maximum mean discrepancy to find a set of diverse policies Proceedings of the International Joint Conferences on Artificial Intelligence Organization December 2019 Vancouver Canada.","DOI":"10.24963\/ijcai.2019\/821"},{"key":"e_1_2_12_30_2","unstructured":"MnihV. BadiaA. P. MirzaM. GravesA. LillicrapT. HarleyT. SilverD. andKavukcuogluK. Asynchronous methods for deep reinforcement learning Proceedings of the International Conference on Machine Learning June 2016 New York NY USA PMLR 1928\u20131937."},{"key":"e_1_2_12_31_2","unstructured":"SchmittS. HesselM. andSimonyanK. Off-policyactor-critic with shared experience replay Proceedings of the International Conference on Machine Learning May 2020 Vienna Austria PMLR 8545\u20138554."},{"key":"e_1_2_12_32_2","unstructured":"EspeholtL. SoyerH. MunosR. SimonyanK. MnihV. WardT. DoronY. FiroiuV. HarleyT. andDunningI. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures Proceedings of the International Conference on Machine Learning June 2018 Stockholm Sweden PMLR."},{"key":"e_1_2_12_33_2","unstructured":"HesterT. Vecer\u00edkM. PietquinO. LanctotM. SchaulT. PiotB. HorganD. QuanJ. SendonarisA. OsbandI. Dulac-ArnoldG. AgapiouJ. P. LeiboJ. Z. andGruslysA. Deep q-learning from demonstrations 2018 https:\/\/arxiv.org\/abs\/1704.03732."},{"key":"e_1_2_12_34_2","unstructured":"BadiaA. P. SprechmannP. VitvitskyiA. GuoD. PiotB. KapturowskiS. TielemanO. ArjovskyM. PritzelA. andBoltA. Never give up: learning directed exploration strategies Proceedings of the International Conference on Learning Representations September 2020 Xi\u2032an China."},{"key":"e_1_2_12_35_2","unstructured":"RossS. GordonG. andBagnellD. A reduction of imitation learning and structured prediction to no-regret online learning Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics January 2011 Fort Lauderdale FL USA 627\u2013635."},{"key":"e_1_2_12_36_2","article-title":"Algorithms for inverse reinforcement learning","volume":"1","author":"Ng A. Y.","year":"2000","journal-title":"Icml"},{"key":"e_1_2_12_37_2","first-page":"1433","article-title":"Maximum entropy inverse reinforcement learning","volume":"8","author":"Ziebart B. D.","year":"2008","journal-title":"Aaai"},{"key":"e_1_2_12_38_2","article-title":"Generative adversarial imitation learning","volume":"29","author":"Ho J.","year":"2016","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_39_2","article-title":"Policy gradient methods for reinforcement learning with function approximation","volume":"12","author":"Sutton R. S.","year":"1999","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_40_2","first-page":"513","article-title":"A kernel method for the two-sample-problem","volume":"19","author":"Gretton A.","year":"2006","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_12_41_2","unstructured":"GrettonA. SejdinovicD. StrathmannH. BalakrishnanS. PontilM. FukumizuK. andSriperumbudurB. K. Optimal kernel choice for large-scaletwo-sample tests Proceedings of the Advances in Neural Information Processing Systems December 2012 Lake Tahoe Nevada Citeseer 1205\u20131213."},{"key":"e_1_2_12_42_2","unstructured":"ThomasP. SilvaB. C. DannC. andBrunskillE. Energetic natural gradient descent Proceedings of the International Conference on Machine Learning June 2016 New York NY USA PMLR 2887\u20132895."},{"key":"e_1_2_12_43_2","unstructured":"DziugaiteG. K. RoyD. M. andGhahramaniZ. Training generative neural networks via maximum mean discrepancy optimization Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence July 2015 Amsterdam Netherlands 258\u2013267."},{"key":"e_1_2_12_44_2","doi-asserted-by":"publisher","DOI":"10.24033\/asens.1013"},{"key":"e_1_2_12_45_2","unstructured":"HaarnojaT. ZhouA. AbbeelP. andLevineS. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor Proceedings of the International Conference on Machine Learning January 2018 Stockholm Sweden PMLR 1861\u20131870."},{"key":"e_1_2_12_46_2","unstructured":"LoweR. TamarA. HarbJ. Pieter AbbeelO. andMordatchI. Multi-agentactor-critic for mixed cooperative-competitive environments Proceedings of the Advances in Neural Information Processing Systems June 2017 Long Beach CA USA."},{"key":"e_1_2_12_47_2","unstructured":"DuanY. ChenX. HouthooftR. SchulmanJ. andAbbeelP. Benchmarking deep reinforcement learning for continuous control Proceedings of the International Conference on Machine Learning May 2016 New York NY USA PMLR 1329\u20131338."},{"key":"e_1_2_12_48_2","unstructured":"ContiE. MadhavanV. SuchF. P. LehmanJ. StanleyK. O. andCluneJ. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents Proceedings of the 32nd International Conference on Neural Information Processing Systems December 2018 Montr Canada 5032\u20135043."},{"key":"e_1_2_12_49_2","unstructured":"EysenbachB. GuptaA. IbarzJ. andLevineS. Diversity is all you need: learning skills without a reward function Proceedings of theInternational Conference on Learning Representations December 2018 New York NY USA."},{"volume-title":"Open AI Gym","year":"2016","author":"Brockman G.","key":"e_1_2_12_50_2"},{"key":"e_1_2_12_51_2","doi-asserted-by":"crossref","unstructured":"TodorovE. ErezT. andTassaY. Mujoco: a physics engine for model-based control Proceedings of the 2012 IEEE\/RSJ International Conference on Intelligent Robots and Systems October 2012 Vilamoura-Algarve Portugal IEEE 5026\u20135033.","DOI":"10.1109\/IROS.2012.6386109"},{"key":"e_1_2_12_52_2","unstructured":"HeessN. TbD. SriramS. LemmonJ. MerelJ. WayneG. TassaY. ErezT. WangZ. andEslamiS. Emergence of locomotion behaviours in rich environments 2017 https:\/\/arxiv.org\/abs\/1707.02286."}],"container-title":["International Journal of Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/journals\/ijis\/2023\/4705291.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/journals\/ijis\/2023\/4705291.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1155\/2023\/4705291","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:40:41Z","timestamp":1735623641000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1155\/2023\/4705291"}},"subtitle":[],"editor":[{"given":"Tao","family":"Li","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,1]]},"references-count":52,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["10.1155\/2023\/4705291"],"URL":"https:\/\/doi.org\/10.1155\/2023\/4705291","archive":["Portico"],"relation":{},"ISSN":["0884-8173","1098-111X"],"issn-type":[{"type":"print","value":"0884-8173"},{"type":"electronic","value":"1098-111X"}],"subject":[],"published":{"date-parts":[[2023,1]]},"assertion":[{"value":"2022-12-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-04-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"4705291"}}