{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,5]],"date-time":"2025-11-05T06:55:33Z","timestamp":1762325733223},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T00:00:00Z","timestamp":1696982400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T00:00:00Z","timestamp":1696982400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key Research and Development Project Grant","award":["2018AAA01008-02"],"award-info":[{"award-number":["2018AAA01008-02"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Deep reinforcement learning has always been used to solve high-dimensional complex sequential decision problems. However, one of the biggest challenges for reinforcement learning is sample efficiency, especially for high-dimensional complex problems. Model-based reinforcement learning can solve the problem with a learned world model, but the performance is limited by the imperfect world model, so it usually has worse approximate performance than model-free reinforcement learning. In this paper, we propose a novel model-based reinforcement learning algorithm called World Model with Trajectory Discrimination (WMTD). We learn the representation of temporal dynamics information by adding a trajectory discriminator to the world model, and then compute the weight of state value estimation based on the trajectory discriminator to optimize the policy. Specifically, we augment the trajectories to generate negative samples and train a trajectory discriminator that shares the feature extractor with the world model. Experimental results demonstrate that our method improves the sample efficiency and achieves state-of-the-art performance on DeepMind control tasks.<\/jats:p>","DOI":"10.1007\/s40747-023-01247-5","type":"journal-article","created":{"date-parts":[[2023,10,11]],"date-time":"2023-10-11T13:01:49Z","timestamp":1697029309000},"page":"1927-1936","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Data-efficient model-based reinforcement learning with trajectory discrimination"],"prefix":"10.1007","volume":"10","author":[{"given":"Tuo","family":"Qu","sequence":"first","affiliation":[]},{"given":"Fuqing","family":"Duan","sequence":"additional","affiliation":[]},{"given":"Junge","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Bo","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Wenzhen","family":"Huang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,10,11]]},"reference":[{"key":"1247_CR1","unstructured":"Buckman J, Hafner D, Tucker G, et\u00a0al (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. Adv Neural Inf Process Syst 31"},{"key":"1247_CR2","first-page":"9912","volume":"33","author":"M Caron","year":"2020","unstructured":"Caron M, Misra I, Mairal J et al (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912\u20139924","journal-title":"Adv Neural Inf Process Syst"},{"key":"1247_CR3","doi-asserted-by":"crossref","unstructured":"Caron M, Touvron H, Misra I, et\u00a0al (2021) Emerging properties in self-supervised vision transformers. Proceedings of the IEEE\/CVF international conference on computer vision. 9650-9660","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"1247_CR4","doi-asserted-by":"crossref","unstructured":"Choi H, Lee H, Song W, et\u00a0al (2023) Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning.Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 15072-15082","DOI":"10.1109\/CVPR52729.2023.01447"},{"key":"1247_CR5","unstructured":"Chua K, Calandra R, McAllister R, et\u00a0al (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Adv Neural Inf Process Syst 31"},{"key":"1247_CR6","first-page":"14156","volume":"33","author":"S Curi","year":"2020","unstructured":"Curi S, Berkenkamp F, Krause A (2020) Efficient model-based reinforcement learning through optimistic policy search and planning. Adv Neural Inf Process Syst 33:14156\u201314170","journal-title":"Adv Neural Inf Process Syst"},{"key":"1247_CR7","unstructured":"Deng F, Jang I, Ahn S (2022) Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations. International Conference on Machine Learning. PMLR, 4956-4975"},{"key":"1247_CR8","unstructured":"Feinberg V, Wan A, Stoica I, et\u00a0al (2018) Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101"},{"key":"1247_CR9","unstructured":"Ghosh P, Sajjadi MSM, Vergari A, et\u00a0al (2019) From variational to deterministic autoencoders. arXiv preprint arXiv:1903.12436"},{"key":"1247_CR10","unstructured":"Haarnoja T, Zhou A, Hartikainen K, et\u00a0al (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905"},{"key":"1247_CR11","unstructured":"Hafner D, Lillicrap T, Ba J, et\u00a0al (2019) Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603"},{"key":"1247_CR12","unstructured":"Hafner D, Lillicrap T, Fischer I, et\u00a0al (2019) Learning latent dynamics for planning from pixels. International conference on machine learning. PMLR, 2555-2565"},{"key":"1247_CR13","unstructured":"Ha D, Schmidhuber J (2018) Recurrent world models facilitate policy evolution. Adv Neural Inf Process Syst 31"},{"issue":"13","key":"1247_CR14","doi-asserted-by":"publisher","first-page":"6939","DOI":"10.1016\/j.jfranklin.2022.06.043","volume":"359","author":"P He","year":"2022","unstructured":"He P, Wen J, Stojanovic V et al (2022) Finite-time control of discrete-time semi-Markov jump linear systems: A self-triggered MPC approach. J Frankl Inst 359(13):6939\u20136957","journal-title":"J Frankl Inst"},{"key":"1247_CR15","first-page":"1612","volume":"35","author":"AK Jain","year":"2022","unstructured":"Jain AK, Sujit S, Joshi S et al (2022) Learning Robust Dynamics through Variational Sparse Gating. Adv Neural Inf Process Syst 35:1612\u20131626","journal-title":"Adv Neural Inf Process Syst"},{"key":"1247_CR16","unstructured":"Janner M, Fu J, Zhang M, et\u00a0al (2019) When to trust your model: Model-based policy optimization. Adv Neural Inf Process Syst, 32"},{"key":"1247_CR17","unstructured":"Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114"},{"key":"1247_CR18","unstructured":"Kostrikov I, Yarats D, Fergus R (2020) Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649"},{"key":"1247_CR19","unstructured":"Kurutach T, Clavera I, Duan Y, et\u00a0al (2018) Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592"},{"key":"1247_CR20","unstructured":"Laskin M, Srinivas A, Abbeel P (2020) Curl: Contrastive unsupervised representations for reinforcement learning. International Conference on Machine Learning. PMLR, 5639-5650"},{"key":"1247_CR21","first-page":"741","volume":"33","author":"AX Lee","year":"2020","unstructured":"Lee AX, Nagabandi A, Abbeel P et al (2020) Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Adv Neural Inf Process Syst 33:741\u2013752","journal-title":"Adv Neural Inf Process Syst"},{"key":"1247_CR22","unstructured":"Luo Y, Xu H, Li Y, et\u00a0al (2018) Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858"},{"key":"1247_CR23","unstructured":"Micheli V, Alonso E, Fleuret F (2022) Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588"},{"key":"1247_CR24","unstructured":"Oh J, Singh S, Lee H (2017) Value prediction network. Adv Neural Inf Process Syst 30"},{"key":"1247_CR25","unstructured":"Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748"},{"key":"1247_CR26","first-page":"10537","volume":"33","author":"F Pan","year":"2020","unstructured":"Pan F, He J, Tu D et al (2020) Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems 33:10537\u201310546","journal-title":"Advances in neural information processing systems"},{"issue":"7839","key":"1247_CR27","doi-asserted-by":"publisher","first-page":"604","DOI":"10.1038\/s41586-020-03051-4","volume":"588","author":"J Schrittwieser","year":"2020","unstructured":"Schrittwieser J, Antonoglou I, Hubert T et al (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604\u2013609","journal-title":"Nature"},{"key":"1247_CR28","unstructured":"Schwarzer M, Anand A, Goel R, et\u00a0al (2020) Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929"},{"key":"1247_CR29","doi-asserted-by":"crossref","unstructured":"Song F, Xing H, Wang X, et\u00a0al (2022) Evolutionary multi-objective reinforcement learning based trajectory control and task offloading in UAV-assisted mobile edge computing. IEEE Trans Mobile Comput","DOI":"10.1109\/TMC.2022.3208457"},{"issue":"4","key":"1247_CR30","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1145\/122344.122377","volume":"2","author":"RS Sutton","year":"1991","unstructured":"Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bullet 2(4):160\u2013163","journal-title":"ACM Sigart Bullet"},{"key":"1247_CR31","doi-asserted-by":"crossref","unstructured":"Talvitie E (2017) Self-correcting models for model-based reinforcement learning. Proc AAAI Conf Artif Intell 31(1)","DOI":"10.1609\/aaai.v31i1.10850"},{"key":"1247_CR32","unstructured":"Tassa Y, Doron Y, Muldal A, et\u00a0al (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690"},{"key":"1247_CR33","doi-asserted-by":"crossref","unstructured":"Venkatraman A, Hebert M, Bagnell J (2015) Improving multi-step prediction of learned time series models. Proc AAAI Conf Artif Intell 29(1)","DOI":"10.1609\/aaai.v29i1.9590"},{"issue":"12","key":"1247_CR34","first-page":"10674","volume":"35","author":"D Yarats","year":"2021","unstructured":"Yarats D, Zhang A, Kostrikov I et al (2021) Improving sample efficiency in model-free reinforcement learning from images. Proc AAAI Conf Artif Intell 35(12):10674\u201310681","journal-title":"Proc AAAI Conf Artif Intell"},{"key":"1247_CR35","first-page":"5276","volume":"34","author":"T Yu","year":"2021","unstructured":"Yu T, Lan C, Zeng W et al (2021) Playvirtual: Augmenting cycle-consistent virtual trajectories for reinforcement learning. Adv Neural Inf Process Syst 34:5276\u20135289","journal-title":"Adv Neural Inf Process Syst"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01247-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-023-01247-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-023-01247-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,3,30]],"date-time":"2024-03-30T15:20:10Z","timestamp":1711812010000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-023-01247-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,11]]},"references-count":35,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["1247"],"URL":"https:\/\/doi.org\/10.1007\/s40747-023-01247-5","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,11]]},"assertion":[{"value":"28 May 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 September 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}