{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T21:20:56Z","timestamp":1774646456413,"version":"3.50.1"},"reference-count":64,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,7,26]],"date-time":"2024-07-26T00:00:00Z","timestamp":1721952000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,7,26]],"date-time":"2024-07-26T00:00:00Z","timestamp":1721952000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001774","name":"University of Sydney","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001774","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent\u2019s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/qizhust\/esceme\">https:\/\/github.com\/qizhust\/esceme<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s11263-024-02159-8","type":"journal-article","created":{"date-parts":[[2024,7,27]],"date-time":"2024-07-27T00:02:04Z","timestamp":1722038524000},"page":"254-274","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["ESceme: Vision-and-Language Navigation with Episodic Scene Memory"],"prefix":"10.1007","volume":"133","author":[{"given":"Qi","family":"Zheng","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daqing","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chaoyue","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6595-7661","authenticated-orcid":false,"given":"Jing","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dadong","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dacheng","family":"Tao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,7,26]]},"reference":[{"key":"2159_CR1","doi-asserted-by":"crossref","unstructured":"An, D., Qi, Y., Huang, Y., Wu, Q., Wang, L., & Tan, T. (2021). Neighbor-view enhanced model for vision and language navigation. In ACMMM, pp. 5101\u20135109.","DOI":"10.1145\/3474085.3475282"},{"key":"2159_CR2","doi-asserted-by":"crossref","unstructured":"An, D., Wang, H., Wang, W., Wang, Z., Huang, Y., He, K., & Wang, L. (2023). Etpnav: Evolving topological planning for vision-language navigation in continuous environments. arXiv preprintarXiv:2304.03047.","DOI":"10.1109\/TPAMI.2024.3386695"},{"key":"2159_CR3","unstructured":"Anderson, P., Chang, A., Chaplot, D.\u00a0S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., & Savva, M., et\u00a0al. (2018). On evaluation of embodied navigation agents. arXiv preprintarXiv:1807.06757."},{"key":"2159_CR4","doi-asserted-by":"crossref","unstructured":"Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S\u00fcnderhauf, N., Reid, I., Gould, S., & Van Den\u00a0Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pp. 3674\u20133683.","DOI":"10.1109\/CVPR.2018.00387"},{"key":"2159_CR5","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.\u00a0L., & Parikh, D. (2015) Vqa: Visual question answering. In ICCV, pp. 2425\u20132433.","DOI":"10.1109\/ICCV.2015.279"},{"key":"2159_CR6","doi-asserted-by":"crossref","unstructured":"Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., & Zhang, A. (2017). Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision, pp. 667\u2013676.","DOI":"10.1109\/3DV.2017.00081"},{"key":"2159_CR7","unstructured":"Chaplot, D.\u00a0S., Salakhutdinov, R., Gupta, A., & Gupta, S. (2020). Neural topological slam for visual navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 12875\u201312884."},{"key":"2159_CR8","doi-asserted-by":"crossref","unstructured":"Chen, J., Gao, C., Meng, E., Zhang, Q., & Liu, S. (2022). Reinforced structured state-evolution for vision-language navigation. In CVPR, pp. 15450\u201315459.","DOI":"10.1109\/CVPR52688.2022.01501"},{"key":"2159_CR9","doi-asserted-by":"crossref","unstructured":"Chen, K., Chen, J.\u00a0K., Chuang, J., V\u00e1zquez, M., & Savarese, S. (2021). Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 11276\u201311286.","DOI":"10.1109\/CVPR46437.2021.01112"},{"key":"2159_CR10","first-page":"5834","volume":"34","author":"S Chen","year":"2021","unstructured":"Chen, S., Guhur, P.-L., Schmid, C., & Laptev, I. (2021). History aware multimodal transformer for vision-and-language navigation. In NeurIPS, 34, 5834\u20135847.","journal-title":"In NeurIPS"},{"key":"2159_CR11","doi-asserted-by":"crossref","unstructured":"Chen, S., Guhur, P.-L., Tapaswi, M., Schmid, C., & Laptev, I. (2022). Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In CVPR, pp. 16537\u201316547.","DOI":"10.1109\/CVPR52688.2022.01604"},{"key":"2159_CR12","unstructured":"Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Doll\u00e1r, P., & Zitnick, C.\u00a0L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprintarXiv:1504.00325."},{"key":"2159_CR13","unstructured":"Cornia, F.\u00a0L. L. B.\u00a0M., & Cucchiara, M.\u00a0C.\u00a0R. (2019). Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation. arXiv preprintarXiv:1911.12377."},{"key":"2159_CR14","doi-asserted-by":"crossref","unstructured":"Datta, S., Dharur, S., Cartillier, V., Desai, R., Khanna, M., Batra, D., & Parikh, D. (2022). Episodic memory question answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 19119\u201319128.","DOI":"10.1109\/CVPR52688.2022.01853"},{"key":"2159_CR15","first-page":"20660","volume":"33","author":"Z Deng","year":"2020","unstructured":"Deng, Z., Narasimhan, K., & Russakovsky, O. (2020). Evolving graphical planner: Contextual global planning for vision-and-language navigation. In NeurIPS, 33, 20660\u201320672.","journal-title":"In NeurIPS"},{"key":"2159_CR16","unstructured":"Dwivedi, V.\u00a0P., Joshi, C.\u00a0K., Laurent, T., Bengio, Y., & Bresson, X. (2020). Benchmarking graph neural networks. arXiv preprintarXiv:2003.00982."},{"key":"2159_CR17","unstructured":"Fried, D., Hu, R, Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., & Darrell, T. (2018). Speaker-follower models for vision-and-language navigation. In NeurIPS, volume\u00a031."},{"key":"2159_CR18","doi-asserted-by":"crossref","unstructured":"Georgakis, G., Schmeckpeper, K., Wanchoo, K., Dan, S., Miltsakaki, E., Roth, D., & Daniilidis, K. (2022). Cross-modal map learning for vision and language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 15460\u201315470.","DOI":"10.1109\/CVPR52688.2022.01502"},{"key":"2159_CR19","doi-asserted-by":"crossref","unstructured":"Guhur, P.-L., Tapaswi, M., Chen, S., Laptev, I., & Schmid, C. (2021) Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 1634\u20131643.","DOI":"10.1109\/ICCV48922.2021.00166"},{"key":"2159_CR20","doi-asserted-by":"crossref","unstructured":"Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2616\u20132625.","DOI":"10.1109\/CVPR.2017.769"},{"key":"2159_CR21","doi-asserted-by":"crossref","unstructured":"Hao, W., Li, C., Li, X., Carin, L., & Gao, J. (2020). Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, pp. 13137\u201313146.","DOI":"10.1109\/CVPR42600.2020.01315"},{"issue":"8","key":"2159_CR22","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735\u20131780.","journal-title":"Neural computation"},{"key":"2159_CR23","first-page":"7685","volume":"33","author":"Y Hong","year":"2020","unstructured":"Hong, Y., Rodriguez, C., Qi, Y., Wu, Q., & Gould, S. (2020). Language and visual entity relationship graph for agent navigation. In NeurIPS, 33, 7685\u20137696.","journal-title":"In NeurIPS"},{"key":"2159_CR24","doi-asserted-by":"crossref","unstructured":"Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., & Gould, S. (2021). Vln bert: A recurrent vision-and-language bert for navigation. In CVPR, pp. 1643\u20131653.","DOI":"10.1109\/CVPR46437.2021.00169"},{"key":"2159_CR25","doi-asserted-by":"crossref","unstructured":"Hong, Y., Wang, Z., Wu, Q., & Gould, S. (2022). Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 15439\u201315449.","DOI":"10.1109\/CVPR52688.2022.01500"},{"key":"2159_CR26","doi-asserted-by":"crossref","unstructured":"Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., & Baldridge, J. (2019). Stay on the path: Instruction fidelity in vision-and-language navigation. In ACL, pp. 1862\u20131872.","DOI":"10.18653\/v1\/P19-1181"},{"key":"2159_CR27","doi-asserted-by":"crossref","unstructured":"Ke, L., Li, X., Bisk, Y., Holtzman, A., Gan, Z., Liu, J., Gao, J., Choi, Y., & Srinivasa, S. (2019). Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In CVPR, pp. 6741\u20136749.","DOI":"10.1109\/CVPR.2019.00690"},{"key":"2159_CR28","unstructured":"Kenton, J.\u00a0D. M.-W.\u00a0C., & Toutanova, L.\u00a0K. (2016). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171\u20134186."},{"key":"2159_CR29","doi-asserted-by":"crossref","unstructured":"Krantz, J., Banerjee, S., Zhu, W., Corso, J., Anderson, P., Lee, S., & Thomason, J. (2023). Iterative vision-and-language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 14921\u201314930.","DOI":"10.1109\/CVPR52729.2023.01433"},{"key":"2159_CR30","doi-asserted-by":"crossref","unstructured":"Ku, A., Anderson, P., Patel, R., Ie, E., & Baldridge, J. (2020). Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In EMNLP.","DOI":"10.18653\/v1\/2020.emnlp-main.356"},{"key":"2159_CR31","doi-asserted-by":"crossref","unstructured":"Ku, A., Anderson, P., Patel, R., Ie, E., & Baldridge, J. (2020). Room-Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Conference on Empirical Methods for Natural Language Processing (EMNLP).","DOI":"10.18653\/v1\/2020.emnlp-main.356"},{"key":"2159_CR32","doi-asserted-by":"crossref","unstructured":"Li, J., & Bansal, M. (2023). Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 10803\u201310812.","DOI":"10.1109\/CVPR52729.2023.01040"},{"key":"2159_CR33","doi-asserted-by":"crossref","unstructured":"Li, J., Tan, H., & Bansal, M. (2022) Envedit: Environment editing for vision-and-language navigation. In CVPR, pp. 15407\u201315417.","DOI":"10.1109\/CVPR52688.2022.01497"},{"key":"2159_CR34","unstructured":"Li, M., Wang, Z., Tuytelaars, T., & Moens, M.-F. (2022). Layout-aware dreamer for embodied referring expression grounding. arXiv preprintarXiv:2212.00171."},{"key":"2159_CR35","doi-asserted-by":"crossref","unstructured":"Li, X., Li, C., Xia, Q., Bisk, Y., Celikyilmaz, A., Gao, J., Smith, N., & Choi, Y. (2019). Robust navigation with language pretraining and stochastic sampling. In EMNLP-IJCNLP.","DOI":"10.18653\/v1\/D19-1159"},{"key":"2159_CR36","doi-asserted-by":"crossref","unstructured":"Li, X., Wang, Z., Yang, J., Wang, Y., & Jiang, S. (2023). Kerm: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 2583\u20132592.","DOI":"10.1109\/CVPR52729.2023.00254"},{"key":"2159_CR37","doi-asserted-by":"crossref","unstructured":"Lin, B., Zhu, Y., Chen, Z., Liang, X., Liu, J., & Liang, X. (2022). Adapt: Vision-language navigation with modality-aligned action prompts. In CVPR, pp. 15396\u201315406.","DOI":"10.1109\/CVPR52688.2022.01496"},{"key":"2159_CR38","doi-asserted-by":"crossref","unstructured":"Lin, C., Jiang, Y., Cai, J., Qu, L., Haffari, G., & Yuan, Z. (2022). Multimodal transformer with variable-length memory for vision-and-language navigation. In ECCV,.","DOI":"10.1007\/978-3-031-20059-5_22"},{"key":"2159_CR39","unstructured":"Ma, C.-Y., Lu, J., Wu, Z., AlRegib, G., Kira, Z., Socher, R., & Xiong, C. (2019). Self-monitoring navigation agent via auxiliary progress estimation. In ICLR."},{"key":"2159_CR40","doi-asserted-by":"crossref","unstructured":"Ma, C.-Y., Wu, Z., AlRegib, G., Xiong, C., & Kira, C. (2019). The regretful agent: Heuristic-aided navigation through progress estimation. In CVPR, pp. 6732\u20136740.","DOI":"10.1109\/CVPR.2019.00689"},{"key":"2159_CR41","doi-asserted-by":"crossref","unstructured":"Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., & Batra, D. (2020). Improving vision-and-language navigation with image-text pairs from the web. In ECCV, pp. 259\u2013274. Springer.","DOI":"10.1007\/978-3-030-58539-6_16"},{"key":"2159_CR42","unstructured":"Maron, H., Ben-Hamu, H., Serviansky, H., & Lipman, Y. (2019). Provably powerful graph networks. NeurIPS, 32."},{"key":"2159_CR43","unstructured":"Mnih, V., Badia, A.\u00a0P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In ICML, pp. 1928\u20131937. PMLR."},{"key":"2159_CR44","doi-asserted-by":"crossref","unstructured":"Qi, Y., Pan, Z., Zhang, S., Hengel, A.\u00a0v.\u00a0d., & Wu, Q. (2020). Object-and-action aware model for visual language navigation. In ECCV, pp. 303\u2013317. Springer.","DOI":"10.1007\/978-3-030-58607-2_18"},{"key":"2159_CR45","doi-asserted-by":"crossref","unstructured":"Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.\u00a0Y., Shen, C., & Hengel, A.\u00a0v.\u00a0d. (2020). Reverie: Remote embodied visual referring expression in real indoor environments. In CVPR, pp. 9982\u20139991.","DOI":"10.1109\/CVPR42600.2020.01000"},{"key":"2159_CR46","doi-asserted-by":"crossref","unstructured":"Qiao, Y., Qi, Y., Hong, Y., Yu, Z., Wang, P., & Wu, Q. (2022). Hop: History-and-order aware pre-training for vision-and-language navigation. In CVPR, pp. 15418\u201315427.","DOI":"10.1109\/CVPR52688.2022.01498"},{"key":"2159_CR47","unstructured":"Shah, D., Eysenbach, B., Rhinehart, N., & Levine, S. (2022). Rapid exploration for open-world navigation with latent goal models. In Conference on Robot Learning, pp. 674\u2013684. PMLR."},{"key":"2159_CR48","first-page":"1984","volume":"2022","author":"A Shrivastava","year":"2022","unstructured":"Shrivastava, A., Gopalakrishnan, K., Liu, Y., Piramuthu, R., T\u00fcr, G., Parikh, D., & Hakkani-Tur, D. (2022). Visitron: Visual semantics-aligned interactively trained object-navigator. In Findings of the Association for Computational Linguistics: ACL, 2022, 1984\u20131994.","journal-title":"In Findings of the Association for Computational Linguistics: ACL"},{"issue":"1\u20132","key":"2159_CR49","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1162\/1064546053278973","volume":"11","author":"L Smith","year":"2005","unstructured":"Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial life, 11(1\u20132), 13\u201329.","journal-title":"Artificial life"},{"key":"2159_CR50","doi-asserted-by":"crossref","unstructured":"Tan, H., Yu, L., & Bansal, M. (2019). Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL.","DOI":"10.18653\/v1\/N19-1268"},{"key":"2159_CR51","unstructured":"Thomason, J., Murray, M., Cakmak, M., & Zettlemoyer, L. (2020). Vision-and-dialog navigation. In Conference on Robot Learning, pp. 394\u2013406. PMLR."},{"key":"2159_CR52","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1007\/s11263-020-01374-3","volume":"129","author":"AB Vasudevan","year":"2021","unstructured":"Vasudevan, A. B., Dai, D., & Van Gool, L. (2021). Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision, 129, 246\u2013266.","journal-title":"International Journal of Computer Vision"},{"key":"2159_CR53","doi-asserted-by":"crossref","unstructured":"Wang, H., Wang, W., Shu, T., Liang, W., & Shen, J. (2020). Active visual information gathering for vision-language navigation. In ECCV, pp. 307\u2013322. Springer.","DOI":"10.1007\/978-3-030-58542-6_19"},{"key":"2159_CR54","doi-asserted-by":"crossref","unstructured":"Wang, H., Wang, W., Liang, W., Xiong, C., & Shen, J. (2021). Structured scene memory for vision-language navigation. In CVPR, pp. 8455\u20138464.","DOI":"10.1109\/CVPR46437.2021.00835"},{"key":"2159_CR55","doi-asserted-by":"crossref","unstructured":"Wang, X., Xiong, W., Wang, H., & Wang, W.\u00a0Y. (2018). Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In ECCV, pp. 37\u201353.","DOI":"10.1007\/978-3-030-01270-0_3"},{"key":"2159_CR56","doi-asserted-by":"crossref","unstructured":"Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.-F., Wang, W.\u00a0Y., & Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, pp. 6629\u20136638.","DOI":"10.1109\/CVPR.2019.00679"},{"key":"2159_CR57","doi-asserted-by":"crossref","unstructured":"Wang, X.\u00a0E., Jain, V., Ie, E., Wang, W.\u00a0Y., Kozareva, Z., & Ravi, S. (2020). Environment-agnostic multitask learning for natural language grounded navigation. In ECCV, pp. 413\u2013430. Springer.","DOI":"10.1007\/978-3-030-58586-0_25"},{"key":"2159_CR58","doi-asserted-by":"crossref","unstructured":"Wang, Z., Li, X., Yang, J., Liu, Y., & Jiang, S. (2023). Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 15625\u201315636.","DOI":"10.1109\/ICCV51070.2023.01432"},{"key":"2159_CR59","doi-asserted-by":"crossref","unstructured":"Wu, S., Fu, X., Wu, F., & Zha, Z.-J. (2022). Cross-modal semantic alignment pre-training for vision-and-language navigation. In ACMMM, pp. 4233\u20134241.","DOI":"10.1145\/3503161.3548283"},{"key":"2159_CR60","doi-asserted-by":"crossref","unstructured":"Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pp. 5288\u20135296.","DOI":"10.1109\/CVPR.2016.571"},{"key":"2159_CR61","doi-asserted-by":"crossref","unstructured":"Zhao, Y., Chen, J., Gao, C., Wang, W., Yang, L., Ren, H., Xia, H., & Liu, S. (2022). Target-driven structured transformer planner for vision-language navigation. In ACMMM, pp. 4194\u20134203.","DOI":"10.1145\/3503161.3548281"},{"key":"2159_CR62","doi-asserted-by":"crossref","unstructured":"Zhu, F., Zhu, Y., Chang, X., & Liang, X. (2020). Vision-language navigation with self-supervised auxiliary reasoning tasks. In CVPR, pp. 10012\u201310022.","DOI":"10.1109\/CVPR42600.2020.01003"},{"key":"2159_CR63","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Zhu, F., Zhan, Z., Lin, B., Jiao, J., Chang, X., & Liang, X. (2020). Vision-dialog navigation by exploring cross-modal memory. In CVPR, pp. 10730\u201310739.","DOI":"10.1109\/CVPR42600.2020.01074"},{"key":"2159_CR64","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Weng, Y., Zhu, F., Liang, X., Ye, Q., Lu, Y., & Jiao, J. (2021). Self-motivated communication agent for real-world vision-dialog navigation. In ICCV, pp. 1594\u20131603.","DOI":"10.1109\/ICCV48922.2021.00162"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02159-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-024-02159-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-024-02159-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,7]],"date-time":"2025-01-07T06:12:01Z","timestamp":1736230321000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-024-02159-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,26]]},"references-count":64,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["2159"],"URL":"https:\/\/doi.org\/10.1007\/s11263-024-02159-8","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,26]]},"assertion":[{"value":"17 July 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 June 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 July 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}