{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T20:17:09Z","timestamp":1774124229143,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":64,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62122010,61876177"],"award-info":[{"award-number":["62122010,61876177"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Key Research and Development Program of Zhejiang Province","award":["2022C01082"],"award-info":[{"award-number":["2022C01082"]}]},{"name":"ARC DECRA","award":["DE220101390"],"award-info":[{"award-number":["DE220101390"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548281","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:46Z","timestamp":1665416566000},"page":"4194-4203","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":47,"title":["Target-Driven Structured Transformer Planner for Vision-Language Navigation"],"prefix":"10.1145","author":[{"given":"Yusheng","family":"Zhao","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Jinyu","family":"Chen","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Chen","family":"Gao","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"given":"Wenguan","family":"Wang","sequence":"additional","affiliation":[{"name":"University of Technology Sydney, Sydney, NSW, Australia"}]},{"given":"Lirong","family":"Yang","sequence":"additional","affiliation":[{"name":"Meituan Inc., Beijing, China"}]},{"given":"Haibing","family":"Ren","sequence":"additional","affiliation":[{"name":"Meituan Inc., Beijing, China"}]},{"given":"Huaxia","family":"Xia","sequence":"additional","affiliation":[{"name":"Meituan Inc., Beijing, China"}]},{"given":"Si","family":"Liu","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_25"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475282"},{"key":"e_1_3_2_2_3_1","volume-title":"Chasing ghosts: Instruction following as bayesian state tracking. Advances in neural information processing systems","author":"Anderson Peter","year":"2019","unstructured":"Peter Anderson , Ayush Shrivastava , Devi Parikh , Dhruv Batra , and Stefan Lee . 2019. Chasing ghosts: Instruction following as bayesian state tracking. Advances in neural information processing systems , Vol. 32 ( 2019 ). Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. 2019. Chasing ghosts: Instruction following as bayesian state tracking. Advances in neural information processing systems, Vol. 32 (2019)."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00387"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSMCC.2007.913919"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_7_1","volume-title":"Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158","author":"Chang Angel","year":"2017","unstructured":"Angel Chang , Angela Dai , Thomas Funkhouser , Maciej Halber , Matthias Niessner , Manolis Savva , Shuran Song , Andy Zeng , and Yinda Zhang . 2017. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 ( 2017 ). Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)."},{"key":"e_1_3_2_2_8_1","volume-title":"Reinforced Structured State-Evolution for Vision-Language Navigation. arXiv preprint arXiv:2204.09280","author":"Chen Jinyu","year":"2022","unstructured":"Jinyu Chen , Chen Gao , Erli Meng , Qiong Zhang , and Si Liu . 2022a. Reinforced Structured State-Evolution for Vision-Language Navigation. arXiv preprint arXiv:2204.09280 ( 2022 ). Jinyu Chen, Chen Gao, Erli Meng, Qiong Zhang, and Si Liu. 2022a. Reinforced Structured State-Evolution for Vision-Language Navigation. arXiv preprint arXiv:2204.09280 (2022)."},{"key":"e_1_3_2_2_9_1","unstructured":"Shizhe Chen Pierre-Louis Guhur Cordelia Schmid and Ivan Laptev. 2021. History Aware multimodal Transformer for Vision-and-Language Navigation. In Advances in neural information processing systems.  Shizhe Chen Pierre-Louis Guhur Cordelia Schmid and Ivan Laptev. 2021. History Aware multimodal Transformer for Vision-and-Language Navigation. In Advances in neural information processing systems."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01604"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.261"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00323"},{"key":"e_1_3_2_2_13_1","first-page":"20660","article-title":"Evolving graphical planner: Contextual global planning for vision-and-language navigation","volume":"33","author":"Deng Zhiwei","year":"2020","unstructured":"Zhiwei Deng , Karthik Narasimhan , and Olga Russakovsky . 2020 . Evolving graphical planner: Contextual global planning for vision-and-language navigation . Advances in Neural Information Processing Systems , Vol. 33 (2020), 20660 -- 20672 . Zhiwei Deng, Karthik Narasimhan, and Olga Russakovsky. 2020. Evolving graphical planner: Contextual global planning for vision-and-language navigation. Advances in Neural Information Processing Systems, Vol. 33 (2020), 20660--20672.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_14_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_15_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_16_1","volume-title":"Advances in Neural Information Processing Systems","volume":"31","author":"Fried Daniel","year":"2018","unstructured":"Daniel Fried , Ronghang Hu , Volkan Cirik , Anna Rohrbach , Jacob Andreas , Louis-Philippe Morency , Taylor Berg-Kirkpatrick , Kate Saenko , Dan Klein , and Trevor Darrell . 2018 . Speaker-follower models for vision-and-language navigation . Advances in Neural Information Processing Systems , Vol. 31 (2018). Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, Vol. 31 (2018)."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58539-6_5"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00166"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.769"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01315"},{"key":"e_1_3_2_2_21_1","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia. 2344--2352","author":"He Dailan","year":"2021","unstructured":"Dailan He , Yusheng Zhao , Junyu Luo , Tianrui Hui , Shaofei Huang , Aixi Zhang , and Si Liu . 2021 . TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding . In Proceedings of the 29th ACM International Conference on Multimedia. 2344--2352 . Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344--2352."},{"key":"e_1_3_2_2_22_1","first-page":"7685","article-title":"Language and visual entity relationship graph for agent navigation","volume":"33","author":"Hong Yicong","year":"2020","unstructured":"Yicong Hong , Cristian Rodriguez , Yuankai Qi , Qi Wu , and Stephen Gould . 2020 . Language and visual entity relationship graph for agent navigation . Advances in Neural Information Processing Systems , Vol. 33 (2020), 7685 -- 7696 . Yicong Hong, Cristian Rodriguez, Yuankai Qi, Qi Wu, and Stephen Gould. 2020. Language and visual entity relationship graph for agent navigation. Advances in Neural Information Processing Systems, Vol. 33 (2020), 7685--7696.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_23_1","volume-title":"Proceedings of the IEEE\/CVF conference on Computer Vision and Pattern Recognition. 1643--1653","author":"Hong Yicong","year":"2021","unstructured":"Yicong Hong , Qi Wu , Yuankai Qi , Cristian Rodriguez-Opazo , and Stephen Gould . 2021 . A Recurrent Vision-and-Language BERT for Navigation . In Proceedings of the IEEE\/CVF conference on Computer Vision and Pattern Recognition. 1643--1653 . Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. 2021. A Recurrent Vision-and-Language BERT for Navigation. In Proceedings of the IEEE\/CVF conference on Computer Vision and Pattern Recognition. 1643--1653."},{"key":"e_1_3_2_2_24_1","volume-title":"Are you looking? grounding to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347","author":"Hu Ronghang","year":"2019","unstructured":"Ronghang Hu , Daniel Fried , Anna Rohrbach , Dan Klein , Trevor Darrell , and Kate Saenko . 2019. Are you looking? grounding to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347 ( 2019 ). Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. 2019. Are you looking? grounding to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347 (2019)."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00750"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00690"},{"key":"e_1_3_2_2_27_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_28_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. 14738--14748","author":"Koh Jing Yu","year":"2021","unstructured":"Jing Yu Koh , Honglak Lee , Yinfei Yang , Jason Baldridge , and Peter Anderson . 2021 . Pathdreamer: A world model for indoor navigation . In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 14738--14748 . Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. 2021. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 14738--14748."},{"key":"e_1_3_2_2_29_1","volume-title":"Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474","author":"Kolve Eric","year":"2017","unstructured":"Eric Kolve , Roozbeh Mottaghi , Winson Han , Eli VanderBilt , Luca Weihs , Alvaro Herrasti , Daniel Gordon , Yuke Zhu , Abhinav Gupta , and Ali Farhadi . 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 ( 2017 ). Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017)."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01488"},{"key":"e_1_3_2_2_31_1","volume-title":"Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954","author":"Ku Alexander","year":"2020","unstructured":"Alexander Ku , Peter Anderson , Roma Patel , Eugene Ie , and Jason Baldridge . 2020 . Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954 (2020). Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954 (2020)."},{"key":"e_1_3_2_2_32_1","volume-title":"Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation. arXiv preprint arXiv:2111.05759","author":"Lin Chuang","year":"2021","unstructured":"Chuang Lin , Yi Jiang , Jianfei Cai , Lizhen Qu , Gholamreza Haffari , and Zehuan Yuan . 2021a. Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation. arXiv preprint arXiv:2111.05759 ( 2021 ). Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. 2021a. Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation. arXiv preprint arXiv:2111.05759 (2021)."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00696"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00167"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_2_36_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems , Vol. 32 (2019). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019)."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01596"},{"key":"e_1_3_2_2_38_1","volume-title":"Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035","author":"Ma Chih-Yao","year":"2019","unstructured":"Chih-Yao Ma , Jiasen Lu , Zuxuan Wu , Ghassan AlRegib , Zsolt Kira , Richard Socher , and Caiming Xiong . 2019a. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 ( 2019 ). Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. 2019a. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 (2019)."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00689"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58539-6_16"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11492"},{"key":"e_1_3_2_2_42_1","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Moudgil Abhinav","year":"2021","unstructured":"Abhinav Moudgil , Arjun Majumdar , Harsh Agrawal , Stefan Lee , and Dhruv Batra . 2021 . SOAT: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation . Advances in Neural Information Processing Systems , Vol. 34 (2021). Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, and Dhruv Batra. 2021. SOAT: A Scene-and Object-Aware Transformer for Vision-and-Language Navigation. Advances in Neural Information Processing Systems, Vol. 34 (2021)."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01564"},{"key":"e_1_3_2_2_44_1","volume-title":"Proceedings of the European Conference on Computer Vision. Springer, 303--317","author":"Qi Yuankai","unstructured":"Yuankai Qi , Zizheng Pan , Shengping Zhang , Anton van den Hengel, and Qi Wu. 2020a. Object-and-action aware model for visual language navigation . In Proceedings of the European Conference on Computer Vision. Springer, 303--317 . Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020a. Object-and-action aware model for visual language navigation. In Proceedings of the European Conference on Computer Vision. Springer, 303--317."},{"key":"e_1_3_2_2_45_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9982--9991","author":"Qi Yuankai","unstructured":"Yuankai Qi , Qi Wu , Peter Anderson , Xin Wang , William Yang Wang , Chunhua Shen , and Anton van den Hengel. 2020b. Reverie: Remote embodied visual referring expression in real indoor environments . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9982--9991 . Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020b. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9982--9991."},{"key":"e_1_3_2_2_46_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog Vol. 1 8 (2019) 9.  Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog Vol. 1 8 (2019) 9."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00943"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01075"},{"key":"e_1_3_2_2_49_1","volume-title":"Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195","author":"Tan Hao","year":"2019","unstructured":"Hao Tan , Licheng Yu , and Mohit Bansal . 2019. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195 ( 2019 ). Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195 (2019)."},{"key":"e_1_3_2_2_50_1","volume-title":"Conference on Robot Learning. PMLR, 394--406","author":"Thomason Jesse","year":"2020","unstructured":"Jesse Thomason , Michael Murray , Maya Cakmak , and Luke Zettlemoyer . 2020 . Vision-and-dialog navigation . In Conference on Robot Learning. PMLR, 394--406 . Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning. PMLR, 394--406."},{"key":"e_1_3_2_2_51_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems , Vol. 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01503"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00835"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58542-6_19"},{"key":"e_1_3_2_2_55_1","volume-title":"Collaborative visual navigation. arXiv preprint arXiv:2107.01151","author":"Wang Haiyang","year":"2021","unstructured":"Haiyang Wang , Wenguan Wang , Xizhou Zhu , Jifeng Dai , and Liwei Wang . 2021b. Collaborative visual navigation. arXiv preprint arXiv:2107.01151 ( 2021 ). Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, and Liwei Wang. 2021b. Collaborative visual navigation. arXiv preprint arXiv:2107.01151 (2021)."},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_8"},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00679"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01270-0_3"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58586-0_25"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00945"},{"key":"e_1_3_2_2_61_1","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Zhang Jiwen","year":"2021","unstructured":"Jiwen Zhang , Jianqing Fan , Jiajie Peng , 2021 . Curriculum Learning for Vision-and-Language Navigation . Advances in Neural Information Processing Systems , Vol. 34 (2021). Jiwen Zhang, Jianqing Fan, Jiajie Peng, et al. 2021. Curriculum Learning for Vision-and-Language Navigation. Advances in Neural Information Processing Systems, Vol. 34 (2021)."},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01003"},{"key":"e_1_3_2_2_63_1","volume-title":"Babywalk: Going farther in vision-and-language navigation by taking baby steps. arXiv preprint arXiv:2005.04625","author":"Zhu Wang","year":"2020","unstructured":"Wang Zhu , Hexiang Hu , Jiacheng Chen , Zhiwei Deng , Vihan Jain , Eugene Ie , and Fei Sha . 2020 a. Babywalk: Going farther in vision-and-language navigation by taking baby steps. arXiv preprint arXiv:2005.04625 (2020). Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. 2020a. Babywalk: Going farther in vision-and-language navigation by taking baby steps. arXiv preprint arXiv:2005.04625 (2020)."},{"key":"e_1_3_2_2_64_1","volume-title":"Qi Wu, Miguel Eckstein, and William Yang Wang.","author":"Zhu Wanrong","year":"2021","unstructured":"Wanrong Zhu , Yuankai Qi , Pradyumna Narayana , Kazoo Sone , Sugato Basu , Xin Eric Wang , Qi Wu, Miguel Eckstein, and William Yang Wang. 2021 . Diagnosing Vision-and-Language Navigation: What Really Matters . arXiv preprint arXiv:2103.16561 (2021). Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Xin Eric Wang, Qi Wu, Miguel Eckstein, and William Yang Wang. 2021. Diagnosing Vision-and-Language Navigation: What Really Matters. arXiv preprint arXiv:2103.16561 (2021)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548281","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548281","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:42Z","timestamp":1750186842000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548281"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":64,"alternative-id":["10.1145\/3503161.3548281","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548281","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}