{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T20:42:57Z","timestamp":1768336977436,"version":"3.49.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62272343"],"award-info":[{"award-number":["62272343"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>\n                    Navigation instruction generation aims to address data scarcity in Vision-and-Language Navigation (VLN) by generating navigation instructions for unannotated routes from data sources like simulators or online data. However, existing methods usually suffer from high reliance on panoramic views, poor cross-task generalization ability, and limited availability of training data. To address these challenges, we propose a novel speaker, CaneSpeaker, to generate human-like instructions from front-facing images for a variety of VLN tasks. First, to mitigate the limited amount of speaker training data, we propose an Large Language Model (LLM)-based instruction augmentation method, LLM-IA, that utilizes an off-the-shelf LLM to create augmented instructions for training by distilling and reformulating existing instructions. This method allows us to collect an instruction-augmented dataset with human-level accuracy for speaker training, namely Rx2R. Second, to eliminate the dependency on panoramic views, we propose a novel Vision-Language Model (VLM)-based speaker architecture, VL-Sp. By leveraging the advanced reasoning capabilities of a pre-trained VLM, CaneSpeaker can effectively generate high-quality instructions directly from front-facing images without relying on panoramic views. Also, the prompt-based characteristic of the VLM allows us to devise a unified input representation to enable the processing of multiple VLN tasks, thus further addressing the problem of data scarcity by combining multiple datasets from different VLN tasks. Finally, we utilize CaneSpeaker to synthesize a large-scale augmented dataset, CANE, from unannotated routes in the Matterport3D Simulator. Comprehensive experiments demonstrate that CaneSpeaker generates precise instructions with diverse expressions across various VLN tasks, and the VLN agent trained on our datasets obviously outperforms its counterparts. The source codes and datasets are available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/zheng19845\/CaneSpeaker\">https:\/\/github.com\/zheng19845\/CaneSpeaker<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3785009","type":"journal-article","created":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T18:28:28Z","timestamp":1765823308000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["CaneSpeaker: An LLM-Assisted Speaker for Generating Human-Like Navigation Instructions"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-1208-4220","authenticated-orcid":false,"given":"Yuanyu","family":"Zheng","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4360-5523","authenticated-orcid":false,"given":"Lin","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-3926-540X","authenticated-orcid":false,"given":"Yunda","family":"Sun","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2966-7955","authenticated-orcid":false,"given":"Ying","family":"Shen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4301-394X","authenticated-orcid":false,"given":"Shengjie","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2026,1,13]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Marah Abdin Jyoti Aneja Hany Awadalla Ahmed Awadallah Ammar Ahmad Awan Nguyen Bach Amit Bahree Arash Bakhtiari Jianmin Bao Harkirat Behl et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219. Retrieved from https:\/\/arxiv.org\/abs\/2404.14219"},{"key":"e_1_3_1_3_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et al. 2024. GPT-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_1_4_2","first-page":"7","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Agarwal Sanyam","year":"2019","unstructured":"Sanyam Agarwal, Devi Parikh, Dhruv Batra, Peter Anderson, and Stefan Lee. 2019. Visual landmark selection for generating grounded and interpretable navigation instructions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, 7."},{"key":"e_1_3_1_5_2","unstructured":"Dong An Yuankai Qi Yangguang Li Yan Huang Liang Wang Tieniu Tan and Jing Shao. 2023. BEVBert: Multimodal map pre-training for language-guided navigation. arXiv:2212.04385. Retrieved from https:\/\/arxiv.org\/abs\/2212.04385"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3386695"},{"key":"e_1_3_1_7_2","first-page":"3674","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S\u00fcnderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 3674\u20133683."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2017.00081"},{"key":"e_1_3_1_9_2","first-page":"9796","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics","author":"Chen Jiaqi","year":"2024","unstructured":"Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee Wong. 2024. MapGPT: Map-guided prompting with adaptive path planning for vision-and-language navigation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 9796\u20139810."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3526024"},{"key":"e_1_3_1_11_2","first-page":"5834","volume-title":"Advances in Neural Information Processing Systems","author":"Chen Shizhe","year":"2021","unstructured":"Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for vision-and-language navigation. In Advances in Neural Information Processing Systems, 5834\u20135847."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01604"},{"key":"e_1_3_1_13_2","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Dollar and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https:\/\/arxiv.org\/abs\/1504.00325"},{"key":"e_1_3_1_14_2","first-page":"368","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Fan Sheng","year":"2024","unstructured":"Sheng Fan, Rui Liu, Wenguan Wang, and Yi Yang. 2024. Navigation instruction generation with BEV perception and large language models. In Proceedings of the European Conference on Computer Vision, 368\u2013387."},{"key":"e_1_3_1_15_2","first-page":"3318","volume-title":"Advances in Neural Information Processing Systems","author":"Fried Daniel","year":"2018","unstructured":"Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, 3318\u20133329."},{"key":"e_1_3_1_16_2","unstructured":"Google Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning Multimodality Long Context and Next Generation Agentic Capabilities. Retrieved from https:\/\/storage.googleapis.com\/deepmind-media\/gemini\/gemini_v2_5_report.pdf"},{"key":"e_1_3_1_17_2","unstructured":"Aaron Grattafiori Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Alex Vaughan et al. 2024. The Llama 3 herd of models. arXiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00166"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01315"},{"key":"e_1_3_1_20_2","unstructured":"Edward J. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. arXiv:2106.09685. Retrieved from https:\/\/arxiv.org\/abs\/2106.09685"},{"key":"e_1_3_1_21_2","unstructured":"Jared Kaplan Sam McCandlish Tom Henighan Tom B. Brown Benjamin Chess Rewon Child Scott Gray Alec Radford Jeffrey Wu and Dario Amodei. 2020. Scaling laws for neural language models. arXiv:2001.08361. Retrieved from https:\/\/arxiv.org\/abs\/2001.08361"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.356"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2025.3554559"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00764"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01544"},{"key":"e_1_3_1_26_2","first-page":"17380","volume-title":"Proceedings of the IEEE International Conference on Robotics and Automation","author":"Long Yuxing","year":"2024","unstructured":"Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. 2024. Discuss before moving: Visual language navigation via multi-expert discussions. In Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 17380\u201317387."},{"key":"e_1_3_1_27_2","first-page":"950","volume-title":"Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Pan Bowen","year":"2024","unstructured":"Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, and Yoon Kim. 2024. LangNav: Language as a perceptual representation for navigation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 950\u2013974."},{"key":"e_1_3_1_28_2","first-page":"9982","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qi Yuankai","year":"2020","unstructured":"Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. REVERIE: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 9982\u20139991."},{"key":"e_1_3_1_29_2","first-page":"459","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Qiao Yanyuan","year":"2024","unstructured":"Yanyuan Qiao, Qianyi Liu, Jiajun Liu, Jing Liu, and Qi Wu. 2024. LLM as copilot for coarse-grained vision-and-language navigation. In Proceedings of the European Conference on Computer Vision, 459\u2013476."},{"key":"e_1_3_1_30_2","first-page":"15418","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qiao Yanyuan","year":"2022","unstructured":"Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu Wang, and Peng Qi Wu. 2022. HOP: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 15418\u201315427."},{"key":"e_1_3_1_31_2","first-page":"2070","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Rawal Niyati","year":"2024","unstructured":"Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, and Rita Cucchiara. 2024. AIGeN: An adversarial approach for instruction generation in VLN. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2070\u20132080."},{"key":"e_1_3_1_32_2","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo","author":"Sun Qiang","year":"2021","unstructured":"Qiang Sun, Yifeng Zhuang, Zhengqing Chen, Yanwei Fu, and Xiangyang Xue. 2021. Depth-guided AdaIN and shift attention network for vision-and-language navigation. In Proceedings of the IEEE International Conference on Multimedia and Expo, 1\u20136."},{"key":"e_1_3_1_33_2","first-page":"2610","volume-title":"Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Tan Hao","year":"2019","unstructured":"Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2610\u20132621."},{"key":"e_1_3_1_34_2","unstructured":"Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. Retrieved from https:\/\/qwenlm.github.io\/blog\/qwen2.5\/"},{"key":"e_1_3_1_35_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Naman Baptiste Rozi\u00e8re Eric Goyal Faisal Hambro Aurelien Azhar et al. 2023. Lample LLaMA: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01503"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01499"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01826"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01103"},{"key":"e_1_3_1_40_2","first-page":"1302","volume-title":"Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics","author":"Zhao Ming","year":"2021","unstructured":"Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alexander Ku, Jason Baldridge, and Eugene Ie. 2021. On the evaluation of vision-and-language navigation instructions. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, 1302\u20131316."},{"key":"e_1_3_1_41_2","first-page":"13624","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zheng Duo","year":"2024","unstructured":"Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Liwei Wang. 2024. Towards learning a generalist model for embodied navigation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 13624\u201313634."},{"key":"e_1_3_1_42_2","unstructured":"Baichuan Zhou Ying Hu Xi Weng Junlong Jia Jie Luo Xien Liu Ji Wu and Lei Huang. 2024. TinyLLaVA: A framework of small-scale large multimodal models. arXiv:2402.14289. Retrieved from https:\/\/arxiv.org\/abs\/2402.14289"},{"key":"e_1_3_1_43_2","first-page":"260","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Zhou Gengze","year":"2024","unstructured":"Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. 2024. NavGPT-2: Unleashing navigational reasoning capability for large vision-language models. In Proceedings of the European Conference on Computer Vision. Springer, 260\u2013278."},{"key":"e_1_3_1_44_2","first-page":"7641","volume-title":"Proceedings of the 38th Conference on Artificial Intelligence","author":"Zhou Gengze","year":"2024","unstructured":"Gengze Zhou, Yicong Hong, and Qi Wu. 2024. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the 38th Conference on Artificial Intelligence, 7641\u20137649."},{"key":"e_1_3_1_45_2","first-page":"12689","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhu Fengda","year":"2021","unstructured":"Fengda Zhu, Xiwen Liang, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. SOON: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 12689\u201312699."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3785009","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,13]],"date-time":"2026-01-13T14:19:56Z","timestamp":1768313996000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3785009"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,13]]},"references-count":44,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3785009"],"URL":"https:\/\/doi.org\/10.1145\/3785009","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,13]]},"assertion":[{"value":"2025-04-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}