{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:03:28Z","timestamp":1781535808576,"version":"3.54.5"},"reference-count":41,"publisher":"MDPI AG","issue":"13","license":[{"start":{"date-parts":[[2023,6,29]],"date-time":"2023-06-29T00:00:00Z","timestamp":1687996800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002241","name":"JST SPRING","doi-asserted-by":"publisher","award":["JPMJSP2123"],"award-info":[{"award-number":["JPMJSP2123"]}],"id":[{"id":"10.13039\/501100002241","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>In the field of embodied AI, vision-and-language navigation (VLN) is a crucial and challenging multi-modal task. Specifically, outdoor VLN involves an agent navigating within a graph-based environment, while simultaneously interpreting information from real-world urban environments and natural language instructions. Existing outdoor VLN models predict actions using a combination of panorama and instruction features. However, these methods may cause the agent to struggle to understand complicated outdoor environments and ignore the details in the environments to fail to navigate. Human navigation often involves the use of specific objects as reference landmarks when navigating to unfamiliar places, providing a more rational and efficient approach to navigation. Inspired by this natural human behavior, we propose an object-level alignment module (OAlM), which guides the agent to focus more on object tokens mentioned in the instructions and recognize these landmarks during navigation. By treating these landmarks as sub-goals, our method effectively decomposes a long-range path into a series of shorter paths, ultimately improving the agent\u2019s overall performance. In addition to enabling better object recognition and alignment, our proposed OAlM also fosters a more robust and adaptable agent capable of navigating complex environments. This adaptability is particularly crucial for real-world applications where environmental conditions can be unpredictable and varied. Experimental results show our OAlM is a more object-focused model, and our approach outperforms all metrics on a challenging outdoor VLN Touchdown dataset, exceeding the baseline by 3.19% on task completion (TC). These results highlight the potential of leveraging object-level information in the form of sub-goals to improve navigation performance in embodied AI systems, paving the way for more advanced and efficient outdoor navigation.<\/jats:p>","DOI":"10.3390\/s23136028","type":"journal-article","created":{"date-parts":[[2023,6,30]],"date-time":"2023-06-30T01:14:12Z","timestamp":1688087652000},"page":"6028","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Outdoor Vision-and-Language Navigation Needs Object-Level Alignment"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-1523-5689","authenticated-orcid":false,"given":"Yanjun","family":"Sun","sequence":"first","affiliation":[{"name":"Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan"},{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2181-9475","authenticated-orcid":false,"given":"Yue","family":"Qiu","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7361-0027","authenticated-orcid":false,"given":"Yoshimitsu","family":"Aoki","sequence":"additional","affiliation":[{"name":"Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8844-165X","authenticated-orcid":false,"given":"Hirokatsu","family":"Kataoka","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,6,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Gu, J., Stefani, E., Wu, Q., Thomason, J., and Wang, X. (2022, January 22\u201327). Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. Proceedings of the Association for Computational Linguistics, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.acl-long.524"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Qi, Y., Pan, Z., Zhang, S., van den Hengel, A., and Wu, Q. (2020, January 23\u201328). Object-and-Action Aware Model for Visual Language Navigation. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58607-2_18"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Gao, C., Chen, J., Liu, S., Wang, L., Zhang, Q., and Wu, Q. (2021, January 20\u201325). Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression. Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00308"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S\u00fcnderhauf, N., Reid, I., Gould, S., and van den Hengel, A. (2018, January 18\u201322). Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00387"},{"key":"ref_5","unstructured":"Mirowski, P., Grimes, M.K., Malinowski, M., Hermann, K.M., Anderson, K., Teplyashin, D., Simonyan, K., Kavukcuoglu, K., Zisserman, A., and Hadsell, R. (2018, January 3\u20138). Learning to Navigate in Cities Without a Map. Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Chen, H., Suhr, A., Misra, D., Snavely, N., and Artzi, Y. (2019, January 15\u201320). TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01282"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chaplot, D.S., Sathyendra, K.M., Pasumarthi, R.K., Rajagopal, D., and Salakhutdinov, R. (2018, January 2\u20137). Gated-Attention Architectures for Task-Oriented Language Grounding. Proceedings of the Association for the Advancement of Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11832"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Xiang, J., Wang, X., and Wang, W.Y. (2020, January 16\u201320). Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation. Proceedings of the Empirical Methods in Natural Language Processing, Online.","DOI":"10.18653\/v1\/2020.findings-emnlp.62"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhu, W., Wang, X., Fu, T.J., Yan, A., Narayana, P., Sone, K., Basu, S., and Wang, W.Y. (2021, January 19\u201323). Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation. Proceedings of the EAssociation for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2021.eacl-main.103"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Schumann, R., and Riezler, S. (2022, January 22\u201327). Analyzing Generalization of Vision and Language Navigation to Unseen Outdoor Areas. Proceedings of the Association for Computational Linguistics, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.acl-long.518"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"304","DOI":"10.3389\/fpsyg.2012.00304","article-title":"From Objects to Landmarks: The Function of Visual Location Information in Spatial Navigation","volume":"3","author":"Chan","year":"2012","journal-title":"Front. Psychol."},{"key":"ref_12","unstructured":"Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., and Batra, D. (2021, January 6\u201314). SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation. Proceedings of the Neural Information Processing Systems, Online."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhu, F., Liang, X., Zhu, Y., Yu, Q., Chang, X., and Liang, X. (2021, January 20\u201325). SOON: Scenario Oriented Object Navigation with Graph-Based Exploration. Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01250"},{"key":"ref_14","unstructured":"Hu, R., Fried, D., Rohrbach, A., Klein, D., Darrell, T., and Saenko, K. (August, January 28). Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation. Proceedings of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Tan, H., and Bansal, M. (2020, January 19\u201326). Diagnosing the Environment Bias in Vision-and-Language Navigation. Proceedings of the International Joint Conference on Artificial Intelligence, Online.","DOI":"10.24963\/ijcai.2020\/124"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., and Batra, D. (2020, January 23\u201328). Improving Vision-and-Language Navigation with Image-Text Pairs from the Web. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58539-6_16"},{"key":"ref_17","unstructured":"Yan, A., Wang, X.E., Feng, J., Li, L., and Wang, W.Y. (2019). Cross-Lingual Vision-Language Navigation. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Ku, A., Anderson, P., Patel, R., Ie, E., and Baldridge, J. (2020, January 16\u201320). Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. Proceedings of the Empirical Methods in Natural Language Processing, Online.","DOI":"10.18653\/v1\/2020.emnlp-main.356"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Mehta, H., Artzi, Y., Baldridge, J., Ie, E., and Mirowski, P. (2020, January 16\u201320). Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View. Proceedings of the Empirical Methods in Natural Language Processing-SpLU, Online.","DOI":"10.18653\/v1\/2020.splu-1.7"},{"key":"ref_20","first-page":"11773","article-title":"Learning to Follow Directions in Street View","volume":"34","author":"Hermann","year":"2020","journal-title":"Assoc. Adv. Artif. Intell."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Schumann, R., and Riezler, S. (2021, January 1\u20136). Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem. Proceedings of the Association for Computational Linguistics, Bangkok, Thailand.","DOI":"10.18653\/v1\/2021.acl-long.41"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1007\/s11263-020-01374-3","article-title":"Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory","volume":"129","author":"Vasudevan","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_23","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the NAAssociation for Computational Linguistics, Minneapolis, MN, USA."},{"key":"ref_24","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 8\u201314). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Qi, Y., Pan, Z., Hong, Y., Yang, M.H., Hengel, A.v.d., and Wu, Q. (2021, January 11\u201317). The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation. Proceedings of the International Conference on Computer Vision, Online.","DOI":"10.1109\/ICCV48922.2021.00168"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Song, C.H., Kil, J., Pan, T.Y., Sadler, B.M., Chao, W.L., and Su, Y. (2022, January 18\u201324). One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01504"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., and Fox, D. (2020, January 13\u201319). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01075"},{"key":"ref_28","unstructured":"Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., and Lin, H. (2020;, January 6\u201312). Language Models are Few-Shot Learners. Proceedings of the Neural Information Processing Systems, Online."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Duch, W., Kacprzyk, J., Oja, E., and Zadro\u017cny, S. (2005, January 11\u201315). Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Proceedings of the Artificial Neural Networks: Formal Models and Their Applications\u2014ICANN, Warsaw, Poland.","DOI":"10.1007\/11550907"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep Residual Learning for Image Recognition. Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA. Available online: https:\/\/openaccess.thecvf.com\/content_cvpr_2016\/html\/He_Deep_Residual_Learning_CVPR_2016_paper.html.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_32","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019, January 8\u201314). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_33","unstructured":"Srihari, S.N., Shekhawat, A., and Lam, S.W. (2003). Encyclopedia of Computer Science, Van Nostrand Reinhold Company."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask R-CNN. Proceedings of the International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., and Doll\u00e1r, P. (2015). Microsoft COCO: Common Objects in Context. arXiv.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Sennrich, R., Haddow, B., and Birch, A. (2016, January 7\u201312). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1162"},{"key":"ref_38","unstructured":"Levenshtein, V. (2021, November 10). Leveinshtein Distance. Available online: https:\/\/rybn.org\/ANTI\/ADMXI\/documentation\/ALGORITHM_DOCUMENTATION\/HARMONY_OF_THE_SPEARS\/LEVENSHTEIN_EDIT_DISTANCE\/ABOUT\/NIST_Levenshtein_Edit_Distance.pdf."},{"key":"ref_39","unstructured":"Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., and Baldridge, J. (August, January 28). Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation. Proceedings of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_40","unstructured":"Magalhaes, G.I., Jain, V., Ku, A., Ie, E., and Baldridge, J. (2019, January 13). General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping. Proceedings of the Neural Information Processing Systems Visually Grounded Interaction and Language (ViGIL) Workshop, Vancouver, BC, Canada."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Zhu, W., Qi, Y., Narayana, P., Sone, K., Basu, S., Wang, X., Wu, Q., Eckstein, M., and Wang, W.Y. (2022, January 22\u201327). Diagnosing Vision-and-Language Navigation: What Really Matters. Proceedings of the NAAssociation for Computational Linguistics, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.naacl-main.438"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/13\/6028\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:03:40Z","timestamp":1760126620000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/13\/6028"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,29]]},"references-count":41,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["s23136028"],"URL":"https:\/\/doi.org\/10.3390\/s23136028","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,29]]}}}