{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:47:38Z","timestamp":1760233658877,"version":"build-2065373602"},"reference-count":25,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2021,2,2]],"date-time":"2021-02-02T00:00:00Z","timestamp":1612224000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003052","name":"Ministry of Trade, Industry and Energy","doi-asserted-by":"publisher","award":["10077538"],"award-info":[{"award-number":["10077538"]}],"id":[{"id":"10.13039\/501100003052","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.<\/jats:p>","DOI":"10.3390\/s21031012","type":"journal-article","created":{"date-parts":[[2021,2,2]],"date-time":"2021-02-02T13:01:12Z","timestamp":1612270872000},"page":"1012","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9522-3981","authenticated-orcid":false,"given":"Jisu","family":"Hwang","sequence":"first","affiliation":[{"name":"Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5754-133X","authenticated-orcid":false,"given":"Incheol","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Kyonggi University, Suwon-si 16227, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2021,2,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1007\/s11263-016-0966-6","article-title":"VQA: Visual question answering","volume":"123","author":"Agrawal","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1242","DOI":"10.1109\/TPAMI.2018.2828437","article-title":"Visual dialog","volume":"41","author":"Das","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018, January 18\u201322). Embodied Question Answering. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00008"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., and Farhadi, A. (2018, January 18\u201322). IQA: Visual question answering in interactive environments. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00430"},{"key":"ref_5","unstructured":"Thomason, J., Murray, M., Cakmak, M., and Zettlemoyer, L. (November, January 30). Vision-and-dialog navigation. Proceedings of the International Conference Robot. Learning (CoRL), Osaka, Japan."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.-Y., Shen, C., and van den Henglel, A. (2020, January 14\u201319). REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Online.","DOI":"10.1109\/CVPR42600.2020.01000"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S\u00fcnderhauf, N., Reid, I., Gould, S., and van den Hengel, A. (2018, January 18\u201322). Vision-and-language navigation: Interpreting visually grounded navigation instructions in real environments. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00387"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Chang, A., Dai, A., Funkhouser, T., Halber, M., Nie\u00dfner, M., Savva, M., Song, S., Zeng, A., and Zhang, Y. (2017, January 10\u201312). Matterport3D: Learning from RGB-D data in indoor environments. Proceedings of the International Conference 3D. Vision 2017, Qingdao, China.","DOI":"10.1109\/3DV.2017.00081"},{"key":"ref_9","unstructured":"Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.-P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., and Darrell, T. (2018). Speaker\u2013follower models for vision-and-language navigation. arXiv."},{"key":"ref_10","unstructured":"Ma, C., Lu, J., Wu, Z., AlRegib, G., Kira, Z., Socher, R., and Xiong, C. (2019, January 6\u20139). Self-monitoring navigation agent via auxiliary progress estimation. Proceedings of the International Conference Learning Representation. (ICLR), New Orleans, LA, USA."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Qi, Y., Pan, Z., Zhang, S., van den Hengel, A., and Wu, Q. (2020, January 23\u201328). Object-and-Action Aware Model for Visual Language Navigation. Proceedings of the Europa Conference Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58607-2_18"},{"key":"ref_12","unstructured":"Hong, Y., Rodriguez -O, C., Qi, Y., Wu, Q., and Gould, S. (2020). Language and Visual Entity Relationship Graph for Agent Navigation. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ma, C., Wu, Z., AlRegib, G., Xiong, C., and Kira, Z. (2019, January 16\u201320). The regretful agent: Heuristic-aided navigation through progress estimation. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00689"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ke, L., Li, X., Bisk, Y., Holtzman, A., Gan, Z., Liu, J., Gao, J., Choi, Y., and Srinvasa, S. (2019, January 16\u201320). Tactical rewind: Self-correction via backtracking in vision-and-language navigation. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00690"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Tan, H., Yu, L., and Bansal, M. (2019, January 2\u20137). Learning to navigate unseen environments: Back translation with environmental dropout. Proceedings of the International Conference North. American Chapter of the Association for Computational Linguistic s (NAACL), Minnesota, MI, USA.","DOI":"10.18653\/v1\/N19-1268"},{"key":"ref_16","unstructured":"Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., and Baldridge, J. (August, January 28). Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, X., Xiong, W., Wang, H., and Wang, W.Y. (2018, January 8\u201314). Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. Proceedings of the Europa Conference Computer Vision (ECCV), Munich, Germany. Lecture Notes in Computer Science.","DOI":"10.1007\/978-3-030-01270-0_3"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, X., Huang, Q., Celikyilmax, A., Gao, J., Shen, D., Wang, Y.-F., Wang, W.-Y., and Zhang, L. (2019, January 16\u201320). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00679"},{"key":"ref_19","unstructured":"Landi, F., Baraldi, L., Cornia, M., Corsini, M., and Cucchiara, R. (2019). Perceive, Transform., and Act.: Multi-modal Attention Networks for Vision-and-Language Navigation. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Hao, W., Li, C., Li, X., Carin, L., and Gao, J. (2020, January 14\u201319). Towards learning a generic agent for vision-and-language navigation via pre-training. Proceedings of the IEEE Conference Computer Vision Patt. Recogn. (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01315"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Majumdar, A., Shviastava, A., Lee, S., Anderson, P., Parikh, D., and Batra, D. (2020, January 23\u201328). Improving vision-and-language navigation with image-text pairs from the Web. Proceedings of the Europa Conference Computer Vision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58539-6_16"},{"key":"ref_22","unstructured":"Li, L.-H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. (2019). VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv."},{"key":"ref_23","unstructured":"Lu, J., Batra, D., Parikh, D., and Lee, S. (2019, January 8\u201314). VILBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proceedings of the Thirty-Third conference Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_24","unstructured":"Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. (2020, January 26\u201330). VILBERT: Pre-training of generic visual linguistic representations. Proceedings of the International Conference Learning Representation. (ICLR), Addis Ababa, Ethiopia."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Chen, Y.-C., Li, L., Yu, L., Kholy, A.-E., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2019). UNITER: Universal Image-TExt Representation Learning. arXiv.","DOI":"10.1007\/978-3-030-58577-8_7"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/3\/1012\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:18:58Z","timestamp":1760159938000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/3\/1012"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,2]]},"references-count":25,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2021,2]]}},"alternative-id":["s21031012"],"URL":"https:\/\/doi.org\/10.3390\/s21031012","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,2,2]]}}}