{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T16:07:10Z","timestamp":1753891630245,"version":"3.41.2"},"reference-count":31,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2022,11,17]],"date-time":"2022-11-17T00:00:00Z","timestamp":1668643200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["728003"],"award-info":[{"award-number":["728003"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004837","name":"Ministerio de Ciencia e Innovaci\u00f3n","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100004837","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>With the rapid development of artificial intelligence technology, many researchers have begun to focus on visual language navigation, which is one of the most important tasks in multi-modal machine learning. The focus of this multi-modal field is how to fuse multiple inputs, which is crucial for the integrated feedback of intrinsic information. However, the existing models are only implemented through simple data augmentation or expansion, and are obviously far from being able to tap the intrinsic relationship between modalities. In this paper, to overcome these challenges, a novel multi-modal matching feedback self-tuning model is proposed, which is a novel neural network called Vital Information Matching Feedback Self-tuning Network (VIM-Net). Our VIM-Net network is mainly composed of two matching feedback modules, a visual matching feedback module (V-mat) and a trajectory matching feedback module (T-mat). Specifically, V-mat matches the target information of visual recognition with the entity information extracted by the command; T-mat matches the serialized trajectory feature with the direction of movement of the command. Ablation experiments and comparative experiments are conducted on the proposed model using the Matterport3D simulator and the Room-to-Room (R2R) benchmark datasets, and the final navigation effect is shown in detail. The results prove that the model proposed in this paper is indeed effective on the task.<\/jats:p>","DOI":"10.3389\/fnbot.2022.1035921","type":"journal-article","created":{"date-parts":[[2022,11,17]],"date-time":"2022-11-17T06:04:39Z","timestamp":1668665079000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Vital information matching in vision-and-language navigation"],"prefix":"10.3389","volume":"16","author":[{"given":"Zixi","family":"Jia","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kai","family":"Yu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jingyu","family":"Ru","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sikai","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sonya","family":"Coleman","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2022,11,17]]},"reference":[{"key":"B1","first-page":"3674","article-title":"\u201cVision-and-language navigation: interpreting visually-grounded navigation instructions in real environments,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Anderson","year":"2018"},{"key":"B2","first-page":"2425","article-title":"VQA: visual question answering,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Antol","year":"2015"},{"key":"B3","first-page":"12538","article-title":"\u201cTouchdown: natural language navigation and spatial reasoning in visual street environments,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen","year":"2019"},{"key":"B4","first-page":"326","article-title":"\u201cVisual dialog,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Das","year":"2017"},{"key":"B5","first-page":"3318","article-title":"\u201cSpeaker-follower models for vision-and-language navigation,\u201d","author":"Fried","year":"2018","journal-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems"},{"key":"B6","first-page":"1440","article-title":"\u201cFast r-CNN,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Girshick","year":"2015"},{"key":"B7","first-page":"13137","article-title":"\u201cTowards learning a generic agent for vision-and-language navigation via pre-training,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Hao","year":"2020"},{"key":"B8","first-page":"7404","article-title":"\u201cTransferable representation learning in vision-and-language navigation,\u201d","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Huang","year":"2019"},{"key":"B9","doi-asserted-by":"publisher","first-page":"1012","DOI":"10.3390\/s21031012","article-title":"Joint multimodal embedding and backtracking search in vision-and-language navigation","volume":"21","author":"Hwang","year":"2021","journal-title":"Sensors"},{"key":"B10","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P17-1089","article-title":"Learning a neural semantic parser from user feedback,\u201d","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics","author":"Iyer","year":"2017"},{"key":"B11","doi-asserted-by":"crossref","first-page":"1862","DOI":"10.18653\/v1\/P19-1181","article-title":"\u201cStay on the path: instruction fidelity in vision-and-language navigation,\u201d","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Jain","year":"2019"},{"key":"B12","first-page":"11773","article-title":"\u201cLearning to follow directions in street view,\u201d","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Karl","year":"2020"},{"key":"B13","first-page":"6741","article-title":"\u201cTactical rewind: self-correction via backtracking in vision-and-language navigation,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ke","year":"2019"},{"key":"B14","first-page":"1","article-title":"\u201cEmbodied vision-and-language navigation with dynamic convolutional filters,\u201d","volume-title":"30th British Machine Vision Conference","author":"Landi","year":"2019"},{"key":"B15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/TCYB.2021.3086501","article-title":"Learning to optimize: reference vector reinforcement learning adaption to constrained many-objective optimization of industrial copper burdening system","author":"Lianbo","year":"","journal-title":"IEEE Trans. Cybern"},{"key":"B16","doi-asserted-by":"publisher","first-page":"6723","DOI":"10.1109\/TSMC.2020.2963943","article-title":"Enhancing learning efficiency of brain storm optimization via orthogonal learning design","volume":"51","author":"Lianbo","year":"","journal-title":"IEEE Trans. Syst. Man Cybern. Syst"},{"key":"B17","doi-asserted-by":"publisher","first-page":"4125","DOI":"10.1109\/TMC.2021.3064314","article-title":"Truthful combinatorial double auctions for mobile edge computing in industrial internet of things","author":"Lianbo","year":"","journal-title":"IEEE Trans. Mobile Comput"},{"key":"B18","first-page":"6732","article-title":"\u201cThe regretful agent: heuristic-aided navigation through progress estimation,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ma","year":"2019"},{"key":"B19","first-page":"259","article-title":"\u201cImproving vision-and-language navigation with image-text pairs from the web,\u201d","volume-title":"European Conference on Computer Vision","author":"Majumdar","year":"2020"},{"key":"B20","doi-asserted-by":"publisher","first-page":"133529","DOI":"10.1109\/ACCESS.2019.2941547","article-title":"Mini-yolov3: real-time object detector for embedded applications","volume":"7","author":"Mao","year":"2019","journal-title":"IEEE Access"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1063","article-title":"Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning","author":"Nguyen","year":"2019","journal-title":"arXiv preprint arXiv:1909.01871"},{"key":"B22","doi-asserted-by":"crossref","first-page":"6449","DOI":"10.18653\/v1\/D19-1681","article-title":"\u201cRun through the streets: a new dataset and baseline models for realistic urban navigation,\u201d","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Paz-Argaman","year":"2019"},{"key":"B23","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/978-3-030-58607-2_18","article-title":"\u201cObject-and-action aware model for visual language navigation,\u201d","volume-title":"Computer Vision-ECCV 2020: 16th European Conference","author":"Qi","year":"2020"},{"key":"B24","first-page":"779","article-title":"\u201cYou only look once: unified, real-time object detection,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Redmon","year":"2016"},{"key":"B25","first-page":"2238","article-title":"\u201cLearning to map context-dependent sentences to executable formal queries,\u201d","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Suhr","year":"2018"},{"key":"B26","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1007\/s11263-020-01374-3","article-title":"TALK2NAV: long-range vision-and-language navigation with dual attention and spatial memory","volume":"129","author":"Vasudevan","year":"2021","journal-title":"Int. J. Comput. Vision"},{"key":"B27","doi-asserted-by":"crossref","first-page":"6622","DOI":"10.1109\/CVPR.2019.00679","article-title":"\u201cReinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,\u201d","volume-title":"2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Wang","year":"2019"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1910.11301","article-title":"Cross-lingual vision-language navigation","author":"Yan","year":"2019","journal-title":"arXiv preprint arXiv:1910.11301"},{"key":"B29","doi-asserted-by":"publisher","first-page":"1302","DOI":"10.18653\/v1\/2021.eacl-main.111","article-title":"\u201cOn the evaluation of vision-and-language navigation instructions,\u201d","author":"Zhao","year":"2021","journal-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume"},{"key":"B30","first-page":"10012","article-title":"\u201cVision-language navigation with self-supervised auxiliary reasoning tasks,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhu","year":"2020"},{"key":"B31","first-page":"10730","article-title":"\u201cVision-dialog navigation by exploring cross-modal memory,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhu","year":"2020"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1035921\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,17]],"date-time":"2022-11-17T06:05:01Z","timestamp":1668665101000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2022.1035921\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,17]]},"references-count":31,"alternative-id":["10.3389\/fnbot.2022.1035921"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2022.1035921","relation":{},"ISSN":["1662-5218"],"issn-type":[{"type":"electronic","value":"1662-5218"}],"subject":[],"published":{"date-parts":[[2022,11,17]]},"article-number":"1035921"}}