{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T09:36:11Z","timestamp":1779356171082,"version":"3.51.4"},"reference-count":42,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2023,3,28]],"date-time":"2023-03-28T00:00:00Z","timestamp":1679961600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2021YFB2501403"],"award-info":[{"award-number":["2021YFB2501403"]}]},{"name":"National Key Research and Development Program of China","award":["STS20201600200122"],"award-info":[{"award-number":["STS20201600200122"]}]},{"name":"Science and Technology Service Network Initiative Program of The Chinese Academy of Sciences","award":["2021YFB2501403"],"award-info":[{"award-number":["2021YFB2501403"]}]},{"name":"Science and Technology Service Network Initiative Program of The Chinese Academy of Sciences","award":["STS20201600200122"],"award-info":[{"award-number":["STS20201600200122"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Embodied PointGoal navigation is a fundamental task for embodied agents. Recent works have shown that the performance of the embodied navigation agent degrades significantly in the presence of visual corruption, including Spatter, Speckle Noise, and Defocus Blur, showing the weak robustness of the agent. To improve the robustness of embodied navigation agents to various visual corruptions, we propose a navigation framework called Regularized Denoising Masked AutoEncoders Navigation (RDMAE-Nav). In a nutshell, RDMAE-Nav mainly consists of two modules: a visual module and a policy module. In the visual module, a self-supervised pretraining method, dubbed Regularized Denoising Masked AutoEncoders (RDMAE), is designed to enable the Vision Transformers (ViT)-based visual encoder to learn robust representations. The bidirectional Kullback\u2013Leibler divergence is introduced in RDMAE as the regularization term for a denoising masked modeling task. Specifically, RDMAE mitigates the gap between clean and noisy image representations by minimizing the bidirectional Kullback\u2013Leibler divergence. Then, the visual encoder is pretrained by RDMAE. In contrast to existing works, RDMAE-Nav applies denoising masked visual pretraining for PointGoal navigation to improve robustness to various visual corruptions. Finally, the pretrained visual encoder with frozen weights is applied to extract robust visual representations for policy learning in the RDMAE-Nav. Extensive experiments show that RDMAE-Nav performs competitively compared with state of the arts (SOTAs) on various visual corruptions. In detail, RDMAE-Nav performs the absolute improvement: 28.2% in SR and 23.68% in SPL under Spatter; 2.28% in SR and 6.41% in SPL under Speckle Noise; and 9.46% in SR and 9.55% in SPL under Defocus Blur.<\/jats:p>","DOI":"10.3390\/s23073553","type":"journal-article","created":{"date-parts":[[2023,3,29]],"date-time":"2023-03-29T01:33:00Z","timestamp":1680053580000},"page":"3553","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Regularized Denoising Masked Visual Pretraining for Robust Embodied PointGoal Navigation"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2874-4744","authenticated-orcid":false,"given":"Jie","family":"Peng","sequence":"first","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yangbin","family":"Xu","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Luqing","family":"Luo","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haiyang","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kaiqiang","family":"Lu","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jian","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,3,28]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"230","DOI":"10.1109\/TETCI.2022.3141105","article-title":"A survey of embodied ai: From simulators to research tasks","volume":"6","author":"Duan","year":"2022","journal-title":"IEEE Trans. Emerg. Top. Comput. Intell."},{"key":"ref_2","unstructured":"Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., and Farhadi, A. (2017). Ai2-thor: An interactive 3d environment for visual ai. arXiv."},{"key":"ref_3","unstructured":"Li, C., Xia, F., Mart\u00edn-Mart\u00edn, R., Lingelbach, M., Srivastava, S., Shen, B., Vainio, K.E., Gokmen, C., Dharan, G., and Jain, T. (2022, January 8\u201311). iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks. Proceedings of the Conference on Robot Learning, London, UK."},{"key":"ref_4","first-page":"251","article-title":"Habitat 2.0: Training home assistants to rearrange their habitat","volume":"34","author":"Szot","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","unstructured":"Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., and Savva, M. (2018). On evaluation of embodied navigation agents. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Tai, L., Paolo, G., and Liu, M. (2017, January 24\u201328). Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. Proceedings of the 2017 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada.","DOI":"10.1109\/IROS.2017.8202134"},{"key":"ref_7","unstructured":"Bansal, S., Tolani, V., Gupta, S., Malik, J., and Tomlin, C. (2020, January 16\u201318). Combining optimal control and learning for visual navigation in novel environments. Proceedings of the Conference on Robot Learning, Virtual."},{"key":"ref_8","unstructured":"Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., and Batra, D. (2019). Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv."},{"key":"ref_9","unstructured":"Wijmans, E., Essa, I., and Batra, D. (2022, January 9\u201313). How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, Virtual."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhao, X., Agrawal, H., Batra, D., and Schwing, A.G. (2021, January 11\u201317). The surprising effectiveness of visual odometry techniques for embodied pointgoal navigation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01582"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Karkus, P., Cai, S., and Hsu, D. (2021, January 11\u201317). Differentiable slam-net: Learning particle slam for visual navigation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada.","DOI":"10.1109\/CVPR46437.2021.00284"},{"key":"ref_12","first-page":"5422","article-title":"Monocular Camera-based Point-goal Navigation by Learning Depth Channel and Cross-modality Pyramid Fusion","volume":"36","author":"Tang","year":"2022","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chattopadhyay, P., Hoffman, J., Mottaghi, R., and Kembhavi, A. (2021, January 11\u201317). Robustnav: Towards benchmarking robustness in embodied navigation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01540"},{"key":"ref_14","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_15","unstructured":"Wu, Q., Ye, H., Gu, Y., Zhang, H., Wang, L., and He, D. (2022). Denoising Masked AutoEncoders are Certifiable Robust Vision Learners. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_17","unstructured":"Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., and Malik, J. (November, January 27). Habitat: A platform for embodied ai research. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Rrepublic of Korea."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_19","unstructured":"Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"2634","DOI":"10.1109\/LRA.2021.3062303","article-title":"Bi-directional domain adaptation for sim2real transfer of embodied navigation agents","volume":"6","author":"Truong","year":"2021","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sadek, A., Bono, G., Chidlovskii, B., and Wolf, C. (2022, January 23\u201327). An in-depth experimental study of sensor usage and visual reasoning of robots navigating in real environments. Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA.","DOI":"10.1109\/ICRA46639.2022.9811833"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Partsey, R., Wijmans, E., Yokoyama, N., Dobosevych, O., Batra, D., and Maksymets, O. (2022, January 18\u201324). Is Mapping Necessary for Realistic PointGoal Navigation?. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01672"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Lee, E.S., Kim, J., and Kim, Y.M. (2022, January 18\u201324). Self-Supervised Domain Adaptation for Visual Navigation with Global Map Consistency. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, New Orleans, LA, USA.","DOI":"10.1109\/WACV51458.2022.00193"},{"key":"ref_24","unstructured":"Sax, A., Emi, B., Zamir, A.R., Guibas, L., Savarese, S., and Malik, J. (2018). Mid-level visual representations improve generalization and sample efficiency for learning visuomotor policies. arXiv."},{"key":"ref_25","unstructured":"Ramakrishnan, S.K., Nagarajan, T., Al-Halah, Z., and Grauman, K. (2021). Environment predictive coding for embodied agents. arXiv."},{"key":"ref_26","unstructured":"Du, H., Yu, X., and Zheng, L. (2021). VTNet: Visual transformer network for object goal navigation. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Khandelwal, A., Weihs, L., Mottaghi, R., and Kembhavi, A. (2022, January 18\u201324). Simple but effective: Clip embeddings for embodied ai. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01441"},{"key":"ref_28","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 6\u201314). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Saavedra-Ruiz, M., Morin, S., and Paull, L. (2022). Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers. arXiv.","DOI":"10.1109\/CRV55824.2022.00033"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., J\u00e9gou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 11\u201317). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"683","DOI":"10.1109\/LRA.2020.3048662","article-title":"Embodied visual navigation with automatic curriculum learning in real environments","volume":"6","author":"Morad","year":"2021","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., and Girshick, R. (2022, January 18\u201324). Masked autoencoders are scalable vision learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_33","unstructured":"Zhang, C., Zhang, C., Song, J., Yi, J.S.K., Zhang, K., and Kweon, I.S. (2022). A survey on masked autoencoder for self-supervised learning in vision and beyond. arXiv."},{"key":"ref_34","unstructured":"Xu, H., Ding, S., Zhang, X., Xiong, H., and Tian, Q. (2022). Masked autoencoders are robust data augmentors. arXiv."},{"key":"ref_35","unstructured":"Xiao, T., Radosavovic, I., Darrell, T., and Malik, J. (2022). Masked visual pre-training for motor control. arXiv."},{"key":"ref_36","unstructured":"Tao, T., Reda, D., and van de Panne, M. (2022). Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels. arXiv."},{"key":"ref_37","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv."},{"key":"ref_38","first-page":"1","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Deitke, M., Han, W., Herrasti, A., Kembhavi, A., Kolve, E., Mottaghi, R., Salvador, J., Schwenk, D., VanderBilt, E., and Wallingford, M. (2020, January 13\u201319). Robothor: An open simulation-to-real embodied ai platform. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00323"},{"key":"ref_40","unstructured":"Murali, A., Chen, T., Alwala, K.V., Gandhi, D., Pinto, L., Gupta, S., and Gupta, A. (2019). Pyrobot: An open-source robotics framework for research and benchmarking. arXiv."},{"key":"ref_41","unstructured":"Loshchilov, I., and Hutter, F. (May, January 30). Decoupled Weight Decay Regularization. Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada."},{"key":"ref_42","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/7\/3553\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:05:25Z","timestamp":1760123125000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/7\/3553"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,28]]},"references-count":42,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,4]]}},"alternative-id":["s23073553"],"URL":"https:\/\/doi.org\/10.3390\/s23073553","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,28]]}}}