{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T13:13:22Z","timestamp":1775913202451,"version":"3.50.1"},"reference-count":50,"publisher":"MDPI AG","issue":"24","license":[{"start":{"date-parts":[[2023,12,10]],"date-time":"2023-12-10T00:00:00Z","timestamp":1702166400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61971279"],"award-info":[{"award-number":["61971279"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62022054"],"award-info":[{"award-number":["62022054"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Self-supervised learning (SSL) has significantly bridged the gap between supervised and unsupervised learning in computer vision tasks and shown impressive success in the field of remote sensing (RS). However, these methods have primarily focused on single-modal RS data, which may have limitations in capturing the diversity of information in complex scenes. In this paper, we propose the Asymmetric Attention Fusion (AAF) framework to explore the potential of multi-modal representation learning compared to two simpler fusion methods: early fusion and late fusion. Given that data from active sensors (e.g., digital surface models and light detection and ranging) is often noisier and less informative than optical images, the AAF is designed with an asymmetric attention mechanism within a two-stream encoder, applied at each encoder stage. Additionally, we introduce a Transfer Gate module to select more informative features from the fused representations, enhancing performance in downstream tasks. Our comparative analyses on the ISPRS Potsdam datasets, focusing on scene classification and segmentation tasks, demonstrate significant performance enhancements with AAF compared to baseline methods. The proposed approach achieves an improvement of over 7% in all metrics compared to randomly initialized methods for both tasks. Furthermore, when compared to early fusion and late fusion methods, AAF consistently outperforms in achieving superior improvements. These results underscore the effectiveness of AAF in leveraging the strengths of multi-modal RS data for SSL, opening doors for more sophisticated and nuanced RS analysis.<\/jats:p>","DOI":"10.3390\/rs15245682","type":"journal-article","created":{"date-parts":[[2023,12,11]],"date-time":"2023-12-11T13:18:21Z","timestamp":1702300701000},"page":"5682","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5083-5896","authenticated-orcid":false,"given":"Guozheng","family":"Xu","sequence":"first","affiliation":[{"name":"The School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7099-6817","authenticated-orcid":false,"given":"Xue","family":"Jiang","sequence":"additional","affiliation":[{"name":"The School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0550-8247","authenticated-orcid":false,"given":"Xiangtai","family":"Li","sequence":"additional","affiliation":[{"name":"The S-Lab, Nanyang Technological University, Singapore 639798, Singapore"}]},{"given":"Ze","family":"Zhang","sequence":"additional","affiliation":[{"name":"The School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4533-3904","authenticated-orcid":false,"given":"Xingzhao","family":"Liu","sequence":"additional","affiliation":[{"name":"The School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,12,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"103921","DOI":"10.1016\/j.landurbplan.2020.103921","article-title":"Remote sensing in urban planning: Contributions towards ecologically sound policies?","volume":"204","author":"Wellmann","year":"2020","journal-title":"Landsc. Urban Plan."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1016\/j.rse.2014.09.034","article-title":"SAR and optical remote sensing: Assessment of complementarity and interoperability in the context of a large-scale operational forest monitoring system","volume":"156","author":"Lehmann","year":"2015","journal-title":"Remote Sens. Environ."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1915","DOI":"10.1016\/S2095-3119(17)61859-8","article-title":"Agricultural remote sensing big data: Management and applications","volume":"17","author":"Huang","year":"2018","journal-title":"J. Integr. Agric."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Schumann, G.J., Brakenridge, G.R., Kettner, A.J., Kashif, R., and Niebuhr, E. (2018). Assisting flood disaster response with earth observation data and products: A critical assessment. Remote Sens., 10.","DOI":"10.3390\/rs10081230"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3485128","article-title":"Tackling climate change with machine learning","volume":"55","author":"Rolnick","year":"2022","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 14\u201319). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_7","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 13\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event."},{"key":"ref_8","first-page":"9912","article-title":"Unsupervised learning of visual features by contrasting cluster assignments","volume":"33","author":"Caron","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_9","first-page":"21271","article-title":"Bootstrap your own latent-a new approach to self-supervised learning","volume":"33","author":"Grill","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Chen, X., and He, K. (2021, January 19\u201325). Exploring simple siamese representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Manas, O., Lacoste, A., Gir\u00f3-i Nieto, X., Vazquez, D., and Rodriguez, P. (2021, January 11\u201317). Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00928"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., and Ermon, S. (2021, January 11\u201317). Geography-aware self-supervised learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.01002"},{"key":"ref_13","first-page":"1","article-title":"Geographical knowledge-driven representation learning for remote sensing images","volume":"60","author":"Li","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.isprsjprs.2017.11.011","article-title":"Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks","volume":"140","author":"Audebert","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1551","DOI":"10.1109\/JSTARS.2020.2983993","article-title":"Change detection in heterogeneous optical and SAR remote sensing images via deep homogeneous feature fusion","volume":"13","author":"Jiang","year":"2020","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"5","DOI":"10.5194\/isprs-annals-IV-1-5-2018","article-title":"Sar to Optical Image Synthesis for Cloud Removal with Generative Adversarial Networks","volume":"4","author":"Bermudez","year":"2018","journal-title":"ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"11485","DOI":"10.1109\/JSTARS.2021.3119191","article-title":"Multisensor land cover classification with sparsely annotated data based on convolutional neural networks and self-distillation","volume":"14","author":"Gbodjo","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.isprsjprs.2019.09.016","article-title":"Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture","volume":"158","author":"Ienco","year":"2019","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Chen, Y., and Bruzzone, L. (2021). Self-supervised SAR-optical Data Fusion and Land-cover Mapping using Sentinel-1\/-2 Images. arXiv.","DOI":"10.1109\/TGRS.2021.3128072"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2023.3335484","article-title":"Nearest Neighbor-Based Contrastive Learning for Hyperspectral and LiDAR Data Classification","volume":"61","author":"Wang","year":"2023","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Lin, J., Gao, F., Shi, X., Dong, J., and Du, Q. (2023). SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification. arXiv.","DOI":"10.1109\/TGRS.2023.3331717"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Scheibenreif, L., Hanna, J., Mommert, M., and Borth, D. (2022, January 18\u201324). Self-supervised vision transformers for land-cover segmentation and classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPRW56347.2022.00148"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., and Efros, A.A. (2016, January 11\u201314). Colorful image colorization. Proceedings of the Computer Vision\u2013ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part III 14.","DOI":"10.1007\/978-3-319-46487-9_40"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Doersch, C., Gupta, A., and Efros, A.A. (2015, January 7\u201313). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.167"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, X., and Gupta, A. (2015, January 7\u201313). Unsupervised learning of visual representations using videos. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.320"},{"key":"ref_26","unstructured":"Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. arXiv."},{"key":"ref_27","unstructured":"Chen, X., Fan, H., Girshick, R., and He, K. (2020). Improved baselines with momentum contrastive learning. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Chen, X., Xie, S., and He, K. (2021, January 11\u201317). An empirical study of training self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., and Girshick, R. (2022, January 18\u201324). Masked autoencoders are scalable vision learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_30","unstructured":"Bao, H., Dong, L., Piao, S., and Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv."},{"key":"ref_31","unstructured":"Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., and Yu, N. (2021). Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., J\u00e9gou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 11\u201317). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_33","first-page":"1","article-title":"Remote sensing image scene classification with self-supervised paradigm under limited labeled samples","volume":"19","author":"Tao","year":"2020","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"2598","DOI":"10.1109\/TGRS.2020.3007029","article-title":"Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast","volume":"59","author":"Kang","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Chen, W., Zheng, X., and Lu, X. (2021). Hyperspectral image super-resolution with self-supervised spectral-spatial residual network. Remote Sens., 13.","DOI":"10.3390\/rs13071260"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"3215","DOI":"10.1109\/JSTARS.2021.3063335","article-title":"Self-supervised deep subspace clustering for hyperspectral images with adaptive self-expressive coefficient matrix initialization","volume":"14","author":"Li","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"7734","DOI":"10.1109\/TGRS.2020.2983420","article-title":"Feature-Enhanced Speckle Reduction via Low-Rank and Space-Angle Continuity for Circular SAR Target Recognition","volume":"58","author":"Chen","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1109\/MGRS.2016.2561021","article-title":"Data fusion and remote sensing: An ever-growing relationship","volume":"4","author":"Schmitt","year":"2016","journal-title":"IEEE Geosci. Remote Sens. Mag."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1109\/MGRS.2018.2890023","article-title":"Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art","volume":"7","author":"Ghamisi","year":"2019","journal-title":"IEEE Geosci. Remote Sens. Mag."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1016\/j.isprsjprs.2020.01.013","article-title":"ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data","volume":"162","author":"Diakogiannis","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Hazirbas, C., Ma, L., Domokos, C., and Cremers, D. (2016, January 20\u201324). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. Proceedings of the Computer Vision\u2013ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan. Revised Selected Papers, Part I 13.","DOI":"10.1007\/978-3-319-54181-5_14"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"3463","DOI":"10.1109\/JSTARS.2022.3165005","article-title":"A crossmodal multiscale fusion network for semantic segmentation of remote sensing data","volume":"15","author":"Ma","year":"2022","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1080\/19479830903562041","article-title":"Fusing high-resolution SAR and optical imagery for improved urban land cover study and classification","volume":"1","author":"Amarsaikhan","year":"2010","journal-title":"Int. J. Image Data Fusion"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Geng, J., Wang, H., Fan, J., and Ma, X. (2017, January 23\u201328). Classification of fusing SAR and multispectral image via deep bimodal autoencoders. Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA.","DOI":"10.1109\/IGARSS.2017.8127079"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Wang, Y., Albrecht, C.M., and Zhu, X.X. (2022, January 17\u201322). Self-supervised vision transformers for joint SAR-optical representation learning. Proceedings of the IGARSS 2022\u20132022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia.","DOI":"10.1109\/IGARSS46834.2022.9883983"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_49","unstructured":"Contributors (2020, June 16). MMSelfSup: Openmmlab Self-Supervised Learning Toolbox and Benchmark. Available online: https:\/\/github.com\/open-mmlab\/mmselfsup."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/24\/5682\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T21:36:23Z","timestamp":1760132183000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/24\/5682"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,10]]},"references-count":50,"journal-issue":{"issue":"24","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["rs15245682"],"URL":"https:\/\/doi.org\/10.3390\/rs15245682","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,10]]}}}