{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,17]],"date-time":"2026-07-17T10:29:06Z","timestamp":1784284146655,"version":"3.55.0"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"12","funder":[{"DOI":"10.13039\/501100004731","name":"Natural Science Foundation of Zhejiang Province","doi-asserted-by":"crossref","award":["LZJMZ24D050009"],"award-info":[{"award-number":["LZJMZ24D050009"]}],"id":[{"id":"10.13039\/501100004731","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Emergency Management Research and Development Project of Zhejiang Province","award":["2024YJ018"],"award-info":[{"award-number":["2024YJ018"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling (CtxMIM) method, a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pretraining model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability.<\/jats:p>","DOI":"10.1145\/3769084","type":"journal-article","created":{"date-parts":[[2025,9,23]],"date-time":"2025-09-23T13:20:49Z","timestamp":1758633649000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6415-2423","authenticated-orcid":false,"given":"Mingming","family":"Zhang","sequence":"first","affiliation":[{"name":"State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5181-6451","authenticated-orcid":false,"given":"Qingjie","family":"Liu","sequence":"additional","affiliation":[{"name":"Hangzhou Innovation Institute, Beihang University, Hangzhou, China and State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8001-2703","authenticated-orcid":false,"given":"Yunhong","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,11,21]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01002"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19836-6_20"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01538"},{"key":"e_1_3_1_5_2","first-page":"9912","volume-title":"Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)","author":"Caron Mathilde","year":"2020","unstructured":"Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 9912\u20139924."},{"key":"e_1_3_1_6_2","first-page":"1597","volume-title":"Proceedings of the International Conference on Machine Learning (ICML)","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), 1597\u20131607."},{"key":"e_1_3_1_7_2","unstructured":"Xinlei Chen Haoqi Fan Ross Girshick and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv:2003.04297. Retrieved from https:\/\/arxiv.org\/abs\/2003.04297"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Yuxing Chen and Lorenzo Bruzzone. 2022. Self-supervised change detection in multiview remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1\u201312.","DOI":"10.1109\/TGRS.2021.3089453"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2675998"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00646"},{"key":"e_1_3_1_12_2","first-page":"197","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Cong Yezhen","year":"2022","unstructured":"Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. 2022. SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. In Advances in Neural Information Processing Systems (NeurIPS), 197\u2013211."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3556978"},{"key":"e_1_3_1_14_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR)","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Matthias Drusch Umberto Del Bello S\u00e9bastien Carlier Olivier Colin Veronica Fernandez Ferran Gascon Bianca Hoersch Claudia Isola Paolo Laberinti Philippe Martimort et al. 2012. Sentinel-2: ESA\u2019s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120 (2012) 25\u201336.","DOI":"10.1016\/j.rse.2011.11.026"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Zhanzhou Feng and Shiliang Zhang. 2025. Evolved hierarchical masking for self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 47 2 (2025) 1013\u20131027.","DOI":"10.1109\/TPAMI.2024.3490776"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.rse.2017.06.031"},{"key":"e_1_3_1_18_2","first-page":"21271","volume-title":"Advances in Neural Information Processing Systems (NeurIPS 2020)","author":"Grill Jean-Bastien","year":"2020","unstructured":"Jean-Bastien Grill, Florian Strub, Florent Altch\u00e9, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent\u2014A new approach to self-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS 2020), 21271\u201321284."},{"key":"e_1_3_1_19_2","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Gupta Agrim","year":"2024","unstructured":"Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. 2024. Siamese masked autoencoders. In Advances in Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_24_2","doi-asserted-by":"crossref","unstructured":"Patrick Helber Benjamin Bischke Andreas Dengel and Damian Borth. 2019. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 7 (2019) 2217\u20132226.","DOI":"10.1109\/JSTARS.2019.2918242"},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","unstructured":"D. Hong B. Zhang X. Li Y. Li C. Li J. Yao N. Yokoya H. Li P. Ghamisi and X. Jia. 2024. SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 46 (2024) 8.","DOI":"10.1109\/TPAMI.2024.3362475"},{"key":"e_1_3_1_26_2","doi-asserted-by":"crossref","unstructured":"Ziyue Huang Mingming Zhang Yuan Gong Qingjie Liu and Yunhong Wang. 2024. Generic knowledge boosted pre-training for remote sensing images. IEEE Transactions on Geoscience and Remote Sensing.","DOI":"10.1109\/TGRS.2024.3354031"},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Heechul Jung Yoonju Oh Seongho Jeong Chaehyeon Lee and Taegyun Jeon. 2021. Contrastive self-supervised learning with smoothed representation for remote sensing. IEEE Geoscience and Remote Sensing Letters 19 (2021) 1\u20135.","DOI":"10.1109\/LGRS.2021.3069799"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2020.3007029"},{"key":"e_1_3_1_29_2","unstructured":"Darius Lam Richard Kuzma Kevin McGee Samuel Dooley Michael Laielli Matthew Klaric Yaroslav Bulatov and Brendan McCord. 2018. xView: Objects in context in overhead imagery. arXiv:1802.07856. Retrieved from https:\/\/arxiv.org\/abs\/1802.07856"},{"key":"e_1_3_1_30_2","first-page":"1","article-title":"Global and local contrastive self-supervised learning for semantic segmentation of HR remote sensing images","volume":"60","author":"Li Haifeng","year":"2022","unstructured":"Haifeng Li, Yi Li, Guo Zhang, Ruoyun Liu, Haozhe Huang, Qing Zhu, and Chao Tao. 2022. Global and local contrastive self-supervised learning for semantic segmentation of HR remote sensing images. IEEE Transactions on Geoscience Remote Sensing 60 (2022), 1\u201314.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"},{"key":"e_1_3_1_31_2","doi-asserted-by":"crossref","unstructured":"Wenyuan Li Hao Chen and Zhenwei Shi. 2021. Semantic segmentation of remote sensing images with self-supervised multitask representation learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021) 6438\u20136450.","DOI":"10.1109\/JSTARS.2021.3090418"},{"key":"e_1_3_1_32_2","first-page":"1","article-title":"Geographical knowledge-driven representation learning for remote sensing images","volume":"60","author":"Li Wenyuan","year":"2021","unstructured":"Wenyuan Li, Keyan Chen, Hao Chen, and Zhenwei Shi. 2021. Geographical knowledge-driven representation learning for remote sensing images. IEEE Transactions on Geoscience Remote Sensing 60 (2021), 1\u201316.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","unstructured":"Wenyuan Li Keyan Chen and Zhenwei Shi. 2022. Geographical supervision correction for remote sensing representation learning. IEEE Transactions on Geoscience Remote Sensing 60 (2022) 1\u201320.","DOI":"10.1109\/TGRS.2022.3202499"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.324"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00605"},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","unstructured":"Xiyao Liu Cundian Yang Jianbiao He Hui Fang Gerald Schaefer Jian Zhang Yuesheng Zhu and Shichao Zhang. 2024. Attack-defending contrastive learning for volumetric medical image zero-watermarking. ACM Transactions on Multimedia Computing Communications and Applications 21 2 (2024) 1\u201323.","DOI":"10.1145\/3702230"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSTARS.2021.3070368"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3473342"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00509"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00928"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01541"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2022.3177770"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2023.3268232"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02627"},{"key":"e_1_3_1_47_2","unstructured":"Aaron van den Oord Yazhe Li and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from https:\/\/arxiv.org\/abs\/1807.03748"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00378"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW56347.2022.00148"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00129"},{"key":"e_1_3_1_52_2","volume-title":"Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)","author":"Tang Maofeng","year":"2024","unstructured":"Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. 2024. Cross-scale MAE: A tale of multiscale exploitation in remote sensing. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)."},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00379"},{"key":"e_1_3_1_54_2","doi-asserted-by":"crossref","unstructured":"Chao Tao Ji Qi Guo Zhang Qing Zhu Weipeng Lu and Haifeng Li. 2023. TOV: The original vision model for optical remote sensing image understanding via self-supervised learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16 (2023) 4916\u20134930.","DOI":"10.1109\/JSTARS.2023.3271312"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_45"},{"key":"e_1_3_1_56_2","first-page":"10078","volume-title":"Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS)","author":"Tong Zhan","year":"2022","unstructured":"Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), 10078\u201310093."},{"key":"e_1_3_1_57_2","unstructured":"Adam Van Etten Dave Lindenbaum and Todd M. Bacastow. 2018. SpaceNet: A remote sensing dataset and challenge series. arXiv:1807.01232. Retrieved from https:\/\/arxiv.org\/abs\/1807.01232"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9413112"},{"key":"e_1_3_1_59_2","first-page":"1","article-title":"Change detection based on supervised contrastive learning for high-resolution remote sensing imagery","volume":"61","author":"Wang Jue","year":"2023","unstructured":"Jue Wang, Yanfei Zhong, and Liangpei Zhang. 2023. Change detection based on supervised contrastive learning for high-resolution remote sensing imagery. IEEE Transactions on Geoscience Remote Sensing 61 (2023), 1\u201316.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01426"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00418"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00943"},{"key":"e_1_3_1_63_2","first-page":"1","article-title":"Self-supervised feature learning for multimodal remote sensing image land cover classification","volume":"60","author":"Xue Zhixiang","year":"2022","unstructured":"Zhixiang Xue, Xuchu Yu, Anzhu Yu, Bing Liu, Pengqiang Zhang, and Shentong Wu. 2022. Self-supervised feature learning for multimodal remote sensing image land cover classification. IEEE Transactions on Geoscience Remote Sensing 60 (2022), 1\u201315.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"},{"key":"e_1_3_1_64_2","doi-asserted-by":"crossref","unstructured":"Chao You Licheng Jiao Lingling Li Xu Liu Fang Liu Wenping Ma and Shuyuan Yang. 2025. Contour knowledge-aware perception learning for semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 35 5 (2025) 4560\u20134575.","DOI":"10.1109\/TCSVT.2024.3515088"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2024.102246"},{"key":"e_1_3_1_66_2","first-page":"12310","volume-title":"Proceedings of the 38th International Conference on Machine Learning (ICML)","author":"Zbontar Jure","year":"2021","unstructured":"Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St\u00e9phane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the 38th International Conference on Machine Learning (ICML), 12310\u201312320."},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01240-3_17"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2024.3407138"},{"key":"e_1_3_1_69_2","unstructured":"Tianyi Zhao Maoxun Yuan Feng Jiang Nan Wang and Xingxing Wei. 2024. Removal then selection: A coarse-to-fine fusion perspective for RGB-infrared object detection. arXiv:2401.10731. Retrieved from https:\/\/arxiv.org\/abs\/2401.10731"},{"key":"e_1_3_1_70_2","first-page":"1","article-title":"Gradient-guided multi-scale focal attention network for remote sensing scene classification","author":"Zhao Yue","year":"2024","unstructured":"Yue Zhao, Maoguo Gong, A. Kai Qin, Mingyang Zhang, Zhuping Hu, Tianqi Gao, and Yan Pu. 2024. Gradient-guided multi-scale focal attention network for remote sensing scene classification. IEEE Transactions on Geoscience Remote Sensing 62 (2024), 1\u201318.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"},{"key":"e_1_3_1_71_2","first-page":"1","article-title":"STMNet: Single-temporal mask-based network for self-supervised hyperspectral change detection","author":"Zhou Tianyuan","year":"2025","unstructured":"Tianyuan Zhou, Fulin Luo, Chuan Fu, Tan Guo, Xiaopan Wang, Bo Du, and Xinbo Gao. 2025. STMNet: Single-temporal mask-based network for self-supervised hyperspectral change detection. IEEE Transactions on Geoscience Remote Sensing 63 (2025), 1\u201312.","journal-title":"IEEE Transactions on Geoscience Remote Sensing"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3769084","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T06:58:45Z","timestamp":1763794725000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3769084"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,21]]},"references-count":70,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3769084"],"URL":"https:\/\/doi.org\/10.1145\/3769084","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,21]]},"assertion":[{"value":"2025-01-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-11","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}