{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T04:27:29Z","timestamp":1774067249668,"version":"3.50.1"},"reference-count":44,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,4,29]],"date-time":"2023-04-29T00:00:00Z","timestamp":1682726400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program under Ministry of Science and Technology of the People\u00b4s Republic of China","award":["2020YFB1600702"],"award-info":[{"award-number":["2020YFB1600702"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>In recent years, the application of semantic segmentation methods based on the remote sensing of images has become increasingly prevalent across a diverse range of domains, including but not limited to forest detection, water body detection, urban rail transportation planning, and building extraction. With the incorporation of the Transformer model into computer vision, the efficacy and accuracy of these algorithms have been significantly enhanced. Nevertheless, the Transformer model\u2019s high computational complexity and dependence on a pre-training weight of large datasets leads to a slow convergence during the training for remote sensing segmentation tasks. Motivated by the success of the adapter module in the field of natural language processing, this paper presents a novel adapter module (ResAttn) for improving the model training speed for remote sensing segmentation. The ResAttn adopts a dual-attention structure in order to capture the interdependencies between sets of features, thereby improving its global modeling capabilities, and introduces a Swin Transformer-like down-sampling method to reduce information loss and retain the original architecture while reducing the resolution. In addition, the existing Transformer model is limited in its ability to capture local high-frequency information, which can lead to an inadequate extraction of edge and texture features. To address these issues, this paper proposes a Local Feature Extractor (LFE) module, which is based on a convolutional neural network (CNN), and incorporates multi-scale feature extraction and residual structure to effectively overcome this limitation. Further, a mask-based segmentation method is employed and a residual-enhanced deformable attention block (Deformer Block) is incorporated to improve the small target segmentation accuracy. Finally, a sufficient number of experiments were performed on the ISPRS Potsdam datasets. The experimental results demonstrate the superior performance of the model described in this paper.<\/jats:p>","DOI":"10.3390\/rs15092363","type":"journal-article","created":{"date-parts":[[2023,5,1]],"date-time":"2023-05-01T12:10:03Z","timestamp":1682943003000},"page":"2363","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery"],"prefix":"10.3390","volume":"15","author":[{"given":"Zheng","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Information, North China University of Technology, Beijing 100144, China"}]},{"given":"Fanchen","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Information, North China University of Technology, Beijing 100144, China"}]},{"given":"Changan","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Information, North China University of Technology, Beijing 100144, China"}]},{"given":"Qing","family":"Tian","sequence":"additional","affiliation":[{"name":"School of Information, North China University of Technology, Beijing 100144, China"}]},{"given":"Hongquan","family":"Qu","sequence":"additional","affiliation":[{"name":"School of Information, North China University of Technology, Beijing 100144, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,29]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Ke, L., Xiong, Y., and Gang, W. (2015, January 17\u201318). Remote Sensing Image Classification Method Based on Superpixel Segmentation and Adaptive Weighting K-Means. Proceedings of the 2015 International Conference on Virtual Reality and Visualization (ICVRV), Xiamen, China.","DOI":"10.1109\/ICVRV.2015.35"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1805","DOI":"10.1007\/s12524-018-0841-8","article-title":"Computationally efficient mean-shift parallel segmentation algorithm for high-resolution remote sensing images","volume":"46","author":"Wu","year":"2018","journal-title":"J. Indian Soc. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Moser, G., and Serpico, S.B. (2008, January 8\u201311). Classification of High-Resolution Images Based on MRF Fusion and Multiscale Segmentation. Proceedings of the IGARSS 2008-2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA.","DOI":"10.1109\/IGARSS.2008.4778981"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Miao, C., Liu, C.A., and Tian, Q. (2022). DCS-TransUperNet: Road segmentation network based on CSwin transformer with dual resolution. Appl. Sci., 12.","DOI":"10.3390\/app12073511"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"9362","DOI":"10.1109\/TGRS.2019.2926397","article-title":"Multi-scale and multi-task deep learning framework for automatic road extraction","volume":"57","author":"Lu","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Xu, Z., Liu, C.A., Tian, Q., and Wang, Y. (2022). Cloudformer: Supplementary aggregation feature and mask-classification network for cloud detection. Appl. Sci., 12.","DOI":"10.3390\/app12073221"},{"key":"ref_7","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_8","first-page":"12077","article-title":"SegFormer: Simple and efficient design for semantic segmentation with transformers","volume":"34","author":"Xie","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Miao, C., Liu, C., Tian, Q., and Zhou, Y. (2022). HA-RoadFormer: Hybrid attention transformer with multi-branch for large-scale high-resolution dense road segmentation. Mathematics, 10.","DOI":"10.3390\/math10111915"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sertel, E., Ekim, B., Osgouei, P.E., and Kabadayi, M.E. (2022). Land Use and Land Cover Mapping Using Deep Learning Based Segmentation Approaches and VHR Worldview-3 Images. Remote Sens., 14.","DOI":"10.3390\/rs14184558"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end Object Detection with Transformers. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., and Liu, F. (2021, January 10\u201317). ViT-YOLO: Transformer-Based YOLO for Object Detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00314"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Xu, Z., Liu, C.A., Tian, Q., and Zhou, Y. (2022). Cloudformer V2: Set Prior Prediction and Binary Mask Weighted Network for Cloud Detection. Mathematics, 10.","DOI":"10.3390\/math10152710"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yang, S., Quan, Z., Nie, M., and Yang, W. (2021, January 10\u201317). Transpose: Keypoint Localization via Transformer. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01159"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"He, Y., Yan, R., Fragkiadaki, K., and Yu, S.-I. (2020, January 14\u201319). Epipolar Transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00780"},{"key":"ref_16","unstructured":"He, K., Girshick, R., and Doll\u00e1r, P. (November, January 27). Rethinking Imagenet Pre-Training. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., and Girdhar, R. (2022, January 18\u201324). Masked-Attention Mask Transformer for Universal Image Segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2022.3230846","article-title":"Swin transformer embedding UNet for remote sensing image semantic segmentation","volume":"60","author":"He","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, H., Chen, X., Zhang, T., Xu, Z., and Li, J. (2022). CCTNet: Coupled CNN and transformer network for crop segmentation of remote sensing images. Remote Sens., 14.","DOI":"10.3390\/rs14091956"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Cao, G., Jiang, D., and Zhou, M. (2020). K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv.","DOI":"10.18653\/v1\/2021.findings-acl.121"},{"key":"ref_24","first-page":"1","article-title":"A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images","volume":"19","author":"Wang","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional Networks for Biomedical Image Segmentation. Proceedings of the Medical Image Computing and Computer-Assisted Intervention\u2013MICCAI 2015: 18th International Conference, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_26","first-page":"1","article-title":"Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery","volume":"60","author":"Zhang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_27","unstructured":"Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., and Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv."},{"key":"ref_28","first-page":"1","article-title":"TransRoadNet: A novel road extraction method for remote sensing images via combining high-level semantic feature and context","volume":"19","author":"Yang","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs","volume":"40","author":"Chen","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_30","unstructured":"Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Pfeiffer, J., Kamath, A., R\u00fcckl\u00e9, A., Cho, K., and Gurevych, I. (2020). AdapterFusion: Non-destructive task composition for transfer learning. arXiv.","DOI":"10.18653\/v1\/2021.eacl-main.39"},{"key":"ref_33","first-page":"1","article-title":"SwinSUNet: Pure transformer network for remote sensing image change detection","volume":"60","author":"Zhang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Wu, G., Shao, X., Guo, Z., Chen, Q., Yuan, W., Shi, X., Xu, Y., and Shibasaki, R. (2018). Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sens., 10.","DOI":"10.3390\/rs10030407"},{"key":"ref_35","first-page":"1","article-title":"When CNNs meet vision transformer: A joint framework for remote sensing scene classification","volume":"19","author":"Deng","year":"2021","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_36","unstructured":"Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., and Qiao, Y. (2022). Vision transformer adapter for dense predictions. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid Scene Parsing Network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. (2022, January 18\u201324). Cswin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01181"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1109\/MGRS.2017.2762307","article-title":"Deep learning in remote sensing: A comprehensive review and list of resources","volume":"5","author":"Zhu","year":"2017","journal-title":"IEEE Geosci. Remote Sens. Mag."},{"key":"ref_40","unstructured":"Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019, January 10\u201315). Parameter-Efficient Transfer Learning for NLP. Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_41","first-page":"17864","article-title":"Per-pixel classification is not all you need for semantic segmentation","volume":"34","author":"Cheng","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Milletari, F., Navab, N., and Ahmadi, S.-A. (2016, January 25\u201328). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA.","DOI":"10.1109\/3DV.2016.79"},{"key":"ref_43","unstructured":"Bao, H., Dong, L., Piao, S., and Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201313). Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.123"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/9\/2363\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:26:48Z","timestamp":1760124408000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/9\/2363"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,29]]},"references-count":44,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["rs15092363"],"URL":"https:\/\/doi.org\/10.3390\/rs15092363","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,29]]}}}