{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T01:17:38Z","timestamp":1775870258163,"version":"3.50.1"},"reference-count":46,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2021,9,9]],"date-time":"2021-09-09T00:00:00Z","timestamp":1631145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB","award":["BK20BE014"],"award-info":[{"award-number":["BK20BE014"]}]},{"name":"Fundamental Research Funds for the China Central Universities of USTB","award":["FRF-DF-19-002"],"award-info":[{"award-number":["FRF-DF-19-002"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Semantic segmentation for remote sensing images (RSIs) is widely applied in geological surveys, urban resources management, and disaster monitoring. Recent solutions on remote sensing segmentation tasks are generally addressed by CNN-based models and transformer-based models. In particular, transformer-based architecture generally struggles with two main problems: a high computation load and inaccurate edge classification. Therefore, to overcome these problems, we propose a novel transformer model to realize lightweight edge classification. First, based on a Swin transformer backbone, a pure Efficient transformer with mlphead is proposed to accelerate the inference speed. Moreover, explicit and implicit edge enhancement methods are proposed to cope with object edge problems. The experimental results evaluated on the Potsdam and Vaihingen datasets present that the proposed approach significantly improved the final accuracy, achieving a trade-off between computational complexity (Flops) and accuracy (Efficient-L obtaining 3.23% mIoU improvement on Vaihingen and 2.46% mIoU improvement on Potsdam compared with HRCNet_W48). As a result, it is believed that the proposed Efficient transformer will have an advantage in dealing with remote sensing image segmentation problems.<\/jats:p>","DOI":"10.3390\/rs13183585","type":"journal-article","created":{"date-parts":[[2021,9,9]],"date-time":"2021-09-09T21:36:58Z","timestamp":1631223418000},"page":"3585","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":206,"title":["Efficient Transformer for Remote Sensing Image Segmentation"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3210-8839","authenticated-orcid":false,"given":"Zhiyong","family":"Xu","sequence":"first","affiliation":[{"name":"School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China"}]},{"given":"Weicun","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0996-2586","authenticated-orcid":false,"given":"Tianxiang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5895-6708","authenticated-orcid":false,"given":"Zhifang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2288-7901","authenticated-orcid":false,"given":"Jiangyun","family":"Li","sequence":"additional","affiliation":[{"name":"School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China"},{"name":"Shunde Graduate School, University of Science and Technology Beijing, Foshan 528000, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"174","DOI":"10.1016\/j.isprsjprs.2020.10.010","article-title":"Understanding the synergies of deep learning and data fusion of multispectral and panchromatic high resolution commercial satellite imagery for automated ice-wedge polygon detection","volume":"170","author":"Witharana","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"105909","DOI":"10.1016\/j.compag.2020.105909","article-title":"State and parameter estimation of the AquaCrop model for winter wheat using sensitivity informed particle filter","volume":"180","author":"Zhang","year":"2021","journal-title":"Comput. Electron. Agric."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Zhang, J., Lin, S., Ding, L., and Bruzzone, L. (2020). Multi-Scale Context Aggregation for Semantic Segmentation of Remote Sensing Images. Remote Sens., 12.","DOI":"10.3390\/rs12040701"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., and Shah, M. (2021). Transformers in vision: A survey. arXiv.","DOI":"10.1145\/3505244"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, W., Zhang, T., and Li, J. (2021). HRCNet: High-resolution context extraction network for semantic segmentation of remote sensing images. Remote Sens., 13.","DOI":"10.3390\/rs13122290"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_7","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_8","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., and Torr, P.H. (2021, January 19\u201325). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_11","unstructured":"Zhang, Q., and Yang, Y. (2021). ResT: An Efficient Transformer for Visual Recognition. arXiv."},{"key":"ref_12","unstructured":"Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Keysers, D., Uszkoreit, J., and Lucic, M. (2021). Mlp-mixer: An all-mlp architecture for vision. arXiv."},{"key":"ref_13","unstructured":"Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., and Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional Networks for Biomedical Image Segmentation. Proceedings of the International Conference on Medical Image Computing And Computer-Assisted Intervention, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Jin, Y., Xu, W., Zhang, C., Luo, X., and Jia, H. (2021). Boundary-aware refined network for automatic building extraction in very high-resolution urban aerial images. Remote Sens., 13.","DOI":"10.3390\/rs13040692"},{"key":"ref_16","unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Doll\u00e1r, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H. (2019, January 16\u201320). Dual attention network for scene segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00326"},{"key":"ref_20","unstructured":"Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. (November, January 27). Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N. (2018, January 8\u201314). Bisenet: Bilateral segmentation network for real-time semantic segmentation. Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01261-8_20"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yu, C., Gao, C., Wang, J., Yu, G., Shen, C., and Sang, N. (2020). BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation. arXiv.","DOI":"10.1007\/s11263-021-01515-2"},{"key":"ref_24","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. (2018, January 18\u201322). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_26","unstructured":"Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., and Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv."},{"key":"ref_27","unstructured":"Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., and Luo, P. (2021). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Yang, G., Zhang, Q., and Zhang, G. (2020). EANet: Edge-Aware Network for the Extraction of Buildings from Aerial Images. Remote Sens., 12.","DOI":"10.3390\/rs12132161"},{"key":"ref_29","unstructured":"Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., and Wang, J. (2019). High-resolution representations for labeling pixels and regions. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. (2018, January 8\u201314). Unified perceptual parsing for scene understanding. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01228-1_26"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"776","DOI":"10.1109\/LGRS.2018.2881045","article-title":"Low\u2013high-power consumption architectures for deep-learning models applied to hyperspectral image classification","volume":"16","author":"Haut","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhang, C., Jiang, W., and Zhao, Q. (2021). Semantic Segmentation of Aerial Imagery via Split-Attention Networks with Disentangled Nonlocal and Edge Supervision. Remote Sens., 13.","DOI":"10.3390\/rs13061176"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhang, T., Su, J., Xu, Z., Luo, Y., and Li, J. (2021). Sentinel-2 satellite imagery for urban land cover classification by optimized random forest classifier. Appl. Sci., 11.","DOI":"10.3390\/app11020543"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yuan, W., Zhang, W., Lai, Z., and Zhang, J. (2020). Extraction of Yardang characteristics using object-based image analysis and canny edge detection methods. Remote Sens., 12.","DOI":"10.3390\/rs12040726"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_36","unstructured":"Sugano, H., and Miyamoto, R. (2008, January 12\u201314). Parallel implementation of morphological processing on cell\/BE with OpenCV interface. Proceedings of the 2008 3rd International Symposium on Communications, Control and Signal Processing, St. Julians, Malta."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"He, J., Deng, Z., Zhou, L., Wang, Y., and Qiao, Y. (2019, January 15\u201320). Adaptive pyramid context network for semantic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00770"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Yin, M., Yao, Z., Cao, Y., Li, X., Zhang, Z., Lin, S., and Hu, H. (2020). Disentangled non-local neural networks. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-030-58555-6_12"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhao, H., Zhang, Y., Liu, S., Shi, J., Loy, C.C., Lin, D., and Jia, J. (2018, January 8\u201314). Psanet: Point-wise spatial attention network for scene parsing. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01240-3_17"},{"key":"ref_41","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. (2017, January 21\u201326). Scene parsing through ade20k dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.544"},{"key":"ref_43","unstructured":"Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_45","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Wang, J., Shen, L., Qiao, W., Dai, Y., and Li, Z. (2019). Deep feature fusion with integration of residual connection and attention model for classification of VHR remote sensing images. Remote Sens., 11.","DOI":"10.3390\/rs11131617"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/18\/3585\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:59:17Z","timestamp":1760165957000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/18\/3585"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,9]]},"references-count":46,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["rs13183585"],"URL":"https:\/\/doi.org\/10.3390\/rs13183585","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,9]]}}}