{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T08:37:20Z","timestamp":1776328640731,"version":"3.50.1"},"reference-count":43,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2022,3,7]],"date-time":"2022-03-07T00:00:00Z","timestamp":1646611200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Chinese National Key Research and Development Project","award":["2020YFD1100200"],"award-info":[{"award-number":["2020YFD1100200"]}]},{"name":"the Science and Technology Major Project of Hubei Province","award":["2021AAA010"],"award-info":[{"award-number":["2021AAA010"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Taking depth into consideration has been proven to improve the performance of semantic segmentation through providing additional geometry information. Most existing works adopt a two-stream network, extracting features from color images and depth images separately using two branches of the same structure, which suffer from high memory and computation costs. We find that depth features acquired by simple downsampling can also play a complementary part in the semantic segmentation task, sometimes even better than the two-stream scheme with the same two branches. In this paper, a novel and efficient depth fusion transformer network for aerial image segmentation is proposed. The presented network utilizes patch merging to downsample depth input and a depth-aware self-attention (DSA) module is designed to mitigate the gap caused by difference between two branches and two modalities. Concretely, the DSA fuses depth features and color features by computing depth similarity and impact on self-attention map calculated by color feature. Extensive experiments on the ISPRS 2D semantic segmentation dataset validate the efficiency and effectiveness of our method. With nearly half the parameters of traditional two-stream scheme, our method acquires 83.82% mIoU on Vaihingen dataset outperforming other state-of-the-art methods and 87.43% mIoU on Potsdam dataset comparable to the state-of-the-art.<\/jats:p>","DOI":"10.3390\/rs14051294","type":"journal-article","created":{"date-parts":[[2022,3,9]],"date-time":"2022-03-09T01:50:53Z","timestamp":1646790653000},"page":"1294","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":28,"title":["Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation"],"prefix":"10.3390","volume":"14","author":[{"given":"Li","family":"Yan","sequence":"first","affiliation":[{"name":"School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China"},{"name":"School of Computer Science, Wuhan University, Wuhan 430072, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0610-1406","authenticated-orcid":false,"given":"Jianming","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China"}]},{"given":"Hong","family":"Xie","sequence":"additional","affiliation":[{"name":"School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4265-8698","authenticated-orcid":false,"given":"Pengcheng","family":"Wei","sequence":"additional","affiliation":[{"name":"School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China"}]},{"given":"Zhao","family":"Gao","sequence":"additional","affiliation":[{"name":"School of Computer Science, Wuhan University, Wuhan 430072, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,3,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"640","DOI":"10.1109\/TPAMI.2016.2572683","article-title":"Fully Convolutional Networks for Semantic Segmentation","volume":"39","author":"Shelhamer","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA."},{"key":"ref_3","unstructured":"Yu, F., and Koltun, V. (2016, January 2\u20134). Multi-Scale Context Aggregation by Dilated Convolutions. Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico."},{"key":"ref_4","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3\u20137). An Image is Worth 16 \u00d7 16 Words: Transformers for Image Recognition at Scale. Proceedings of the International Conference on Learning Representations (ICLR), Online."},{"key":"ref_5","unstructured":"Kampffmeyer, M., Salberg, A.B., and Jenssen, R. (July, January 26). Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Audebert, N., Le Saux, B., and Lef\u00e8vre, S. (2016, January 20\u201324). Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-scale Deep Networks. Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan.","DOI":"10.1007\/978-3-319-54181-5_12"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhang, W., Huang, H., Schmitz, M., Sun, X., Wang, H., and Mayer, H. (2018). Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling. Remote Sens., 10.","DOI":"10.3390\/rs10010052"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.isprsjprs.2017.11.011","article-title":"Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks","volume":"140","author":"Audebert","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Hazirbas, C., Ma, L., Domokos, C., and Cremers, D. (2016, January 20\u201324). FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan.","DOI":"10.1007\/978-3-319-54181-5_14"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Hu, X., Yang, K., Fei, L., and Wang, K. (2019, January 22\u201325). ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803025"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Chen, X., Lin, K.Y., Wang, J., Wu, W., Qian, C., Li, H., and Zeng, G. (2020, January 23\u201328). Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation. Proceedings of the European Conference on Computer Cision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58621-8_33"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"3492","DOI":"10.1109\/JSTARS.2019.2930724","article-title":"High-Resolution Aerial Images Semantic Segmentation Using Deep Fully Convolutional Network With Channel Attention Mechanism","volume":"12","author":"Luo","year":"2019","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Cheng, Y., Cai, R., Li, Z., Zhao, X., and Huang, K. (2017, January 21\u201326). Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.161"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"22475","DOI":"10.1007\/s11042-018-6056-8","article-title":"RGB-D joint modelling with scene geometric information for indoor semantic segmentation","volume":"77","author":"Liu","year":"2018","journal-title":"Multimed. Tools Appl."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wang, W., and Ulrich, N. (2018, January 8\u201314). Depth-Aware CNN for RGB-D Segmentation. Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01252-6_9"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Xing, Y., Wang, J., Chen, X., and Zeng, G. (2019, January 22\u201325). 2.5D Convolution for RGB-D Semantic Segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803757"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Xing, Y., Wang, J., and Zeng, G. (2020, January 23\u201328). Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing. Proceedings of the European Conference on Computer Cision (ECCV), Glasgow, UK.","DOI":"10.1007\/978-3-030-58529-7_33"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chen, R., Zhang, F.L., and Rhee, T. (2020, January 25\u201327). Edge-Aware Convolution for RGB-D Image Segmentation. Proceedings of the 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), Wellington, New Zealand.","DOI":"10.1109\/IVCNZ51579.2020.9290608"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Online.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1016\/j.neucom.2018.11.051","article-title":"Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: Structural stereotype and insufficient learning","volume":"330","author":"Sun","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2011","DOI":"10.1109\/TPAMI.2019.2913372","article-title":"Squeeze-and-Excitation Networks","volume":"42","author":"Hu","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Mou, L., Hua, Y., and Zhu, X.X. (2019, January 15\u201320). A Relation-Augmented Fully Convolutional Network for Semantic Segmentation in Aerial Scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01270"},{"key":"ref_23","first-page":"1","article-title":"Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images","volume":"60","author":"Niu","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_24","first-page":"1","article-title":"Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images","volume":"19","author":"Li","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, L., Li, R., Duan, C., Zhang, C., Meng, X., and Fang, S. (2022). A Novel Transformer based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett., 19.","DOI":"10.1109\/LGRS.2022.3143368"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, W., Zhang, T., Yang, Z., and Li, J. (2021). Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens., 13.","DOI":"10.3390\/rs13183585"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"4499","DOI":"10.1007\/s11042-019-7684-3","article-title":"A survey on indoor RGB-D semantic segmentation: From hand-crafted features to deep convolutional neural networks","volume":"79","author":"Fooladgar","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_28","unstructured":"Chen, K., Fu, K., Gao, X., Yan, M., Zhang, W., Zhang, Y., and Sun, X. (August, January 28). Effective Fusion of Multi-Modal Data with Group Convolutions for Semantic Segmentation of Aerial Imagery. Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"881","DOI":"10.1109\/TGRS.2016.2616585","article-title":"Dense Semantic Labeling of Subdecimeter Resolution Images With Convolutional Neural Networks","volume":"55","author":"Volpi","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.isprsjprs.2018.01.021","article-title":"Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models","volume":"145","author":"Marcos","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"7092","DOI":"10.1109\/TGRS.2017.2740362","article-title":"High-Resolution Aerial Image Labeling With Convolutional Neural Networks","volume":"55","author":"Maggiori","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","unstructured":"Xie, E., Wang, W., Yu, Z., Anandkumar, A., Lvarez, J.E.M.A., and Luo, P. (2021, January 6\u201314). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Proceedings of the Advances in Neural Information Processing Systems (NIPS), Online."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","article-title":"SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation","volume":"39","author":"Badrinarayanan","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-Net: Convolutional Networks for Biomedical Image Segmentation. Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhao, H., Jia, J., and Koltun, V. (2020, January 13\u201319). Exploring Self-Attention for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01009"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., and Hu, Q. (2020, January 13\u201319). ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01155"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1016\/j.isprsjprs.2017.11.009","article-title":"Classification with an edge: Improving semantic image segmentation with boundary detection","volume":"135","author":"Marmanis","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_38","unstructured":"Gerke, M. (2015). Use of the Stair Vision Library within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen), ResearcheGate."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.isprsjprs.2019.07.007","article-title":"TreeUNet: Adaptive Tree convolutional neural networks for subdecimeter aerial image segmentation","volume":"156","author":"Yue","year":"2019","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1016\/j.isprsjprs.2017.12.007","article-title":"Semantic labeling in very high resolution images via a self-cascaded convolutional neural network","volume":"145","author":"Liu","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_41","first-page":"1","article-title":"Geometry-Aware Segmentation of Remote Sensing Images via Joint Height Estimation","volume":"19","author":"Li","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Xiao, T., Liu, Y., Zhou, B., Jiang, Y., and Sun, J. (2018, January 8\u201314). Unified Perceptual Parsing for Scene Understanding. Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01228-1_26"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018, January 8\u201314). CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/5\/1294\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:33:15Z","timestamp":1760135595000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/5\/1294"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,7]]},"references-count":43,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2022,3]]}},"alternative-id":["rs14051294"],"URL":"https:\/\/doi.org\/10.3390\/rs14051294","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,7]]}}}