{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T22:55:11Z","timestamp":1774047311691,"version":"3.50.1"},"reference-count":42,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2023,5,22]],"date-time":"2023-05-22T00:00:00Z","timestamp":1684713600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2018YFB0505300"],"award-info":[{"award-number":["2018YFB0505300"]}]},{"name":"National Key Research and Development Program of China","award":["41701472"],"award-info":[{"award-number":["41701472"]}]},{"name":"National Key Research and Development Program of China","award":["42071316"],"award-info":[{"award-number":["42071316"]}]},{"name":"National Key Research and Development Program of China","award":["41971375"],"award-info":[{"award-number":["41971375"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["2018YFB0505300"],"award-info":[{"award-number":["2018YFB0505300"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["41701472"],"award-info":[{"award-number":["41701472"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["42071316"],"award-info":[{"award-number":["42071316"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["41971375"],"award-info":[{"award-number":["41971375"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Automatically extracting 2D buildings from high-resolution remote sensing images is among the most popular research directions in the area of remote sensing information extraction. Semantic segmentation based on a CNN or transformer has greatly improved building extraction accuracy. A CNN is good at local feature extraction, but its ability to acquire global features is poor, which can lead to incorrect and missed detection of buildings. The advantage of transformer models lies in their global receptive field, but they do not perform well in extracting local features, resulting in poor local detail for building extraction. We propose a CNN-based and transformer-based dual-stream feature extraction network (DSFENet) in this paper, for accurate building extraction. In the encoder, convolution extracts the local features for buildings, and the transformer realizes the global representation of the buildings. The effective combination of local and global features greatly enhances the network\u2019s feature extraction ability. We validated the capability of DSFENet on the Google Image dataset and the ISPRS Vaihingen dataset. DSEFNet achieved the best accuracy performance compared to other state-of-the-art models.<\/jats:p>","DOI":"10.3390\/rs15102689","type":"journal-article","created":{"date-parts":[[2023,5,23]],"date-time":"2023-05-23T01:36:48Z","timestamp":1684805808000},"page":"2689","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":32,"title":["Dual-Stream Feature Extraction Network Based on CNN and Transformer for Building Extraction"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3190-0178","authenticated-orcid":false,"given":"Liegang","family":"Xia","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China"}]},{"given":"Shulin","family":"Mi","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China"}]},{"given":"Junxia","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China"}]},{"given":"Jiancheng","family":"Luo","sequence":"additional","affiliation":[{"name":"Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100875, China"}]},{"given":"Zhanfeng","family":"Shen","sequence":"additional","affiliation":[{"name":"Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100875, China"}]},{"given":"Yubin","family":"Cheng","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1080\/15481603.2017.1361509","article-title":"Segmentation of airborne point cloud data for automatic building roof extraction","volume":"55","author":"Gilani","year":"2018","journal-title":"GISci. Remote Sens."},{"key":"ref_2","first-page":"102591","article-title":"DSA-Net: A novel deeply supervised attention-guided network for building change detection in high-resolution remote sensing images","volume":"105","author":"Ding","year":"2021","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Yang, G., Zhang, Q., and Zhang, G. (2020). EANet: Edge-aware network for the extraction of buildings from aerial images. Remote Sens., 12.","DOI":"10.3390\/rs12132161"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1125","DOI":"10.1080\/15481603.2020.1847453","article-title":"Multi-scale three-dimensional detection of urban buildings using aerial LiDAR data","volume":"57","author":"Cao","year":"2020","journal-title":"GISci. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"7313","DOI":"10.1109\/ACCESS.2020.2964043","article-title":"Automatic building extraction from high-resolution aerial imagery via fully convolutional encoder-decoder network with non-local block","volume":"8","author":"Wang","year":"2020","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"749","DOI":"10.1080\/15481603.2018.1564499","article-title":"Semantic segmentation of high spatial resolution images with deep neural networks","volume":"56","author":"Yang","year":"2019","journal-title":"GISci. Remote Sens."},{"key":"ref_7","first-page":"102768","article-title":"Multi-scale attention integrated hierarchical networks for high-resolution building footprint extraction","volume":"109","author":"Liu","year":"2022","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_8","first-page":"102680","article-title":"Deep Roof Refiner: A detail-oriented deep learning network for refined delineation of roof structure lines using satellite imagery","volume":"107","author":"Qian","year":"2022","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Huang, H., Sun, G., Rong, J., Zhang, A., and Ma, P. (2018, January 18\u201320). Multi-feature combined for building shadow detection in GF-2 Images. Proceedings of the 2018 Fifth International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Xi\u2019an, China.","DOI":"10.1109\/EORSA.2018.8598603"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1109\/TGRS.2018.2858817","article-title":"Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set","volume":"57","author":"Ji","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., and Ye, Q. (2021, January 11\u201317). Conformer: Local features coupling global representations for visual recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00042"},{"key":"ref_12","first-page":"5999","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","unstructured":"Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., and Qiao, Y. (2022). UniFormer: Unifying Convolution and Self-attention for Visual Recognition. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"861","DOI":"10.1080\/15481603.2022.2076382","article-title":"Urban building extraction from high-resolution remote sensing imagery based on multi-scale recurrent conditional generative adversarial network","volume":"59","author":"Wang","year":"2022","journal-title":"GISci. Remote Sens."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Yi, Y., Zhang, Z., Zhang, W., Zhang, C., Li, W., and Zhao, T. (2019). Semantic segmentation of urban buildings from VHR remote sensing imagery using a deep convolutional neural network. Remote Sens., 11.","DOI":"10.3390\/rs11151774"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Liu, P., Liu, X., Liu, M., Shi, Q., Yang, J., Xu, X., and Zhang, Y. (2019). Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sens., 11.","DOI":"10.3390\/rs11070830"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2611","DOI":"10.1109\/JSTARS.2021.3058097","article-title":"Attention-Gate-Based Encoder\u2013Decoder Network for Automatical Building Extraction","volume":"14","author":"Deng","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"6608","DOI":"10.1109\/JSTARS.2021.3076085","article-title":"Fine building segmentation in high-resolution SAR images via selective pyramid dilated network","volume":"14","author":"Jing","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"6169","DOI":"10.1109\/TGRS.2020.3026051","article-title":"MAP-Net: Multiple attending path neural network for building footprint extraction from remote sensed imagery","volume":"59","author":"Zhu","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., and Torr, P.H. (2021, January 20\u201325). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"ref_21","first-page":"12077","article-title":"SegFormer: Simple and efficient design for semantic segmentation with transformers","volume":"34","author":"Xie","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_22","unstructured":"Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., and Wang, M. (2021). Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, W., Zhang, T., Yang, Z., and Li, J. (2021). Efficient transformer for remote sensing image segmentation. Remote Sens., 13.","DOI":"10.3390\/rs13183585"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Chen, K., Zou, Z., and Shi, Z. (2021). Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens., 13.","DOI":"10.3390\/rs13214441"},{"key":"ref_26","unstructured":"Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., and Patel, V.M. (October, January 27). Medical transformer: Gated axial-attention for medical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France."},{"key":"ref_27","unstructured":"Gao, Y., Zhou, M., and Metaxas, D.N. (October, January 27). UTNet: A hybrid transformer architecture for medical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France."},{"key":"ref_28","unstructured":"Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., and Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1016\/j.isprsjprs.2022.06.008","article-title":"UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery","volume":"190","author":"Wang","year":"2022","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2022.3230846","article-title":"Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation","volume":"60","author":"He","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_31","first-page":"1","article-title":"Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery","volume":"60","author":"Zhang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_32","unstructured":"Wang, L., Fang, S., Zhang, C., Li, R., and Duan, C. (2021). Efficient Hybrid Transformer: Learning Global-local Context for Urban Scene Segmentation. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"10990","DOI":"10.1109\/JSTARS.2021.3119654","article-title":"STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation","volume":"14","author":"Gao","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Shang, R., Zhang, J., Jiao, L., Li, Y., Marturi, N., and Stolkin, R. (2020). Multi-scale adaptive feature fusion network for semantic segmentation in remote sensing images. Remote Sens., 12.","DOI":"10.3390\/rs12050872"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.-W., and Wu, J. (2020, January 4\u20138). Unet 3+: A full-scale connected unet for medical image segmentation. Proceedings of the ICASSP 2020\u20132020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053405"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 13\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Xia, L., Zhang, J., Zhang, X., Yang, H., and Xu, M. (2021). Precise Extraction of Buildings from High-Resolution Remote-Sensing Images Based on Semantic Edges and Segmentation. Remote Sens., 13.","DOI":"10.3390\/rs13163083"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Zhou, L., Zhang, C., and Wu, M. (2018, January 18\u201322). D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPRW.2018.00034"},{"key":"ref_41","unstructured":"Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J., and Cheng, M.-M. (November, January 27). EGNet: Edge guidance network for salient object detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_42","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/10\/2689\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:39:57Z","timestamp":1760125197000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/10\/2689"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,22]]},"references-count":42,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["rs15102689"],"URL":"https:\/\/doi.org\/10.3390\/rs15102689","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,22]]}}}