{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T22:54:53Z","timestamp":1774047293669,"version":"3.50.1"},"reference-count":46,"publisher":"MDPI AG","issue":"19","license":[{"start":{"date-parts":[[2024,9,25]],"date-time":"2024-09-25T00:00:00Z","timestamp":1727222400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61902349"],"award-info":[{"award-number":["61902349"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The accurate extraction of buildings from remote sensing images is crucial in fields such as 3D urban planning, disaster detection, and military reconnaissance. In recent years, models based on Transformer have performed well in global information processing and contextual relationship modeling, but suffer from high computational costs and insufficient ability to capture local information. In contrast, convolutional neural networks (CNNs) are very effective in extracting local features, but have a limited ability to process global information. In this paper, an asymmetric network (CTANet), which combines the advantages of CNN and Transformer, is proposed to achieve efficient extraction of buildings. Specifically, CTANet employs ConvNeXt as an encoder to extract features and combines it with an efficient bilateral hybrid attention transformer (BHAFormer) which is designed as a decoder. The BHAFormer establishes global dependencies from both texture edge features and background information perspectives to extract buildings more accurately while maintaining a low computational cost. Additionally, the multiscale mixed attention mechanism module (MSM-AMM) is introduced to learn the multiscale semantic information and channel representations of the encoder features to reduce noise interference and compensate for the loss of information in the downsampling process. Experimental results show that the proposed model achieves the best F1-score (86.7%, 95.74%, and 90.52%) and IoU (76.52%, 91.84%, and 82.68%) compared to other state-of-the-art methods on the Massachusetts building dataset, the WHU building dataset, and the Inria aerial image labeling dataset.<\/jats:p>","DOI":"10.3390\/s24196198","type":"journal-article","created":{"date-parts":[[2024,9,25]],"date-time":"2024-09-25T17:12:04Z","timestamp":1727284324000},"page":"6198","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-5057-1746","authenticated-orcid":false,"given":"Junhao","family":"Chang","sequence":"first","affiliation":[{"name":"School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7752-2849","authenticated-orcid":false,"given":"Yuefeng","family":"Cen","sequence":"additional","affiliation":[{"name":"School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gang","family":"Cen","sequence":"additional","affiliation":[{"name":"School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2024,9,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"2036","DOI":"10.1080\/15481603.2022.2142727","article-title":"Generating annual high resolution land cover products for 28 metropolises in China based on a deep super-resolution mapping network using Landsat imagery","volume":"59","author":"He","year":"2022","journal-title":"GISci. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"105773","DOI":"10.1016\/j.landusepol.2021.105773","article-title":"Tracking the history of urban expansion in Guangzhou (China) during 1665\u20132017: Evidence from historical maps and remote sensing images","volume":"112","author":"Liu","year":"2022","journal-title":"Land Use Policy"},{"key":"ref_3","first-page":"1","article-title":"UGS-1m: Fine-grained urban green space mapping of 34 major cities in China based on the deep learning framework","volume":"2022","author":"Shi","year":"2022","journal-title":"Earth Syst. Sci. Data Discuss."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.isprsjprs.2013.06.011","article-title":"A comprehensive review of earthquake-induced building damage detection with remote sensing techniques","volume":"84","author":"Dong","year":"2013","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Rui, X., Cao, Y., Yuan, X., Kang, Y., and Song, W. (2021). Disastergan: Generative adversarial networks for remote sensing disaster image generation. Remote Sens., 13.","DOI":"10.3390\/rs13214284"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TGRS.2023.3336471","article-title":"MSRF-Net: Multiscale receptive field network for building detection from remote sensing images","volume":"61","author":"Zhao","year":"2023","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1109\/JSTARS.2011.2168195","article-title":"Morphological building\/shadow index for building extraction from high-resolution imagery over urban areas","volume":"5","author":"Huang","year":"2011","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Luo, L., Li, P., and Yan, X. (2021). Deep learning-based building extraction from remote sensing images: A comprehensive review. Energies, 14.","DOI":"10.3390\/en14237982"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Maggiori, E., Tarabalka, Y., Charpiat, G., and Alliez, P. (2017, January 23\u201328). Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA.","DOI":"10.1109\/IGARSS.2017.8127684"},{"key":"ref_10","unstructured":"Mnih, V. (2013). Machine Learning for Aerial Image Labeling. [Ph.D. Thesis, University of Toronto]."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1109\/TGRS.2018.2858817","article-title":"Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set","volume":"57","author":"Ji","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"ImageNet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1109\/MCI.2018.2840738","article-title":"Recent trends in deep learning based natural language processing","volume":"13","author":"Young","year":"2018","journal-title":"IEEE Comput. Intell. Mag."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_16","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wu, L., Xie, Z., and Chen, Z. (2018). Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens., 10.","DOI":"10.3390\/rs10010144"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Dong, H., Pan, J., Xiang, L., Hu, Z., Zhang, X., Wang, F., and Yang, M.-H. (2020, January 13\u201319). Multi-scale boosted dehazing network with dense feature fusion. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00223"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Yang, H., Wu, P., Yao, X., Wu, Y., Wang, B., and Xu, Y. (2018). Building extraction in very high resolution imagery by dense-attention networks. Remote Sens., 10.","DOI":"10.3390\/rs10111768"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"6699","DOI":"10.1109\/TGRS.2018.2841808","article-title":"Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network","volume":"56","author":"Mou","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs","volume":"40","author":"Chen","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_23","unstructured":"Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end object detection with transformers. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Isensee, F., Petersen, J., Klein, A., Zimmerer, D., Jaeger, P.F., Kohl, S., Wasserthal, J., Koehler, G., Norajitra, T., and Wirkert, S. (2018). nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv.","DOI":"10.1007\/978-3-658-25326-4_7"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022, January 18\u201324). A convnet for the 2020s. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01167"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_28","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015;, January 5\u20139). U-net: Convolutional networks for biomedical image segmentation. Proceedings of the Medical Image Computing and Computer-Assisted Intervention\u2013MICCAI 2015: 18th International Conference, Munich, Germany. Proceedings, Part III 18."},{"key":"ref_29","unstructured":"Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018, January 8\u201314). Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_49"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Girshick, R., He, K., and Doll\u00e1r, P. (2019, January 16\u201317). Panoptic feature pyramid networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00656"},{"key":"ref_32","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. (2021, January 11\u201317). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. Proceedings of the International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 11\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Ren, S., Zhou, D., He, S., Feng, J., and Wang, X. (2022, January 18\u201324). Shunted self-attention via multi-scale token aggregation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01058"},{"key":"ref_36","first-page":"6008405","article-title":"DSAT-net: Dual spatial attention transformer for building extraction from aerial images","volume":"20","author":"Zhang","year":"2023","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021, January 11\u201317). Segmenter: Transformer for semantic segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00717"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Chen, K., Zou, Z., and Shi, Z. (2021). Building extraction from remote sensing images with sparse token transformers. Remote Sens., 13.","DOI":"10.3390\/rs13214441"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., and Wang, M. (2022, January 23\u201327). Swin-unet: Unet-like pure transformer for medical image segmentation. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-25066-8_9"},{"key":"ref_40","first-page":"1","article-title":"A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images","volume":"19","author":"Wang","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_41","unstructured":"Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., and Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"3123","DOI":"10.1109\/JSTARS.2024.3349625","article-title":"TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images","volume":"17","author":"Xiang","year":"2024","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Liu, H., and Hu, Q. (27\u20131, January 27). Transfuse: Fusing transformers and cnns for medical image segmentation. Proceedings of the Medical Image Computing and Computer Assisted Intervention\u2013MICCAI 2021: 24th International Conference, Strasbourg, France. Proceedings, Part I 24.","DOI":"10.1007\/978-3-030-87193-2_2"},{"key":"ref_44","first-page":"1","article-title":"Building extraction with vision transformer","volume":"60","author":"Wang","year":"2022","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018, January 18\u201323). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"3023","DOI":"10.1109\/JSTARS.2024.3349657","article-title":"SSNet: A novel transformer and CNN hybrid network for remote sensing semantic segmentation","volume":"17","author":"Yao","year":"2024","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/19\/6198\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:02:20Z","timestamp":1760112140000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/19\/6198"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,25]]},"references-count":46,"journal-issue":{"issue":"19","published-online":{"date-parts":[[2024,10]]}},"alternative-id":["s24196198"],"URL":"https:\/\/doi.org\/10.3390\/s24196198","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,25]]}}}