{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T11:28:39Z","timestamp":1775042919392,"version":"3.50.1"},"reference-count":53,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2023,4,10]],"date-time":"2023-04-10T00:00:00Z","timestamp":1681084800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Sichuan Urban Informatization Surveying and Mapping Engineering Technology Research Center","award":["CDKC-2022001"],"award-info":[{"award-number":["CDKC-2022001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Extracting building data from remote sensing images is an efficient way to obtain geographic information data, especially following the emergence of deep learning technology, which results in the automatic extraction of building data from remote sensing images becoming increasingly accurate. A CNN (convolution neural network) is a successful structure after a fully connected network. It has the characteristics of saving computation and translation invariance with improved local features, but it has difficulty obtaining global features. Transformers can compensate for the shortcomings of CNNs and more effectively obtain global features. However, the calculation number of transformers is excessive. To solve this problem, a Lite Swin transformer is proposed. The three matrices Q, K, and V of the transformer are simplified to only a V matrix, and the v of the pixel is then replaced by the v with the largest projection value on the pixel feature vector. In order to better integrate global features and local features, we propose the LiteST-Net model, in which the features extracted by the Lite Swin transformer and the CNN are added together and then sampled up step by step to fully utilize the global feature acquisition ability of the transformer and the local feature acquisition ability of the CNN. The comparison experiments on two open datasets are carried out using our proposed LiteST-Net and some classical image segmentation models. The results show that compared with other networks, all metrics of LiteST-Net are the best, and the predicted image is closer to the label.<\/jats:p>","DOI":"10.3390\/rs15081996","type":"journal-article","created":{"date-parts":[[2023,4,10]],"date-time":"2023-04-10T05:59:33Z","timestamp":1681106373000},"page":"1996","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":22,"title":["LiteST-Net: A Hybrid Model of Lite Swin Transformer and Convolution for Building Extraction from Remote Sensing Image"],"prefix":"10.3390","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3158-3486","authenticated-orcid":false,"given":"Wei","family":"Yuan","sequence":"first","affiliation":[{"name":"College of Computer Science, Chengdu University, Chengdu 610106, China"},{"name":"Sichuan Urban Informatization Surveying and Mapping Engineering Technology Research Center, Chengdu 610084, China"}]},{"given":"Xiaobo","family":"Zhang","sequence":"additional","affiliation":[{"name":"Sichuan Urban Informatization Surveying and Mapping Engineering Technology Research Center, Chengdu 610084, China"},{"name":"Chengdu Institute of Survey & Investigation, Chengdu 610084, China"}]},{"given":"Jibao","family":"Shi","sequence":"additional","affiliation":[{"name":"Sichuan Urban Informatization Surveying and Mapping Engineering Technology Research Center, Chengdu 610084, China"},{"name":"Chengdu Institute of Survey & Investigation, Chengdu 610084, China"}]},{"given":"Jin","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Chengdu University, Chengdu 610106, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,10]]},"reference":[{"key":"ref_1","first-page":"58","article-title":"Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping","volume":"34","author":"Turker","year":"2015","journal-title":"Int. J. Appl. Earth Obs. Geoinf."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1016\/j.eswa.2016.03.024","article-title":"Building detection from orthophotos using a machine learning approach: An empirical study on image segmentation and descriptors","volume":"58","author":"Dornaika","year":"2016","journal-title":"Expert Syst. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/j.isprsjprs.2013.09.004","article-title":"Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts","volume":"86","author":"Ok","year":"2013","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_4","first-page":"143","article-title":"Improved building detection using texture information","volume":"38","author":"Awrangjeb","year":"2011","journal-title":"Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"721","DOI":"10.14358\/PERS.77.7.721","article-title":"A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1 imagery","volume":"77","author":"Huang","year":"2011","journal-title":"Photogramm. Eng. Remote Sens."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1109\/JSTARS.2011.2168195","article-title":"Morphological building\/shadow index for building extraction from high-resolution imagery over urban areas","volume":"5","author":"Huang","year":"2011","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_7","first-page":"883","article-title":"Extracting manmade objects from high spatial resolution remote sensing images via fast level set evolutions","volume":"53","author":"Li","year":"2014","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"3265","DOI":"10.1109\/JSTARS.2017.2669217","article-title":"Urban building density estimation from high-resolution imagery using multiple features and support vector regression","volume":"10","author":"Zhang","year":"2017","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"Lecun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_10","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the International Conference on Learning Representations, San Diego, CA, USA. Available online: https:\/\/arxiv.org\/abs\/1409.1556."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). Convolutional networks for biomedical image segmentation. Proceedings of the 2015 Medical Image Computing and Computer Assisted Intervention, Piscataway, NJ, USA.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2481","DOI":"10.1109\/TPAMI.2016.2644615","article-title":"Segnet: A deep convolutional encoder-decoder architecture for image segmentation","volume":"39","author":"Badrinarayanan","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017, January 21\u201326). Pyramid scene parsing network. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.660"},{"key":"ref_15","unstructured":"Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A.L. (2014). Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TPAMI.2017.2699184","article-title":"DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs","volume":"40","author":"Chen","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","unstructured":"Chen, L.C., Papandreou, G., Schroff, F., and Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hou, Q., Zhang, L., Cheng, M.M., and Feng, J. (2020, January 13\u201319). Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00406"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3051","DOI":"10.1007\/s11263-021-01515-2","article-title":"BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation","volume":"129","author":"Yu","year":"2021","journal-title":"Int. J. Comput. Vis."},{"key":"ref_20","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words:Transformers for image recognition at scale. arXiv."},{"key":"ref_21","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention is All You Need. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., and Torr, P.H.S. (2020). Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Liu, P., Liu, X., Liu, M., Shi, Q., Yang, J., Xu, X., and Zhang, Y. (2019). Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sens., 11.","DOI":"10.3390\/rs11070830"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yi, Y.N., Zhang, Z.J., Zhang, W.C., Zhang, C.R., Li, W.D., and Zhao, T. (2019). Semantic segmentation of urban buildings from vhr remote sensing imagery using a deep convolutional neural network. Remote Sens., 11.","DOI":"10.3390\/rs11151774"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1016\/j.isprsjprs.2020.01.013","article-title":"Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data","volume":"162","author":"Diakogiannis","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Ye, Z., Fu, Y., Gan, M., Deng, J., Comber, A., and Wang, K. (2019). Building extraction from very high resolution aerial imagery using joint attention deep neural network. Remote Sens., 11.","DOI":"10.3390\/rs11242970"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"3252","DOI":"10.1109\/JSTARS.2018.2860989","article-title":"Semantic segmentation for high spatial resolution remote sensing images based on convolution neural network and pyramid pooling module","volume":"11","author":"Yu","year":"2018","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"154997","DOI":"10.1109\/ACCESS.2020.3015701","article-title":"Arc-net: An efficient network for building extraction from high-resolution aerial images","volume":"8","author":"Liu","year":"2020","journal-title":"IEEE Access"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Pan, X., Yang, F., Gao, L., Chen, Z., Zhang, B., Fan, H., and Ren, J. (2019). Building extraction from high-resolution aerial imagery using a generative adversarial network with spatial and channel attention mechanisms. Remote Sens., 11.","DOI":"10.3390\/rs11080917"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Protopapadakis, E., Doulamis, A., Doulamis, N., and Maltezos, E. (2021). Stacked autoencoders driven by semi-supervised learning for building extraction from near infrared remote sensing imagery. Remote Sens., 13.","DOI":"10.3390\/rs13030371"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Cheng, D., Liao, R., Fidler, S., and Urtasun, R. (2019, January 15\u201320). Darnet: Deep active ray network for building segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00761"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Chen, J., Zhang, D., Wu, Y., Chen, Y., and Yan, X. (2022). A Context Feature Enhancement Network for Building Extraction from High-Resolution Remote Sensing Imagery. Remote Sens., 14.","DOI":"10.3390\/rs14092276"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"5171","DOI":"10.1109\/TGRS.2020.3010055","article-title":"Domain Adaptive Transfer Attack (DATA)-based Segmentation Networks for Building Extraction from Aerial Images","volume":"59","author":"Na","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"75641","DOI":"10.1109\/ACCESS.2021.3082076","article-title":"NeighborLoss: A Loss Function Considering Spatial Correlation for Semantic Segmentation of Remote Sensing Image","volume":"9","author":"Yuan","year":"2021","journal-title":"IEEE Access"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Wang, Y., Zhao, L., Liu, L., Hu, H., and Tao, W. (2021). URNet: A U-Shaped Residual Network for Lightweight Image Super-Resolution. Remote Sens., 13.","DOI":"10.3390\/rs13193848"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Chen, M., Wu, J., Liu, L., Zhao, W., Tian, F., Shen, Q., Zhao, B., and Du, R. (2021). DR-Net: An Improved Network for Building Extraction from High Resolution Remote Sensing Image. Remote Sens., 13.","DOI":"10.3390\/rs13020294"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Miao, Y., Jiang, S., Xu, Y., and Wang, D. (2022). Feature Residual Analysis Network for Building Extraction from Remote Sensing Images. Appl. Sci., 12.","DOI":"10.3390\/app12105095"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"106103","DOI":"10.1016\/j.knosys.2020.106103","article-title":"Lightweight multi-scale residual networks with attention for image super-resolution","volume":"203","author":"Liu","year":"2020","journal-title":"Knowl. Based Syst."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Guo, M., Liu, H., Xu, Y., and Huang, Y. (2020). Building extraction based on U-Net with an attention block and multiple losses. Remote Sens., 12.","DOI":"10.3390\/rs12091400"},{"key":"ref_41","first-page":"8011305","article-title":"Multiscale building extraction with refined attention pyramid networks","volume":"19","author":"Tian","year":"2021","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Das, P., and Chand, S. (2021, January 19\u201320). AttentionBuildNet for Building Extraction from Aerial Imagery. Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India.","DOI":"10.1109\/ICCCIS51004.2021.9397178"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, D., Fan, W., Guan, H., Wang, C., and Li, J. (2021). Self-attention in reconstruction bias U-Net for semantic segmentation of building rooftops in optical remote sensing images. Remote Sens., 13.","DOI":"10.3390\/rs13132524"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"2611","DOI":"10.1109\/JSTARS.2021.3058097","article-title":"Attention-Gate-Based Encoder\u2013Decoder Network for Automatical Building Extraction","volume":"14","author":"Deng","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"5807","DOI":"10.1109\/JSTARS.2021.3084805","article-title":"MHA-Net: Multipath Hybrid Attention Network for building footprint extraction from high-resolution remote sensing imagery","volume":"14","author":"Cai","year":"2021","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Liu, Y., Wang, S., Chen, J., Chen, B., Wang, X., Hao, D., and Sun, L. (2022). Rice Yield Prediction and Model Interpretation Based on Satellite and Climatic Indicators Using a Transformer Method. Remote Sens., 14.","DOI":"10.3390\/rs14195045"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Yuan, W., and Xu, W. (2021). MSST-Net: A Multi-Scale Adaptive Network for Building Extraction from Remote Sensing Images Based on Swin Transformer. Remote Sens., 13.","DOI":"10.3390\/rs13234743"},{"key":"ref_48","first-page":"2503605","article-title":"Multiscale feature learning by transformer for building extraction from satellite images","volume":"19","author":"Chen","year":"2022","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Chen, K., Zou, Z., and Shi, Z. (2021). Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens., 13.","DOI":"10.3390\/rs13214441"},{"key":"ref_50","first-page":"5625711","article-title":"Building extraction with vision Transformer","volume":"60","author":"Wang","year":"2021","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_51","first-page":"448","article-title":"Building extraction via convolutional neural networks from an open remote sensing building dataset","volume":"48","author":"Ji","year":"2019","journal-title":"Acta Geod. Cartogr. Sin."},{"key":"ref_52","unstructured":"Mnih, V. (2013). Machine Learning for Aerial Image Labeling, University of Toronto."},{"key":"ref_53","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/8\/1996\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:13:07Z","timestamp":1760123587000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/15\/8\/1996"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,10]]},"references-count":53,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2023,4]]}},"alternative-id":["rs15081996"],"URL":"https:\/\/doi.org\/10.3390\/rs15081996","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,10]]}}}