{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T00:27:55Z","timestamp":1780532875272,"version":"3.54.1"},"reference-count":61,"publisher":"MDPI AG","issue":"16","license":[{"start":{"date-parts":[[2021,8,4]],"date-time":"2021-08-04T00:00:00Z","timestamp":1628035200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["41971352"],"award-info":[{"award-number":["41971352"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2018YFB0505003"],"award-info":[{"award-number":["2018YFB0505003"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, urban planning, etc. However, the tremendous details contained in the VFR image, especially the considerable variations in scale and appearance of objects, severely limit the potential of the existing deep learning approaches. Addressing such issues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose a Bilateral Awareness Network which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specifically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convolution operation. In addition, using the linear attention mechanism, a feature aggregation module is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effectiveness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset.<\/jats:p>","DOI":"10.3390\/rs13163065","type":"journal-article","created":{"date-parts":[[2021,8,4]],"date-time":"2021-08-04T08:47:52Z","timestamp":1628066872000},"page":"3065","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":254,"title":["Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8096-6531","authenticated-orcid":false,"given":"Libo","family":"Wang","sequence":"first","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7858-3160","authenticated-orcid":false,"given":"Rui","family":"Li","sequence":"additional","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dongzhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Surveying and Mapping Institute, Lands and Resource Department of Guangdong Province, Guangzhou 510500, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenxi","family":"Duan","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Teng","family":"Wang","sequence":"additional","affiliation":[{"name":"Surveying and Mapping Institute, Lands and Resource Department of Guangdong Province, Guangzhou 510500, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xiaoliang","family":"Meng","sequence":"additional","affiliation":[{"name":"School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,8,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"280","DOI":"10.1016\/j.isprsjprs.2020.09.025","article-title":"Identifying and mapping individual plants in a highly diverse high-elevation ecosystem using UAV imagery and deep learning","volume":"169","author":"Zhang","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"111593","DOI":"10.1016\/j.rse.2019.111593","article-title":"Scale sequence joint deep learning (SS-JDL) for land use and land cover classification","volume":"237","author":"Zhang","year":"2020","journal-title":"Remote Sens. Environ."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Li, R., Zheng, S., Duan, C., Su, J., and Zhang, C. (2021). Multistage attention ResU-Net for Semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett.","DOI":"10.1109\/LGRS.2021.3063381"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, R., Duan, C., Zheng, S., Zhang, C., and Atkinson, P.M. (2021). MACU-Net for semantic segmentation of fine-resolution remotely sensed images. IEEE Geosci. Remote Sens. Lett.","DOI":"10.1109\/LGRS.2021.3052886"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, L., Fang, S., Zhang, C., Li, R., Duan, C., Meng, X., and Atkinson, P.M. (2021). SaNet: Scale-aware neural network for semantic labelling of multiple spatial resolution aerial images. arXiv.","DOI":"10.3390\/rs13245015"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Huang, Z., Wei, Y., Wang, X., Shi, H., Liu, W., and Huang, T.S. (2021). AlignSeg: Feature-Aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell.","DOI":"10.1109\/TPAMI.2021.3062772"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Yao, H., Qin, R., and Chen, X. (2019). Unmanned aerial vehicle for remote sensing applications\u2014A review. Remote Sens., 11.","DOI":"10.3390\/rs11121443"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Audebert, N., Le Saux, B., and Lef\u00e8vre, S. (2017). Segment-before-Detect: Vehicle detection and classification through semantic segmentation of aerial images. Remote Sens., 9.","DOI":"10.3390\/rs9040368"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1777","DOI":"10.3390\/rs3081777","article-title":"Segment-based land cover mapping of a suburban area\u2014Comparison of high-resolution remotely sensed datasets using classification trees and test field points","volume":"3","author":"Matikainen","year":"2011","journal-title":"Remote Sens."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"2320","DOI":"10.1016\/j.rse.2011.04.032","article-title":"Mapping urbanization dynamics at regional and global scales using multi-temporal DMSP\/OLS nighttime light data","volume":"115","author":"Zhang","year":"2011","journal-title":"Remote Sens. Environ."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"709","DOI":"10.1109\/LGRS.2017.2672734","article-title":"Road structure refined CNN for road extraction in aerial image","volume":"14","author":"Wei","year":"2017","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"4483","DOI":"10.1109\/TGRS.2015.2400462","article-title":"Robust rooftop extraction from visible band images using higher order CRF","volume":"53","author":"Li","year":"2015","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wang, C., Ji, Y., Chen, J., Deng, Y., Chen, J., and Jie, Y. (2020). Combining segmentation network and nonsubsampled contourlet transform for automatic marine raft aquaculture area extraction from sentinel-1 images. Remote Sens., 12.","DOI":"10.3390\/rs12244182"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Maxwell, A.E., Bester, M.S., Guillen, L.A., Ramezan, C.A., Carpinello, D.J., Fan, Y., Hartley, F.M., Maynard, S.M., and Pyron, J.L. (2020). Semantic segmentation deep learning for extracting surface mine extents from historic topographic maps. Remote Sens., 12.","DOI":"10.3390\/rs12244145"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kalajdjieski, J., Zdravevski, E., Corizzo, R., Lameski, P., Kalajdziski, S., Pires, I.M., Garcia, N.M., and Trajkovik, V. (2020). Air pollution prediction with multi-modal data and deep neural networks. Remote Sens., 12.","DOI":"10.3390\/rs12244142"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1016\/j.isprsjprs.2020.01.013","article-title":"ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data","volume":"162","author":"Diakogiannis","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Li, R., and Duan, C. (2021). ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remote sensing images. arXiv.","DOI":"10.1016\/j.isprsjprs.2021.09.005"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1016\/j.rse.2018.11.014","article-title":"Joint deep learning for land cover and land use classification","volume":"221","author":"Zhang","year":"2019","journal-title":"Remote Sens. Environ."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1016\/j.rse.2018.06.034","article-title":"An object-based convolutional neural network (OCNN) for urban land use classification","volume":"216","author":"Zhang","year":"2018","journal-title":"Remote Sens. Environ."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_21","unstructured":"Sherrah, J. (2016). Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"3036","DOI":"10.1109\/TIP.2018.2808767","article-title":"Effective sequential classifier training for SVM-based multitemporal remote sensing image classification","volume":"27","author":"Guo","year":"2018","journal-title":"IEEE Trans. Image Process."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1080\/01431160412331269698","article-title":"Random forest classifier for remote sensing classification","volume":"26","author":"Pal","year":"2005","journal-title":"Int. J. Remote Sens."},{"key":"ref_24","first-page":"109","article-title":"Efficient inference in fully connected crfs with gaussian edge potentials","volume":"24","author":"Koltun","year":"2011","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"166","DOI":"10.1016\/j.isprsjprs.2019.04.015","article-title":"Deep learning in remote sensing applications: A meta-analysis and review","volume":"152","author":"Ma","year":"2019","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.isprsjprs.2018.01.021","article-title":"Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models","volume":"145","author":"Marcos","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.isprsjprs.2019.07.007","article-title":"TreeUNet: Adaptive Tree convolutional neural networks for subdecimeter aerial image segmentation","volume":"156","author":"Yue","year":"2019","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., and Brox, T. (2015, January 5\u20139). U-Net: Convolutional networks for biomedical image segmentation. Proceedings of the Medical Image Computing and Computer-Assisted Intervention\u2014MICCAI 2015, Munich, Germany.","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1016\/j.isprsjprs.2017.12.007","article-title":"Semantic labeling in very high resolution images via a self-cascaded convolutional neural network","volume":"145","author":"Liu","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1016\/j.isprsjprs.2021.06.006","article-title":"Real-time semantic segmentation with context aggregation network","volume":"178","author":"Yang","year":"2021","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., and Lu, H. (2019, January 16\u201320). Dual attention network for scene segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00326"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Li, R., Zheng, S., Zhang, C., Duan, C., Su, J., Wang, L., and Atkinson, P.M. (2021). Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens.","DOI":"10.1109\/TGRS.2021.3093977"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Zhao, H., Qi, X., Shen, X., Shi, J., and Jia, J. (2018, January 8\u201314). Icnet for real-time semantic segmentation on high-resolution images. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01219-9_25"},{"key":"ref_34","unstructured":"Kampffmeyer, M., Salberg, A.-B., and Jenssen, R. (July, January 26). Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"7092","DOI":"10.1109\/TGRS.2017.2740362","article-title":"High-resolution aerial image labeling with convolutional neural networks","volume":"55","author":"Maggiori","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1016\/j.isprsjprs.2017.11.011","article-title":"Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks","volume":"140","author":"Audebert","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Duan, C., Pan, J., and Li, R. (2020). Thick cloud removal of remote sensing images using temporal smoothness and sparsity regularized tensor optimization. Remote Sens., 12.","DOI":"10.3390\/rs12203446"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1758","DOI":"10.1109\/JSTARS.2018.2834961","article-title":"Urban land cover classification with missing data modalities using deep convolutional neural networks","volume":"11","author":"Kampffmeyer","year":"2018","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1016\/j.isprsjprs.2017.11.009","article-title":"Classification with an edge: Improving semantic image segmentation with boundary detection","volume":"135","author":"Marmanis","year":"2018","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1016\/j.isprsjprs.2020.09.019","article-title":"Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss","volume":"170","author":"Zheng","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201322). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_42","unstructured":"Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"6309","DOI":"10.1109\/TGRS.2020.2976658","article-title":"Dense Dilated Convolutions\u2019 Merging Network for Land Cover Classification","volume":"58","author":"Liu","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., and Huang, T.S. (2020). CCNet: Criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.","DOI":"10.1109\/ICCV.2019.00069"},{"key":"ref_45","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. arXiv."},{"key":"ref_46","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_47","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2020). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Wang, L., Li, R., Duan, C., and Fang, S. (2021). Transformer meets DCFAM: A novel semantic segmentation scheme for fine-resolution remote sensing images. arXiv.","DOI":"10.1109\/LGRS.2022.3143368"},{"key":"ref_49","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_51","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_52","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve Restricted Boltzmann machines. Proceedings of the International Conference on Machine Learning, Haifa, Israel."},{"key":"ref_53","unstructured":"Zhang, Q., and Yang, Y. (2021). ResT: An efficient transformer for visual recognition. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Chollet, F. (2017, January 21\u201326). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.195"},{"key":"ref_55","unstructured":"Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1016\/j.isprsjprs.2020.05.009","article-title":"UAVid: A semantic segmentation dataset for UAV imagery","volume":"165","author":"Lyu","year":"2020","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., and Sang, N. (2018, January 8\u201314). Bisenet: Bilateral segmentation network for real-time semantic segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01261-8_20"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"263","DOI":"10.1109\/LRA.2020.3039744","article-title":"Real-time semantic segmentation with fast attention","volume":"6","author":"Hu","year":"2021","journal-title":"IEEE Robot. Autom. Lett."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"107611","DOI":"10.1016\/j.patcog.2020.107611","article-title":"Efficient semantic segmentation with pyramidal fusion","volume":"110","year":"2021","journal-title":"Pattern Recognit."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Zhuang, J., Yang, J., Gu, L., and Dvornek, N. (2019, January 27\u201328). Shelfnet for fast semantic segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops, Seoul, Korea.","DOI":"10.1109\/ICCVW.2019.00113"},{"key":"ref_61","unstructured":"Poudel, R.P.K., Liwicki, S., and Cipolla, R. (2019). Fast-scnn: Fast semantic segmentation network. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/16\/3065\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:40:26Z","timestamp":1760164826000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/16\/3065"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,4]]},"references-count":61,"journal-issue":{"issue":"16","published-online":{"date-parts":[[2021,8]]}},"alternative-id":["rs13163065"],"URL":"https:\/\/doi.org\/10.3390\/rs13163065","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,8,4]]}}}