{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,13]],"date-time":"2026-07-13T22:00:14Z","timestamp":1783980014258,"version":"3.55.0"},"reference-count":42,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2021,11,25]],"date-time":"2021-11-25T00:00:00Z","timestamp":1637798400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Remote sensing image object detection and instance segmentation are widely valued research fields. A convolutional neural network (CNN) has shown defects in the object detection of remote sensing images. In recent years, the number of studies on transformer-based models increased, and these studies achieved good results. However, transformers still suffer from poor small object detection and unsatisfactory edge detail segmentation. In order to solve these problems, we improved the Swin transformer based on the advantages of transformers and CNNs, and designed a local perception Swin transformer (LPSW) backbone to enhance the local perception of the network and to improve the detection accuracy of small-scale objects. We also designed a spatial attention interleaved execution cascade (SAIEC) network framework, which helped to strengthen the segmentation accuracy of the network. Due to the lack of remote sensing mask datasets, the MRS-1800 remote sensing mask dataset was created. Finally, we combined the proposed backbone with the new network framework and conducted experiments on this MRS-1800 dataset. Compared with the Swin transformer, the proposed model improved the mask AP by 1.7%, mask APS by 3.6%, AP by 1.1% and APS by 4.6%, demonstrating its effectiveness and feasibility.<\/jats:p>","DOI":"10.3390\/rs13234779","type":"journal-article","created":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T01:45:02Z","timestamp":1638323102000},"page":"4779","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":162,"title":["An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation"],"prefix":"10.3390","volume":"13","author":[{"given":"Xiangkai","family":"Xu","sequence":"first","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhejun","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Changqing","family":"Cao","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mengyuan","family":"Li","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jin","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zengyan","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yajie","family":"Shang","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shubing","family":"Ye","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, 2 South TaiBai Road, Xi\u2019an 710071, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,11,25]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"An Improved Faster R-CNN for Small Object Detection","volume":"7","author":"Cao","year":"2019","journal-title":"IEEE Access"},{"key":"ref_2","first-page":"1","article-title":"Survey on Aircraft Detection in Optical Remote Sensing Images","volume":"47","author":"Zhu","year":"2020","journal-title":"Comput. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wu, J., Cao, C., Zhou, Y., Zeng, X., Feng, Z., Wu, Q., and Huang, Z. (2021). Multiple Ship Tracking in Remote Sensing Images Using Deep Learning. Remote Sens., 13.","DOI":"10.3390\/rs13183601"},{"key":"ref_4","unstructured":"Li, X.Y. (2019). Object Detection in Remote Sensing Images Based on Deep Learning. [Master\u2019s Thesis, Department Computer Application Technology, University of Science and Technology of China]."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1016\/j.compenvurbsys.2013.12.002","article-title":"Using street based metrics to characterize urban typologies","volume":"44","author":"Hermosilla","year":"2014","journal-title":"Comput. Environ. Urban Syst."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"He, K., Gkioxari, G., Doll\u00e1r, P., and Girshick, R. (2017, January 22\u201329). Mask R-CNN. Proceedings of the IEEE ICCV, Venice, Italy.","DOI":"10.1109\/ICCV.2017.322"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018, January 18\u201323). Path Aggregation Network for Instance Segmentation. Proceedings of the 2018, IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00913"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Huang, Z., Huang, L., Gong, Y., Huang, C., and Wang, X. (2019, January 16\u201320). Mask Scoring R-CNN. Proceedings of the IEEE\/CVF CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00657"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Dai, J.F., He, K.M., and Sun, J. (2016, January 27\u201330). Instance-aware semantic segmentation via multi-task network cascades. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.343"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Wang, X., Kong, T., Shen, C., Jiang, Y., and Li, L. (2020, January 23\u201328). Solo: Segmenting objects by locations. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58523-5_38"},{"key":"ref_12","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_13","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3\u20137). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria."},{"key":"ref_14","unstructured":"Nicolas, C., Francisco, M., Gabriel, S., Nicolas, U., Alexander, K., and Sergey, Z. (2020, January 23\u201328). End-to-End Object Detection with Transformers. Proceedings of the 16th ECCV, Glasgow, UK."},{"key":"ref_15","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. (2021, January 18\u201324). Training data-efficient image transformers & distillation through attention. Proceedings of the 38th ICML, Virtual Event."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., and Torr, P.H.S. (2021, January 19\u201325). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"ref_17","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv, Available online: https:\/\/arxiv.org\/abs\/2103.14030.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Chen, K., Pang, J.M., Wang, J.Q., Xiong, Y., Li, X.X., Sun, S.Y., Feng, W.F., Liu, Z.W., Shi, J.P., and Wangli, O.Y. (2019, January 15\u201321). Hybrid Task Cascade for Instance Segmentation. Proceedings of the IEEE CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00511"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1904","DOI":"10.1109\/TPAMI.2015.2389824","article-title":"Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition","volume":"37","author":"He","year":"2015","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast R-CNN. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the IEEE CVPR, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C.Y., and Berg, A.C. (2016, January 11\u201314). SSD: Single Shot MultiBox Detector. Proceedings of the IEEE ECCV, Amsterdam, Netherlands.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HA, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"2978","DOI":"10.1109\/TPAMI.2017.2775623","article-title":"Proposal-Free Network for Instance-Level Object Segmentation","volume":"40","author":"Liang","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_27","unstructured":"Wang, X.L., Zhang, R.F., Kong, T., Li, L., and Shen, C.H. (2020). SOLOv2: Dynamic and Fast Instance Segmentation. arXiv, Available online: https:\/\/arxiv.org\/abs\/2003.10152v3."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Lee, Y., and Park, J. (2020, January 13\u201319). Centermaslc: Real-Time Anchor-Free Instance Segmentation. Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01392"},{"key":"ref_29","unstructured":"Tian, Z., Shen, C., Chen, H., and He, T. (November, January 27). Fcos: Fully Convolutional One-Stage Object Detection. Proceedings of the the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_30","unstructured":"Zhou, X.Z., Su, W.J., Lu, L.W., Li, B., Wang, X.G., and Dai, J.F. (2020, January 3\u20137). Deformable DETR: Deformable Transformers for End-to-End Object Detection. Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria."},{"key":"ref_31","unstructured":"Zheng, M.H., Gao, P., Wang, X.G., Li, H.S., and Dong, H. (2020). End-to-End Object Detection with Adaptive Clustering Transformer. arXiv, Available online: https:\/\/arxiv.org\/abs\/2011.09315."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Wang, Y.Q., Xu, Z.L., Wang, X.L., Shen, C.H., Cheng, B.S., Shen, H., and Xia, H.X. (2021, January 19\u201325). End-to-End Video Instance Segmentation with Transformers. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual.","DOI":"10.1109\/CVPR46437.2021.00863"},{"key":"ref_33","unstructured":"Yu, F., and Koltun, V. (2016, January 2\u20134). Multi-scale context aggregation by dilated convolutions. Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the ECCV, Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_35","unstructured":"Zhu, X.Z., Cheng, D.Z., Zhang, Z., Lin, S., and Dai, J.F. (November, January 27). An empirical study of spatial attention mechanisms in deep networks. Proceedings of the ICCV, Seoul, Korea."},{"key":"ref_36","unstructured":"Li, K., Wang, G., Cheng, G., Meng, L.Q., and Han, J.W. (2019). Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. arXiv, Available online: https:\/\/arxiv.org\/abs\/1909.00133v2."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"5535","DOI":"10.1109\/TGRS.2019.2900302","article-title":"Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection","volume":"57","author":"Zhang","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_38","first-page":"7405","article-title":"Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images","volume":"12","author":"Gong","year":"2016","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Sun, P.Z., Zhang, R.F., Jiang, Y., Kong, T., Xu, C.F., Zhan, W., Tomizuka, M., Li, L., Yuan, Z.H., and Wang, C.H. (2021, January 19\u201325). Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Virtual.","DOI":"10.1109\/CVPR46437.2021.01422"},{"key":"ref_40","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled weight decay regularization. Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Versaci, M., Calcagno, S., and Morabito, F.C. (2015, January 19\u201321). Fuzzy Geometrical Approach Based on Unit Hyper-Cubes for Image Contrast Enhancement. Proceedings of the 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ICSIPA.2015.7412240"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/23\/4779\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:35:47Z","timestamp":1760168147000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/23\/4779"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,25]]},"references-count":42,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["rs13234779"],"URL":"https:\/\/doi.org\/10.3390\/rs13234779","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,25]]}}}