{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:12:15Z","timestamp":1774631535561,"version":"3.50.1"},"reference-count":56,"publisher":"MDPI AG","issue":"13","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>The mainstream algorithms used for ship classification and detection can be improved based on convolutional neural networks (CNNs). By analyzing the characteristics of ship images, we found that the difficulty in ship image classification lies in distinguishing ships with similar hull structures but different equipment and superstructures. To extract features such as ship superstructures, this paper introduces transformer architecture with self-attention into ship classification and detection, and a CNN and Swin transformer model (CNN-Swin model) is proposed for ship image classification and detection. The main contributions of this study are as follows: (1) The proposed approach pays attention to different scale features in ship image classification and detection, introduces a transformer architecture with self-attention into ship classification and detection for the first time, and uses a parallel network of a CNN and a transformer to extract features of images. (2) To exploit the CNN\u2019s performance and avoid overfitting as much as possible, a multi-branch CNN-Block is designed and used to construct a CNN backbone with simplicity and accessibility to extract features. (3) The performance of the CNN-Swin model is validated on the open FGSC-23 dataset and a dataset containing typical military ship categories based on open-source images. The results show that the model achieved accuracies of 90.9% and 91.9% for the FGSC-23 dataset and the military ship dataset, respectively, outperforming the existing nine state-of-the-art approaches. (4) The good extraction effect on the ship features of the CNN-Swin model is validated as the backbone of the three state-of-the-art detection methods on the open datasets HRSC2016 and FAIR1M. The results show the great potential of the CNN-Swin backbone with self-attention in ship detection.<\/jats:p>","DOI":"10.3390\/rs14133087","type":"journal-article","created":{"date-parts":[[2022,6,28]],"date-time":"2022-06-28T00:07:02Z","timestamp":1656374822000},"page":"3087","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":46,"title":["Fine-Grained Ship Classification by Combining CNN and Swin Transformer"],"prefix":"10.3390","volume":"14","author":[{"given":"Liang","family":"Huang","sequence":"first","affiliation":[{"name":"College of Electronic Engineering, Naval University of Engineering, Wuhan 430000, China"}]},{"given":"Fengxiang","family":"Wang","sequence":"additional","affiliation":[{"name":"College of Electronic Engineering, Naval University of Engineering, Wuhan 430000, China"}]},{"given":"Yalun","family":"Zhang","sequence":"additional","affiliation":[{"name":"Institute of Noise & Vibration, Naval University of Engineering, Wuhan 430000, China"}]},{"given":"Qingxia","family":"Xu","sequence":"additional","affiliation":[{"name":"College of International Studies, National University of Defense Technology, Wuhan 430000, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"ref_1","unstructured":"Krizhevsky, A., Sutskeve, I., and Hinton, G.E. (2012, January 3\u20136). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in neural information processing systems, Lake Tahoe, NV, USA."},{"key":"ref_2","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations, San Diego, CA, USA."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 8\u201310). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_4","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, S., and Wojna, Z. (2016, January 27\u201330). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. (2017, January 4\u20139). Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the AAAI Conference on Artificial Intelligence, Hilton San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Maaten, L.V.D., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional net- works. Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_9","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., and Chen, L. (2018, January 18\u201323). Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00474"},{"key":"ref_11","unstructured":"Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., and Vasudevan, V. (November, January 27). Searching for mobilenetv3. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_12","unstructured":"Tan, M., and Le, Q. (2019, January 9\u201316). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 21\u201326). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Jeon, H., and Yang, C. (2021). Enhancement of Ship Type Classification from a Combination of CNN and KNN. Electronics, 10.","DOI":"10.3390\/electronics10101169"},{"key":"ref_15","first-page":"233","article-title":"Research on the Development of Object Detection Algorithm in the Field of Ship Target Recognition","volume":"7","author":"Li","year":"2021","journal-title":"Int. Core J. Eng."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"012081","DOI":"10.1088\/1742-6596\/1450\/1\/012081","article-title":"Object recognition on patrol ship using image processing and convolutional neural network (CNN)","volume":"1450","author":"Julianto","year":"2020","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"104812","DOI":"10.1016\/j.ssci.2020.104812","article-title":"Deep learning for autonomous ship-oriented small ship detection","volume":"130","author":"Chen","year":"2020","journal-title":"Saf. Sci."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"82","DOI":"10.2112\/SI102-011.1","article-title":"Optical Remote Sensing Ship Image Classification Based on Deep Feature Combined Distance Metric Learning","volume":"102","author":"Zhao","year":"2020","journal-title":"J. Coast. Res."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1879","DOI":"10.1049\/iet-rsn.2020.0113","article-title":"Fast ship detection combining visual saliency and a cascade CNN in SAR images","volume":"14","author":"Xu","year":"2020","journal-title":"IET Radar Sonar Navig."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"277","DOI":"10.2112\/JCR-SI115-088.1","article-title":"Design and Implementation of Marine Automatic Target Recognition System Based on Visible Remote Sensing Images","volume":"115","author":"Gao","year":"2020","journal-title":"J. Coast. Res."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Ren, Y., Yang, J., Zhang, Q., and Guo, Z. (2019). Multi-Feature Fusion with Convolutional Neural Network for Ship Classification in Optical Images. Appl. Sci., 20.","DOI":"10.3390\/app9204209"},{"key":"ref_22","first-page":"7343","article-title":"Ship classification based on convolutional neural networks","volume":"21","author":"Li","year":"2019","journal-title":"J. Eng."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Bi, F., Hou, J., Chen, L., Yang, Z., and Wang, Y. (2019). Ship Detection for Optical Remote Sensing Images Based on Visual Attention Enhanced Network. Sensors, 10.","DOI":"10.3390\/s19102271"},{"key":"ref_24","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_25","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3\u20137). An image is worth 16\u00d716 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations, Vienna, Australia."},{"key":"ref_26","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. (2020). Training data-efficient image transformers & distillation through attention. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Tay, F.E., Feng, J., and Yan, S. (2021). Tokens- to-token vit: Training vision transformers from scratch on imagenet. arXiv.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"ref_28","unstructured":"Chu, X., Zhang, B., Tian, Z., Wei, X., and Xia, H. (2021). Do we really need explicit position encodings for vision transformers?. arXiv."},{"key":"ref_29","unstructured":"Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., and Wang, Y. (2021). transformer in transformer. arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Heo, B., Yun, S., Han, D., Chun, S., Choe, J., and Oh, S.J. (2021). Rethinking Spatial Dimensions of Vision Transformers. arXiv.","DOI":"10.1109\/ICCV48922.2021.01172"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J\u00e9gou, H. (2021). going deeper with Image Transformers. arXiv.","DOI":"10.1109\/ICCV48922.2021.00010"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xu, X., Feng, Z., Cao, C., Li, M., Wu, J., Wu, Z., Shang, Y., and Ye, S. (2021). An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens., 13.","DOI":"10.3390\/rs13234779"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Huang, B., Guo, Z., Wu, L., He, B., Li, X., and Lin, Y. (2021). Pyramid Information Distillation Attention Network for Super-Resolution Reconstruction of Remote Sensing Images. Remote Sens., 13.","DOI":"10.3390\/rs13245143"},{"key":"ref_36","first-page":"2337","article-title":"FGSC-23: A large-scale dataset of high-resolution optical remote sensing image for deep learning-based fine-grained ship recognition","volume":"26","author":"Yao","year":"2021","journal-title":"J. Image Graph."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1074","DOI":"10.1109\/LGRS.2016.2565705","article-title":"Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex back- grounds","volume":"13","author":"Liu","year":"2016","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Sun, X., Wang, P., Yan, Z., Xu, F., Wang, R., Diao, W., Chen, J., Li, J., Feng, Y., and Xu, T. (2021). FAIR1M:A Benchmark Dataset for Fine-grained Object Recognition in High-Resolution Remote Sensing Imagery. arXiv.","DOI":"10.1016\/j.isprsjprs.2021.12.004"},{"key":"ref_39","unstructured":"Springenberg, J.T., Dosovitskiy, A., and Riedmiller, M.A. (2014). Striving for Simplicity: The All Convolutional Net. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Han, D., Yun, S., Heo, B., and Yoo, Y. (2021, January 20\u201325). Rethinking channel dimensions for efficient model design. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00079"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., and Dollar, P. (2020, January 13\u201319). Designing network design spaces. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01044"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun, J. (2021, January 20\u201325). RepVGG: Making VGG-style convnets great again. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01352"},{"key":"ref_43","unstructured":"Veit, A., Wilber, M.J., and Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems, Proceeding of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5\u201310 December 2016, Curran Associates Inc."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Hu, H., Zhang, Z., Xie, Z., and Lin, S. (2019, January 27\u201328). Local relation networks for image recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00356"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. (2018, January 18\u201323). Relation networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00378"},{"key":"ref_46","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"ref_47","unstructured":"Bao, H., Dong, L., Wei, F., Wang, W., Yang, N., Liu, X., Wang, Y., Gao, J., Piao, S., and Zhou, M. (2020, January 12\u201318). Unilmv2: Pseudo-masked language models for unified language model pre-training. Proceedings of the International Conference on Machine Learning, Vienna, Austria."},{"key":"ref_48","first-page":"3221","article-title":"Accelerating t-SNE using tree-based algorithms","volume":"15","author":"Maaten","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Xiao, Z., Qian, L., Shao, W., Tan, X., and Wang, K. (2020). Axis learning for orientated objects detection in aerial images. Remote Sens., 12.","DOI":"10.3390\/rs12060908"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Zhong, B., and Ao, K. (2020). Single-stage rotation-decoupled detector for oriented object. Remote Sens., 12.","DOI":"10.3390\/rs12193262"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Ming, Q., Miao, L., Zhou, Z., Song, J., and Yang, X. (2021). Sparse Label Assignment for Oriented Object Detection in Aerial Images. Remote Sens., 13.","DOI":"10.3390\/rs13142664"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1007\/s11263-009-0275-4","article-title":"The Pascal Visual Object Classes (VOC) Challenge","volume":"88","author":"Everingham","year":"2010","journal-title":"Int. J. Comput. Vis."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1007\/s11263-019-01228-7","article-title":"Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization","volume":"128","author":"Selvaraju","year":"2019","journal-title":"Int. J. Comput. Vis."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Abnar, S., and Zuidema, W. (2020, January 5\u201310). Quantifying Attention Flow in Transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.385"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Zhu, M., Hu, G., Zhou, H., Wang, S., Feng, Z., and Yue, S. (2022). A Ship Detection Method via Redesigned FCOS in Large-Scale SAR Images. Remote Sens., 14.","DOI":"10.3390\/rs14051153"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Li, L., Jiang, L., Zhang, J., Wang, S., and Chen, F. (2022). A Complete YOLO-Based Ship Detection Method for Thermal Infrared Remote Sensing Images under Complex Backgrounds. Remote Sens., 14.","DOI":"10.3390\/rs14071534"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/13\/3087\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:39:06Z","timestamp":1760139546000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/13\/3087"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,27]]},"references-count":56,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2022,7]]}},"alternative-id":["rs14133087"],"URL":"https:\/\/doi.org\/10.3390\/rs14133087","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,6,27]]}}}