{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T22:35:41Z","timestamp":1775687741794,"version":"3.50.1"},"reference-count":89,"publisher":"MDPI AG","issue":"20","license":[{"start":{"date-parts":[[2021,10,16]],"date-time":"2021-10-16T00:00:00Z","timestamp":1634342400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Provincial Science and Technology Innovation Special Fund Project of Jilin Province","award":["20190302026GX"],"award-info":[{"award-number":["20190302026GX"]}]},{"DOI":"10.13039\/100007847","name":"Natural Science Foundation of Jilin Province","doi-asserted-by":"publisher","award":["20200201037JC"],"award-info":[{"award-number":["20200201037JC"]}],"id":[{"id":"10.13039\/100007847","id-type":"DOI","asserted-by":"publisher"}]},{"name":"the Higher Education Research Project of Jilin Association for Higher Education","award":["JGJX2018D10"],"award-info":[{"award-number":["JGJX2018D10"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model\u2019s ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful \u201cpure CNNs \u2192 Convolution + Transformer \u2192 pure Transformers\u201d structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 \u00d7 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.<\/jats:p>","DOI":"10.3390\/rs13204143","type":"journal-article","created":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T23:25:15Z","timestamp":1634513115000},"page":"4143","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":121,"title":["TRS: Transformers for Remote Sensing Scene Classification"],"prefix":"10.3390","volume":"13","author":[{"given":"Jianrong","family":"Zhang","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology, Jilin University, Changchun 130012, China"},{"name":"Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7088-8848","authenticated-orcid":false,"given":"Hongwei","family":"Zhao","sequence":"additional","affiliation":[{"name":"College of Computer Science and Technology, Jilin University, Changchun 130012, China"},{"name":"Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China"}]},{"given":"Jiao","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Jilin University Library, Jilin University, Changchun 130012, China"}]}],"member":"1968","published-online":{"date-parts":[[2021,10,16]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"3681","DOI":"10.1109\/TGRS.2018.2806371","article-title":"Lunar crater detection based on terrain analysis and mathematical morphology methods using digital elevation models","volume":"56","author":"Chen","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1535","DOI":"10.1109\/LGRS.2018.2847303","article-title":"Remote sensing image retrieval using convolutional neural network features and weighted distance","volume":"15","author":"Ye","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1855","DOI":"10.1109\/JPROC.2017.2729890","article-title":"Spatial technology and social media in remote sensing: A survey","volume":"105","author":"Li","year":"2017","journal-title":"Proc. IEEE"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Luo, F., Huang, H., Duan, Y., Liu, J., and Liao, Y. (2017). Local geometric structure feature for dimensionality reduction of hyperspectral imagery. Remote Sens., 9.","DOI":"10.3390\/rs9080790"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","article-title":"Distinctive image features from scale-invariant keypoints","volume":"60","author":"Lowe","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1109\/TPAMI.2002.1017623","article-title":"Multiresolution gray-scale and rotation invariant texture classification with local binary patterns","volume":"24","author":"Ojala","year":"2002","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"4104","DOI":"10.1109\/JSTARS.2017.2705419","article-title":"Aggregating rich hierarchical features for scene classification in remote sensing imagery","volume":"10","author":"Wang","year":"2017","journal-title":"IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Yang, S., and Ramanan, D. (2015, January 7\u201313). Multi-scale recognition with DAG-CNNs. Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA.","DOI":"10.1109\/ICCV.2015.144"},{"key":"ref_9","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the ICLR 2015: International Conference on Learning Representations, San Diego, CA, USA."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Liang, Y., Monteiro, S.T., and Saber, E.S. (2016, January 18\u201320). Transfer learning for high resolution aerial image classification. Proceedings of the 2016 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA.","DOI":"10.1109\/AIPR.2016.8010600"},{"key":"ref_12","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2021, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_13","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"key":"ref_14","first-page":"1877","article-title":"Language models are few-shot learners","volume":"Volume 33","author":"Brown","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"ref_15","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.N. (2018, January 1\u20136). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA."},{"key":"ref_16","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A.L., and Chen, L.-C. (2020, January 23\u201328). Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58548-8_7"},{"key":"ref_18","unstructured":"Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., and Shlens, J. (2019, January 8\u201314). Stand-alone self-attention in vision models. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201322). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Li, X., Wang, W., Hu, X., and Yang, J. (2019, January 15\u201320). Selective kernel networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00060"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201322). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_23","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., and Gelly, S. (2021, January 3\u20137). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. Proceedings of the ICLR 2021: The Ninth International Conference on Learning Representations, Virtual Event."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., and Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020, January 23\u201328). End-to-end object detection with transformers. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., and Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv.","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"ref_27","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1865","DOI":"10.1109\/JPROC.2017.2675998","article-title":"Remote sensing image scene classification: Benchmark and state of the art","volume":"105","author":"Cheng","year":"2017","journal-title":"Proc. IEEE"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1109\/LGRS.2017.2786241","article-title":"Aerial scene classification via multilevel fusion based on deep convolutional neural networks","volume":"15","author":"Yu","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_31","unstructured":"Tan, M., and Le, Q. (2019, January 24\u201326). Efficientnet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, Crete, Greece."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1603","DOI":"10.1109\/LGRS.2019.2949930","article-title":"APDC-Net: Attention pooling-based convolutional network for aerial scene classification","volume":"17","author":"Bi","year":"2019","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"14680","DOI":"10.3390\/rs71114680","article-title":"Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery","volume":"7","author":"Hu","year":"2015","journal-title":"Remote Sens."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"5653","DOI":"10.1109\/TGRS.2017.2711275","article-title":"Integrating multilayer features of convolutional neural networks for remote sensing scene classification","volume":"55","author":"Li","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1793","DOI":"10.1109\/TGRS.2015.2488681","article-title":"Scene classification via a gradient boosting random convolutional network framework","volume":"54","author":"Zhang","year":"2015","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/TGRS.2019.2931801","article-title":"Remote sensing scene classification by gated bidirectional network","volume":"58","author":"Sun","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Xu, C., Zhu, G., and Shu, J. (2021). A Lightweight and Robust Lie Group-Convolutional Neural Networks Joint Representation for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens., 1\u201315.","DOI":"10.1109\/TGRS.2020.3048024"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1109\/TGRS.2018.2864987","article-title":"Scene classification with recurrent attention of VHR remote sensing images","volume":"57","author":"Wang","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1109\/TGRS.2019.2937830","article-title":"Attention GANs: Unsupervised deep feature learning for aerial scene classification","volume":"58","author":"Yu","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/BF00994018","article-title":"Support vector machine","volume":"20","author":"Cortes","year":"1995","journal-title":"Mach. Learn."},{"key":"ref_41","unstructured":"Joachims, T. (1999, January 27\u201330). Transductive inference for text classification using support vector machines. Proceedings of the International Conference on Machine Learning (ICML), Bled, Slovenia."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1109\/LGRS.2008.916070","article-title":"Semisupervised image classification with Laplacian support vector machines","volume":"5","author":"Calpe","year":"2008","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"2018","DOI":"10.4028\/www.scientific.net\/AMM.644-650.2018","article-title":"A new kind of parallel K_NN network public opinion classification algorithm based on Hadoop platform","volume":"644","author":"Ma","year":"2014","journal-title":"Appl. Mech. Mater."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2012\/793490","article-title":"Multiclass Boosting with Adaptive Group-Based kNN and Its Application in Text Categorization","volume":"2012","author":"La","year":"2012","journal-title":"Math. Probl. Eng."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"747","DOI":"10.1109\/LGRS.2015.2513443","article-title":"Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery","volume":"13","author":"Zhu","year":"2016","journal-title":"IEEE Trans. Geosci. Remote Sens. Lett."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"2279","DOI":"10.1109\/JSTARS.2016.2536143","article-title":"Application and evaluation of a hierarchical patch clustering method for remote sensing images","volume":"9","author":"Yao","year":"2016","journal-title":"IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.isprsjprs.2016.03.004","article-title":"A spectral\u2013structural bag-of-features scene classifier for very high spatial resolution remote sensing imagery","volume":"116","author":"Zhao","year":"2016","journal-title":"ISPRS J. Photogram. Remote Sens."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"035004","DOI":"10.1117\/1.JRS.10.035004","article-title":"Feature significance-based multibag-of-visual-words model for remote sensing image scene classification","volume":"10","author":"Zhao","year":"2016","journal-title":"J. Appl. Remote Sens."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Wu, H., Liu, B., Su, W., Zhang, W., and Sun, J. (2016). Hierarchical coding vectors for scene level land-use classification. Remote Sens., 8.","DOI":"10.3390\/rs8050436"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1109\/LGRS.2015.2503142","article-title":"Unsupervised multilayer feature learning for satellite image scene classification","volume":"13","author":"Li","year":"2016","journal-title":"IEEE Trans. Geosci. Remote Sens. Lett."},{"key":"ref_51","unstructured":"Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., and Smola, A. (2020). Resnest: Split-attention networks. arXiv."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Romera-Paredes, B., and Torr, P.H.S. (2016, January 11\u201314). Recurrent instance segmentation. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46466-4_19"},{"key":"ref_53","unstructured":"Olah, C. (2015, October 01). Understanding LSTM Networks. Available online: http:\/\/colah.github.io\/posts\/2015-08-Understanding-LSTMs."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014, January 25\u201329). Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Stewart, R., Andriluka, M., and Ng, A.Y. (2016, January 27\u201330). End-to-end people detection in crowded scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.255"},{"key":"ref_56","unstructured":"Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. (2018, January 10\u201315). Image transformer. Proceedings of the International Conference on Machine Learning, Stockholm, Sweden."},{"key":"ref_57","unstructured":"Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv."},{"key":"ref_58","unstructured":"Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2021, January 3\u20137). Deformable DETR: Deformable Transformers for End-to-End Object Detection. Proceedings of the ICLR 2021: The Ninth International Conference on Learning Representations, Virtual Event."},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Bello, I., Zoph, B., Vaswani, A., Shlens, J., and Le, Q.V. (2019, January 27\u201328). Attention augmented convolutional networks. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00338"},{"key":"ref_60","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021, January 18\u201324). Training data-efficient image transformers & distillation through attention. Proceedings of the International Conference on Machine Learning, Virtual Event."},{"key":"ref_61","unstructured":"Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv."},{"key":"ref_62","unstructured":"Abnar, S., Dehghani, M., and Zuidema, W. (2020). Transferring inductive biases through knowledge distillation. arXiv."},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. arXiv.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Li, W., Cao, D., Peng, Y., and Yang, C. (2021). MSNet: A Multi-Stream Fusion Network for Remote Sensing Spatiotemporal Fusion Based on Transformer and Convolution. Remote Sens., 13.","DOI":"10.3390\/rs13183724"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Bashmal, L., Rahhal, M.M.A., Dayil, R.A., and Ajlan, N.A. (2021). Vision Transformers for Remote Sensing Image Classification. Remote Sens., 13.","DOI":"10.3390\/rs13030516"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, W., Zhang, T., Yang, Z., and Li, J. (2021). Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens., 13.","DOI":"10.3390\/rs13183585"},{"key":"ref_67","unstructured":"Hendrycks, D., and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv."},{"key":"ref_68","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel."},{"key":"ref_69","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_70","unstructured":"Brock, A., De, S., and Smith, S.L. (2021, January 3\u20137). Characterizing signal propagation to close the performance gap in unnormalized ResNets. Proceedings of the ICLR 2021: The Ninth International Conference on Learning Representations, Virtual Event."},{"key":"ref_71","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Wu, Y., and He, K. (2018, January 8\u201314). Group normalization. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01261-8_1"},{"key":"ref_73","doi-asserted-by":"crossref","first-page":"818","DOI":"10.1109\/TGRS.2012.2205158","article-title":"Geographic image retrieval using local invariant features","volume":"51","author":"Yang","year":"2012","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., and Efros, A.A. (2016, January 11\u201314). Colorful image colorization. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46487-9_40"},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"3965","DOI":"10.1109\/TGRS.2017.2685945","article-title":"AID: A benchmark data set for performance evaluation of aerial scene classification","volume":"55","author":"Xia","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Al Rahhal, M.M., Alhichri, H., and Alajlan, N. (2019). Simple yet effective fine-tuning of deep CNNs using an auxiliary classification loss for remote sensing scene classification. Remote Sens., 11.","DOI":"10.3390\/rs11242908"},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"2636","DOI":"10.1109\/TNNLS.2020.3007412","article-title":"C-CNN: Contourlet convolutional neural networks","volume":"32","author":"Liu","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Luo, Z., Li, J., Chen, C., and Piao, Y. (2020). When self-supervised learning meets scene classification: Remote sensing scene classification based on a multitask learning framework. Remote Sens., 12.","DOI":"10.3390\/rs12203276"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Liu, Y., Zhong, Y., Fei, F., Zhu, Q., and Qin, Q. (2018). Scene classification based on a deep random-scale stretched convolutional neural network. Remote Sens., 10.","DOI":"10.3390\/rs10030444"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"119951","DOI":"10.1109\/ACCESS.2020.3005450","article-title":"A new image recognition and classification method combining transfer learning algorithm and mobilenet model for welding defects","volume":"8","author":"Pan","year":"2020","journal-title":"IEEE Access"},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Xie, S., Girshick, R., Doll\u00e1r, P., Tu, Z., and He, K. (2017, January 22\u201325). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.634"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"2636","DOI":"10.1109\/JSTARS.2019.2919317","article-title":"A lightweight and discriminative model for remote sensing scene classification with multidilation pooling module","volume":"12","author":"Zhang","year":"2019","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"136668","DOI":"10.1109\/ACCESS.2020.3005044","article-title":"Automatic detection and monitoring of diabetic retinopathy using efficient convolutional neural networks and contrast limited adaptive histogram equalization","volume":"8","author":"Pour","year":"2020","journal-title":"IEEE Access"},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Aral, R.A., Keskin, \u015e.R., Kaya, M., and Hac\u0131\u00f6mero\u011flu, M. (2018, January 11\u201314). Classification of trashnet dataset based on deep learning models. Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Boston, MA, USA.","DOI":"10.1109\/BigData.2018.8622212"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1109\/LGRS.2017.2731997","article-title":"Remote sensing image scene classification using bag of convolutional features","volume":"14","author":"Cheng","year":"2017","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_87","unstructured":"Springenberg, J.T., Dosovitskiy, A., Brox, T., and Riedmiller, M.A. (2015, January 7\u20139). Striving for Simplicity: The All Convolutional Net. Proceedings of the ICLR (Workshop Track), San Diego, CA, USA ."},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017, January 22\u201325). Feature Pyramid Networks for Object Detection. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.106"},{"key":"ref_89","unstructured":"Cheng, B., Schwing, A.G., and Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/20\/4143\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:15:56Z","timestamp":1760166956000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/20\/4143"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,16]]},"references-count":89,"journal-issue":{"issue":"20","published-online":{"date-parts":[[2021,10]]}},"alternative-id":["rs13204143"],"URL":"https:\/\/doi.org\/10.3390\/rs13204143","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,16]]}}}