{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T09:54:21Z","timestamp":1776419661703,"version":"3.51.2"},"reference-count":60,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2021,2,1]],"date-time":"2021-02-01T00:00:00Z","timestamp":1612137600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"The authors extend their appreciation to the Researchers Supporting Project number (RSP-2020\/69), King Saud University, Riyadh, Saudi Arabia","award":["RSP-2020\/69"],"award-info":[{"award-number":["RSP-2020\/69"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.<\/jats:p>","DOI":"10.3390\/rs13030516","type":"journal-article","created":{"date-parts":[[2021,2,1]],"date-time":"2021-02-01T11:40:48Z","timestamp":1612179648000},"page":"516","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":557,"title":["Vision Transformers for Remote Sensing Image Classification"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9287-0596","authenticated-orcid":false,"given":"Yakoub","family":"Bazi","sequence":"first","affiliation":[{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"given":"Laila","family":"Bashmal","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8105-9746","authenticated-orcid":false,"given":"Mohamad M. Al","family":"Rahhal","sequence":"additional","affiliation":[{"name":"Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"given":"Reham Al","family":"Dayil","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1846-1131","authenticated-orcid":false,"given":"Naif Al","family":"Ajlan","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia"}]}],"member":"1968","published-online":{"date-parts":[[2021,2,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"6026","DOI":"10.3390\/rs5116026","article-title":"Exploring the use of google earth imagery and object-based methods in land use\/cover mapping","volume":"5","author":"Hu","year":"2013","journal-title":"Remote Sens."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1016\/j.isprsjprs.2015.10.004","article-title":"Remote sensing platforms and sensors: A survey","volume":"115","author":"Toth","year":"2016","journal-title":"ISPRS J. Photogramm. Remote Sens."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"121","DOI":"10.3141\/1855-15","article-title":"Microscopic traffic data collection by remote sensing","volume":"1855","author":"Hoogendoorn","year":"2003","journal-title":"Transp. Res. Rec."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Valavanis, K.P. (2008). Advances in Unmanned Aerial Vehicles: State of the Art and the Road to Autonomy, Springer Science & Business Media.","DOI":"10.1007\/978-1-4020-6114-1"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sheppard, C., and Rahnemoonfar, M. (2017, January 23\u201328). Real-time scene understanding for UAV imagery based on deep convolutional neural networks. Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA.","DOI":"10.1109\/IGARSS.2017.8127435"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Al-Najjar, H.A.H., Kalantar, B., Pradhan, B., Saeidi, V., Halin, A.A., Ueda, N., and Mansor, S. (2019). Land cover classification from fused DSM and UAV images using convolutional neural networks. Remote Sens., 11.","DOI":"10.3390\/rs11121461"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1016\/j.rse.2018.06.031","article-title":"A fully learnable context-driven object-based model for mapping land cover using multi-view data from unmanned aircraft systems","volume":"216","author":"Liu","year":"2018","journal-title":"Remote Sens. Environ."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Bazi, Y. (August, January 28). Two-branch neural network for learning multi-label classification in UAV imagery. Proceedings of the IGARSS 2019\u20142019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan.","DOI":"10.1109\/IGARSS.2019.8898895"},{"key":"ref_9","first-page":"302","article-title":"Use of remote sensing and GIS for sustainable land management","volume":"3","author":"Skidmore","year":"1997","journal-title":"ITC J."},{"key":"ref_10","unstructured":"Xiao, Y., and Zhan, Q. (2009, January 20\u201322). A review of remote sensing applications in urban planning and management in China. Proceedings of the 2009 Joint Urban Remote Sensing Event, Shanghai, China."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"111340","DOI":"10.1016\/j.rse.2019.111340","article-title":"Spectral mixture analysis in google earth engine to model and delineate fire scars over a large extent and a long time-series in a rainforest-savanna transition zone","volume":"232","author":"Daldegan","year":"2019","journal-title":"Remote Sens. Environ."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2037","DOI":"10.1109\/TPAMI.2006.244","article-title":"Face description with local binary patterns: Application to face recognition","volume":"28","author":"Ahonen","year":"2006","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"886","DOI":"10.1109\/CVPR.2005.177","article-title":"Histograms of oriented gradients for human detection","volume":"Volume 1","author":"Dalal","year":"2005","journal-title":"Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR\u201905)"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1551","DOI":"10.1109\/LGRS.2015.2412955","article-title":"Multispectral image alignment with nonlinear scale-invariant keypoint and enhanced local feature matrix","volume":"12","author":"Li","year":"2015","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., and Freeman, W.T. (2005, January 17\u201321). Discovering objects and their location in images. Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV\u201905), Beijing, China.","DOI":"10.1109\/ICCV.2005.77"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Huang, L., Chen, C., Li, W., and Du, Q. (2016). Remote sensing image scene classification using multi-scale completed local binary patterns and fisher vectors. Remote Sens., 8.","DOI":"10.3390\/rs8060483"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Imbriaco, R., Sebastian, C., Bondarev, E., and de With, P.H.N. (2019). Aggregated deep local features for remote sensing image retrieval. Remote Sens., 11.","DOI":"10.3390\/rs11050493"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1109\/LGRS.2015.2498644","article-title":"Efficient saliency-based object detection in remote sensing images using deep belief networks","volume":"13","author":"Diao","year":"2016","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2094","DOI":"10.1109\/JSTARS.2014.2329330","article-title":"Deep learning-based classification of hyperspectral data","volume":"7","author":"Chen","year":"2014","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Nogueira, K., Miranda, W.O., and Santos, J.A.D. (2015, January 26\u201329). Improving spatial feature representation from aerial scenes by using convolutional networks. Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil.","DOI":"10.1109\/SIBGRAPI.2015.39"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1109\/LGRS.2015.2499239","article-title":"Deep learning earth observation classification using imagenet pretrained networks","volume":"13","author":"Marmanis","year":"2016","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"645","DOI":"10.1109\/TGRS.2016.2612821","article-title":"Convolutional neural networks for large-scale remote-sensing image classification","volume":"55","author":"Maggiori","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1040","DOI":"10.1049\/iet-cvi.2017.0420","article-title":"Recurrent neural networks for remote sensing image classification","volume":"12","author":"Lakhal","year":"2018","journal-title":"IET Comput. Vis."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"5046","DOI":"10.1109\/TGRS.2018.2805286","article-title":"Generative adversarial networks for hyperspectral image classification","volume":"56","author":"Zhu","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"5329","DOI":"10.1109\/TGRS.2019.2899057","article-title":"Classification of hyperspectral images based on multiclass spatial\u2013spectral generative adversarial networks","volume":"57","author":"Feng","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Mou, L., Lu, X., Li, X., and Zhu, X.X. (2020). Nonlocal graph convolutional networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens., 1\u201312.","DOI":"10.1109\/TGRS.2020.2973363"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"4237","DOI":"10.1109\/TGRS.2019.2961947","article-title":"Spatial\u2013spectral feature extraction via deep ConvLSTM neural networks for hyperspectral image classification","volume":"58","author":"Hu","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"4911","DOI":"10.1109\/TIP.2020.2975718","article-title":"A multiple-instance densely-connected ConvNet for aerial scene classification","volume":"29","author":"Bi","year":"2020","journal-title":"IEEE Trans. Image Process."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1109\/TGRS.2019.2937830","article-title":"Attention GANs: Unsupervised deep feature learning for aerial scene classification","volume":"58","author":"Yu","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Al Rahhal, M.M., Alhichri, H., and Alajlan, N. (2019). Simple yet effective fine-tuning of deep CNNs using an auxiliary classification loss for remote sensing scene classification. Remote Sens., 11.","DOI":"10.3390\/rs11242908"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Sun, H., Li, S., Zheng, X., and Lu, X. (2019). Remote sensing scene classification by gated bidirectional network. IEEE Trans. Geosci. Remote Sens., 1\u201315.","DOI":"10.1109\/TGRS.2019.2931801"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1109\/LGRS.2017.2779469","article-title":"Scene classification based on two-stage deep feature fusion","volume":"15","author":"Liu","year":"2018","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Yu, Y., and Liu, F. (2020, November 20). A Two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification. Available online: https:\/\/www.hindawi.com\/journals\/cin\/2018\/8639367\/.","DOI":"10.1155\/2018\/8639367"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"2811","DOI":"10.1109\/TGRS.2017.2783902","article-title":"When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs","volume":"56","author":"Cheng","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"28746","DOI":"10.1109\/ACCESS.2020.2968771","article-title":"Remote sensing scene classification based on multi-structure deep features fusion","volume":"8","author":"Xue","year":"2020","journal-title":"IEEE Access"},{"key":"ref_36","unstructured":"Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., and Chao, L.S. (August, January 28). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Fortezza da Florence, Italy."},{"key":"ref_37","unstructured":"Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. (August, January 28). Transformer-XL: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Fortezza da Florence, Italy."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1109\/LSP.2020.3044547","article-title":"Non-autoregressive transformer for speech recognition","volume":"28","author":"Chen","year":"2020","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_39","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bello, I., Zoph, B., Vaswani, A., Shlens, J., and Le, Q.V. (November, January 27). Attention Augmented Convolutional Networks. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00338"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1109\/TGRS.2018.2864987","article-title":"Scene classification with recurrent attention of VHR remote sensing images","volume":"57","author":"Wang","year":"2019","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_42","unstructured":"Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Tomizuka, M., Keutzer, K., and Vajda, P. (2020). Visual transformers: Token-based image representation and processing for computer vision. arXiv."},{"key":"ref_43","unstructured":"Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., and Shlens, J. (2019). Stand-alone self-attention in vision models. arXiv."},{"key":"ref_44","unstructured":"Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. (2020, January 12\u201318). Generative pretraining from pixels. Proceedings of the 37th International Conference on Machine Learning, Vienna, Austrlia."},{"key":"ref_45","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"165","DOI":"10.1109\/TGRS.2019.2934760","article-title":"HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers","volume":"58","author":"He","year":"2020","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_47","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MI, USA. Long and Short Papers."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q.V. (2019, January 15\u201321). AutoAugment: Learning augmentation strategies from data. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00020"},{"key":"ref_49","unstructured":"Jackson, P.T., Atapour-Abarghouei, A., Bonner, S., Breckon, T.P., and Obara, B. (2019, January 16\u201320). Style Augmentation: Data Augmentation via Style Randomization. Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA."},{"key":"ref_50","unstructured":"Bowles, C., Chen, L., Guerrero, R., Bentley, P., Gunn, R., Hammers, A., Dickie, D.A., Hern\u00e1ndez, M.V., Wardlaw, J., and Rueckert, D. (2018). GAN augmentation: Augmenting training data using generative adversarial networks. arXiv."},{"key":"ref_51","unstructured":"DeVries, T., and Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv."},{"key":"ref_52","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (2018). Mixup: Beyond empirical risk minimization. arXiv."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., and Choe, J. (November, January 27). CutMix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00612"},{"key":"ref_54","unstructured":"Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv."},{"key":"ref_55","unstructured":"Han, S., Mao, H., and Dally, W.J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. (2016, January 27\u201330). Quantized Convolutional Neural Networks for Mobile Devices. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.521"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Yang, Y., and Newsam, S. (2010, January 2\u20135). Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems\u2014GIS \u201910, San Jose, CA, USA.","DOI":"10.1145\/1869790.1869829"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"3965","DOI":"10.1109\/TGRS.2017.2685945","article-title":"AID: A benchmark data set for performance evaluation of aerial scene classification","volume":"55","author":"Xia","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"6899","DOI":"10.1109\/TGRS.2018.2845668","article-title":"Remote sensing scene classification using multilayer stacked covariance pooling","volume":"56","author":"He","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"1865","DOI":"10.1109\/JPROC.2017.2675998","article-title":"Remote sensing image scene classification: Benchmark and state of the art","volume":"105","author":"Cheng","year":"2017","journal-title":"Proc. IEEE"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/3\/516\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:18:18Z","timestamp":1760159898000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/13\/3\/516"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,1]]},"references-count":60,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2021,2]]}},"alternative-id":["rs13030516"],"URL":"https:\/\/doi.org\/10.3390\/rs13030516","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,1]]}}}