{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,10]],"date-time":"2026-05-10T14:26:28Z","timestamp":1778423188335,"version":"3.51.4"},"reference-count":30,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,4,28]],"date-time":"2023-04-28T00:00:00Z","timestamp":1682640000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Sciences"],"abstract":"<jats:p>Transformers are models that implement a mechanism of self-attention, individually weighting the importance of each part of the input data. Their use in image classification tasks is still somewhat limited since researchers have so far chosen Convolutional Neural Networks for image classification and transformers were more targeted to Natural Language Processing (NLP) tasks. Therefore, this paper presents a literature review that shows the differences between Vision Transformers (ViT) and Convolutional Neural Networks. The state of the art that used the two architectures for image classification was reviewed and an attempt was made to understand what factors may influence the performance of the two deep learning architectures based on the datasets used, image size, number of target classes (for the classification problems), hardware, and evaluated architectures and top results. The objective of this work is to identify which of the architectures is the best for image classification and under what conditions. This paper also describes the importance of the Multi-Head Attention mechanism for improving the performance of ViT in image classification.<\/jats:p>","DOI":"10.3390\/app13095521","type":"journal-article","created":{"date-parts":[[2023,5,1]],"date-time":"2023-05-01T12:14:08Z","timestamp":1682943248000},"page":"5521","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":504,"title":["Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review"],"prefix":"10.3390","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8234-9481","authenticated-orcid":false,"given":"Jos\u00e9","family":"Maur\u00edcio","sequence":"first","affiliation":[{"name":"Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), Rua Pedro Nunes, 3030-199 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2334-7280","authenticated-orcid":false,"given":"In\u00eas","family":"Domingues","sequence":"additional","affiliation":[{"name":"Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), Rua Pedro Nunes, 3030-199 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9660-2011","authenticated-orcid":false,"given":"Jorge","family":"Bernardino","sequence":"additional","affiliation":[{"name":"Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), Rua Pedro Nunes, 3030-199 Coimbra, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2023,4,28]]},"reference":[{"key":"ref_1","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_2","unstructured":"Saha, S. (2023, January 08). A Comprehensive Guide to Convolutional Neural Networks\u2014The ELI5 Way. Available online: https:\/\/towardsdatascience.com\/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1016\/j.jbusres.2019.07.039","article-title":"Literature Review as a Research Methodology: An Overview and Guidelines","volume":"104","author":"Snyder","year":"2019","journal-title":"J. Bus. Res."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"98754","DOI":"10.1109\/ACCESS.2021.3095559","article-title":"Software Defect Prediction Using Ensemble Learning: A Systematic Literature Review","volume":"9","author":"Matloob","year":"2021","journal-title":"IEEE Access"},{"key":"ref_5","unstructured":"Benz, P., Ham, S., Zhang, C., Karjauv, A., and Kweon, I.S. (2021). Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs. arXiv."},{"key":"ref_6","unstructured":"Bai, Y., Mei, J., Yuille, A., and Xie, C. (2021). Are Transformers More Robust Than CNNs?. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Tyagi, K., Pathak, G., Nijhawan, R., and Mittal, A. (2021, January 2). Detecting Pneumonia Using Vision Transformer and Comparing with Other Techniques. Proceedings of the 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA), IEEE, Coimbatore, India.","DOI":"10.1109\/ICECA52323.2021.9676146"},{"key":"ref_8","unstructured":"Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and Dosovitskiy, A. (2021). Do Vision Transformers See Like Convolutional Neural Networks?. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Gheflati, B., and Rivaz, H. (2021). Vision Transformer for Classification of Breast Ultrasound Images. arXiv.","DOI":"10.1109\/EMBC48229.2022.9871809"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhou, H.-Y., Lu, C., Yang, S., and Yu, Y. (2021, January 17). ConvNets vs. Transformers: Whose Visual Representations Are More Transferable?. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00252"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"245016","DOI":"10.1088\/1361-6560\/ac3dc8","article-title":"A Vision Transformer for Emphysema Classification Using CT Images","volume":"66","author":"Wu","year":"2021","journal-title":"Phys. Med. Biol."},{"key":"ref_12","first-page":"1","article-title":"Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems","volume":"3087","author":"Filipiuk","year":"2022","journal-title":"AAAI Workshop Artif. Intell. Saf."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Galdran, A., Carneiro, G., and Ballester, M.A.G. (2022). Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification. arXiv.","DOI":"10.1007\/978-3-030-94907-5_2"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Cuenat, S., and Couturier, R. (2022, January 18). Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography. Proceedings of the 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), IEEE, Shanghai, China.","DOI":"10.1109\/ICCCR54399.2022.9790134"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Coccomini, D.A., Caldelli, R., Falchi, F., Gennaro, C., and Amato, G. (2022, January 27\u201330). Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection. Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, Newark, NJ, USA.","DOI":"10.1145\/3512732.3533582"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wang, H. (2022, January 27\u201329). Traffic Sign Recognition with Vision Transformers. Proceedings of the 6th International Conference on Information System and Data Mining, Silicon Valley, CA, USA.","DOI":"10.1145\/3546157.3546166"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"105939","DOI":"10.1016\/j.compbiomed.2022.105939","article-title":"An Improved Transformer Network for Skin Cancer Classification","volume":"149","author":"Xin","year":"2022","journal-title":"Comput. Biol. Med."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"101846","DOI":"10.1016\/j.ecoinf.2022.101846","article-title":"CNN and Transformer Framework for Insect Pest Classification","volume":"72","author":"Peng","year":"2022","journal-title":"Ecol. Inform."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"461","DOI":"10.1016\/j.neunet.2022.06.038","article-title":"Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead","volume":"153","author":"Bakhtiarnia","year":"2022","journal-title":"Neural Netw."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"104316","DOI":"10.1016\/j.autcon.2022.104316","article-title":"Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces","volume":"140","author":"Xu","year":"2022","journal-title":"Autom. Constr."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Reedha, R., Dericquebourg, E., Canals, R., and Hafiane, A. (2022). Vision Transformers for Weeds and Crops Classification of High Resolution UAV Images. Remote Sens., 14.","DOI":"10.3390\/rs14030592"},{"key":"ref_22","unstructured":"Platt, J., Koller, D., Singer, Y., and Roweis, S. (2007). Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_23","unstructured":"Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. (2020). Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv."},{"key":"ref_24","first-page":"747","article-title":"The Extragradient Method for Finding Saddle Points and Other Problems","volume":"12","author":"Korpelevich","year":"1976","journal-title":"Ekon. Mat. Metod."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"104863","DOI":"10.1016\/j.dib.2019.104863","article-title":"Dataset of Breast Ultrasound Images","volume":"28","author":"Gomaa","year":"2020","journal-title":"Data Brief"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1218","DOI":"10.1109\/JBHI.2017.2731873","article-title":"Automated Breast Ultrasound Lesions Detection Using Convolutional Neural Networks","volume":"22","author":"Yap","year":"2018","journal-title":"IEEE J. Biomed. Health Inform."},{"key":"ref_27","unstructured":"Zhang, R. (2019). Making Convolutional Networks Shift-Invariant Again. arXiv."},{"key":"ref_28","first-page":"3762","article-title":"Attention Is All You Need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Neural Inf. Process. Syst."},{"key":"ref_29","unstructured":"Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang, Z., Hou, Q., and Feng, J. (2021). DeepViT: Towards Deeper Vision Transformer. arXiv."},{"key":"ref_30","unstructured":"Amorim, J.P., Domingues, I., Abreu, P.H., and Santos, J.A.M. (2018, January 25\u201327). Interpreting Deep Learning Models for Ordinal Problems. Proceedings of the European Symposium on Artificial Neural Networks, Bruges, Belgium."}],"container-title":["Applied Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2076-3417\/13\/9\/5521\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:26:11Z","timestamp":1760124371000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2076-3417\/13\/9\/5521"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,28]]},"references-count":30,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["app13095521"],"URL":"https:\/\/doi.org\/10.3390\/app13095521","relation":{},"ISSN":["2076-3417"],"issn-type":[{"value":"2076-3417","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,28]]}}}