{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:23:35Z","timestamp":1775067815752,"version":"3.50.1"},"reference-count":56,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2025,5,12]],"date-time":"2025-05-12T00:00:00Z","timestamp":1747008000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, requiring early detection for effective treatment. Deep learning models have been widely used for automated DR classification, with Convolutional Neural Networks (CNNs) being the most established approach. Recently, Vision Transformers (ViTs) have shown promise, but a direct comparison of their performance and interpretability remains limited. Additionally, hybrid models that combine CNN and transformer-based architectures have not been extensively studied. This work systematically evaluates CNNs (ResNet-50), ViTs (Vision Transformer and SwinV2-Tiny), and hybrid models (Convolutional Vision Transformer, LeViT-256, and CvT-13) on DR classification using publicly available retinal image datasets. The models are assessed based on classification accuracy and interpretability, applying Grad-CAM and Attention-Rollout to analyze decision-making patterns. Results indicate that hybrid models outperform both standalone CNNs and ViTs, achieving a better balance between local feature extraction and global context awareness. The best-performing model (CvT-13) achieved a Quadratic Weighted Kappa (QWK) score of 0.84 and an AUC of 0.93 on the test set. Interpretability analysis shows that CNNs focus on fine-grained lesion details, while ViTs exhibit broader but less localized attention. These findings provide valuable insights for optimizing deep learning models in medical imaging, supporting the development of clinically viable AI-driven DR screening systems.<\/jats:p>","DOI":"10.3390\/computers14050187","type":"journal-article","created":{"date-parts":[[2025,5,12]],"date-time":"2025-05-12T10:58:07Z","timestamp":1747047487000},"page":"187","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures"],"prefix":"10.3390","volume":"14","author":[{"given":"Weijie","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Computing, Communication and Business, Hochschule f\u00fcr Technik und Wirtschaft, University of Applied Sciences for Engineering and Economics, 10318 Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3833-5882","authenticated-orcid":false,"given":"Veronika","family":"Belcheva","sequence":"additional","affiliation":[{"name":"School of Computing, Communication and Business, Hochschule f\u00fcr Technik und Wirtschaft, University of Applied Sciences for Engineering and Economics, 10318 Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0864-3302","authenticated-orcid":false,"given":"Tatiana","family":"Ermakova","sequence":"additional","affiliation":[{"name":"School of Computing, Communication and Business, Hochschule f\u00fcr Technik und Wirtschaft, University of Applied Sciences for Engineering and Economics, 10318 Berlin, Germany"}]}],"member":"1968","published-online":{"date-parts":[[2025,5,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1643","DOI":"10.2337\/dc15-2171","article-title":"Global Estimates on the Number of People Blind or Visually Impaired by Diabetic Retinopathy: A Meta-analysis From 1990 to 2010","volume":"39","author":"Leasher","year":"2016","journal-title":"Diabetes Care"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"107840","DOI":"10.1016\/j.diabres.2019.107840","article-title":"IDF Diabetes Atlas: A review of studies utilising retinal photography on the global prevalence of diabetes related retinopathy between 2015 and 2018","volume":"157","author":"Thomas","year":"2019","journal-title":"Diabetes Res. Clin. Pract."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"e015024","DOI":"10.1136\/bmjopen-2016-015024","article-title":"Retrospective analysis of newly recorded certifications of visual impairment due to diabetic retinopathy in Wales during 2007\u20132015","volume":"7","author":"Thomas","year":"2017","journal-title":"BMJ Open"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Laurik-Feuerstein, K.L., Sapahia, R., Cabrera DeBuc, D., and Somfai, G.M. (2022). The assessment of fundus image quality labeling reliability among graders with different backgrounds. PLoS ONE, 17.","DOI":"10.1371\/journal.pone.0271156"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1167\/iovs.64.15.47","article-title":"The Diabetic Retinopathy \u201cPandemic\u201d and Evolving Global Strategies: The 2023 Friedenwald Lecture","volume":"64","author":"Wong","year":"2023","journal-title":"Investig. Opthalmology Vis. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Yagin, F.H., Yasar, S., Gormez, Y., Yagin, B., Pinar, A., Alkhateeb, A., and Ardig\u00f2, L.P. (2023). Explainable Artificial Intelligence Paves the Way in Precision Diagnostics and Biomarker Discovery for the Subclass of Diabetic Retinopathy in Type 2 Diabetics. Metabolites, 13.","DOI":"10.3390\/metabo13121204"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"149015","DOI":"10.1016\/j.gene.2024.149015","article-title":"Machine learning-based identification and validation of immune-related biomarkers for early diagnosis and targeted therapy in diabetic retinopathy","volume":"934","author":"Tao","year":"2025","journal-title":"Gene"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Adak, C., Karkera, T., Chattopadhyay, S., and Saqib, M. (2023). Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers. arXiv.","DOI":"10.1016\/j.neucom.2024.127991"},{"key":"ref_9","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2023). Attention Is All You Need. arXiv."},{"key":"ref_10","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"7850","DOI":"10.1002\/mp.15312","article-title":"Vision Transformer-based recognition of diabetic retinopathy grade","volume":"48","author":"Wu","year":"2021","journal-title":"Med. Phys."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"948","DOI":"10.3390\/biomedinformatics3040058","article-title":"Federated Learning for Diabetic Retinopathy Detection Using Vision Transformers","volume":"3","author":"Chetoui","year":"2023","journal-title":"BioMedInformatics"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Mohan, N.J., Murugan, R., Goel, T., and Roy, P. (2022, January 16\u201318). ViT-DR: Vision Transformers in Diabetic Retinopathy Grading Using Fundus Images. Proceedings of the 2022 IEEE 10th Region 10 Humanitarian Technology Conference (R10-HTC), Hyderabad, India.","DOI":"10.1109\/R10-HTC54060.2022.9930027"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"117546","DOI":"10.1109\/ACCESS.2023.3326528","article-title":"Vision Transformer Model for Predicting the Severity of Diabetic Retinopathy in Fundus Photography-Based Retina Images","volume":"11","author":"Nazih","year":"2023","journal-title":"IEEE Access"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kumar, N.S., and Ramaswamy Karthikeyan, B. (2021, January 16\u201319). Diabetic Retinopathy Detection using CNN, Transformer and MLP based Architectures. Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Hualien City, Taiwan.","DOI":"10.1109\/ISPACS51563.2021.9651024"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Sun, R., Li, Y., Zhang, T., Mao, Z., Wu, F., and Zhang, Y. (2021, January 20\u201325). Lesion-Aware Transformers for Diabetic Retinopathy Grading. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.01079"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Yang, Y., Cai, Z., Qiu, S., and Xu, P. (2024). Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image. PLoS ONE, 19.","DOI":"10.1371\/journal.pone.0299265"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"4787","DOI":"10.1007\/s00521-023-09304-3","article-title":"CTNet: Convolutional Transformer Network for Diabetic Retinopathy Classification","volume":"36","author":"Bala","year":"2024","journal-title":"Neural Comput. Appl."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, Z., Lu, H., Yan, H., Kan, H., and Jin, L. (2023). Vison Transformer Adapter-Based Hyperbolic Embeddings for Multi-Lesion Segmentation in Diabetic Retinopathy. Sci. Rep., 13.","DOI":"10.1038\/s41598-023-38320-5"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zang, F., and Ma, H. (2024). CRA-Net: Transformer guided category-relation attention network for diabetic retinopathy grading. Comput. Biol. Med., 170.","DOI":"10.1016\/j.compbiomed.2024.107993"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"102608","DOI":"10.1016\/j.media.2022.102608","article-title":"Focused Attention in Transformers for interpretable classification of retinal images","volume":"82","author":"Playout","year":"2022","journal-title":"Med. Image Anal."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Abnar, S., and Zuidema, W. (2020). Quantifying Attention Flow in Transformers. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.385"},{"key":"ref_24","unstructured":"Band, N., Rudner, T.G.J., Feng, Q., Filos, A., Nado, Z., Dusenberry, M.W., Jerfel, G., Tran, D., and Gal, Y. (2022). Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Lee, C.H., and Ke, Y.H. (2021, January 25\u201327). Fundus images classification for Diabetic Retinopathy using Deep Learning. Proceedings of the 13th International Conference on Computer Modeling and Simulation, ICCMS \u201921, New York, NY, USA.","DOI":"10.1145\/3474963.3475849"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Halder, A., Gharami, S., Sadhu, P., Singh, P.K., Wo\u017aniak, M., and Ijaz, M.F. (2024). Implementing vision transformer for classifying 2D biomedical images. Sci. Rep., 14.","DOI":"10.1038\/s41598-024-63094-9"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Philippi, D., Rothaus, K., and Castelli, M. (2023). A vision transformer architecture for the automated segmentation of retinal lesions in spectral domain optical coherence tomography images. Sci. Rep., 13.","DOI":"10.1038\/s41598-023-27616-1"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"He, J., Wang, J., Han, Z., Ma, J., Wang, C., and Qi, M. (2023). An interpretable transformer network for the retinal disease classification using optical coherence tomography. Sci. Rep., 13.","DOI":"10.1038\/s41598-023-30853-z"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"100552","DOI":"10.1016\/j.xops.2024.100552","article-title":"Comparative Analysis of Vision Transformers and Conventional Convolutional Neural Networks in Detecting Referable Diabetic Retinopathy","volume":"4","author":"Goh","year":"2024","journal-title":"Ophthalmol. Sci."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Touati, M., Touati, R., Nana, L., Benzarti, F., and Ben Yahia, S. (2025). DRCCT: Enhancing Diabetic Retinopathy Classification with a Compact Convolutional Transformer. Big Data Cogn. Comput., 9.","DOI":"10.3390\/bdcc9010009"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Sassi Hidri, M., Hidri, A., Alsaif, S.A., Alahmari, M., and AlShehri, E. (2025). Optimal Convolutional Networks for Staging and Detecting of Diabetic Retinopathy. Information, 16.","DOI":"10.3390\/info16030221"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Asia, A.O., Zhu, C.Z., Althubiti, S.A., Al-Alimi, D., Xiao, Y.L., Ouyang, P.B., and Al-Qaness, M.A.A. (2022). Detection of Diabetic Retinopathy in Retinal Fundus Images Using CNN Classification Models. Electronics, 11.","DOI":"10.3390\/electronics11172740"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Akhtar, S., Aftab, S., Ali, O., Ahmad, M., Khan, M.A., Abbas, S., and Ghazal, T.M. (2025). A deep learning based model for diabetic retinopathy grading. Sci. Rep., 15.","DOI":"10.1038\/s41598-025-87171-9"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xue, J., Wu, J., Bian, Y., Zhang, S., and Du, Q. (2024). Classification of Diabetic Retinopathy Based on Efficient Computational Modeling. Appl. Sci., 14.","DOI":"10.3390\/app142311327"},{"key":"ref_35","unstructured":"Dugas, E., Jared, J., and Cukierski, W. (2024, May 21). Diabetic Retinopathy Detection. Available online: https:\/\/kaggle.com\/competitions\/diabetic-retinopathy-detection."},{"key":"ref_36","unstructured":"Maggie, K., and Dane, S. (2024, May 21). APTOS 2019 Blindness Detection. Available online: https:\/\/kaggle.com\/competitions\/aptos2019-blindness-detection."},{"key":"ref_37","first-page":"10","article-title":"Comparing the International Clinical Diabetic Retinopathy (ICDR) severity scale","volume":"36","author":"Cleland","year":"2023","journal-title":"Community Eye Health"},{"key":"ref_38","unstructured":"Graham, B. (2024, May 21). Diabetic Retinopathy Detection Competition Report. Available online: https:\/\/storage.googleapis.com\/kaggle-forum-message-attachments\/88655\/2795\/competitionreport.pdf."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_40","unstructured":"Tan, M., and Le, Q.V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv."},{"key":"ref_41","unstructured":"Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., and El-Nouby, A. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., and Dong, L. (2022, January 18\u201324). Swin Transformer V2: Scaling Up Capacity and Resolution. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01170"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jegou, H., and Douze, M. (2021, January 11\u201317). LeViT: A Vision Transformer in ConvNet\u2019s Clothing for Faster Inference. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA.","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., and Zhang, L. (2021, January 10\u201317). CvT: Introducing Convolutions to Vision Transformers. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"ref_45","unstructured":"Microsoft (2024, June 03). ResNet-50. Available online: https:\/\/huggingface.co\/microsoft\/resnet-50."},{"key":"ref_46","unstructured":"Google (2024, June 03). EfficientNet-B0. Available online: https:\/\/huggingface.co\/google\/efficientnet-b0."},{"key":"ref_47","unstructured":"WinKawaks (2024, June 03). vit-small-patch16-224. Available online: https:\/\/huggingface.co\/WinKawaks\/vit-small-patch16-224."},{"key":"ref_48","unstructured":"Facebook (2024, June 03). DINOv2-Small-ImageNet1K-1-Layer. Available online: https:\/\/huggingface.co\/facebook\/dinov2-small-imagenet1k-1-layer."},{"key":"ref_49","unstructured":"Microsoft (2024, June 03). SwinV2-Tiny-Patch4-Window16-256. Available online: https:\/\/huggingface.co\/microsoft\/swinv2-tiny-patch4-window16-256."},{"key":"ref_50","unstructured":"Facebook (2024, June 03). LeViT-256. Available online: https:\/\/huggingface.co\/facebook\/levit-256."},{"key":"ref_51","unstructured":"Microsoft (2024, June 03). CvT-13. Available online: https:\/\/huggingface.co\/microsoft\/cvt-13."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1177\/001316446002000104","article-title":"A Coefficient of Agreement for Nominal Scales","volume":"20","author":"Cohen","year":"1960","journal-title":"Educ. Psychol. Meas."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1037\/h0026256","article-title":"Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit","volume":"70","author":"Cohen","year":"1968","journal-title":"Psychol. Bull."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1186\/s40537-019-0192-5","article-title":"Survey on deep learning with class imbalance","volume":"6","author":"Johnson","year":"2019","journal-title":"J. Big Data"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Yin, H., Gao, Y., Chen, S., Wen, Y., Cai, G., Gu, T., Du, J., Tall\u00f3n-Ballesteros, A.J., and Zhang, M. (2017). Markov Random Field Based Convolutional Neural Networks for Image Classification. Intelligent Data Engineering and Automated Learning\u2014IDEAL 2017, Springer.","DOI":"10.1007\/978-3-319-68935-7"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"656","DOI":"10.3348\/kjr.2024.0049","article-title":"Statistical Methods for Comparing Predictive Values in Medical Diagnosis","volume":"25","author":"Park","year":"2024","journal-title":"Korean J. Radiol."}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/187\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:31:24Z","timestamp":1760031084000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/5\/187"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,12]]},"references-count":56,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2025,5]]}},"alternative-id":["computers14050187"],"URL":"https:\/\/doi.org\/10.3390\/computers14050187","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,12]]}}}