{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T01:17:08Z","timestamp":1775870228315,"version":"3.50.1"},"reference-count":47,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2022,4,13]],"date-time":"2022-04-13T00:00:00Z","timestamp":1649808000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>Traditional automatic pavement distress detection methods using convolutional neural networks (CNNs) require a great deal of time and resources for computing and are poor in terms of interpretability. Therefore, inspired by the successful application of Transformer architecture in natural language processing (NLP) tasks, a novel Transformer method called LeViT was introduced for automatic asphalt pavement image classification. LeViT consists of convolutional layers, transformer stages where Multi-layer Perception (MLP) and multi-head self-attention blocks alternate using the residual connection, and two classifier heads. To conduct the proposed methods, three different sources of pavement image datasets and pre-trained weights based on ImageNet were attained. The performance of the proposed model was compared with six state-of-the-art (SOTA) deep learning models. All of them were trained based on transfer learning strategy. Compared to the tested SOTA methods, LeViT has less than 1\/8 of the parameters of the original Vision Transformer (ViT) and 1\/2 of ResNet and InceptionNet. Experimental results show that after training for 100 epochs with a 16 batch-size, the proposed method acquired 91.56% accuracy, 91.72% precision, 91.56% recall, and 91.45% F1-score in the Chinese asphalt pavement dataset and 99.17% accuracy, 99.19% precision, 99.17% recall, and 99.17% F1-score in the German asphalt pavement dataset, which is the best performance among all the tested SOTA models. Moreover, it shows superiority in inference speed (86 ms\/step), which is approximately 25% of the original ViT method and 80% of some prevailing CNN-based models, including DenseNet, VGG, and ResNet. Overall, the proposed method can achieve competitive performance with fewer computation costs. In addition, a visualization method combining Grad-CAM and Attention Rollout was proposed to analyze the classification results and explore what has been learned in every MLP and attention block of LeViT, which improved the interpretability of the proposed pavement image classification model.<\/jats:p>","DOI":"10.3390\/rs14081877","type":"journal-article","created":{"date-parts":[[2022,4,13]],"date-time":"2022-04-13T23:07:16Z","timestamp":1649891236000},"page":"1877","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":57,"title":["A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8011-1290","authenticated-orcid":false,"given":"Yihan","family":"Chen","sequence":"first","affiliation":[{"name":"Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, China"},{"name":"National Demonstration Center for Experimental Road and Traffic Engineering Education, Southeast University, Nanjing 211189, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8561-7223","authenticated-orcid":false,"given":"Xingyu","family":"Gu","sequence":"additional","affiliation":[{"name":"Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, China"},{"name":"National Demonstration Center for Experimental Road and Traffic Engineering Education, Southeast University, Nanjing 211189, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8012-7682","authenticated-orcid":false,"given":"Zhen","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, China"},{"name":"National Demonstration Center for Experimental Road and Traffic Engineering Education, Southeast University, Nanjing 211189, China"}]},{"given":"Jia","family":"Liang","sequence":"additional","affiliation":[{"name":"Department of Roadway Engineering, School of Transportation, Southeast University, Nanjing 211189, China"},{"name":"National Demonstration Center for Experimental Road and Traffic Engineering Education, Southeast University, Nanjing 211189, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,4,13]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Chen, C., Chandra, S., Han, Y., and Seo, H. (2021). Deep Learning-Based Thermal Image Analysis for Pavement Defect Detection and Classification Considering Complex Pavement Conditions. Remote Sens., 14.","DOI":"10.3390\/rs14010106"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Liu, Z., Wu, W., Gu, X., Li, S., Wang, L., and Zhang, T. (2021). Application of combining YOLO models and 3D GPR images in road detection and maintenance. Remote Sens., 13.","DOI":"10.3390\/rs13061081"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1031","DOI":"10.1016\/j.conbuildmat.2018.08.011","article-title":"Comparison of deep convolutional neural networks and edge detectors for image-based crack detection in concrete","volume":"186","author":"Dorafshan","year":"2018","journal-title":"Constr. Build. Mater."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"845","DOI":"10.1016\/j.eng.2020.07.030","article-title":"The state-of-the-art review on applications of intrusive sensing, image processing techniques, and machine learning methods in pavement monitoring and analysis","volume":"7","author":"Hou","year":"2021","journal-title":"Engineering"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"04021024","DOI":"10.1061\/JPEODX.0000280","article-title":"3D visualization of airport pavement quality based on BIM and WebGL integration","volume":"147","author":"Liu","year":"2021","journal-title":"J. Transp. Eng. Part B Pavements"},{"key":"ref_6","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_7","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2015, January 7\u201313). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.123"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017, January 21\u201326). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.243"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"805","DOI":"10.1111\/mice.12297","article-title":"Automated pixel-level pavement crack detection on 3D asphalt surfaces using a deep-learning network","volume":"32","author":"Zhang","year":"2017","journal-title":"Comput.-Aided Civil Infrastruct. Eng."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"04018041","DOI":"10.1061\/(ASCE)CP.1943-5487.0000775","article-title":"Deep learning\u2013based fully automated pavement crack detection on 3D asphalt surfaces with an improved CrackNet","volume":"32","author":"Zhang","year":"2018","journal-title":"J. Comput. Civil. Eng."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1109\/TITS.2019.2891167","article-title":"Pixel-level cracking detection on 3D asphalt pavement images through deep-learning-based CrackNet-V","volume":"21","author":"Fei","year":"2019","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"04020092","DOI":"10.1061\/JPEODX.0000245","article-title":"MobileCrack: Object classification in asphalt pavements using an adaptive lightweight deep learning","volume":"147","author":"Hou","year":"2021","journal-title":"J. Transp. Eng. Part B Pavements"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Ali, L., Alnajjar, F., Jassmi, H.A., Gochoo, M., Khan, W., and Serhani, M.A. (2021). Performance Evaluation of Deep CNN-Based Crack Detection and Localization Techniques for Concrete Structures. Sensors, 21.","DOI":"10.3390\/s21051688"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"9289","DOI":"10.1007\/s00521-021-05690-8","article-title":"Surface crack detection using deep learning with shallow CNN architecture for enhanced computation","volume":"33","author":"Kim","year":"2021","journal-title":"Neural Comput. Appl."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"245016","DOI":"10.1088\/1361-6560\/ac3dc8","article-title":"A vision transformer for emphysema classification using CT images","volume":"66","author":"Wu","year":"2021","journal-title":"Phys. Med. Biol."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"119085","DOI":"10.1016\/j.atmosenv.2022.119085","article-title":"Visibility classification and influencing-factors analysis of airport: A deep learning approach","volume":"278","author":"Liu","year":"2022","journal-title":"Atmos. Environ."},{"key":"ref_19","unstructured":"Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-C. (2015, January 7\u201312). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1111\/mice.12409","article-title":"Automated pixel-level pavement crack detection on 3D asphalt surfaces with a recurrent neural network","volume":"34","author":"Zhang","year":"2019","journal-title":"Comput.-Aided Civil Infrastruct. Eng."},{"key":"ref_22","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Bashmal, L., Rahhal, M.M.A., Dayil, R.A., and Ajlan, N.A. (2021). Vision transformers for remote sensing image classification. Remote Sens., 13.","DOI":"10.3390\/rs13030516"},{"key":"ref_24","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_25","unstructured":"Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang, Z., Hou, Q., and Feng, J. (2021). Deepvit: Towards deeper vision transformer. arXiv."},{"key":"ref_26","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021, January 18\u201324). Training data-efficient image transformers & distillation through attention. Proceedings of the International Conference on Machine Learning, online."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Chen, C.-F., Fan, Q., and Panda, R. (2021). Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv.","DOI":"10.1109\/ICCV48922.2021.00041"},{"key":"ref_28","unstructured":"Mehta, S., and Rastegari, M. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Liu, H., Miao, X., Mertz, C., Xu, C., and Kong, H. (2021, January 10\u201317). CrackFormer: Transformer Network for Fine-Grained Crack Detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00376"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Guo, J.-M., and Markoni, H. (2021, January 26\u201328). Transformer based Refinement Network for Accurate Crack Detection. Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam.","DOI":"10.1109\/ICSSE52999.2021.9538477"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., J\u00e9gou, H., and Douze, M. (2021). LeViT: A Vision Transformer in ConvNet\u2019s Clothing for Faster Inference. arXiv.","DOI":"10.1109\/ICCV48922.2021.01204"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1038\/538020a","article-title":"Can we open the black box of AI?","volume":"538","author":"Castelvecchi","year":"2016","journal-title":"Nat. News"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017, January 22\u201329). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.74"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Serrano, S., and Smith, N.A. (2019). Is attention interpretable?. arXiv.","DOI":"10.18653\/v1\/P19-1282"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Eisenbach, M., Stricker, R., Seichter, D., Amende, K., Debes, K., Sesselmann, M., Ebersbach, D., Stoeckert, U., and Gross, H.-M. (2017, January 14\u201319). How to get pavement distress detection ready for deep learning? A systematic approach. Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA.","DOI":"10.1109\/IJCNN.2017.7966101"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"1525","DOI":"10.1109\/TITS.2019.2910595","article-title":"Feature pyramid and hierarchical boosting network for pavement crack detection","volume":"21","author":"Yang","year":"2019","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Zhang, L., Yang, F., Zhang, Y.D., and Zhu, Y.J. (2016, January 25\u201328). Road crack detection using deep convolutional neural network. Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA.","DOI":"10.1109\/ICIP.2016.7533052"},{"key":"ref_38","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1162\/neco.1989.1.4.541","article-title":"Backpropagation applied to handwritten zip code recognition","volume":"1","author":"LeCun","year":"1989","journal-title":"Neural Comput."},{"key":"ref_40","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning, Lile, France."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., and Vasudevan, V. (2019, January 27\u201328). Searching for mobilenetv3. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seol, Korea.","DOI":"10.1109\/ICCV.2019.00140"},{"key":"ref_42","unstructured":"Lin, M., Chen, Q., and Yan, S. (2013). Network in network. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Abnar, S., and Zuidema, W. (2020). Quantifying attention flow in transformers. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.385"},{"key":"ref_44","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016, January 27\u201330). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"},{"key":"ref_47","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv."}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/8\/1877\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:53:39Z","timestamp":1760136819000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/14\/8\/1877"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,13]]},"references-count":47,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2022,4]]}},"alternative-id":["rs14081877"],"URL":"https:\/\/doi.org\/10.3390\/rs14081877","relation":{},"ISSN":["2072-4292"],"issn-type":[{"value":"2072-4292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,13]]}}}