{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T22:24:33Z","timestamp":1775082273446,"version":"3.50.1"},"reference-count":118,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T00:00:00Z","timestamp":1673222400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video\/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis\u2019s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.<\/jats:p>","DOI":"10.3390\/s23020734","type":"journal-article","created":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T07:05:09Z","timestamp":1673247909000},"page":"734","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":89,"title":["Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9637-7317","authenticated-orcid":false,"given":"Oumaima","family":"Moutik","sequence":"first","affiliation":[{"name":"Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8433-4643","authenticated-orcid":false,"given":"Hiba","family":"Sekkat","sequence":"additional","affiliation":[{"name":"Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9074-4318","authenticated-orcid":false,"given":"Smail","family":"Tigani","sequence":"additional","affiliation":[{"name":"Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4193-6062","authenticated-orcid":false,"given":"Abdellah","family":"Chehri","sequence":"additional","affiliation":[{"name":"Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON 11 K7K 7B4, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0197-8313","authenticated-orcid":false,"given":"Rachid","family":"Saadane","sequence":"additional","affiliation":[{"name":"SIRC-LaGeS, Hassania School of Public Works, Casablanca 8108, Morocco"}]},{"given":"Taha Ait","family":"Tchakoucht","sequence":"additional","affiliation":[{"name":"Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco"}]},{"given":"Anand","family":"Paul","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,9]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"Deep Learning for Computer Vision: A Brief Review","volume":"2018","author":"Voulodimos","year":"2018","journal-title":"Comput. Intell. Neurosci."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/38.674971","article-title":"Computer Vision for Interactive","volume":"18","author":"Freeman","year":"1998","journal-title":"IEEE Comput. Graph. Appl."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1016\/0262-8856(95)99717-F","article-title":"Medical computer vision, virtual reality and robotics","volume":"13","author":"Ayache","year":"1995","journal-title":"Image Vis. Comput."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Che, E., Jung, J., and Olsen, M. (2019). Object Recognition, Segmentation, and Classification of Mobile Laser Scanning Point Clouds: A State of the Art Review. Sensors, 19.","DOI":"10.3390\/s19040810"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1007\/s41315-021-00193-0","article-title":"Vision-based positioning system for auto-docking of unmanned surface vehicles (USVs)","volume":"6","author":"Volden","year":"2022","journal-title":"Int. J. Intell. Robot. Appl."},{"key":"ref_6","unstructured":"Minaee, S., Luo, P., Lin, Z., and Bowyer, K. (2021). Going Deeper into Face Detection: A Survey. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Militello, C., Rundo, L., Vitabile, S., and Conti, V. (2021). Fingerprint Classification Based on Deep Learning Approaches: Experimental Findings and Comparisons. Symmetry, 13.","DOI":"10.3390\/sym13050750"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"845","DOI":"10.1016\/j.eng.2020.07.030","article-title":"The State-of-the-Art Review on Applications of Intrusive Sensing, Image Processing Techniques, and Machine Learning Methods in Pavement Monitoring and Analysis","volume":"7","author":"Hou","year":"2021","journal-title":"Engineering"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Deng, G., Luo, J., Sun, C., Pan, D., Peng, L., Ding, N., and Zhang, A. (2021, January 27\u201331). Vision-based Navigation for a Small-scale Quadruped Robot Pegasus-Mini. Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China.","DOI":"10.1109\/ROBIO54168.2021.9739369"},{"key":"ref_10","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2023, January 06). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/research-covers\/language-unsupervised\/language_understanding_paper.pdf."},{"key":"ref_11","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Degardin, B., and Proen\u00e7a, H. (2021). Human Behavior Analysis: A Survey on Action Recognition. Appl. Sci., 11.","DOI":"10.3390\/app11188324"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ravanbakhsh, M., Nabi, M., Mousavi, H., Sangineto, E., and Sebe, N. (2018). Plug-and-Play CNN for Crowd Motion Analysis: An Application in Abnormal Event Detection. arXiv.","DOI":"10.1109\/WACV.2018.00188"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Gabeur, V., Sun, C., Alahari, K., and Schmid, C. (2020). Multi-modal Transformer for Video Retrieval. arXiv.","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"James, S., and Davison, A.J. (2022). Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation. arXiv.","DOI":"10.1109\/CVPR52688.2022.01337"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"55","DOI":"10.36548\/jscp.2021.2.001","article-title":"An Efficient Dimension Reduction based Fusion of CNN and SVM Model for Detection of Abnormal Incident in Video Surveillance","volume":"3","author":"Sharma","year":"2021","journal-title":"J. Soft Comput. Paradig."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1113\/jphysiol.1962.sp006837","article-title":"Receptive fields, binocular interaction and functional architecture in the cat\u2019s visual cortex","volume":"160","author":"Hubel","year":"1962","journal-title":"J. Physiol."},{"key":"ref_18","unstructured":"Huang, T.S. (1996). Computer Vision: Evolution and Promise, CERN School of Computing."},{"key":"ref_19","first-page":"396","article-title":"Handwritten Digit Recognition with a Back-Propagation Network","volume":"2","author":"LeCun","year":"1990","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF00344251","article-title":"Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position","volume":"36","author":"Fukushima","year":"1980","journal-title":"Biol. Cybern."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014). Going Deeper with Convolutions. arXiv.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"818","DOI":"10.1007\/978-3-319-10590-1_53","article-title":"Visualizing and Understanding Convolutional Networks","volume":"Volume 8689","author":"Fleet","year":"2014","journal-title":"Computer Vision\u2013ECCV 2014"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_24","unstructured":"Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv."},{"key":"ref_25","unstructured":"Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., and Wang, X. (2020). Deep High-Resolution Representation Learning for Visual Recognition. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1007\/s10916-018-1088-1","article-title":"Medical Image Analysis using Convolutional Neural Networks: A Review","volume":"42","author":"Anwar","year":"2018","journal-title":"J. Med. Syst."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Valiente, R., Zaman, M., Ozer, S., and Fallah, Y.P. (2019, January 9\u201312). Controlling Steering Angle for Cooperative Self-driving Vehicles utilizing CNN and LSTM-based Deep Networks. Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France.","DOI":"10.1109\/IVS.2019.8814260"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger. arXiv.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016). SSD: Single Shot MultiBox Detector. arXiv.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Law, H., and Deng, J. (2019). CornerNet: Detecting Objects as Paired Keypoints. arXiv.","DOI":"10.1007\/978-3-030-01264-9_45"},{"key":"ref_32","unstructured":"Law, H., Teng, Y., Russakovsky, O., and Deng, J. (2020). CornerNet-Lite: Efficient Keypoint Based Object Detection. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015). Fast R-CNN. arXiv.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv.","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Du, H., Shi, H., Zeng, D., Zhang, X.-P., and Mei, T. (2021). The Elements of End-to-end Deep Face Recognition: A Survey of Recent Advances. arXiv.","DOI":"10.1145\/3507902"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014, January 23\u201328). DeepFace: Closing the Gap to Human-Level Performance in Face Verification. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.220"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Sun, Y., Wang, X., and Tang, X. (2014, January 23\u201328). Deep Learning Face Representation from Predicting 10,000 Classes. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.244"},{"key":"ref_39","unstructured":"Liu, W., Wen, Y., Yu, Z., and Yang, M. (2017). Large-Margin Softmax Loss for Convolutional Neural Networks. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Chen, C.-F., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., and Fan, Q. (2021). Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition. arXiv.","DOI":"10.1109\/CVPR46437.2021.00610"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). ImageNet: A Large-Scale Hierarchical Image Database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_42","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Natsev, P. (2017). The Kinetics Human Action Video Dataset. arXiv."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1007\/978-3-319-46484-8_2","article-title":"Temporal Segment Networks: Towards Good Practices for Deep Action Recognition","volume":"Volume 9912","author":"Leibe","year":"2016","journal-title":"Computer Vision\u2013ECCV 2016"},{"key":"ref_44","unstructured":"Fan, Q. (2019). More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Lin, J., Gan, C., and Han, S. (November, January 27). TSM: Temporal Shift Module for Efficient Video Understanding. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00718"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 7\u201311). Quo Vadis, Action Recognition?. A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Hara, K., Kataoka, H., and Satoh, Y. (2018, January 10\u201314). Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00685"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (November, January 27). SlowFast Networks for Video Recognition. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Luo, C., and Yuille, A. (November, January 27). Grouped Spatial-Temporal Aggregation for Efficient Action Recognition. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00561"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Sudhakaran, S., Escalera, S., and Lanz, O. (2020, January 13\u201319). Gate-Shift Networks for Video Action Recognition. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00118"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Neimark, D., Bar, O., Zohar, M., and Asselmann, D. (October, January 11). Video Transformer Network. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00355"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1186\/s40537-021-00444-8","article-title":"Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions","volume":"8","author":"Alzubaidi","year":"2021","journal-title":"J. Big Data"},{"key":"ref_53","unstructured":"Simonyan, K., and Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 23\u201328). Large-scale Video Classification with Convolutional Neural Networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1007\/978-3-030-01246-5_49","article-title":"Temporal Relational Reasoning in Videos","volume":"Volume 11205","author":"Ferrari","year":"2018","journal-title":"Computer Vision\u2013ECCV 2018"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., and Mueller-Freitag, M. (October, January 22). The \u201cSomething Something\u201d Video Database for Learning and Evaluating Visual Common Sense. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.622"},{"key":"ref_57","unstructured":"Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. (2016). YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., and Funtowicz, M. (2020). HuggingFace\u2019s Transformers: State-of-the-art Natural Language Processing. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"101060","DOI":"10.1016\/j.aei.2020.101060","article-title":"Automated text classification of near-misses from safety reports: An improved deep learning approach","volume":"44","author":"Fang","year":"2020","journal-title":"Adv. Eng. Inform."},{"key":"ref_60","unstructured":"Zhu, J., Xia, Y., Wu, L., He, D., Qin, T., Zhou, W., Li, H., and Liu, T.-Y. (2020). Incorporating BERT into Neural Machine Translation. arXiv."},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Wang, Z., Ng, P., Ma, X., Nallapati, R., and Xiang, B. (2019). Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering. arXiv.","DOI":"10.18653\/v1\/D19-1599"},{"key":"ref_62","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is All you Need. arXiv."},{"key":"ref_63","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_64","unstructured":"Fedus, W., Zoph, B., and Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv."},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.R. (2019). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv.","DOI":"10.18653\/v1\/W18-5446"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Rajpurkar, P., Jia, R., and Liang, P. (2018). Know What You Don\u2019t Know: Unanswerable Questions for SQuAD. arXiv.","DOI":"10.18653\/v1\/P18-2124"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Ye, L., Rochan, M., Liu, Z., and Wang, Y. (2019, January 16\u201320). Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01075"},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. arXiv.","DOI":"10.18653\/v1\/D18-1009"},{"key":"ref_70","unstructured":"Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., and Wang, Y. (2021). Transformer in Transformer. arXiv."},{"key":"ref_71","unstructured":"Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., and Shen, C. (2021). Twins: Revisiting the Design of Spatial Attention in Vision Transformers. arXiv."},{"key":"ref_72","unstructured":"Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., and Fu, B. (2021). Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer. arXiv."},{"key":"ref_73","unstructured":"Chen, C.-F., Panda, R., and Fan, Q. (2022). RegionViT: Regional-to-Local Attention for Vision Transformers. arXiv."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F.E.H., Feng, J., and Yan, S. (October, January 11). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00060"},{"key":"ref_75","unstructured":"Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang, Z., Hou, Q., and Feng, J. (2021). DeepViT: Towards Deeper Vision Transformer. arXiv."},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Wang, P., Wang, X., Wang, F., Lin, M., Chang, S., Li, H., and Jin, R. (2022). KVT: K-NN Attention for Boosting Vision Transformers. arXiv.","DOI":"10.1007\/978-3-031-20053-3_17"},{"key":"ref_77","unstructured":"El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., and Verbeek, J. (2021). XCiT: Cross-Covariance Image Transformers. arXiv."},{"key":"ref_78","unstructured":"Beltagy, I., Peters, M.E., and Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Girdhar, R., Joao Carreira, J., Doersch, C., and Zisserman, A. (2019, January 16\u201320). Video Action Transformer Network. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00033"},{"key":"ref_80","unstructured":"Zhang, Y., Wu, B., Li, W., Duan, L., and Gan, C. STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition. Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., and Schmid, C. (October, January 11). ViViT: A Video Vision Transformer. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"ref_82","first-page":"694","article-title":"Spatial Temporal Transformer Network for Skeleton-based Action Recognition","volume":"Volume 12663","author":"Plizzari","year":"2021","journal-title":"International Conference on Pattern Recognition"},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"729","DOI":"10.1109\/TCBB.2021.3078089","article-title":"Netpro2vec: A Graph Embedding Framework for Biomedical Applications","volume":"19","author":"Manipur","year":"2022","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Shahroudy, A., Liu, J., Ng, T.-T., and Wang, G. (2016, January 27\u201330). NTU RGB+D: A Large-Scale Dataset for 3D Human Activity Analysis. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.115"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","article-title":"NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding","volume":"42","author":"Liu","year":"2020","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 16-20). Skeleton-Based Action Recognition with Directed Graph Neural Networks. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00810"},{"key":"ref_87","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 16\u201320). Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"ref_88","unstructured":"Koot, R., Hennerbichler, M., and Lu, H. (2021). Evaluating Transformers for Lightweight Action Recognition. arXiv."},{"key":"ref_89","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3505244","article-title":"Transformers in Vision: A Survey","volume":"54","author":"Khan","year":"2022","journal-title":"ACM Comput. Surv."},{"key":"ref_90","unstructured":"Ulhaq, A., Akhtar, N., Pogrebna, G., and Mian, A. (2022). Vision Transformers for Action Recognition: A Survey. arXiv."},{"key":"ref_91","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s41095-021-0247-3","article-title":"Transformers in computational visual media: A survey","volume":"8","author":"Xu","year":"2022","journal-title":"Comput. Vis. Media"},{"key":"ref_92","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1109\/TPAMI.2022.3152247","article-title":"A Survey on Vision Transformer","volume":"45","author":"Han","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_93","unstructured":"Zhao, Y., Wang, G., Tang, C., Luo, C., Zeng, W., and Zha, Z.-J. (2021). A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP. arXiv."},{"key":"ref_94","doi-asserted-by":"crossref","first-page":"2119","DOI":"10.1587\/transinf.2022EDP7058","article-title":"Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition","volume":"E105.D","author":"Omi","year":"2022","journal-title":"IEICE Trans. Inf. Syst."},{"key":"ref_95","first-page":"11669","article-title":"TEINet: Towards an Efficient Architecture for Video Recognition","volume":"34","author":"Liu","year":"2020","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Li, X., Shuai, B., and Tighe, J. (2020). Directional Temporal Modeling for Action Recognition. arXiv.","DOI":"10.1007\/978-3-030-58539-6_17"},{"key":"ref_97","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Feiszli, M., and Torresani, L. (November, January 27). Video Classification with Channel-Separated Convolutional Networks. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.","DOI":"10.1109\/ICCV.2019.00565"},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Wang, J., and Torresani, L. (2022, January 18\u201324). Deformable Video Transformer. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01366"},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2022, January 18\u201324). Video Swin Transformer. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"ref_100","doi-asserted-by":"crossref","unstructured":"Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., and Yu, D. (2022, January 18\u201324). Recurring the Transformer for Video Action Recognition. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01367"},{"key":"ref_101","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., and Schmid, C. Multiview Transformers for Video Recognition. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00333"},{"key":"ref_102","unstructured":"Zha, X., Zhu, W., Lv, T., Yang, S., and Liu, J. (2021). Shifted Chunk Transformer for Spatio-Temporal Representational Learning. arXiv."},{"key":"ref_103","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., and Tighe, J. (2021, January 11\u201317). VidTr: Video Transformer Without Convolutions. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.01332"},{"key":"ref_104","doi-asserted-by":"crossref","unstructured":"Kalfaoglu, M.E., Kalkan, S., and Alatan, A.A. (2020). Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. arXiv.","DOI":"10.1007\/978-3-030-68238-5_48"},{"key":"ref_105","first-page":"1263","article-title":"Shrinking Temporal Attention in Transformers for Video Action Recognition","volume":"36","author":"Li","year":"2022","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_106","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv."},{"key":"ref_107","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_108","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/s12652-019-01239-9","article-title":"Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition","volume":"11","author":"Imran","year":"2020","journal-title":"J. Ambient Intell. Humaniz. Comput."},{"key":"ref_109","doi-asserted-by":"crossref","unstructured":"Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., and Liu, J. (2022). Human Action Recognition from Various Data Modalities: A Review. IEEE Trans. Pattern Anal. Mach. Intell., 1\u201320.","DOI":"10.1109\/TPAMI.2022.3183112"},{"key":"ref_110","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1186\/s40649-019-0069-y","article-title":"Graph convolutional networks: A comprehensive review","volume":"6","author":"Zhang","year":"2019","journal-title":"Comput. Soc. Netw."},{"key":"ref_111","unstructured":"Wang, Q., Peng, J., Shi, S., Liu, T., He, J., and Weng, R. (2021). IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition. arXiv."},{"key":"ref_112","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., and Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"ref_113","doi-asserted-by":"crossref","first-page":"2206","DOI":"10.1109\/TCSVT.2020.3019293","article-title":"Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition","volume":"31","author":"Banerjee","year":"2021","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_114","doi-asserted-by":"crossref","unstructured":"Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., and Hu, W. (2021). Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. arXiv.","DOI":"10.1109\/ICCV48922.2021.01311"},{"key":"ref_115","doi-asserted-by":"crossref","unstructured":"Chi, H.-G., Ha, M.H., Chi, S., Lee, S.W., Huang, Q., and Ramani, K. (2022, January 18\u201324). InfoGCN: Representation Learning for Human Skeleton-based Action Recognition. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01955"},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Song, Y.-F., Zhang, Z., Shan, C., and Wang, L. (2022). Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition. arXiv.","DOI":"10.1109\/TPAMI.2022.3157033"},{"key":"ref_117","unstructured":"Shi, F., Lee, C., Qiu, L., Zhao, Y., Shen, T., Muralidhar, S., Han, T., Zhu, S.-C., and Narayanan, V. (2021). STAR: Sparse Transformer-based Action Recognition. arXiv."},{"key":"ref_118","doi-asserted-by":"crossref","first-page":"4111","DOI":"10.1038\/s41598-022-08157-5","article-title":"An efficient self-attention network for skeleton-based action recognition","volume":"12","author":"Qin","year":"2022","journal-title":"Sci. Rep."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/734\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:04:04Z","timestamp":1760119444000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/734"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,9]]},"references-count":118,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["s23020734"],"URL":"https:\/\/doi.org\/10.3390\/s23020734","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,9]]}}}