{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,20]],"date-time":"2025-11-20T18:50:45Z","timestamp":1763664645881,"version":"build-2065373602"},"reference-count":104,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2021,9,8]],"date-time":"2021-09-08T00:00:00Z","timestamp":1631059200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["UIDB\/50008\/2020","UI\/BD\/150765\/2020"],"award-info":[{"award-number":["UIDB\/50008\/2020","UI\/BD\/150765\/2020"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","award":["Centro-01-0145-FEDER-000019"],"award-info":[{"award-number":["Centro-01-0145-FEDER-000019"]}],"id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Sciences"],"abstract":"<jats:p>The visual recognition and understanding of human actions remain an active research domain of computer vision, being the scope of various research works over the last two decades. The problem is challenging due to its many interpersonal variations in appearance and motion dynamics between humans, without forgetting the environmental heterogeneity between different video images. This complexity splits the problem into two major categories: action classification, recognising the action being performed in the scene, and spatiotemporal action localisation, concerning recognising multiple localised human actions present in the scene. Previous surveys mainly focus on the evolution of this field, from handcrafted features to deep learning architectures. However, this survey presents an overview of both categories and respective evolution within each one, the guidelines that should be followed and the current benchmarks employed for performance comparison between the state-of-the-art methods.<\/jats:p>","DOI":"10.3390\/app11188324","type":"journal-article","created":{"date-parts":[[2021,9,8]],"date-time":"2021-09-08T10:12:03Z","timestamp":1631095923000},"page":"8324","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Human Behavior Analysis: A Survey on Action Recognition"],"prefix":"10.3390","volume":"11","author":[{"given":"Bruno","family":"Degardin","sequence":"first","affiliation":[{"name":"IT-Instituto de Telecomunica\u00e7\u00f5es, University of Beira Interior, 6201-001 Covilh\u00e3, Portugal"}]},{"given":"Hugo","family":"Proen\u00e7a","sequence":"additional","affiliation":[{"name":"IT-Instituto de Telecomunica\u00e7\u00f5es, University of Beira Interior, 6201-001 Covilh\u00e3, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,8]]},"reference":[{"key":"ref_1","unstructured":"Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, MIT Press."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"ref_3","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.neunet.2014.09.003","article-title":"Deep learning in neural networks: An overview","volume":"61","author":"Schmidhuber","year":"2015","journal-title":"Neural Netw."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1016\/j.cviu.2006.08.002","article-title":"A survey of advances in vision-based human motion capture and analysis","volume":"104","author":"Moeslund","year":"2006","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"976","DOI":"10.1016\/j.imavis.2009.11.014","article-title":"A survey on vision-based human action recognition","volume":"28","author":"Poppe","year":"2010","journal-title":"Image Vis. Comput."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1473","DOI":"10.1109\/TCSVT.2008.2005594","article-title":"Machine recognition of human activities: A survey","volume":"18","author":"Turaga","year":"2008","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.imavis.2017.01.010","article-title":"Going deeper into action recognition: A survey","volume":"60","author":"Herath","year":"2017","journal-title":"Image Vis. Comput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1016\/j.imavis.2016.06.007","article-title":"From handcrafted to learned representations for human action recognition: A survey","volume":"55","author":"Zhu","year":"2016","journal-title":"Image Vis. Comput."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18\u201323). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_13","unstructured":"Kong, Y., and Fu, Y. (2018). Human action recognition and prediction: A survey. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1016\/j.patrec.2021.01.031","article-title":"Iterative weak\/self-supervised classification framework for abnormal events detection","volume":"145","author":"Degardin","year":"2021","journal-title":"Pattern Recognit. Lett."},{"key":"ref_15","unstructured":"Wang, G., Lai, J., Huang, P., and Xie, X. (February, January 27). Spatial-temporal person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 21\u201323). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Kyoto, Japan.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Bendersky, M., Garcia-Pueyo, L., Harmsen, J., Josifovski, V., and Lepikhin, D. (2014, January 24\u201327). Up next: Retrieval methods for large scale related video suggestion. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA.","DOI":"10.1145\/2623330.2623344"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 23\u201328). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_20","first-page":"568","article-title":"Two-stream convolutional networks for action recognition in videos","volume":"27","author":"Simonyan","year":"2014","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Zeiler, M.D., and Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10590-1_53"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Duan, H., Zhao, Y., Xiong, Y., Liu, W., and Lin, D. (2020). Omni-sourced Webly-supervised Learning for Video Recognition. arXiv.","DOI":"10.1007\/978-3-030-58555-6_40"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hong, J., Cho, B., Hong, Y.W., and Byun, H. (2019). Contextual action cues from camera sensor for multi-stream action recognition. Sensors, 19.","DOI":"10.3390\/s19061382"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., Ngo, C.W., Tian, X., and Mei, T. (2019, January 15\u201320). Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01233"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"032035","DOI":"10.1088\/1757-899X\/569\/3\/032035","article-title":"I3d-lstm: A new model for human action recognition","volume":"Volume 569","author":"Wang","year":"2019","journal-title":"IOP Conference Series: Materials Science and Engineering"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.patcog.2018.07.028","article-title":"Asymmetric 3d convolutional neural networks for action recognition","volume":"85","author":"Yang","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_27","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014, January 23\u201328). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.81"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J., Stiefelhagen, R., and Van Gool, L. (2019). Holistic large scale video understanding. arXiv.","DOI":"10.1007\/978-3-030-58558-7_35"},{"key":"ref_31","unstructured":"Zhu, Y., Lan, Z., Newsam, S., and Hauptmann, A. (2018). Hidden two-stream convolutional networks for action recognition. Asian Conference on Computer Vision, Springer."},{"key":"ref_32","unstructured":"Lin, J., Gan, C., and Han, S. (November, January 27). Tsm: Temporal shift module for efficient video understanding. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. (2018, January 8\u201314). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_36","unstructured":"(2021, September 03). DeepDraw. Available online: https:\/\/github.com\/auduno\/deepdraw."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Kalfaoglu, M., Kalkan, S., and Alatan, A.A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition. arXiv.","DOI":"10.1007\/978-3-030-68238-5_48"},{"key":"ref_38","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Girshick, R. (2015, January 7\u201313). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.169"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C. (2016). Ssd: Single shot multibox detector. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Tan, M., Pang, R., and Le, Q.V. (2020, January 13\u201319). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01079"},{"key":"ref_43","unstructured":"Chen, B.X., and Tsotsos, J.K. (2019). Fast visual object tracking with rotated bounding boxes. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. (2019, January 19\u201320). Siamrpn++: Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00441"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. (2018, January 18\u201323). High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00935"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Wang, Q., Zhang, L., Bertinetto, L., Hu, W., and Torr, P.H. (2019, January 19\u201320). Fast online object tracking and segmentation: A unifying approach. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00142"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Bergmann, P., Meinhardt, T., and Leal-Taixe, L. (2019, January 19\u201320). Tracking without bells and whistles. Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.00103"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Bras\u00f3, G., and Leal-Taix\u00e9, L. (2020, January 13\u201319). Learning a neural solver for multiple object tracking. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00628"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Wang, Z., Zheng, L., Liu, Y., Li, Y., and Wang, S. (2019). Towards real-time multi-object tracking. arXiv.","DOI":"10.1007\/978-3-030-58621-8_7"},{"key":"ref_50","unstructured":"Zhan, Y., Wang, C., Wang, X., Zeng, W., and Liu, W. (2020). A Simple Baseline for Multi-Object Tracking. arXiv."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Peng, X., and Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46493-0_45"},{"key":"ref_52","first-page":"91","article-title":"Faster r-cnn: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren","year":"2015","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Kalogeiton, V., Weinzaepfel, P., Ferrari, V., and Schmid, C. (2017, January 22\u201329). Action tubelet detector for spatio-temporal action localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.472"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., and Sukthankar, R. (2018, January 18\u201323). Ava: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00633"},{"key":"ref_55","unstructured":"K\u00f6p\u00fckl\u00fc, O., Wei, X., and Rigoll, G. (2019). You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization. arXiv."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Hara, K., Kataoka, H., and Satoh, Y. (2018, January 18\u201323). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00685"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 22\u201329). YOLO9000: Better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., and Kautz, J. (2019, January 19\u201320). Step: Spatio-temporal progressive learning for video action detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00035"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Li, Y., Wang, Z., Wang, L., and Wu, G. (2020). Actions as Moving Points. arXiv.","DOI":"10.1007\/978-3-030-58517-4_5"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019, January 19\u201320). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.00630"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.E., and Sheikh, Y. (2017, January 22\u201329). Realtime multi-person 2d pose estimation using part affinity fields. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.143"},{"key":"ref_62","unstructured":"Neverova, N., Novotny, D., and Vedaldi, A. (2021, September 03). Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels. Available online: https:\/\/openreview.net\/forum?id=SklKNNBx8B."},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Riza Alp G\u00fcler, Natalia Neverova, I.K (2021, September 03). DensePose: Dense Human Pose Estimation in the Wild. Available online: https:\/\/openaccess.thecvf.com\/content_cvpr_2018\/html\/Guler_DensePose_Dense_Human_CVPR_2018_paper.html.","DOI":"10.1109\/CVPR.2018.00762"},{"key":"ref_64","unstructured":"Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., and Girshick, R. (2021, September 03). Detectron2. Available online: https:\/\/github.com\/facebookresearch\/detectron2."},{"key":"ref_65","unstructured":"Xiu, Y., Li, J., Wang, H., Fang, Y., and Lu, C. (2018). Pose Flow: Efficient Online Pose Tracking. arXiv."},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1109\/MMUL.2012.24","article-title":"Microsoft kinect sensor and its effect","volume":"19","author":"Zhang","year":"2012","journal-title":"IEEE Multimed."},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Junejo, I.N., Dexter, E., Laptev, I., and P\u00darez, P. (2008). Cross-view action recognition from temporal self-similarities. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-540-88688-4_22"},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Degardin, B., Lopes, V., and Proen\u00e7a, H. (2021). REGINA-Reasoning Graph Convolutional Networks in Human Action Recognition. arXiv.","DOI":"10.1109\/TIFS.2021.3130437"},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Liu, J., Shahroudy, A., Xu, D., and Wang, G. (2016). Spatio-temporal lstm with trust gates for 3d human action recognition. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46487-9_50"},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"1915","DOI":"10.1109\/TPAMI.2011.272","article-title":"Context-aware saliency detection","volume":"34","author":"Goferman","year":"2011","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Song, S., Lan, C., Xing, J., Zeng, W., and Liu, J. (2016). An end-to-end spatio-temporal attention model for human action recognition from skeleton data. arXiv.","DOI":"10.1609\/aaai.v31i1.11212"},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., and Zheng, N. (2017, January 22\u201329). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.233"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Graves, A. (2012). Supervised sequence labelling. Supervised Sequence Labelling with Recurrent Neural Networks, Springer.","DOI":"10.1007\/978-3-642-24797-2"},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"538","DOI":"10.1007\/s11390-020-0405-6","article-title":"Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition","volume":"35","author":"Jia","year":"2020","journal-title":"J. Comput. Sci. Technol."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Ke, Q., Bennamoun, M., An, S., Sohel, F., and Boussaid, F. (2017, January 22\u201329). A new representation of skeleton sequences for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy.","DOI":"10.1109\/CVPR.2017.486"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Kim, T.S., and Reiter, A. (2017, January 22\u201329). Interpretable 3d human action analysis with temporal convolutional networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Venice, Italy.","DOI":"10.1109\/CVPRW.2017.207"},{"key":"ref_77","unstructured":"Li, C., Zhong, Q., Xie, D., and Pu, S. (2017, January 10\u201314). Skeleton-based action recognition with convolutional neural networks. Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China."},{"key":"ref_78","unstructured":"Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R.P. (2015). Convolutional networks on graphs for learning molecular fingerprints. arXiv."},{"key":"ref_79","unstructured":"Henaff, M., Bruna, J., and LeCun, Y. (2015). Deep convolutional networks on graph-structured data. arXiv."},{"key":"ref_80","unstructured":"Kipf, T.N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv."},{"key":"ref_81","unstructured":"Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. (2015). Gated graph sequence neural networks. arXiv."},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1109\/TNNLS.2020.2978386","article-title":"A comprehensive survey on graph neural networks","volume":"32","author":"Wu","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_83","doi-asserted-by":"crossref","unstructured":"Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. (2019, January 19\u201320). Actional-structural graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00371"},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Si, C., Jing, Y., Wang, W., Wang, L., and Tan, T. (2018, January 8\u201314). Skeleton-based action recognition with spatial reasoning and temporal stack learning. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_7"},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Tang, Y., Tian, Y., Lu, J., Li, P., and Zhou, J. (2018, January 8\u201314). Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Munich, Germany.","DOI":"10.1109\/CVPR.2018.00558"},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., and Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"ref_87","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 19\u201320). Skeleton-based action recognition with directed graph neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00810"},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019, January 19\u201320). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00132"},{"key":"ref_89","unstructured":"Sutton, R.S., and Barto, A.G. (2018). Reinforcement Learning: An Introduction, MIT Press."},{"key":"ref_90","unstructured":"Zoph, B., and Le, Q.V. (2016). Neural architecture search with reinforcement learning. arXiv."},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Peng, W., Hong, X., Chen, H., and Zhao, G. (2020, January 7\u201312). Learning Graph Convolutional Network for Skeleton-Based Human Action Recognition by Neural Searching. Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i03.5652"},{"key":"ref_92","first-page":"3856","article-title":"Dynamic routing between capsules","volume":"30","author":"Sabour","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_93","unstructured":"Hinton, G.E., Sabour, S., and Frosst, N. (May, January 30). Matrix capsules with EM routing. Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada."},{"key":"ref_94","first-page":"7610","article-title":"Videocapsulenet: A simplified network for action detection","volume":"31","author":"Duarte","year":"2018","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_95","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_97","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Natsev, P. (2017). The kinetics human action video dataset. arXiv."},{"key":"ref_98","unstructured":"Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., and Zisserman, A. (2018). A short note about kinetics-600. arXiv."},{"key":"ref_99","unstructured":"Smaira, L., Carreira, J., Noland, E., Clancy, E., Wu, A., and Zisserman, A. (2020). A Short Note on the Kinetics-700-2020 Human Action Dataset. arXiv."},{"key":"ref_100","unstructured":"Jiang, Y.G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., and Sukthankar, R. (2021, September 03). THUMOS Challenge: Action Recognition with a Large Number of Classes. Available online: http:\/\/crcv.ucf.edu\/THUMOS14\/."},{"key":"ref_101","doi-asserted-by":"crossref","unstructured":"Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M.J. (2013, January 1\u20138). Towards understanding action recognition. Proceedings of the IEEE international conference on computer vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.396"},{"key":"ref_102","unstructured":"Shahroudy, A., Liu, J., Ng, T.T., and Wang, G. (July, January 26). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_103","doi-asserted-by":"crossref","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","article-title":"Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding","volume":"42","author":"Liu","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_104","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.image.2018.09.003","article-title":"TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition","volume":"71","author":"Ma","year":"2019","journal-title":"Signal Process. Image Commun."}],"container-title":["Applied Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/18\/8324\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:58:43Z","timestamp":1760165923000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/18\/8324"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,8]]},"references-count":104,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["app11188324"],"URL":"https:\/\/doi.org\/10.3390\/app11188324","relation":{},"ISSN":["2076-3417"],"issn-type":[{"type":"electronic","value":"2076-3417"}],"subject":[],"published":{"date-parts":[[2021,9,8]]}}}