{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T20:56:30Z","timestamp":1776200190589,"version":"3.50.1"},"reference-count":252,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2023,2,15]],"date-time":"2023-02-15T00:00:00Z","timestamp":1676419200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Institute for Information &amp; communications Technology Promotion (IITP)","award":["IITP-2022-2021-0-00859"],"award-info":[{"award-number":["IITP-2022-2021-0-00859"]}]},{"name":"Institute for Information &amp; communications Technology Promotion (IITP)","award":["2021-0-02068"],"award-info":[{"award-number":["2021-0-02068"]}]},{"name":"Korea Government (MSIT) (Artificial Intelligence Innovation Hub)","award":["IITP-2022-2021-0-00859"],"award-info":[{"award-number":["IITP-2022-2021-0-00859"]}]},{"name":"Korea Government (MSIT) (Artificial Intelligence Innovation Hub)","award":["2021-0-02068"],"award-info":[{"award-number":["2021-0-02068"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Human action recognition systems use data collected from a wide range of sensors to accurately identify and interpret human actions. One of the most challenging issues for computer vision is the automatic and precise identification of human activities. A significant increase in feature learning-based representations for action recognition has emerged in recent years, due to the widespread use of deep learning-based features. This study presents an in-depth analysis of human activity recognition that investigates recent developments in computer vision. Augmented reality, human\u2013computer interaction, cybersecurity, home monitoring, and surveillance cameras are all examples of computer vision applications that often go in conjunction with human action detection. We give a taxonomy-based, rigorous study of human activity recognition techniques, discussing the best ways to acquire human action features, derived using RGB and depth data, as well as the latest research on deep learning and hand-crafted techniques. We also explain a generic architecture to recognize human actions in the real world and its current prominent research topic. At long last, we are able to offer some study analysis concepts and proposals for academics. In-depth researchers of human action recognition will find this review an effective tool.<\/jats:p>","DOI":"10.3390\/s23042182","type":"journal-article","created":{"date-parts":[[2023,2,15]],"date-time":"2023-02-15T03:09:21Z","timestamp":1676430561000},"page":"2182","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":128,"title":["Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6262-6952","authenticated-orcid":false,"given":"Md Golam","family":"Morshed","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Kyung Hee University, Global Campus, Yongin-si 17104, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3896-5591","authenticated-orcid":false,"given":"Tangina","family":"Sultana","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Kyung Hee University, Global Campus, Yongin-si 17104, Republic of Korea"},{"name":"Department of Electronics and Communication Engineering, Hajee Mohammad Danesh Science & Technology University, Dinajpur 5200, Bangladesh"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9222-2468","authenticated-orcid":false,"given":"Aftab","family":"Alam","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Kyung Hee University, Global Campus, Yongin-si 17104, Republic of Korea"},{"name":"Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha P.O. Box 34110, Qatar"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2314-5395","authenticated-orcid":false,"given":"Young-Koo","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Kyung Hee University, Global Campus, Yongin-si 17104, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2023,2,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"3585","DOI":"10.1109\/JSEN.2017.2697077","article-title":"Radar and RGB-depth sensors for fall detection: A review","volume":"17","author":"Cippitelli","year":"2017","journal-title":"IEEE Sens. J."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1508","DOI":"10.1109\/JSEN.2018.2877662","article-title":"Sensing-enhanced therapy system for assessing children with autism spectrum disorders: A feasibility study","volume":"19","author":"Cai","year":"2018","journal-title":"IEEE Sens. J."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Kong, Y., and Fu, Y. (2014, January 6\u201312). Modeling supporting regions for close human interaction recognition. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-16181-5_3"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1016\/j.patcog.2016.05.019","article-title":"RGB-D-based action recognition datasets: A survey","volume":"60","author":"Zhang","year":"2016","journal-title":"Pattern Recognit."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1995","DOI":"10.1016\/j.patrec.2013.02.006","article-title":"A survey of human motion analysis using depth imagery","volume":"34","author":"Chen","year":"2013","journal-title":"Pattern Recognit. Lett."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1555008","DOI":"10.1142\/S0218001415550083","article-title":"A survey of applications and human motion recognition with microsoft kinect","volume":"29","author":"Lun","year":"2015","journal-title":"Int. J. Pattern Recognit. Artif. Intell."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1016\/j.patcog.2015.11.019","article-title":"3D skeleton-based human action classification: A survey","volume":"53","author":"Presti","year":"2016","journal-title":"Pattern Recognit."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.cviu.2017.01.011","article-title":"Space-time representation of people based on 3D skeletal data: A review","volume":"158","author":"Han","year":"2017","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_9","unstructured":"Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., and Gall, J. (2013). Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications, Springer."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.patrec.2014.04.011","article-title":"Human activity recognition from 3d data: A review","volume":"48","author":"Aggarwal","year":"2014","journal-title":"Pattern Recognit. Lett."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1016\/j.imavis.2016.06.007","article-title":"From handcrafted to learned representations for human action recognition: A survey","volume":"55","author":"Zhu","year":"2016","journal-title":"Image Vis. Comput."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1922649.1922653","article-title":"Human activity analysis: A review","volume":"43","author":"Aggarwal","year":"2011","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1007\/s00371-015-1066-2","article-title":"A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector","volume":"32","author":"Dawn","year":"2016","journal-title":"Vis. Comput."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Liu, S., Liu, S., Han, L., Shao, Y., and Zhou, W. (2014, January 14\u201315). Human action recognition using salient region detection in complex scenes. Proceedings of the Third International Conference on Communications, Signal Processing, and Systems, Hohhot, Inner Mongolia, China.","DOI":"10.1007\/978-3-319-08991-1_58"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1109\/TCSVT.2014.2333151","article-title":"STAP: Spatial-temporal attention-aware pooling for action recognition","volume":"25","author":"Nguyen","year":"2014","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40064-016-2876-z","article-title":"Multi-surface analysis for human action recognition in video","volume":"5","author":"Zhang","year":"2016","journal-title":"SpringerPlus"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1007\/s11760-014-0672-1","article-title":"Instantaneous threat detection based on a semantic representation of activities, zones and trajectories","volume":"8","author":"Burghouts","year":"2014","journal-title":"Signal Image Video Process."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Oreifej, O., and Liu, Z. (2013, January 23\u201328). Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.98"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, M., Leung, H., and Shum, H.P. (2016, January 10\u201312). Human action recognition via skeletal and depth based feature fusion. Proceedings of the 9th International Conference on Motion in Games, Burlingame, CA, USA.","DOI":"10.1145\/2994258.2994268"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1016\/j.jvcir.2013.03.001","article-title":"Effective 3d action recognition using eigenjoints","volume":"25","author":"Yang","year":"2014","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1007\/s11554-013-0370-1","article-title":"Real-time human action recognition based on depth motion maps","volume":"12","author":"Chen","year":"2016","journal-title":"J. Real-Time Image Process."},{"key":"ref_23","unstructured":"(2023, February 06). Azure Kinect DK. Available online: https:\/\/azure.microsoft.com\/en-us\/products\/kinect-dk\/."},{"key":"ref_24","unstructured":"Simonyan, K., and Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst., 27."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Liu, J., Shahroudy, A., Xu, D., and Wang, G. (2016, January 8\u201316). Spatio-temporal lstm with trust gates for 3d human action recognition. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46487-9_50"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"498","DOI":"10.1109\/THMS.2015.2504550","article-title":"Action recognition from depth maps using deep convolutional neural networks","volume":"46","author":"Wang","year":"2015","journal-title":"IEEE Trans. Hum.-Mach. Syst."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"G\u00fcler, R.A., Neverova, N., and Kokkinos, I. (2018, January 18\u201322). Densepose: Dense human pose estimation in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00762"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Fang, H.S., Xie, S., Tai, Y.W., and Lu, C. (2017, January 22\u201329). Rmpe: Regional multi-person pose estimation. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.256"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.E., and Sheikh, Y. (2017, January 21\u201326). Realtime multi-person 2d pose estimation using part affinity fields. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.143"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, Y., and Lin, D. (2018, January 4\u20136). Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. (2017, January 22\u201329). Temporal action detection with structured segment networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.317"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Morshed, M.G., and Lee, Y.K. (2022, January 17\u201320). MNSSD: A Real-time DNN based Companion Image Data Annotation using MobileNet and Single Shot Multibox Detector. Proceedings of the 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), Daegu, Republic of Korea.","DOI":"10.1109\/BigComp54360.2022.00055"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"512","DOI":"10.1109\/TMM.2015.2404779","article-title":"Learning spatial and temporal extents of human actions for action detection","volume":"17","author":"Zhou","year":"2015","journal-title":"IEEE Trans. Multimed."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"705","DOI":"10.1007\/s11760-013-0501-y","article-title":"Locating and recognizing multiple human actions by searching for maximum score subsequences","volume":"9","author":"Zhang","year":"2015","journal-title":"Signal Image Video Process."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Shu, Z., Yun, K., and Samaras, D. (2014, January 6\u201312). Action detection with improved dense trajectories and sliding window. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-16178-5_38"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Oneata, D., Verbeek, J., and Schmid, C. (2014, January 23\u201328). Efficient action localization with approximately normalized fisher vectors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.326"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014, January 23\u201328). Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_39","unstructured":"De la Torre, F., Hodgins, J., Bargteil, A., Martin, X., Macey, J., Collado, A., and Beltran, P. (2009). Guide to the Carnegie Mellon University Multimodal Activity (Cmu-Mmac) Database, Citeseer."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Steil, J., and Bulling, A. (2015, January 7\u201311). Discovery of everyday human activities from long-term visual behaviour using topic models. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Osakam, Japan.","DOI":"10.1145\/2750858.2807520"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Baradel, F., Wolf, C., Mille, J., and Taylor, G.W. (2018, January 18\u201322). Glimpse clouds: Human activity recognition from unstructured feature points. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00056"},{"key":"ref_42","unstructured":"Takizawa, K., Aoyagi, T., Takada, J.i., Katayama, N., Yekeh, K., Takehiko, Y., and Kohno, K.R. (2008, January 20\u201325). Channel models for wireless body area networks. Proceedings of the 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, BC, Canada."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Ohn-Bar, E., and Trivedi, M. (2013, January 23\u201328). Joint angles similarities and HOG2 for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA.","DOI":"10.1109\/CVPRW.2013.76"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 15\u201320). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Tenorth, M., Bandouch, J., and Beetz, M. (October, January 27). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan.","DOI":"10.1109\/ICCVW.2009.5457583"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1016\/j.cviu.2006.07.013","article-title":"Free viewpoint action recognition using motion history volumes","volume":"104","author":"Weinland","year":"2006","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3158645","article-title":"Activity recognition with evolving data streams: A review","volume":"51","author":"Abdallah","year":"2018","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.imavis.2017.01.010","article-title":"Going deeper into action recognition: A survey","volume":"60","author":"Herath","year":"2017","journal-title":"Image Vis. Comput."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1016\/j.patcog.2016.08.003","article-title":"Robust human activity recognition from depth video using spatiotemporal multi-fused features","volume":"61","author":"Jalal","year":"2017","journal-title":"Pattern Recognit."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"1028","DOI":"10.1109\/TPAMI.2016.2565479","article-title":"Super normal vector for human activity recognition with depth cameras","volume":"39","author":"Yang","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1016\/j.patcog.2017.08.009","article-title":"Hand action detection from ego-centric depth sequences with error-correcting Hough transform","volume":"72","author":"Xu","year":"2017","journal-title":"Pattern Recognit."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1384","DOI":"10.1109\/JIOT.2018.2846359","article-title":"A hybrid hierarchical framework for gym physical activity recognition and measurement using wearable sensors","volume":"6","author":"Qi","year":"2018","journal-title":"IEEE Internet Things J."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"1","DOI":"10.4236\/etsn.2017.61001","article-title":"RFID systems in healthcare settings and activity of daily living in smart homes: A review","volume":"6","author":"Alsinglawi","year":"2017","journal-title":"E-Health Telecommun. Syst. Netw."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"1192","DOI":"10.1109\/SURV.2012.110112.00192","article-title":"A survey on human activity recognition using wearable sensors","volume":"15","author":"Lara","year":"2012","journal-title":"IEEE Commun. Surv. Tutor."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1109\/JSEN.2016.2628346","article-title":"A survey on activity detection and classification using wearable sensors","volume":"17","author":"Cornacchia","year":"2016","journal-title":"IEEE Sens. J."},{"key":"ref_56","first-page":"5","article-title":"Sensors, vision and networks: From video surveillance to activity recognition and health monitoring","volume":"11","author":"Prati","year":"2019","journal-title":"J. Ambient Intell. Smart Environ."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"3543","DOI":"10.1007\/s11042-018-6034-1","article-title":"Human activity recognition in egocentric video using HOG, GiST and color features","volume":"79","author":"Kumar","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_58","unstructured":"Roy, P.K., and Om, H. (2018). Advances in Soft Computing and Machine Learning in Image Processing, Springer."},{"key":"ref_59","unstructured":"Thyagarajmurthy, A., Ninad, M., Rakesh, B., Niranjan, S., and Manvi, B. (2019). Emerging Research in Electronics, Computer Science and Technology, Springer."},{"key":"ref_60","first-page":"1550147719853987","article-title":"A concise review on sensor signal acquisition and transformation applied to human activity recognition and human\u2013robot interaction","volume":"15","author":"Ponce","year":"2019","journal-title":"Int. J. Distrib. Sens. Netw."},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.patcog.2018.07.028","article-title":"Asymmetric 3d convolutional neural networks for action recognition","volume":"85","author":"Yang","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1016\/j.patcog.2017.10.033","article-title":"Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition","volume":"76","author":"Nunez","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Li, W., Zhang, Z., and Liu, Z. (2018, January 13\u201318). Action recognition based on a bag of 3d points. Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA.","DOI":"10.1109\/CVPRW.2010.5543273"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Bulbul, M.F., Jiang, Y., and Ma, J. (2015, January 20\u201322). Human action recognition based on dmms, hogs and contourlet transform. Proceedings of the 2015 IEEE International Conference on Multimedia Big Data, Beijing, China.","DOI":"10.1109\/BigMM.2015.82"},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"22590","DOI":"10.1109\/ACCESS.2017.2759058","article-title":"Multi-temporal depth motion maps-based local binary patterns for 3-D human action recognition","volume":"5","author":"Chen","year":"2017","journal-title":"IEEE Access"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"4648","DOI":"10.1109\/TIP.2017.2718189","article-title":"Action recognition using 3D histograms of texture and a multi-class boosting classifier","volume":"26","author":"Zhang","year":"2017","journal-title":"IEEE Trans. Image Process."},{"key":"ref_67","unstructured":"Yang, X., Zhang, C., and Tian, Y. (November, January 29). Recognizing actions using depth motion maps-based histograms of oriented gradients. Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan."},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Lai, K., Bo, L., Ren, X., and Fox, D. (2011, January 9\u201313). A large-scale hierarchical multi-view rgb-d object dataset. Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China.","DOI":"10.1109\/ICRA.2011.5980382"},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Yang, X., and Tian, Y. (2014, January 23\u201328). Super normal vector for activity recognition using depth sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.108"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Slama, R., Wannous, H., and Daoudi, M. (2014, January 24\u201328). Grassmannian representation of motion depth for 3D human gesture and action recognition. Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden.","DOI":"10.1109\/ICPR.2014.602"},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Wang, J., Liu, Z., Chorowski, J., Chen, Z., and Wu, Y. (2012, January 7\u201313). Robust 3d action recognition with random occupancy patterns. Proceedings of the European Conference on Computer Vision, Florence, Italy.","DOI":"10.1007\/978-3-642-33709-3_62"},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Xia, L., and Aggarwal, J. (2013, January 23\u201328). Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.365"},{"key":"ref_73","doi-asserted-by":"crossref","first-page":"747","DOI":"10.1016\/j.neucom.2015.11.005","article-title":"Depth context: A new descriptor for human activity recognition by using sole depth sequences","volume":"175","author":"Liu","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"1932","DOI":"10.1109\/TMM.2017.2786868","article-title":"Robust 3D action recognition through sampling local appearances and global distributions","volume":"20","author":"Liu","year":"2017","journal-title":"IEEE Trans. Multimed."},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.sigpro.2017.08.016","article-title":"Skeleton embedded motion body partition for human action recognition using depth sequences","volume":"143","author":"Ji","year":"2018","journal-title":"Signal Process."},{"key":"ref_76","unstructured":"Gowayyed, M.A., Torki, M., Hussein, M.E., and El-Saban, M. (2013, January 3\u20139). Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China."},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1016\/j.patcog.2017.01.015","article-title":"Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition","volume":"66","author":"Qiao","year":"2017","journal-title":"Pattern Recognit."},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"1340","DOI":"10.1109\/TCYB.2014.2350774","article-title":"3-d human action recognition by shape analysis of motion trajectories on riemannian manifold","volume":"45","author":"Devanne","year":"2014","journal-title":"IEEE Trans. Cybern."},{"key":"ref_79","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1016\/j.patcog.2017.10.034","article-title":"DSRF: A flexible trajectory descriptor for articulated human action recognition","volume":"76","author":"Guo","year":"2018","journal-title":"Pattern Recognit."},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Duan, H., Zhao, Y., Chen, K., Lin, D., and Dai, B. (2022, January 18\u201324). Revisiting skeleton-based action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00298"},{"key":"ref_81","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","article-title":"Distinctive image features from scale-invariant keypoints","volume":"60","author":"Lowe","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_82","unstructured":"Doll\u00e1r, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005, January 15\u201316). Behavior recognition via sparse spatio-temporal features. Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"786","DOI":"10.1016\/j.eswa.2013.08.009","article-title":"Evolutionary joint selection to improve human action recognition with RGB-D devices","volume":"41","author":"Chaaraoui","year":"2014","journal-title":"Expert Syst. Appl."},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Vemulapalli, R., Arrate, F., and Chellappa, R. (2014, January 23\u201328). Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.82"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"108360","DOI":"10.1016\/j.patcog.2021.108360","article-title":"Skeleton-based relational reasoning for group activity analysis","volume":"122","author":"Perez","year":"2022","journal-title":"Pattern Recognit."},{"key":"ref_86","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1109\/TPAMI.2013.198","article-title":"Learning actionlet ensemble for 3D human action recognition","volume":"36","author":"Wang","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_87","unstructured":"Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012, January 16\u201321). Mining actionlet ensemble for action recognition with depth cameras. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA."},{"key":"ref_88","doi-asserted-by":"crossref","first-page":"163","DOI":"10.1016\/j.neucom.2016.03.024","article-title":"Activity recognition using a supervised non-parametric hierarchical HMM","volume":"199","author":"Raman","year":"2016","journal-title":"Neurocomputing"},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Chen, W., and Guo, G. (2013, January 23\u201328). Fusing spatiotemporal features and joints for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA.","DOI":"10.1109\/CVPRW.2013.78"},{"key":"ref_90","unstructured":"Sung, J., Ponce, C., Selman, B., and Saxena, A. (2012, January 4\u201318). Unstructured human activity detection from rgbd images. Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St Paul, MN, USA."},{"key":"ref_91","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1016\/j.sigpro.2014.08.038","article-title":"Coupled hidden conditional random fields for RGB-D human action recognition","volume":"112","author":"Liu","year":"2015","journal-title":"Signal Process."},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Kong, Y., and Fu, Y. (2015, January 7\u201312). Bilinear heterogeneous information machine for RGB-D action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298708"},{"key":"ref_93","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1007\/s11263-016-0982-6","article-title":"Max-margin heterogeneous information machine for RGB-D action recognition","volume":"123","author":"Kong","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_94","doi-asserted-by":"crossref","first-page":"104465","DOI":"10.1016\/j.imavis.2022.104465","article-title":"Handcrafted localized phase features for human action recognition","volume":"123","author":"Hejazi","year":"2022","journal-title":"Image Vis. Comput."},{"key":"ref_95","doi-asserted-by":"crossref","first-page":"82686","DOI":"10.1109\/ACCESS.2021.3085708","article-title":"Making sense of neuromorphic event data for human action recognition","volume":"9","author":"Abhayaratne","year":"2021","journal-title":"IEEE Access"},{"key":"ref_96","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1016\/j.patcog.2017.01.001","article-title":"Graph formulation of video activities for abnormal activity recognition","volume":"65","author":"Singh","year":"2017","journal-title":"Pattern Recognit."},{"key":"ref_97","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1109\/TIP.2014.2302677","article-title":"Evaluation of color spatio-temporal interest points for human action recognition","volume":"23","author":"Everts","year":"2014","journal-title":"IEEE Trans. Image Process."},{"key":"ref_98","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1016\/j.imavis.2014.04.005","article-title":"Evaluating spatiotemporal interest point features for depth-based action recognition","volume":"32","author":"Zhu","year":"2014","journal-title":"Image Vis. Comput."},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J., and Roca, F.X. (2011, January 6\u201313). A selective spatio-temporal interest point detector for human action recognition in complex scenes. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126443"},{"key":"ref_100","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1016\/j.robot.2015.11.013","article-title":"A proposed unified framework for the recognition of human activity by exploiting the characteristics of action dynamics","volume":"77","author":"Vishwakarma","year":"2016","journal-title":"Robot. Auton. Syst."},{"key":"ref_101","doi-asserted-by":"crossref","first-page":"660","DOI":"10.1016\/j.compeleceng.2018.01.037","article-title":"Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition","volume":"72","author":"Nazir","year":"2018","journal-title":"Comput. Electr. Eng."},{"key":"ref_102","unstructured":"Miao, Y., and Song, J. (2014, January 29\u201330). Abnormal event detection based on SVM in video surveillance. Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA), Ottawa, ON, Canada."},{"key":"ref_103","doi-asserted-by":"crossref","unstructured":"Xu, D., Xiao, X., Wang, X., and Wang, J. (2016, January 11\u201312). Human action recognition based on Kinect and PSO-SVM by representing 3D skeletons as points in lie group. Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China.","DOI":"10.1109\/ICALIP.2016.7846646"},{"key":"ref_104","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1109\/TCYB.2015.2399172","article-title":"Learning spatio-temporal representations for action recognition: A genetic programming approach","volume":"46","author":"Liu","year":"2015","journal-title":"IEEE Trans. Cybern."},{"key":"ref_105","doi-asserted-by":"crossref","first-page":"6957","DOI":"10.1016\/j.eswa.2015.04.039","article-title":"Hybrid classifier based human activity recognition using the silhouette and cells","volume":"42","author":"Vishwakarma","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_106","doi-asserted-by":"crossref","first-page":"2238","DOI":"10.4304\/jsw.8.9.2238-2245","article-title":"Human Action Recognition Using APJ3D and Random Forests","volume":"8","author":"Gan","year":"2013","journal-title":"J. Softw."},{"key":"ref_107","doi-asserted-by":"crossref","first-page":"1843","DOI":"10.1109\/TCE.2011.6131162","article-title":"Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care","volume":"57","author":"Khan","year":"2011","journal-title":"IEEE Trans. Consum. Electron."},{"key":"ref_108","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1016\/j.engappai.2013.10.003","article-title":"Optimizing human action recognition based on a cooperative coevolutionary algorithm","volume":"31","author":"Chaaraoui","year":"2014","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_109","doi-asserted-by":"crossref","unstructured":"Chen, C., Jafari, R., and Kehtarnavaz, N. (2015, January 5\u20139). Action recognition from depth sequences using depth motion maps-based local binary patterns. Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV.2015.150"},{"key":"ref_110","doi-asserted-by":"crossref","first-page":"624","DOI":"10.1109\/LSP.2017.2678539","article-title":"Joint distance maps based action recognition with convolutional neural networks","volume":"24","author":"Li","year":"2017","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_111","doi-asserted-by":"crossref","unstructured":"Ke, Q., Bennamoun, M., An, S., Sohel, F., and Boussaid, F. (2017, January 21\u201326). A new representation of skeleton sequences for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.486"},{"key":"ref_112","unstructured":"Liu, J., Akhtar, N., and Mian, A. (2019, January 16\u201320). Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition. Proceedings of the CVPR Workshops, Long Beach, CA, USA."},{"key":"ref_113","doi-asserted-by":"crossref","first-page":"807","DOI":"10.1109\/TCSVT.2016.2628339","article-title":"Skeleton optical spectra-based action recognition using convolutional neural networks","volume":"28","author":"Hou","year":"2016","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_114","doi-asserted-by":"crossref","unstructured":"Xie, C., Li, C., Zhang, B., Chen, C., Han, J., Zou, C., and Liu, J. (2018). Memory attention networks for skeleton-based action recognition. arXiv.","DOI":"10.24963\/ijcai.2018\/227"},{"key":"ref_115","doi-asserted-by":"crossref","unstructured":"Huang, Z., Wan, C., Probst, T., and Van Gool, L. (2017, January 21\u201326). Deep learning on lie groups for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.137"},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Vemulapalli, R., and Chellapa, R. (2016, January 27\u201330). Rolling rotations for recognizing human actions from 3d skeletal data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.484"},{"key":"ref_117","doi-asserted-by":"crossref","unstructured":"Liu, M., and Yuan, J. (2018, January 18\u201322). Recognizing human actions as the evolution of pose estimation maps. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00127"},{"key":"ref_118","first-page":"1","article-title":"Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition","volume":"18","author":"Tang","year":"2022","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)"},{"key":"ref_119","doi-asserted-by":"crossref","unstructured":"Li, X., Liu, C., Shuai, B., Zhu, Y., Chen, H., and Tighe, J. (2022, January 3\u20138). Nuta: Non-uniform temporal aggregation for action recognition. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00090"},{"key":"ref_120","doi-asserted-by":"crossref","unstructured":"Xu, Y., Wei, F., Sun, X., Yang, C., Shen, Y., Dai, B., Zhou, B., and Lin, S. (2022, January 18\u201324). Cross-model pseudo-labeling for semi-supervised action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00297"},{"key":"ref_121","doi-asserted-by":"crossref","unstructured":"Qian, Y., Kang, G., Yu, L., Liu, W., and Hauptmann, A.G. (2022, January 3\u20138). Trm: Temporal relocation module for video recognition. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACVW54805.2022.00021"},{"key":"ref_122","doi-asserted-by":"crossref","unstructured":"Yu, L., Qian, Y., Liu, W., and Hauptmann, A.G. (2022, January 3\u20138). Argus++: Robust real-time activity detection for unconstrained video streams with overlapping cube proposals. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACVW54805.2022.00017"},{"key":"ref_123","doi-asserted-by":"crossref","unstructured":"Wang, L., Tong, Z., Ji, B., and Wu, G. (2021, January 20\u201325). Tdn: Temporal difference networks for efficient action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00193"},{"key":"ref_124","doi-asserted-by":"crossref","unstructured":"Gowda, S.N., Rohrbach, M., and Sevilla-Lara, L. (2020). SMART Frame Selection for Action Recognition. arXiv.","DOI":"10.1609\/aaai.v35i2.16235"},{"key":"ref_125","doi-asserted-by":"crossref","first-page":"1510","DOI":"10.1109\/TMM.2017.2666540","article-title":"Sequential deep trajectory descriptor for action recognition with three-stream CNN","volume":"19","author":"Shi","year":"2017","journal-title":"IEEE Trans. Multimed."},{"key":"ref_126","doi-asserted-by":"crossref","first-page":"346","DOI":"10.1016\/j.patcog.2017.02.030","article-title":"Enhanced skeleton visualization for view invariant human action recognition","volume":"68","author":"Liu","year":"2017","journal-title":"Pattern Recognit."},{"key":"ref_127","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D convolutional neural networks for human action recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_128","doi-asserted-by":"crossref","unstructured":"Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. (2019, January 15\u201320). Actional-structural graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00371"},{"key":"ref_129","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016, January 27\u201330). Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_130","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1016\/j.future.2019.01.029","article-title":"Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments","volume":"96","author":"Ullah","year":"2019","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_131","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1016\/j.patcog.2016.01.012","article-title":"Human action recognition using genetic algorithms and convolutional neural networks","volume":"59","author":"Ijjina","year":"2016","journal-title":"Pattern Recognit."},{"key":"ref_132","doi-asserted-by":"crossref","unstructured":"Akilan, T., Wu, Q.J., Safaei, A., and Jiang, W. (2017, January 5\u20138). A late fusion approach for harnessing multi-CNN model high-level features. Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada.","DOI":"10.1109\/SMC.2017.8122666"},{"key":"ref_133","doi-asserted-by":"crossref","unstructured":"Kim, T.S., and Reiter, A. (2017, January 21\u201326). Interpretable 3d human action analysis with temporal convolutional networks. Proceedings of the 2017 IEEE conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA.","DOI":"10.1109\/CVPRW.2017.207"},{"key":"ref_134","first-page":"3100","article-title":"Encoding pose features to images with data augmentation for 3-D action recognition","volume":"16","author":"Hua","year":"2019","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_135","doi-asserted-by":"crossref","unstructured":"Gowda, S.N. (2017, January 21\u201326). Human activity recognition using combinatorial Deep Belief Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, Honolulu, HI, USA.","DOI":"10.1109\/CVPRW.2017.203"},{"key":"ref_136","unstructured":"Li, C., Wang, P., Wang, S., Hou, Y., and Li, W. (2017, January 10\u201314). Skeleton-based action recognition using LSTM and CNN. Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China."},{"key":"ref_137","doi-asserted-by":"crossref","unstructured":"Das, S., Chaudhary, A., Bremond, F., and Thonnat, M. (2019, January 7\u201311). Where to focus on for human action recognition?. Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA.","DOI":"10.1109\/WACV.2019.00015"},{"key":"ref_138","doi-asserted-by":"crossref","unstructured":"Veeriah, V., Zhuang, N., and Qi, G.J. (2015, January 7\u201313). Differential recurrent neural networks for action recognition. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.460"},{"key":"ref_139","unstructured":"Du, Y., Wang, W., and Wang, L. (2015, January 7\u201312). Hierarchical recurrent neural network for skeleton based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_140","doi-asserted-by":"crossref","first-page":"3010","DOI":"10.1109\/TIP.2016.2552404","article-title":"Representation learning of temporal dynamics for skeleton-based action recognition","volume":"25","author":"Du","year":"2016","journal-title":"IEEE Trans. Image Process."},{"key":"ref_141","doi-asserted-by":"crossref","unstructured":"Zhang, S., Liu, X., and Xiao, J. (2017, January 24\u201331). On geometric features for skeleton-based action recognition using multilayer lstm networks. Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA.","DOI":"10.1109\/WACV.2017.24"},{"key":"ref_142","doi-asserted-by":"crossref","unstructured":"Shahroudy, A., Liu, J., Ng, T.T., and Wang, G. (2016, January 27\u201330). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.115"},{"key":"ref_143","doi-asserted-by":"crossref","unstructured":"Mahasseni, B., and Todorovic, S. (2016, January 27\u201330). Regularizing long short term memory with 3D human-skeleton sequences for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.333"},{"key":"ref_144","doi-asserted-by":"crossref","unstructured":"Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., and Xie, X. (2016, January 12\u201317). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.","DOI":"10.1609\/aaai.v30i1.10451"},{"key":"ref_145","doi-asserted-by":"crossref","unstructured":"Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., and Zheng, N. (2017, January 22\u201329). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.233"},{"key":"ref_146","doi-asserted-by":"crossref","first-page":"1586","DOI":"10.1109\/TIP.2017.2785279","article-title":"Skeleton-based human action recognition with global context-aware attention LSTM networks","volume":"27","author":"Liu","year":"2017","journal-title":"IEEE Trans. Image Process."},{"key":"ref_147","doi-asserted-by":"crossref","unstructured":"Song, S., Lan, C., Xing, J., Zeng, W., and Liu, J. (2017, January 4\u20139). An end-to-end spatio-temporal attention model for human action recognition from skeleton data. Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11212"},{"key":"ref_148","doi-asserted-by":"crossref","unstructured":"Wang, H., and Wang, L. (2017, January 21\u201326). Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.387"},{"key":"ref_149","doi-asserted-by":"crossref","unstructured":"Si, C., Jing, Y., Wang, W., Wang, L., and Tan, T. (2018, January 8\u201314). Skeleton-based action recognition with spatial reasoning and temporal stack learning. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_7"},{"key":"ref_150","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1016\/j.neucom.2013.09.055","article-title":"Autoencoder for words","volume":"139","author":"Liou","year":"2014","journal-title":"Neurocomputing"},{"key":"ref_151","doi-asserted-by":"crossref","first-page":"504","DOI":"10.1126\/science.1127647","article-title":"Reducing the dimensionality of data with neural networks","volume":"313","author":"Hinton","year":"2006","journal-title":"Science"},{"key":"ref_152","doi-asserted-by":"crossref","unstructured":"Zhang, J., Shan, S., Kan, M., and Chen, X. (2014, January 6\u201312). Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10605-2_1"},{"key":"ref_153","doi-asserted-by":"crossref","unstructured":"Jiang, X., Zhang, Y., Zhang, W., and Xiao, X. (2013, January 19\u201321). A novel sparse auto-encoder for deep unsupervised learning. Proceedings of the 2013 Sixth International Conference on Advanced Computational Intelligence (ICACI), Hangzhou, China.","DOI":"10.1109\/ICACI.2013.6748512"},{"key":"ref_154","unstructured":"Zhou, Y., Arpit, D., Nwogu, I., and Govindaraju, V. (2014). Is joint training better for deep auto-encoders?. arXiv."},{"key":"ref_155","doi-asserted-by":"crossref","unstructured":"Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. (2008, January 5\u20139). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1145\/1390156.1390294"},{"key":"ref_156","first-page":"3371","article-title":"Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion","volume":"11","author":"Vincent","year":"2010","journal-title":"J. Mach. Learn. Res."},{"key":"ref_157","doi-asserted-by":"crossref","first-page":"3170","DOI":"10.1109\/TII.2018.2808910","article-title":"An efficient deep learning model to predict cloud workload for industry informatics","volume":"14","author":"Zhang","year":"2018","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_158","doi-asserted-by":"crossref","unstructured":"Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., and Baskurt, A. (2012, January 3\u20137). Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. Proceedings of the BMVC, Surrey, UK.","DOI":"10.5244\/C.26.124"},{"key":"ref_159","first-page":"2","article-title":"Learning and relearning in Boltzmann machines","volume":"1","author":"Hinton","year":"1986","journal-title":"Parallel Distrib. Process. Explor. Microstruct. Cogn."},{"key":"ref_160","unstructured":"Carreira-Perpinan, M.A., and Hinton, G.E. (2005, January 6\u20138). On contrastive divergence learning. Proceedings of the Aistats, Bridgetown, Barbados."},{"key":"ref_161","unstructured":"Hinton, G.E. (2012). Neural Networks: Tricks of the Trade, Springer."},{"key":"ref_162","unstructured":"Cho, K., Raiko, T., and Ilin, A. (July, January 28). Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. Proceedings of the ICML, Bellevue, WA, USA."},{"key":"ref_163","unstructured":"Nair, V., and Hinton, G.E. (2010, January 21\u201324). Rectified linear units improve restricted boltzmann machines. Proceedings of the ICML, Haifa, Israel."},{"key":"ref_164","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_165","unstructured":"Zeiler, M.D., and Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv."},{"key":"ref_166","doi-asserted-by":"crossref","first-page":"1527","DOI":"10.1162\/neco.2006.18.7.1527","article-title":"A fast learning algorithm for deep belief nets","volume":"18","author":"Hinton","year":"2006","journal-title":"Neural Comput."},{"key":"ref_167","unstructured":"Chen, B. (2010). Deep Learning of Invariant Spatio-Temporal Features from Video. [Ph.D Thesis, University of British Columbia]."},{"key":"ref_168","doi-asserted-by":"crossref","unstructured":"Zhang, L., Zhu, G., Shen, P., Song, J., Afaq Shah, S., and Bennamoun, M. (2017, January 22\u201329). Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy.","DOI":"10.1109\/ICCVW.2017.369"},{"key":"ref_169","doi-asserted-by":"crossref","first-page":"1806","DOI":"10.1109\/TSMC.2018.2850149","article-title":"Deep convolutional neural networks for human action recognition using depth maps and postures","volume":"49","author":"Kamel","year":"2018","journal-title":"IEEE Trans. Syst. Man, Cybern. Syst."},{"key":"ref_170","doi-asserted-by":"crossref","first-page":"323","DOI":"10.3390\/s22010323","article-title":"Human activity recognition via hybrid deep learning based model","volume":"22","author":"Khan","year":"2022","journal-title":"Sensors"},{"key":"ref_171","doi-asserted-by":"crossref","first-page":"1583","DOI":"10.1109\/TPAMI.2016.2537340","article-title":"Deep dynamic neural networks for multimodal gesture segmentation and recognition","volume":"38","author":"Wu","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_172","doi-asserted-by":"crossref","unstructured":"Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., and Ogunbona, P. (2017, January 21\u201326). Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.52"},{"key":"ref_173","doi-asserted-by":"crossref","unstructured":"Shi, Z., and Kim, T.K. (2017, January 21\u201326). Learning and refining of privileged information-based RNNs for action recognition from depth sequences. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.498"},{"key":"ref_174","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1016\/j.imavis.2016.04.004","article-title":"3D-based deep convolutional neural network for action recognition with depth sequences","volume":"55","author":"Liu","year":"2016","journal-title":"Image Vis. Comput."},{"key":"ref_175","doi-asserted-by":"crossref","unstructured":"Wang, X., Zhang, S., Qing, Z., Tang, M., Zuo, Z., Gao, C., Jin, R., and Sang, N. (2022, January 18\u201324). Hybrid relation guided set matching for few-shot action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01932"},{"key":"ref_176","doi-asserted-by":"crossref","first-page":"1474","DOI":"10.1109\/TPAMI.2022.3157033","article-title":"Constructing stronger and faster baselines for skeleton-based action recognition","volume":"45","author":"Song","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_177","doi-asserted-by":"crossref","unstructured":"Duan, H., Wang, J., Chen, K., and Lin, D. (2022, January 10\u201314). Pyskl: Towards good practices for skeleton action recognition. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal.","DOI":"10.1145\/3503161.3548546"},{"key":"ref_178","unstructured":"Wang, M., Xing, J., and Liu, Y. (2021). Actionclip: A new paradigm for video action recognition. arXiv."},{"key":"ref_179","doi-asserted-by":"crossref","unstructured":"Gao, R., Oh, T.H., Grauman, K., and Torresani, L. (2020, January 13\u201319). Listen to look: Action recognition by previewing audio. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01047"},{"key":"ref_180","doi-asserted-by":"crossref","unstructured":"Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019, January 15\u201320). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00132"},{"key":"ref_181","doi-asserted-by":"crossref","unstructured":"Das, S., Koperski, M., Bremond, F., and Francesca, G. (2018, January 27\u201330). Deep-temporal lstm for daily living action recognition. Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand.","DOI":"10.1109\/AVSS.2018.8639122"},{"key":"ref_182","unstructured":"Sharma, S., Kiros, R., and Salakhutdinov, R. (2015). Action recognition using visual attention. arXiv."},{"key":"ref_183","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016, January 11\u201314). Temporal segment networks: Towards good practices for deep action recognition. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_184","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1016\/j.neucom.2018.03.077","article-title":"Deep key frame extraction for sport training","volume":"328","author":"Jian","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_185","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Sun, X., Zha, Z.J., and Zeng, W. (2018, January 18\u201322). Mict: Mixed 3d\/2d convolutional tube for human action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00054"},{"key":"ref_186","doi-asserted-by":"crossref","unstructured":"Foggia, P., Saggese, A., Strisciuglio, N., and Vento, M. (2014, January 26\u201329). Exploiting the deep learning paradigm for recognizing human actions. Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Seoul, Republic of Korea.","DOI":"10.1109\/AVSS.2014.6918650"},{"key":"ref_187","unstructured":"Ahsan, U., Sun, C., and Essa, I. (2018). Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. arXiv."},{"key":"ref_188","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1016\/j.image.2011.05.002","article-title":"Human action recognition using pose-based discriminant embedding","volume":"27","author":"Saghafi","year":"2012","journal-title":"Signal Process. Image Commun."},{"key":"ref_189","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7\u20139). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, PMLR, Lille, France."},{"key":"ref_190","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_191","doi-asserted-by":"crossref","unstructured":"Guo, H., Wang, H., and Ji, Q. (2022, January 18\u201324). Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01942"},{"key":"ref_192","doi-asserted-by":"crossref","unstructured":"Liu, Z., Tian, Y., and Wang, Z. (2017, January 17\u201320). Improving human action recognitionby temporal attention. Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China.","DOI":"10.1109\/ICIP.2017.8296405"},{"key":"ref_193","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1016\/j.asoc.2017.06.007","article-title":"First and second order dynamics in a hierarchical SOM system for action recognition","volume":"59","author":"Gharaee","year":"2017","journal-title":"Appl. Soft Comput."},{"key":"ref_194","doi-asserted-by":"crossref","unstructured":"Chen, J., Mittal, G., Yu, Y., Kong, Y., and Chen, M. (2022, January 18\u201324). GateHUB: Gated History Unit with Background Suppression for Online Action Detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01930"},{"key":"ref_195","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_196","doi-asserted-by":"crossref","unstructured":"Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv.","DOI":"10.18653\/v1\/W18-5446"},{"key":"ref_197","doi-asserted-by":"crossref","unstructured":"Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv.","DOI":"10.18653\/v1\/D16-1264"},{"key":"ref_198","doi-asserted-by":"crossref","unstructured":"Zellers, R., Bisk, Y., Schwartz, R., and Choi, Y. (2018). Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv.","DOI":"10.18653\/v1\/D18-1009"},{"key":"ref_199","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_200","unstructured":"Rae, J.W., Potapenko, A., Jayakumar, S.M., and Lillicrap, T.P. (2019). Compressive transformers for long-range sequence modelling. arXiv."},{"key":"ref_201","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_202","doi-asserted-by":"crossref","unstructured":"Wei, Y., Liu, H., Xie, T., Ke, Q., and Guo, Y. (2022, January 3\u20138). Spatial-temporal transformer for 3d point cloud sequences. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00073"},{"key":"ref_203","doi-asserted-by":"crossref","unstructured":"Chen, J., and Ho, C.M. (2022, January 3\u20138). MM-ViT: Multi-modal video transformer for compressed video action recognition. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACV51458.2022.00086"},{"key":"ref_204","doi-asserted-by":"crossref","unstructured":"Wu, C.Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., and Feichtenhofer, C. (2022, January 18\u201324). Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01322"},{"key":"ref_205","doi-asserted-by":"crossref","unstructured":"Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., and Schmid, C. (2022, January 18\u201324). Multiview transformers for video recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00333"},{"key":"ref_206","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_207","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_208","unstructured":"Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. (November, January 27). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_209","doi-asserted-by":"crossref","unstructured":"Xu, H., Ghosh, G., Huang, P.Y., Arora, P., Aminzadeh, M., Feichtenhofer, C., Metze, F., and Zettlemoyer, L. (2021). VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding. arXiv.","DOI":"10.18653\/v1\/2021.findings-acl.370"},{"key":"ref_210","unstructured":"Akbari, H., Yuan, L., Qian, R., Chuang, W.H., Chang, S.F., Cui, Y., and Gong, B. (2021). Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv."},{"key":"ref_211","unstructured":"Sun, C., Baradel, F., Murphy, K., and Schmid, C. (2019). Learning video representations using contrastive bidirectional transformer. arXiv."},{"key":"ref_212","unstructured":"Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling. arXiv."},{"key":"ref_213","doi-asserted-by":"crossref","unstructured":"Zhu, L., and Yang, Y. (2020, January 13\u201319). Actbert: Learning global-local video-text representations. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00877"},{"key":"ref_214","unstructured":"Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., Li, J., Bharti, T., and Zhou, M. (2020). Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv."},{"key":"ref_215","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. (2022, January 18\u201324). Video swin transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"ref_216","doi-asserted-by":"crossref","unstructured":"Marszalek, M., Laptev, I., and Schmid, C. (2009, January 22\u201324). Actions in context. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206557"},{"key":"ref_217","doi-asserted-by":"crossref","first-page":"971","DOI":"10.1007\/s00138-012-0450-4","article-title":"Recognizing 50 human action categories of web videos","volume":"24","author":"Reddy","year":"2013","journal-title":"Mach. Vis. Appl."},{"key":"ref_218","unstructured":"Li, W., Wong, Y., Liu, A.A., Li, Y., Su, Y.T., and Kankanhalli, M. (2016). Multi-camera action dataset (MCAD): A dataset for studying non-overlapped cross-camera action recognition. arXiv."},{"key":"ref_219","doi-asserted-by":"crossref","unstructured":"Bhardwaj, R., and Singh, P.K. (2016, January 14\u201315). Analytical review on human activity recognition in video. Proceedings of the 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), Noida, India.","DOI":"10.1109\/CONFLUENCE.2016.7508177"},{"key":"ref_220","unstructured":"Chahuara, P., Fleury, A., Vacher, M., and Portet, F. (2012, January 22\u201324). M\u00e9thodes SVM et MLN pour la reconnaissance automatique d\u2019activit\u00e9s humaines dans les habitats perceptifs: Tests et perspectives. Proceedings of the RFIA 2012 (Reconnaissance des Formes et Intelligence Artificielle), Lyon, France."},{"key":"ref_221","doi-asserted-by":"crossref","unstructured":"Nguyen-Duc-Thanh, N., Stonier, D., Lee, S., and Kim, D.H. (2011, January 22\u201324). A new approach for human-robot interaction using human body language. Proceedings of the International Conference on Hybrid Information Technology, Daejeon, Republic of Korea.","DOI":"10.1007\/978-3-642-24082-9_92"},{"key":"ref_222","doi-asserted-by":"crossref","unstructured":"Mollet, N., and Chellali, R. (2005, January 27\u201331). D\u00e9tection et interpr\u00e9tation des Gestes de la Main. Proceedings of the 2005 3rd International Conference on SETIT, Sousse, Tunisia.","DOI":"10.1016\/S0338-9898(05)80195-7"},{"key":"ref_223","first-page":"339","article-title":"Continuous gesture trajectory recognition system based on computer vision","volume":"6","author":"Wenkai","year":"2012","journal-title":"Int. J. Appl. Math. Inf. Sci."},{"key":"ref_224","doi-asserted-by":"crossref","first-page":"763","DOI":"10.3837\/tiis.2015.02.016","article-title":"A novel method for hand posture recognition based on depth information descriptor","volume":"9","author":"Xu","year":"2015","journal-title":"KSII Trans. Internet Inf. Syst. (TIIS)"},{"key":"ref_225","unstructured":"Youssef, M.B., Trabelsi, I., and Bouhlel, M.S. (2016). Human action analysis for assistance with daily activities. Int. J. Hum. Mach. Interact., 7."},{"key":"ref_226","doi-asserted-by":"crossref","unstructured":"Shao, J., Kang, K., Change Loy, C., and Wang, X. (2015, January 7\u201312). Deeply learned attributes for crowded scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299097"},{"key":"ref_227","unstructured":"Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Chun Zhu, S. (2015, January 7\u201312). Joint inference of groups, events and human roles in aerial videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_228","doi-asserted-by":"crossref","unstructured":"Ryoo, M.S., and Aggarwal, J.K. (October, January 29). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan.","DOI":"10.1109\/ICCV.2009.5459361"},{"key":"ref_229","doi-asserted-by":"crossref","first-page":"28","DOI":"10.3389\/frobt.2015.00028","article-title":"A review of human activity recognition methods","volume":"2","author":"Vrigkas","year":"2015","journal-title":"Front. Robot. AI"},{"key":"ref_230","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1109\/ICPR.2004.1334462","article-title":"Recognizing human actions: A local SVM approach","volume":"Volume 3","author":"Schuldt","year":"2004","journal-title":"Proceedings of the 17th International Conference on Pattern Recognition 2004, ICPR 2004"},{"key":"ref_231","doi-asserted-by":"crossref","first-page":"2684","DOI":"10.1109\/TPAMI.2019.2916873","article-title":"Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding","volume":"42","author":"Liu","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_232","doi-asserted-by":"crossref","unstructured":"Singh, S., Velastin, S.A., and Ragheb, H. (September, January 29). Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods. Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA.","DOI":"10.1109\/AVSS.2010.63"},{"key":"ref_233","doi-asserted-by":"crossref","unstructured":"Caba Heilbron, F., Escorcia, V., Ghanem, B., and Carlos Niebles, J. (2015, January 7\u201312). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"ref_234","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., and Natsev, P. (2017). The kinetics human action video dataset. arXiv."},{"key":"ref_235","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_236","doi-asserted-by":"crossref","unstructured":"Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld, B. (2008, January 23\u201328). Learning realistic human actions from movies. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska.","DOI":"10.1109\/CVPR.2008.4587756"},{"key":"ref_237","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_238","first-page":"141","article-title":"Performance metrics and evaluation issues for continuous activity recognition","volume":"4","author":"Minnen","year":"2006","journal-title":"Perform. Metrics Intell. Syst."},{"key":"ref_239","unstructured":"Wang, Y., Wu, H., Zhang, J., Gao, Z., Wang, J., Yu, P.S., and Long, M. (2021). PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. arXiv."},{"key":"ref_240","doi-asserted-by":"crossref","unstructured":"Paoletti, G., Cavazza, J., Beyan, C., and Del Bue, A. (2021, January 10\u201315). Subspace Clustering for Action Recognition with Covariance Representations and Temporal Pruning. Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtual.","DOI":"10.1109\/ICPR48806.2021.9412060"},{"key":"ref_241","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1016\/j.neucom.2019.12.151","article-title":"Conflux LSTMs network: A novel approach for multi-view action recognition","volume":"435","author":"Ullah","year":"2021","journal-title":"Neurocomputing"},{"key":"ref_242","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1109\/TPAMI.2017.2691321","article-title":"Deep multimodal feature analysis for action recognition in rgb+ d videos","volume":"40","author":"Shahroudy","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_243","unstructured":"Lan, Z., Lin, M., Li, X., Hauptmann, A.G., and Raj, B. (2015, January 7\u201312). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_244","unstructured":"Wu, W., Sun, Z., and Ouyang, W. (2023, January 7\u20138). Revisiting classifier: Transferring vision-language models for video recognition. Proceedings of the AAAI, Washington, DC, USA."},{"key":"ref_245","unstructured":"Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., and Wang, Z. (2022). InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv."},{"key":"ref_246","doi-asserted-by":"crossref","unstructured":"Wang, L., and Koniusz, P. (2021, January 20\u201324). Self-supervising action recognition by statistical moment and subspace descriptors. Proceedings of the 29th ACM International Conference on Multimedia, Virtual.","DOI":"10.1145\/3474085.3475572"},{"key":"ref_247","doi-asserted-by":"crossref","first-page":"107102","DOI":"10.1016\/j.asoc.2021.107102","article-title":"Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications","volume":"103","author":"Ullah","year":"2021","journal-title":"Appl. Soft Comput."},{"key":"ref_248","doi-asserted-by":"crossref","unstructured":"Negin, F., Koperski, M., Crispim, C.F., Bremond, F., Co\u015far, S., and Avgerinakis, K. (2016, January 23\u201326). A hybrid framework for online recognition of activities of daily living in real-world settings. Proceedings of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA.","DOI":"10.1109\/AVSS.2016.7738021"},{"key":"ref_249","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s10462-012-9356-9","article-title":"Vision based hand gesture recognition for human computer interaction: A survey","volume":"43","author":"Rautaray","year":"2015","journal-title":"Artif. Intell. Rev."},{"key":"ref_250","doi-asserted-by":"crossref","unstructured":"Xu, K., Qin, Z., and Wang, G. (2016, January 11\u201315). Recognize human activities from multi-part missing videos. Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA.","DOI":"10.1109\/ICME.2016.7552941"},{"key":"ref_251","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1016\/j.inffus.2018.06.002","article-title":"Data fusion and multiple classifier systems for human activity detection and health monitoring: Review and open research directions","volume":"46","author":"Nweke","year":"2019","journal-title":"Inf. Fusion"},{"key":"ref_252","unstructured":"Akansha, U.A., Shailendra, M., and Singh, N. (2016, January 16\u201318). Analytical review on video-based human activity recognition. Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/4\/2182\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:36:08Z","timestamp":1760121368000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/4\/2182"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,15]]},"references-count":252,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2023,2]]}},"alternative-id":["s23042182"],"URL":"https:\/\/doi.org\/10.3390\/s23042182","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,15]]}}}