{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T02:52:55Z","timestamp":1760237575013,"version":"build-2065373602"},"reference-count":67,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2020,6,1]],"date-time":"2020-06-01T00:00:00Z","timestamp":1590969600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61672150, 61702092, 61907007, 61602221"],"award-info":[{"award-number":["61672150, 61702092, 61907007, 61602221"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Fund of the Jilin Provincial Science and Technology Department","award":["20190201305JC, 20180201089GX"],"award-info":[{"award-number":["20190201305JC, 20180201089GX"]}]},{"name":"Fund of Education Department of Jilin Province","award":["JJKH20190355KJ, JJKH20190294KJ, JJKH20190291KJ"],"award-info":[{"award-number":["JJKH20190355KJ, JJKH20190294KJ, JJKH20190291KJ"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["2412019FZ049"],"award-info":[{"award-number":["2412019FZ049"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.<\/jats:p>","DOI":"10.3390\/s20113126","type":"journal-article","created":{"date-parts":[[2020,6,2]],"date-time":"2020-06-02T09:19:27Z","timestamp":1591089567000},"page":"3126","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition"],"prefix":"10.3390","volume":"20","author":[{"given":"Jianyu","family":"Chen","sequence":"first","affiliation":[{"name":"College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China"}]},{"given":"Jun","family":"Kong","sequence":"additional","affiliation":[{"name":"Institute for Intelligent Elderly Care, College of Humanities &amp; Sciences of Northeast Normal University, Changchun 130117, China"},{"name":"Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun 130024, China"}]},{"given":"Hui","family":"Sun","sequence":"additional","affiliation":[{"name":"Institute for Intelligent Elderly Care, College of Humanities &amp; Sciences of Northeast Normal University, Changchun 130117, China"}]},{"given":"Hui","family":"Xu","sequence":"additional","affiliation":[{"name":"College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China"}]},{"given":"Xiaoli","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of Chemical &amp; Biomolecular Engineering, National University of Singapore, Singapore 117585, Singapore"}]},{"given":"Yinghua","family":"Lu","sequence":"additional","affiliation":[{"name":"Institute for Intelligent Elderly Care, College of Humanities &amp; Sciences of Northeast Normal University, Changchun 130117, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4028-6149","authenticated-orcid":false,"given":"Caixia","family":"Zheng","sequence":"additional","affiliation":[{"name":"College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China"},{"name":"Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun 130024, China"}]}],"member":"1968","published-online":{"date-parts":[[2020,6,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"773","DOI":"10.1109\/TPAMI.2016.2558148","article-title":"Rank Pooling for Action Recognition","volume":"39","author":"Fernando","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhu, H., Vial, R., and Lu, S. (2017, January 22\u201329). TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.619"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Papadopoulos, G.T., Axenopoulos, A., and Daras, P. (2014, January 8\u201310). Real-Time Skeleton-Tracking-Based Human Action Recognition Using Kinect Data. Proceedings of the International Conference on Multimedia Modeling, Dublin, Ireland.","DOI":"10.1007\/978-3-319-04114-8_40"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2329","DOI":"10.1016\/j.patcog.2015.03.006","article-title":"Semantic human activity recognition: A literature review","volume":"48","author":"Ziaeefard","year":"2015","journal-title":"Pattern Recognit."},{"key":"ref_5","unstructured":"Kong, Y., and Fu, Y. (2018). Action Recognition and Prediction: A Survey Human. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Papadopoulos, K., Demisse, G., Ghorbel, E., Antunes, M., Aouada, D., and Ottersten, B. (2019). Localized Trajectories for 2D and 3D Action Recognition. Sensors, 19.","DOI":"10.3390\/s19163503"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Li, F.-F. (2014, January 24\u201327). Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.223"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Nazir, S., Yousaf, M.H., Nebel, J.-C., and Velastin, S.A. (2019). Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition. Sensors, 19.","DOI":"10.3390\/s19122790"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wei, H., Jafari, R., and Kehtarnavaz, N. (2019). Fusion of Video and Inertial Sensing for Deep Learning\u2013Based Human Action Recognition. Sensors, 19.","DOI":"10.3390\/s19173680"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 7\u201312). Fully convolutional networks for semantic segmentation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1023\/B:VISI.0000011202.85607.00","article-title":"Object Detection Using the Statistics of Parts","volume":"56","author":"Schneiderman","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_14","unstructured":"Li, C., Wang, P., Wang, S., Hou, Y., and Li, W. (2017, January 10\u201314). Skeleton-based action recognition using LSTM and CNN. Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Park, E., Han, X., Berg, T.L., and Berg, A.C. (2016, January 7\u20139). Combining multiple sources of knowledge in deep CNNs for action recognition. Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA.","DOI":"10.1109\/WACV.2016.7477589"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R.P. (2017, January 21\u201326). Temporal Residual Networks for Dynamic Scene Recognition. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.786"},{"key":"ref_17","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-Stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Cambridge, MA, USA."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","article-title":"3D Convolutional Neural Networks for Human Action Recognition","volume":"35","author":"Ji","year":"2012","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R.P. (2017, January 21\u201326). Spatiotemporal Multiplier Networks for Video Action Recognition. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.787"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., and Baskurt, A. (2011, January 16). Sequential Deep Learning for Human Action Recognition. Proceedings of the Applications of Evolutionary Computation, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-642-25446-8_4"},{"key":"ref_21","unstructured":"Yunpeng, C., Kalantidis, Y., Li, J., Yan, S., and Feng, J. (2018, January 8\u201314). Multi-fiber Networks for Video Recognition. Proceedings of the Applications of Evolutionary Computation, Munich, Germany."},{"key":"ref_22","first-page":"1","article-title":"A Review on Human Activity Recognition Using Vision-Based Method","volume":"2017","author":"Zhang","year":"2017","journal-title":"J. Heal. Eng."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Ali, S., Basharat, A., and Shah, M. (2007, January 14\u201320). Chaotic Invariants for Human Action Recognition. Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil.","DOI":"10.1109\/ICCV.2007.4409046"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/34.910878","article-title":"The recognition of human movement using temporal templates","volume":"23","author":"Bobick","year":"2001","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"2247","DOI":"10.1109\/TPAMI.2007.70711","article-title":"Actions as Space-Time Shapes","volume":"29","author":"Gorelick","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1007\/s11263-005-1838-7","article-title":"On Space-Time Interest Points","volume":"64","author":"Laptev","year":"2005","journal-title":"Int. J. Comput. Vis."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Willems, G., Tuytelaars, T., and Van Gool, L. (2008, January 12\u201318). An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. Proceedings of the European Conference on Computer Vision, Marseille, France.","DOI":"10.1007\/978-3-540-88688-4_48"},{"key":"ref_28","unstructured":"Doll\u00e1r, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005, January 15\u201316). Behavior Recognition via Sparse Spatio-Temporal Features. Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Rodriguez, M.D., Ahmed, J., and Shah, M. (2008, January 23\u201328). Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587727"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Niebles, J.C., and Li., F.-F. (2007, January 17\u201322). A Hierarchical Model of Shape and Appearance for Human Action Classification. Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MI, USA.","DOI":"10.1109\/CVPR.2007.383132"},{"key":"ref_31","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014, January 8\u201313). Generative Adversarial Networks. Proceedings of the Advances in Neural Information Processing Systems, Cambridge, MA, USA."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lv, F., and Nevatia, R. (2006, January 7\u201313). Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost. Proceedings of the European Conference on Computer Vision, Graz, Austria.","DOI":"10.1007\/11744085_28"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Savarese, S., Delpozo, A., Niebles, J.C., and Li., F.-F. (2008, January 8\u20139). Spatial-Temporal correlatons for unsupervised action classification. Proceedings of the 2008 IEEE Workshop on Motion and video Computing, Copper Mountain, CO, USA.","DOI":"10.1109\/WMVC.2008.4544068"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1612","DOI":"10.1109\/JSEN.2017.2784425","article-title":"Fisherposes for Human Action Recognition Using Kinect Sensor Data","volume":"18","author":"Ghojogh","year":"2018","journal-title":"IEEE Sens. J."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"Pdf ImageNet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"ref_38","unstructured":"Lee, C.-Y., Gallagher, P.W., and Tu, Z. (2016, January 9\u201311). Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. Proceedings of the Artificial intelligence and statistics, Cadiz, Spain."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Xu, Z., Yang, Y., and Hauptmann, A.G. (2015, January 7\u201312). A discriminative CNN video representation for event detection. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298789"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., and Russell, B. (2017, January 21\u201326). ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.337"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201313). Learning Spatiotemporal Features with 3D Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201326). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_44","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7\u201312). Going deeper with convolutions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"ref_46","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (July, January 26). Convolutional Two-Stream Network Fusion for Video Action Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_48","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2014). UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011, January 6\u201313). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Li., F.-F. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Wang, X., Farhadi, A., and Gupta, A. (2016, January 27\u201330). Actions~transformations. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.291"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Sun, L., Jia, K., Yeung, D.-Y., and Shi, B.E. (2015, January 7\u201313). Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.522"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Wang, H., and Schmid, C. (2013, January 1\u20138). Action Recognition with Improved Trajectories. Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.441"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"677","DOI":"10.1109\/TPAMI.2016.2599174","article-title":"Long-Term Recurrent Convolutional Networks for Visual Recognition and Description","volume":"39","author":"Donahue","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_55","unstructured":"Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015, January 6\u201311). Unsupervised learning of video representations using lstms. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_56","unstructured":"Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., and Toderici, G. (2015, January 7\u201312). Beyond short snippets: Deep networks for video classification. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_57","unstructured":"Tran, D., Ray, J., Shou, Z., Chang, S.-F., and Paluri, M. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., and Gould, S. (July, January 26). Dynamic Image Networks for Action Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.331"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.patcog.2018.07.028","article-title":"Asymmetric 3d convolutional neural networks for action recognition","volume":"85","author":"Yang","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_60","unstructured":"Diba, A., Fayyaz, M., Sharma, V., Karami, A.H., Arzani, M.M., Yousefzadeh, R., and Van Gool, L. (2017). Temporal 3d Convnets: New Architecture and Transfer Learning for Video Classification. arXiv."},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Wang, L., Qiao, Y., and Tang, X. (2015, January 7\u201312). Action recognition with trajectory-pooled deep-convolutional descriptors. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1016\/j.cviu.2017.10.011","article-title":"VideoLSTM convolves, attends and flows for action recognition","volume":"166","author":"Li","year":"2018","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_63","unstructured":"Wang, Y., Wang, S., Tang, J., O\u2019Hare, N., Chang, Y., and Li, B. (2016). Hierarchical Attention Network for Action Recognition in Videos. arXiv."},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1016\/j.neucom.2018.06.071","article-title":"Action recognition using spatial-optical data organization and sequential learning framework","volume":"315","author":"Yuan","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. (2016, January 8\u201316). Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"57267","DOI":"10.1109\/ACCESS.2019.2910604","article-title":"A Spatiotemporal Heterogeneous Two-Stream Network for Action Recognition","volume":"7","author":"Chen","year":"2019","journal-title":"IEEE Access"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Sun, S., Kuang, Z., Sheng, L., Ouyang, W., and Zhang, W. (2018, January 18\u201322). Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA.","DOI":"10.1109\/CVPR.2018.00151"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/11\/3126\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:34:35Z","timestamp":1760175275000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/11\/3126"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,1]]},"references-count":67,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2020,6]]}},"alternative-id":["s20113126"],"URL":"https:\/\/doi.org\/10.3390\/s20113126","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2020,6,1]]}}}