{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T18:44:26Z","timestamp":1767984266514,"version":"3.49.0"},"reference-count":36,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2022,9,1]],"date-time":"2022-09-01T00:00:00Z","timestamp":1661990400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.<\/jats:p>","DOI":"10.3390\/s22176595","type":"journal-article","created":{"date-parts":[[2022,9,1]],"date-time":"2022-09-01T03:55:38Z","timestamp":1662004538000},"page":"6595","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2028-048X","authenticated-orcid":false,"given":"Yi","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Computer Science, Sichuan University, Chengdu 610017, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,9,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zhou, Z., Yuille, A., and Li, Y. (2022, January 18\u201322). Learning from temporal gradient for semi-supervised action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52688.2022.00325"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016, January 11\u201316). Convolutional Two-Stream Network Fusion for Video Action Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2016.213"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Yan, A., Wang, Y., Li, Z., and Qiao, Y. (2019, January 15\u201320). PA3D: Pose-Action 3D Machine for Video Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00811"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tran, D., Wang, H., Torresani1, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18\u201322). A Closer Look at Spatiotemporal Convolutions for Action Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake, UT, USA.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"ref_5","unstructured":"Ji, L., Gan, C., and Han, S. (November, January 27). TSM: Temporal Shift Module for Efficient Video Understanding. Proceedings of the 2019 IEEE\/CVF International Conference on Computer Vision (ICCV), Seoul, Korea."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, Y., Ji, B., and Shi, X. (2020, January 13\u201319). TEA: Temporal excitation and aggregation for action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00099"},{"key":"ref_7","unstructured":"Luo, C., and Yuille, A.L. (November, January 27). Grouped spatial-temporal aggregation for efficient action recognition. Proceedings of the International Conference of Computer Vision (ICCV), Seoul, Korea."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zolfaghari, M., Singh, K., and Brox, T. (2018, January 8\u201314). ECO: Efficient convolutional network for online video understanding. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01216-8_43"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Carreira, J., and Zisserman, A. (2017, January 21\u201327). Quo vadis, Action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.502"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wang, X., Girshick, R., Gupta, A., and He, K. (2018, January 18\u201320). Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake, UT, USA.","DOI":"10.1109\/CVPR.2018.00813"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Wang, W.L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and van Gool, L. (2016). Temporal Segment Networks: Towards good practices for deep action recognition. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Douglas Chai, M.B.S. (2021). RGB-D Data-Based Action Recognition: A Review. Sensors, 21.","DOI":"10.3390\/s21124246"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wang, S., Guan, S., Lin, H., Huang, J., Long, F., and Yao, J. (2022). Micro-Expression Recognition Based on Optical Flow and PCANet+. Sensors, 22.","DOI":"10.3390\/s22114296"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7\u201312). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_15","unstructured":"Tran, D., Ray, J., and Shou, Z. (2017). Convnet architecture search for spatiotemporal feature learning. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Huang, L., Li, Y., Wang, X., Wang, H., and Chaddad, A.B.A. (2022). Gaze Estimation Approach Using Deep Differential Residual Network. Sensors, 22.","DOI":"10.3390\/s22145462"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Qiu, Z., Yao, T., and Mei, T. (2017, January 22\u201329). Learning Spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.590"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Dong, M., Fang, Z., Li, Y., Bi, S., and Chen, J. (2021). AR3D: Attention Residual 3D Network for Human Action Recognition. Sensors, 21.","DOI":"10.3390\/s21051656"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Sun, S., Lei, L., Liu, H., and Xie, H. (2021). STAC: Spatial-Temporal Attention on Compensation Information for Activity Recognition in FPV. Sensors, 21.","DOI":"10.3390\/s21041106"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Po, L.-M., Xiong, J., Rehman, Y.A.U., and Cheung, K.W. (2021). ASNet: Auto-Augmented Siamese Neural Network for Action Recognition. Sensors, 21.","DOI":"10.3390\/s21144720"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"728","DOI":"10.1109\/TETCI.2021.3079966","article-title":"Evolutionary Dual-Ensemble Class Imbalance Learning for Human Activity Recognition","volume":"6","author":"Guo","year":"2022","journal-title":"IEEE Trans. Emerg. Top. Comput. Intell."},{"key":"ref_22","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 7\u20139). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (PMLR), Lille, France."},{"key":"ref_23","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer normalization. arXiv."},{"key":"ref_24","unstructured":"Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv."},{"key":"ref_25","unstructured":"Qiao, S., Wang, H., Liu, C., Shen, W., and Yuille, A. (2019). Micro-Batch Training with Batch-Channel Normalization and Weight Standardization. arXiv."},{"key":"ref_26","unstructured":"Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018, January 3\u20138). How does batch normalization help optimization?. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, QC, Canada."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Frund, I., Yianilos, P., and Freitag, M. (2017, January 22\u201329). The \u201cSomething Something\u201d video database for learning and evaluating visual common sense. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.622"},{"key":"ref_28","unstructured":"Materzynska, J., Berger, G., Bax, I., and Memisevic, R. (November, January 27). The Jester dataset: A large-scale video dataset of human gestures. Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea."},{"key":"ref_29","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). A Dataset of 101 Human Action Classes from Videos in the Wild, Center for Research in Computer Vision."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Kuehne, H., Jhuang, H., and Garrote, E. (2011, January 25\u201327). HMDB: A large video database for human motion recognition. Proceedings of the 2011 International Conference on Computer Vision, Tokyo, Japan.","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Zhou, B., Andonian, A., Oliva, A., and Torralba, A. (2018, January 8\u201314). Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Liu, Z., Wang, L., Wu, W., Qian, C., and Lu, T. (2020). TAM: Temporal adaptive module for video recognition. arXiv.","DOI":"10.1109\/ICCV48922.2021.01345"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Li, X., Wang, Y., and Zhou, Z. (2020, January 13\u201319). SmallBignet: Integrating core and contextual views for video classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00117"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Liu, X., Lee, J.Y., and Jin, H. (2019, January 15\u201320). Learning video representations from correspondence proposals. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00440"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Wang, L., Li, W., Li, W., and van Gool, L. (2018, January 18\u201322). Appearance-and-relation networks for video classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake, UT, USA.","DOI":"10.1109\/CVPR.2018.00155"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1007\/s11263-019-01228-7","article-title":"Grad-Cam: Visual explanations from deep networks via gradient-based localization","volume":"128","author":"Selvaraju","year":"2020","journal-title":"Int. J. Comput. Vis."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6595\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:21:25Z","timestamp":1760142085000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6595"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,1]]},"references-count":36,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2022,9]]}},"alternative-id":["s22176595"],"URL":"https:\/\/doi.org\/10.3390\/s22176595","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,1]]}}}