{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,8]],"date-time":"2026-02-08T06:26:47Z","timestamp":1770532007030,"version":"3.49.0"},"reference-count":30,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2021,2,16]],"date-time":"2021-02-16T00:00:00Z","timestamp":1613433600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"published-print":{"date-parts":[[2021,10,14]]},"abstract":"<jats:p>In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. This paper puts forward a network extracting video sequence semantic information based on deep integration of local Spatial-Temporal information. The network uses 2D Convolutional Neural Network (2DCNN) and Multi Spatial-Temporal scale 3D Convolutional Neural Network (MST_3DCNN) respectively to extract spatial information and motion information. Spatial information and motion information of the same time quantum receive 3D convolutional integration to generate the temporary Spatial-Temporal information of a certain moment. Then, the Spatial-Temporal information of multiple single moments enters Temporal Pyramid Net (TPN) to generate the local Spatial-Temporal information of multiple time scales. Finally, bidirectional recurrent neutral network is used to act on the Spatial-Temporal information of all parts so as to acquire the context information spanning the length of the entire video, which endows the network with video context information extraction capability. Through the experiments on the three video action recognitio common experimental data sets UCF101, UCF11, UCFSports, the Spatial-Temporal information deep fusion network proposed in this paper has a high correct recognition rate in the task of video action recognitio.<\/jats:p>","DOI":"10.3233\/jifs-189714","type":"journal-article","created":{"date-parts":[[2021,2,16]],"date-time":"2021-02-16T11:46:11Z","timestamp":1613475971000},"page":"4533-4545","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":3,"title":["Multi-scale spatialtemporal information deep fusion network with temporal pyramid mechanism for video action recognition"],"prefix":"10.1177","volume":"41","author":[{"given":"Hongshi","family":"Ou","sequence":"first","affiliation":[{"name":"South China University of Technology, School of Electronic and Information Engineering, Guangzhou, China"}]},{"given":"Jifeng","family":"Sun","sequence":"additional","affiliation":[{"name":"South China University of Technology, School of Electronic and Information Engineering, Guangzhou, China"}]}],"member":"179","published-online":{"date-parts":[[2021,2,16]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2016.04.004"},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","unstructured":"KarA. RaiN. SikkaK. and SharmaG. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos in CVPR (2017) 5699\u20135708.","DOI":"10.1109\/CVPR.2017.604"},{"key":"e_1_3_2_4_2","unstructured":"DibaA. PazandehA.M. and GoolL.V. Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification CS.CV. (2016) 1\u20134."},{"key":"e_1_3_2_5_2","unstructured":"ZhuY. LanZ.Z. NewsamS. and HauptmannA.G. Hidden Two-Stream Convolutional Networks for Action Recognition CS.CV. (2017) 1\u201312."},{"key":"e_1_3_2_6_2","unstructured":"MaC.-Y. ChenM.-H. KiraZ. and AlRegibG. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition CS.CV. (2017) 1011\u20131020."},{"key":"e_1_3_2_7_2","first-page":"1","article-title":"Learning realistic human actions from movies","volume":"1","author":"Laptev I.","year":"2008","unstructured":"LaptevI., Marsza\u0142ekM., SchmidC. and RozenfeldB., Learning realistic human actions from movies, In Proc. CVPR1 (2008), 1\u20138.","journal-title":"In Proc. CVPR"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2016.06.007"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","unstructured":"FeichtenhoferC. PinzA. and ZissermanA. Convolutional two-stream network fusion for video action recognition CVPR (2016) 1933\u20131941.","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","unstructured":"WangY. SongJ. WangL. GoolL.V. and HilligesO. Two-stream sr-cnns for action recognition in videos In BMVC (2016) 108.1\u2013108.12.","DOI":"10.5244\/C.30.108"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"TranD. BourdevL. FergusR. TorresaniL. and PaluriM. Learning spatiotemporal features with 3D convolutional networks In Proc. ICCV (2015) 4489\u20134497.","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"SunL. JiaK. YeungD.-Y. and ShiB. Human action recognition using factorized spatio-temporal convolutional networks In Proc. ICCV (2015) 815\u2013823.","DOI":"10.1109\/ICCV.2015.522"},{"key":"e_1_3_2_13_2","unstructured":"SimonyanK. and ZissermanA. Two-stream convolutional networks for action recognition in videos CS.CV. (2014) 1\u201311."},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"ZhuC. ZhengY. LuuK. LeT.H.N. BhagavatulaC. and SavvidesM. Weakly supervised facial analysis with dense hyper-column features In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2016) 93\u2013101.","DOI":"10.1109\/CVPRW.2016.19"},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","unstructured":"KarA. RaiN. SikkaK. and SharmaG. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos in CVPR (2017) 5699\u20135708.","DOI":"10.1109\/CVPR.2017.604"},{"key":"e_1_3_2_16_2","unstructured":"SharmaS. KirosR. and SalakhutdinovR. Action recognition using visual attention CS.CV (2015) 186\u2013192. 73"},{"key":"e_1_3_2_17_2","unstructured":"YaoL. TorabiA. ChoK. BallasN. PalC.J. LarochelleH. and CourvilleA.C. Describing videos by exploiting temporal structure In ICCV pages 4507\u20134515. IEEE Computer Society (2015) 4507\u20134515."},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"KarpathyA. TodericiG. ShettyS. LeungT. SukthankarR. and LiF.-F. Large-scale video classification with convolutional neural networks In CVPR 1725\u20131732. IEEE Computer Society 2014.","DOI":"10.1109\/CVPR.2014.223"},{"key":"e_1_3_2_19_2","unstructured":"Human Motion Recognition ICCV 2011."},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","unstructured":"SoomroK. and ZamirA.R. Action recognition in realistic sports videos In Computer Vision in Sports Springer 2014.","DOI":"10.1007\/978-3-319-09396-3_9"},{"key":"e_1_3_2_21_2","unstructured":"SoomroK. ZamirA.R. and ShahM. Ucf101: A dataset of 101 human actions classes from videos in the wild CS.CV (2012) 6\u20138."},{"key":"e_1_3_2_22_2","unstructured":"SimonyanK. and ZissermanA. Very deep convolutional networks for large-scale image recognition CoRR abs\/1409.1556 2014."},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"DonahueJ. HendricksL.A. GuadarramaS. RohrbachM. VenugopalanS. SaenkoK. and DarrellT. Long-term recurrent convolutional networks for visual recognition and description In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 2625\u20132634.","DOI":"10.1109\/CVPR.2015.7298878"},{"key":"e_1_3_2_24_2","unstructured":"WangH. Kl\u0142serA. SchmidC. and LiuC.-L. Action recognition by dense trajectories In CVPR 3169\u20133176. IEEE Computer Society 2011.163\u2013172 Network & Security 67 2017."},{"key":"e_1_3_2_25_2","unstructured":"SrivastavaN. MansimovE. and SalakhutdinovR. Unsupervised learning of video representations using LSTMs In Proc. ICML (2015) 843\u2013852."},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","unstructured":"TranD. BourdevL. FergusR. TorresaniL. and PaluriM. Learning spatio-temporal features with 3D convolutional networks In Proc. ICCV (2015) 4489\u20134497.","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_27_2","unstructured":"RavanbakhshM. MousaviH. RastegariM. MurinoV. and DavisL.S. Action recognition with image based CNN features CS.CV (2015) 1512\u20131521."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2013.12.004"},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","unstructured":"WeinzaepfelP. HarchaouiZ. and SchmidC. Learning to track for spatio-temporal action localization In ICCV 2015 \u2013 IEEE International Conference on Computer Vision Santiago Chile (2015) 3164\u20133172.","DOI":"10.1109\/ICCV.2015.362"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/s41095-016-0033-9"},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"ZhuJ.G. ZouW. and ZhuZ. End-to-end Video-level Representation Learning for Action Recognition Computer Vision and Pattern Recognition 2017.","DOI":"10.1109\/ICPR.2018.8545710"}],"container-title":["Journal of Intelligent &amp; Fuzzy Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-189714","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.3233\/JIFS-189714","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.3233\/JIFS-189714","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:58:48Z","timestamp":1769993928000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.3233\/JIFS-189714"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,2,16]]},"references-count":30,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,10,14]]}},"alternative-id":["10.3233\/JIFS-189714"],"URL":"https:\/\/doi.org\/10.3233\/jifs-189714","relation":{},"ISSN":["1064-1246","1875-8967"],"issn-type":[{"value":"1064-1246","type":"print"},{"value":"1875-8967","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,16]]}}}