{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,19]],"date-time":"2025-09-19T07:59:25Z","timestamp":1758268765257,"version":"3.37.3"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"23","license":[{"start":{"date-parts":[[2021,7,11]],"date-time":"2021-07-11T00:00:00Z","timestamp":1625961600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,7,11]],"date-time":"2021-07-11T00:00:00Z","timestamp":1625961600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the study of human action recognition, two-stream networks have made excellent progress recently. However, there remain challenges in distinguishing similar human actions in videos. This paper proposes a novel local-aware spatio-temporal attention network with multi-stage feature fusion based on compact bilinear pooling for human action recognition. To elaborate, taking two-stream networks as our essential backbones, the spatial network first employs multiple spatial transformer networks in a parallel manner to locate the discriminative regions related to human actions. Then, we perform feature fusion between the local and global features to enhance the human action representation. Furthermore, the output of the spatial network and the temporal information are fused at a particular layer to learn the pixel-wise correspondences. After that, we bring together three outputs to generate the global descriptors of human actions. To verify the efficacy of the proposed approach, comparison experiments are conducted with the traditional hand-engineered IDT algorithms, the classical machine learning methods (i.e., SVM) and the state-of-the-art deep learning methods (i.e., spatio-temporal multiplier networks). According to the results, our approach is reported to obtain the best performance among existing works, with the accuracy of 95.3% and 72.9% on UCF101 and HMDB51, respectively. The experimental results thus demonstrate the superiority and significance of the proposed architecture in solving the task of human action recognition.<\/jats:p>","DOI":"10.1007\/s00521-021-06239-5","type":"journal-article","created":{"date-parts":[[2021,7,11]],"date-time":"2021-07-11T04:02:22Z","timestamp":1625976142000},"page":"16439-16450","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition"],"prefix":"10.1007","volume":"33","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9929-2650","authenticated-orcid":false,"given":"Yaqing","family":"Hou","sequence":"first","affiliation":[]},{"given":"Hua","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Dongsheng","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Pengfei","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Hongwei","family":"Ge","sequence":"additional","affiliation":[]},{"given":"Jianxin","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Qiang","family":"Zhang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,7,11]]},"reference":[{"key":"6239_CR1","doi-asserted-by":"crossref","unstructured":"Ch\u00e9ron G, Laptev I, Schmid C (2015) P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3218\u20133226","DOI":"10.1109\/ICCV.2015.368"},{"key":"6239_CR2","doi-asserted-by":"crossref","unstructured":"Dai H, Shahzad M, Liu AX, Zhong Y (2016) Finding persistent items in data streams. Proceedings of the VLDB Endowment 10(4):289\u2013300","DOI":"10.14778\/3025111.3025112"},{"key":"6239_CR3","doi-asserted-by":"publisher","DOI":"10.1007\/11744047_33","volume-title":"Human detection using oriented histograms of flow and appearance","author":"N Dalal","year":"2006","unstructured":"Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Springer, Berlin"},{"key":"6239_CR4","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li L, Li K, Feifei L (2009) Imagenet: a large-scale hierarchical image database pp. 248\u2013255","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"6239_CR5","doi-asserted-by":"crossref","unstructured":"Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description pp. 2625\u20132634","DOI":"10.21236\/ADA623249"},{"issue":"99","key":"6239_CR6","first-page":"1347","volume":"27","author":"W Du","year":"2017","unstructured":"Du W, Wang Y, Yu Q (2017) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27(99):1347\u20131360","journal-title":"IEEE Trans Image Process"},{"key":"6239_CR7","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition pp. 7445\u20137454","DOI":"10.1109\/CVPR.2017.787"},{"key":"6239_CR8","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition pp. 1933\u20131941","DOI":"10.1109\/CVPR.2016.213"},{"key":"6239_CR9","doi-asserted-by":"crossref","unstructured":"Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling pp. 317\u2013326","DOI":"10.1109\/CVPR.2016.41"},{"issue":"14","key":"6239_CR10","doi-asserted-by":"publisher","first-page":"20533","DOI":"10.1007\/s11042-019-7404-z","volume":"78","author":"H Ge","year":"2019","unstructured":"Ge H, Yan Z, Yu W, Sun L (2019) An attention mechanism based convolutional lstm network for video action recognition. Multim Tools Appl 78(14):20533\u201320556","journal-title":"Multim Tools Appl"},{"key":"6239_CR11","unstructured":"Girdhar R, Ramanan D (2017) Attentional pooling for action recognition pp. 34\u201345"},{"key":"6239_CR12","doi-asserted-by":"crossref","unstructured":"Girdhar R, Ramanan D, Gupta A, Sivic J, Russell BC (2017) Actionvlad: learning spatio-temporal aggregation for action classification pp. 3165\u20133174","DOI":"10.1109\/CVPR.2017.337"},{"key":"6239_CR13","doi-asserted-by":"crossref","unstructured":"Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet pp. 6546\u20136555","DOI":"10.1109\/CVPR.2018.00685"},{"key":"6239_CR14","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition pp. 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"6239_CR15","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks pp. 630\u2013645","DOI":"10.1007\/978-3-319-46493-0_38"},{"key":"6239_CR16","unstructured":"Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv: Learning"},{"key":"6239_CR17","unstructured":"Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks pp. 2017\u20132025"},{"issue":"1","key":"6239_CR18","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1109\/TPAMI.2012.59","volume":"35","author":"S Ji","year":"2013","unstructured":"Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221\u2013231","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"6239_CR19","doi-asserted-by":"crossref","unstructured":"Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Feifei L (2014) Large-scale video classification with convolutional neural networks pp. 1725\u20131732","DOI":"10.1109\/CVPR.2014.223"},{"key":"6239_CR20","doi-asserted-by":"crossref","unstructured":"Klaser A, arszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients pp. 1\u201310","DOI":"10.5244\/C.22.99"},{"key":"6239_CR21","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks pp. 1097\u20131105"},{"key":"6239_CR22","doi-asserted-by":"crossref","unstructured":"Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition pp. 2556\u20132563","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"6239_CR23","doi-asserted-by":"crossref","unstructured":"Kuen J, Wang Z, Wang G (2016) Recurrent attentional networks for saliency detection pp. 3668\u20133677","DOI":"10.1109\/CVPR.2016.399"},{"key":"6239_CR24","doi-asserted-by":"crossref","unstructured":"Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies pp. 1\u20138","DOI":"10.1109\/CVPR.2008.4587756"},{"key":"6239_CR25","unstructured":"Li C, Zhong Q, Xie D, Pu S (2017) Skeleton-based action recognition with convolutional neural networks"},{"key":"6239_CR26","doi-asserted-by":"crossref","unstructured":"Lin T, Roychowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition pp. 1449\u20131457","DOI":"10.1109\/ICCV.2015.170"},{"issue":"3","key":"6239_CR27","doi-asserted-by":"publisher","first-page":"284","DOI":"10.3390\/robotics4030284","volume":"4","author":"S Mohammad","year":"2015","unstructured":"Mohammad S, Mircea N, Monica N, Banafsheh R (2015) Intent understanding using an activation spreading architecture. Robotics 4(3):284\u2013315","journal-title":"Robotics"},{"key":"6239_CR28","doi-asserted-by":"crossref","unstructured":"Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification 6314:143\u2013156","DOI":"10.1007\/978-3-642-15561-1_11"},{"key":"6239_CR29","doi-asserted-by":"crossref","unstructured":"Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps pp. 239\u2013247","DOI":"10.1145\/2487575.2487591"},{"key":"6239_CR30","unstructured":"Shi X, Chen Z, Wang H, Yeung D, Wong W, Woo W (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting pp. 802\u2013810"},{"key":"6239_CR31","doi-asserted-by":"crossref","unstructured":"Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition","DOI":"10.1109\/CVPR.2019.00132"},{"key":"6239_CR32","unstructured":"Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos pp. 568\u2013576"},{"key":"6239_CR33","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition"},{"key":"6239_CR34","unstructured":"Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. Computer ence"},{"key":"6239_CR35","unstructured":"Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms pp. 843\u2013852"},{"key":"6239_CR36","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions pp. 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"6239_CR37","doi-asserted-by":"crossref","unstructured":"Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks","DOI":"10.1109\/ICCV.2015.510"},{"key":"6239_CR38","doi-asserted-by":"crossref","unstructured":"Wang H, Schmid C (2013) Action recognition with improved trajectories pp. 3551\u20133558","DOI":"10.1109\/ICCV.2013.441"},{"key":"6239_CR39","doi-asserted-by":"crossref","unstructured":"Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors pp. 4305\u20134314","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"6239_CR40","doi-asserted-by":"crossref","unstructured":"Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van\u00a0Gool L (2016) Temporal segment networks: towards good practices for deep action recognition pp. 20\u201336","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"6239_CR41","doi-asserted-by":"crossref","unstructured":"Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks pp. 7794\u20137803","DOI":"10.1109\/CVPR.2018.00813"},{"key":"6239_CR42","doi-asserted-by":"crossref","unstructured":"Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition pp. 2097\u20132106","DOI":"10.1109\/CVPR.2017.226"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06239-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-021-06239-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-021-06239-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,11,3]],"date-time":"2021-11-03T18:19:41Z","timestamp":1635963581000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-021-06239-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,11]]},"references-count":42,"journal-issue":{"issue":"23","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["6239"],"URL":"https:\/\/doi.org\/10.1007\/s00521-021-06239-5","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"type":"print","value":"0941-0643"},{"type":"electronic","value":"1433-3058"}],"subject":[],"published":{"date-parts":[[2021,7,11]]},"assertion":[{"value":"13 January 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 July 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}