{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T17:37:20Z","timestamp":1772818640291,"version":"3.50.1"},"reference-count":48,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T00:00:00Z","timestamp":1711497600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T00:00:00Z","timestamp":1711497600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Research and Application of edge computing TechnologyBased on TinyML","award":["KYP0222010"],"award-info":[{"award-number":["KYP0222010"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>3D CNN networks can model existing large action recognition datasets well in temporal modeling and have made extremely great progress in the field of RGB-based video action recognition. However, the previous 3D CNN models also face many troubles. For video feature extraction convolutional kernels are often designed and fixed in each layer of the network, which may not be suitable for the diversity of data in action recognition tasks. In this paper, a new model called <jats:italic>Multipath Attention and Adaptive Gating Network<\/jats:italic> (MAAGN) is proposed. The core idea of MAAGN is to use the <jats:italic>spatial difference module<\/jats:italic> (SDM) and the <jats:italic>multi-angle temporal attention module<\/jats:italic> (MTAM) in parallel at each layer of the multipath network to obtain spatial and temporal features, respectively, and then dynamically fuses the spatial-temporal features by the <jats:italic>adaptive gating module<\/jats:italic> (AGM). SDM explores the action video spatial domain using difference operators based on the attention mechanism, while MTAM tends to explore the action video temporal domain in terms of both global timing and local timing. AGM is built on an adaptive gate unit, the value of which is determined by the input of each layer, and it is unique in each layer, dynamically fusing the spatial and temporal features in the paths of each layer in the multipath network. We construct the temporal network MAAGN, which has a competitive or better performance than state-of-the-art methods in video action recognition, and we provide exhaustive experiments on several large datasets to demonstrate the effectiveness of our approach.<\/jats:p>","DOI":"10.1007\/s11063-024-11591-3","type":"journal-article","created":{"date-parts":[[2024,3,27]],"date-time":"2024-03-27T11:03:24Z","timestamp":1711537404000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Multipath Attention and Adaptive Gating Network for Video Action Recognition"],"prefix":"10.1007","volume":"56","author":[{"given":"Haiping","family":"Zhang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1789-0575","authenticated-orcid":false,"given":"Zepeng","family":"Hu","sequence":"additional","affiliation":[]},{"given":"Dongjin","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Liming","family":"Guan","sequence":"additional","affiliation":[]},{"given":"Xu","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Conghao","family":"Ma","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,3,27]]},"reference":[{"key":"11591_CR1","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 6202\u20136211","DOI":"10.1109\/ICCV.2019.00630"},{"key":"11591_CR2","doi-asserted-by":"crossref","unstructured":"Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794\u20137803","DOI":"10.1109\/CVPR.2018.00813"},{"key":"11591_CR3","doi-asserted-by":"crossref","unstructured":"Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725\u20131732","DOI":"10.1109\/CVPR.2014.223"},{"key":"11591_CR4","unstructured":"Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27"},{"key":"11591_CR5","first-page":"20","volume-title":"European conference on computer vision","author":"L Wang","year":"2016","unstructured":"Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. European conference on computer vision. Springer, Cham, pp 20\u201336"},{"key":"11591_CR6","doi-asserted-by":"crossref","unstructured":"Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305\u20134314","DOI":"10.1109\/CVPR.2015.7299059"},{"key":"11591_CR7","doi-asserted-by":"crossref","unstructured":"Goyal R, Ebrahimi Kahou S, Michalski V, Materzynska J, Westphal S, Kim H, Memisevic, R (2017) The something something video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision, pp 5842\u20135850","DOI":"10.1109\/ICCV.2017.622"},{"key":"11591_CR8","doi-asserted-by":"crossref","unstructured":"Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299\u20136308","DOI":"10.1109\/CVPR.2017.502"},{"key":"11591_CR9","doi-asserted-by":"crossref","unstructured":"Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768\u20134777","DOI":"10.1109\/CVPR.2017.787"},{"key":"11591_CR10","doi-asserted-by":"crossref","unstructured":"Wang H, Tran D, Torresani L, Feiszli M (2020) Video modeling with correlation networks. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 352\u2013361","DOI":"10.1109\/CVPR42600.2020.00043"},{"key":"11591_CR11","doi-asserted-by":"crossref","unstructured":"Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 591\u2013600","DOI":"10.1109\/CVPR42600.2020.00067"},{"key":"11591_CR12","doi-asserted-by":"crossref","unstructured":"Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 7083\u20137093","DOI":"10.1109\/ICCV.2019.00718"},{"key":"11591_CR13","doi-asserted-by":"crossref","unstructured":"Sudhakaran S, Escalera S, Lanz O (2020) Gate-shift networks for video action recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1102\u20131111","DOI":"10.1109\/CVPR42600.2020.00118"},{"key":"11591_CR14","doi-asserted-by":"crossref","unstructured":"Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450\u20136459","DOI":"10.1109\/CVPR.2018.00675"},{"key":"11591_CR15","doi-asserted-by":"crossref","unstructured":"Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305\u2013321","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"11591_CR16","doi-asserted-by":"crossref","unstructured":"Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: Proceedings of the European conference on computer vision (ECCV), pp 803\u2013818","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"11591_CR17","doi-asserted-by":"crossref","unstructured":"Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 2000\u20132009","DOI":"10.1109\/ICCV.2019.00209"},{"key":"11591_CR18","doi-asserted-by":"publisher","first-page":"9532","DOI":"10.1109\/TIP.2020.3028207","volume":"29","author":"L Shi","year":"2020","unstructured":"Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532\u20139545","journal-title":"IEEE Trans Image Process"},{"key":"11591_CR19","doi-asserted-by":"crossref","unstructured":"Wang L, Tong Z, Ji B, Wu G (2021) Tdn: temporal difference networks for efficient action recognition. In Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 1895\u20131904","DOI":"10.1109\/CVPR46437.2021.00193"},{"key":"11591_CR20","doi-asserted-by":"crossref","unstructured":"Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489\u20134497","DOI":"10.1109\/ICCV.2015.510"},{"key":"11591_CR21","doi-asserted-by":"crossref","unstructured":"Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 909\u2013918","DOI":"10.1109\/CVPR42600.2020.00099"},{"key":"11591_CR22","doi-asserted-by":"crossref","unstructured":"Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Lu T (2020) Teinet: towards an efficient architecture for video recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, No 07, pp 11669\u201311676","DOI":"10.1609\/aaai.v34i07.6836"},{"key":"11591_CR23","doi-asserted-by":"crossref","unstructured":"Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z (2020) Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 11030\u201311039","DOI":"10.1109\/CVPR42600.2020.01104"},{"key":"11591_CR24","unstructured":"Yang B, Bender G, Le QV, Ngiam J (2019) Condconv: conditionally parameterized convolutions for efficient inference. Adv Neural Inf Process Syst, 32"},{"key":"11591_CR25","doi-asserted-by":"crossref","unstructured":"Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 13708\u201313718","DOI":"10.1109\/ICCV48922.2021.01345"},{"key":"11591_CR26","unstructured":"Huang Z, Zhang S, Pan L, Qing Z, Tang M, Liu Z, Ang Jr MH (2021) TAda! temporally-adaptive convolutions for video understanding. arXiv preprint arXiv:2110.06178"},{"issue":"3","key":"11591_CR27","doi-asserted-by":"publisher","first-page":"201","DOI":"10.1038\/nrn755","volume":"3","author":"M Corbetta","year":"2002","unstructured":"Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201\u2013215","journal-title":"Nat Rev Neurosci"},{"issue":"11","key":"11591_CR28","doi-asserted-by":"publisher","first-page":"1254","DOI":"10.1109\/34.730558","volume":"20","author":"L Itti","year":"1998","unstructured":"Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254\u20131259","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11591_CR29","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin, I (2017) Attention is all you need. Adv Neural Inf Process Syst, 30"},{"key":"11591_CR30","doi-asserted-by":"crossref","unstructured":"Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132\u20137141","DOI":"10.1109\/CVPR.2018.00745"},{"key":"11591_CR31","doi-asserted-by":"crossref","unstructured":"Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156\u20133164","DOI":"10.1109\/CVPR.2017.683"},{"key":"11591_CR32","doi-asserted-by":"crossref","unstructured":"Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3\u201319","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"11591_CR33","doi-asserted-by":"crossref","unstructured":"Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921\u20132929","DOI":"10.1109\/CVPR.2016.319"},{"key":"11591_CR34","first-page":"818","volume-title":"European conference on computer vision","author":"MD Zeiler","year":"2014","unstructured":"Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. European conference on computer vision. Springer, Cham, pp 818\u2013833"},{"key":"11591_CR35","unstructured":"Wu F, Fan A, Baevski A, Dauphin YN, Auli M (2019) Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430"},{"key":"11591_CR36","doi-asserted-by":"crossref","unstructured":"Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695\u2013712","DOI":"10.1007\/978-3-030-01216-8_43"},{"key":"11591_CR37","unstructured":"Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?. In: ICML, vol 2, No 3"},{"key":"11591_CR38","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"key":"11591_CR39","unstructured":"Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012)Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580"},{"key":"11591_CR40","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"11591_CR41","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248\u2013255. IEEE","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"11591_CR42","doi-asserted-by":"publisher","first-page":"5491","DOI":"10.1109\/TIP.2020.2985219","volume":"29","author":"J Zhang","year":"2020","unstructured":"Zhang J, Shen F, Xu X, Shen HT (2020) Temporal reasoning graph for activity recognition. IEEE Trans Image Process 29:5491\u20135506","journal-title":"IEEE Trans Image Process"},{"key":"11591_CR43","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich, A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"11591_CR44","unstructured":"Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448-456. PMLR"},{"key":"11591_CR45","doi-asserted-by":"crossref","unstructured":"Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618\u2013626","DOI":"10.1109\/ICCV.2017.74"},{"key":"11591_CR46","doi-asserted-by":"crossref","unstructured":"Luo C, Yuille AL (2019) Grouped spatial-temporal aggregation for efficient action recognition. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 5512\u20135521","DOI":"10.1109\/ICCV.2019.00561"},{"key":"11591_CR47","doi-asserted-by":"crossref","unstructured":"Wang Z, She Q, Smolic A (2021) Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 13214\u201313223","DOI":"10.1109\/CVPR46437.2021.01301"},{"key":"11591_CR48","doi-asserted-by":"crossref","unstructured":"Lee M, Lee S, Son S, Park G, Kwak N (2018) Motion feature network: fixed motion filter for action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 387\u2013403","DOI":"10.1007\/978-3-030-01249-6_24"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11591-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11591-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11591-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T20:42:59Z","timestamp":1715892179000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11591-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,27]]},"references-count":48,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["11591"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11591-3","relation":{},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,27]]},"assertion":[{"value":"1 March 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 March 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"124"}}