{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,16]],"date-time":"2026-02-16T15:33:28Z","timestamp":1771256008005,"version":"3.50.1"},"reference-count":47,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T00:00:00Z","timestamp":1730332800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.<\/jats:p>","DOI":"10.3389\/fnbot.2024.1457843","type":"journal-article","created":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T06:10:56Z","timestamp":1730355056000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference"],"prefix":"10.3389","volume":"18","author":[{"given":"Dong","family":"Chen","sequence":"first","affiliation":[]},{"given":"Peisong","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Mingdong","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Mengtao","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Tao","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Chuanqi","family":"Li","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2024,10,31]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"2496","DOI":"10.1109\/TNNLS.2022.3190367","article-title":"An effective video transformer with synchronized spatiotemporal and spatial self-attention for action recognition","volume":"35","author":"Alfasly","year":"2024","journal-title":"IEEE Trans. Neural Netw. Learn. Syst"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676","article-title":"\u201cVivit: a video vision transformer,\u201d","author":"Arnab","year":"2021","journal-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B3","first-page":"813","article-title":"\u201cIs space-time attention all you need for video understanding?\u201d","author":"Bertasius","year":"2021","journal-title":"Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research"},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502","article-title":"\u201cQuo vadis, action recognition? A new model and the kinetics dataset,\u201d","author":"Carreira","year":"2017","journal-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B5","doi-asserted-by":"publisher","first-page":"1061","DOI":"10.3390\/app14031061","article-title":"A multi-scale video longformer network for action recognition","volume":"14","author":"Chen","year":"2024","journal-title":"Appl. Sci"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00041","article-title":"\u201cCrossvit: cross-attention multi-scale vision transformer for image classification,\u201d","author":"Chen","year":"2021","journal-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B7","first-page":"578","article-title":"\u201cTVM: an automated End-to-End optimizing compiler for deep learning,\u201d","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen","year":"2018"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848","article-title":"\u201cImagenet: a large-scale hierarchical image database,\u201d","author":"Deng","year":"2009","journal-title":"2009 IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B9","article-title":"\u201cAn image is worth 16x16 words: transformers for image recognition at scale,\u201d","author":"Dosovitskiy","year":"2021","journal-title":"9th International Conference on Learning Representations, ICLR 2021"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630","article-title":"\u201cSlowfast networks for video recognition,\u201d","author":"Feichtenhofer","year":"2018","journal-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B11","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.787","article-title":"\u201cSpatiotemporal residual networks for video action recognition,\u201d","author":"Feichtenhofer","year":"","journal-title":"Advances in Neural Information Processing Systems"},{"key":"B12","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213","article-title":"\u201cConvolutional two-stream network fusion for video action recognition,\u201d","author":"Feichtenhofer","year":"","journal-title":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B13","first-page":"15908","article-title":"\u201cTransformer in transformer,\u201d","author":"Han","year":"2021","journal-title":"Advances in Neural Information Processing Systems"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00209","article-title":"\u201cStm: spatiotemporal and motion encoding for action recognition,\u201d","author":"Jiang","year":"2019","journal-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.223","article-title":"\u201cLarge-scale video classification with convolutional neural networks,\u201d","author":"Karpathy","year":"2014","journal-title":"2014 IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B16","doi-asserted-by":"publisher","first-page":"571","DOI":"10.1007\/978-3-642-33374-3_41","article-title":"\u201cHmdb51: a large video database for human motion recognition,\u201d","author":"Kuehne","year":"2013","journal-title":"High Performance Computing in Science and Engineering"},{"key":"B17","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00099","article-title":"\u201cTea: temporal excitation and aggregation for action recognition,\u201d","author":"Li","year":"2020","journal-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B18","doi-asserted-by":"publisher","first-page":"5174","DOI":"10.1109\/TCSVT.2023.3250646","article-title":"Spatio-temporal adaptive network with bidirectional temporal difference for action recognition","volume":"33","author":"Li","year":"2023","journal-title":"IEEE Trans. Circ. Syst. Video Technol"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718","article-title":"\u201cTsm: temporal shift module for efficient video understanding,\u201d","author":"Lin","year":"2019","journal-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B20","doi-asserted-by":"publisher","first-page":"4104","DOI":"10.1109\/TIP.2022.3180585","article-title":"Motion-driven visual tempo learning for video-based action recognition","volume":"31","author":"Liu","year":"2022","journal-title":"IEEE Trans. Image Proc"},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986","article-title":"\u201cSwin transformer: Hierarchical vision transformer using shifted windows,\u201d","author":"Liu","year":"2021","journal-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B22","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6836","article-title":"\u201cTeinet: towards an efficient architecture for video recognition,\u201d","author":"Liu","year":"2020","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"B23","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00320","article-title":"\u201cVideo swin transformer,\u201d","author":"Liu","year":"2022","journal-title":"2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00558","article-title":"\u201cAction recognition with spatial-temporal discriminative filter banks,\u201d","author":"Mart\u00ednez","year":"2019","journal-title":"2019 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B25","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01018","article-title":"\u201cRepresentation flow for action recognition,\u201d","author":"Piergiovanni","year":"2018","journal-title":"2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B26","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.590","article-title":"\u201cLearning spatio-temporal representation with pseudo-3D residual networks,\u201d","author":"Qiu","year":"2017","journal-title":"2017 IEEE International Conference on Computer Vision (ICCV)"},{"key":"B27","doi-asserted-by":"publisher","first-page":"346","DOI":"10.1007\/s11263-015-0851-8","article-title":"Recognizing fine-grained and composite activities using hand-centric features and script data","volume":"119","author":"Rohrbach","year":"2015","journal-title":"Int. J. Comput. Vis"},{"key":"B28","first-page":"568","article-title":"\u201cTwo-stream convolutional networks for action recognition in videos,\u201d","author":"Simonyan","year":"2014","journal-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems"},{"key":"B29","article-title":"Ucf101: a dataset of 101 human actions classes from videos in the wild","author":"Soomro","year":"2012","journal-title":"ArXiv, abs\/1212.0402"},{"key":"B30","article-title":"\u201cTraining data-efficient image transformers &distillation through attention,\u201d","author":"Touvron","year":"2020","journal-title":"International Conference on Machine Learning"},{"key":"B31","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510","article-title":"\u201cLearning spatiotemporal features with 3D convolutional networks,\u201d","author":"Tran","year":"2015","journal-title":"2015 IEEE International Conference on Computer Vision (ICCV)"},{"key":"B32","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675","article-title":"\u201cA closer look at spatiotemporal convolutions for action recognition,\u201d","author":"Tran","year":"2017","journal-title":"2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B33","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00193","article-title":"\u201cTDN: temporal difference networks for efficient action recognition,\u201d","author":"Wang","year":"2021","journal-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B34","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1007\/978-3-319-46484-8_2","article-title":"\u201cTemporal segment networks: towards good practices for deep action recognition,\u201d","author":"Wang","year":"2016","journal-title":"Computer Vision-ECCV 2016"},{"key":"B35","doi-asserted-by":"publisher","first-page":"2740","DOI":"10.1109\/TPAMI.2018.2868668","article-title":"Temporal segment networks for action recognition in videos","volume":"41","author":"Wang","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813","article-title":"\u201cNon-local neural networks,\u201d","author":"Wang","year":"2018","journal-title":"2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B37","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20062-5_36","article-title":"\u201cSpatiotemporal self-attention modeling with temporal patch shift for action recognition,\u201d","author":"Xiang","year":"2022","journal-title":"Computer Vision-ECCV 2022"},{"key":"B38","doi-asserted-by":"publisher","first-page":"318","DOI":"10.1007\/978-3-030-01267-0_19","article-title":"\u201cRethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,\u201d","author":"Xie","year":"2018","journal-title":"Computer Vision-ECCV 2018"},{"key":"B39","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1142\/S0218001423560116","article-title":"Cls-net: an action recognition algorithm based on channel-temporal information modeling","volume":"2356011","author":"Xue","year":"2023","journal-title":"Int. J. Pattern Recognit. Artif. Intell"},{"key":"B40","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00067","article-title":"\u201cTemporal pyramid network for action recognition,\u201d","author":"Yang","year":"2020","journal-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)"},{"key":"B41","doi-asserted-by":"publisher","first-page":"e11401","DOI":"10.1016\/j.heliyon.2022.e11401","article-title":"Human action recognition method based on motion excitation and temporal aggregation module","volume":"8","author":"Ye","year":"2022","journal-title":"Heliyon"},{"key":"B42","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00060","article-title":"\u201cTokens-to-token vit: Training vision transformers from scratch on imagenet,\u201d","author":"Yuan","year":"2021","journal-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B43","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475272","article-title":"\u201cToken shift transformer for video classification,\u201d","author":"Zhang","year":"2021","journal-title":"Proceedings of the 29th ACM International Conference on Multimedia, MM '21"},{"key":"B44","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01332","article-title":"\u201cVIDTR: video transformer without convolutions,\u201d","author":"Zhang","year":"2021","journal-title":"2021 IEEE\/CVF International Conference on Computer Vision (ICCV)"},{"key":"B45","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00687","article-title":"\u201cRecognize actions by disentangling components of dynamics,\u201d","author":"Zhao","year":"2018","journal-title":"2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B46","doi-asserted-by":"publisher","first-page":"7970","DOI":"10.1109\/TIP.2020.3007826","article-title":"Dynamic sampling networks for efficient action recognition in videos","volume":"29","author":"Zheng","year":"2020","journal-title":"IEEE Trans. Image Proc"},{"key":"B47","doi-asserted-by":"publisher","first-page":"831","DOI":"10.1007\/978-3-030-01246-5_49","article-title":"\u201cTemporal relational reasoning in videos,\u201d","author":"Zhou","year":"2018","journal-title":"Computer Vision-ECCV"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1457843\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T06:11:01Z","timestamp":1730355061000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1457843\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,31]]},"references-count":47,"alternative-id":["10.3389\/fnbot.2024.1457843"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2024.1457843","relation":{},"ISSN":["1662-5218"],"issn-type":[{"value":"1662-5218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,31]]},"article-number":"1457843"}}