{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T17:24:08Z","timestamp":1761672248108,"version":"build-2065373602"},"reference-count":57,"publisher":"Institution of Engineering and Technology (IET)","issue":"11","license":[{"start":{"date-parts":[[2023,7,10]],"date-time":"2023-07-10T00:00:00Z","timestamp":1688947200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Image Processing"],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Skeleton\u2010based neural networks have been considered a focus for human action recognition (HAR). It is noteworthy that the existing skeleton\u2010based methods are not capable of combining the spatial and temporal features reasonably to derive more effective high\u2010level representations, and it continues to be a challenging task of learning and representing the skeleton action discriminatively. In this study, a novel two\u2010stream spatiotemporal network (TSTN) is proposed, which is capable of processing the spatial and temporal features respectively and collectively to achieve a better representation and understanding of human action. The temporal branch stacks three gate recurrent unit (GRU) blocks in a new architecture to encode the temporal correlations from different aspects of human action, achieving high\u2010level temporal semantic feature expressions. The spatial branch encodes the spatial features with multi\u2010stacked graph convolutional network (GCN) blocks. Self\u2010attention mechanisms incorporated with the graph structure of the skeleton are explored to add weight influence and structural hints to further enhance the performance. The experimental results verify the effectiveness and superiority of the proposed model in skeleton action recognition; the model reaches state\u2010of\u2010the\u2010art on specific datasets.<\/jats:p>","DOI":"10.1049\/ipr2.12868","type":"journal-article","created":{"date-parts":[[2023,7,10]],"date-time":"2023-07-10T10:46:02Z","timestamp":1688985962000},"page":"3358-3370","update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Two\u2010stream spatiotemporal networks for skeleton action recognition"],"prefix":"10.1049","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0912-6059","authenticated-orcid":false,"given":"Lei","family":"Wang","sequence":"first","affiliation":[{"name":"School of Aeronautics and Astronautics Sichuan University Chengdu China"}]},{"given":"Jianwei","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science Sichuan University Chengdu China"}]},{"given":"Shanmin","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer Science Chengdu University of Information Technology Chengdu China"}]},{"given":"Song","family":"Gu","sequence":"additional","affiliation":[{"name":"School of Aeronautical Manufacturing Industry Chengdu Aeronautic Vocational and Technical College Chengdu China"}]}],"member":"265","published-online":{"date-parts":[[2023,7,10]]},"reference":[{"key":"e_1_2_9_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2014.04.011"},{"key":"e_1_2_9_3_1","doi-asserted-by":"crossref","unstructured":"Tsunoda T. et\u00a0al.:Football action recognition using hierarchical LSTM. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops(2017)","DOI":"10.1109\/CVPRW.2017.25"},{"key":"e_1_2_9_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-020-09904-8"},{"key":"e_1_2_9_5_1","doi-asserted-by":"crossref","unstructured":"Tran D. et\u00a0al.:A closer look at spatiotemporal convolutions for action recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2018)","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_2_9_6_1","doi-asserted-by":"crossref","unstructured":"Goyal R. et\u00a0al.:The \u2018\u201csomething something\u201d video database for learning and evaluating visual common sense. In:Proceedings of the IEEE International Conference on Computer Vision(2017)","DOI":"10.1109\/ICCV.2017.622"},{"key":"e_1_2_9_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2021.116424"},{"key":"e_1_2_9_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-018-6875-7"},{"key":"e_1_2_9_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01470-y"},{"key":"e_1_2_9_10_1","unstructured":"Hussein M.E. et\u00a0al.:Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In:Twenty\u2010third International Joint Conference on Artificial Intelligence(2013)"},{"key":"e_1_2_9_11_1","doi-asserted-by":"crossref","unstructured":"Song S. et\u00a0al.:An end\u2010to\u2010end spatio\u2010temporal attention model for human action recognition from skeleton data. In:Proceedings of the AAAI Conference on Artificial Intelligence vol.31(1) (2017)","DOI":"10.1609\/aaai.v31i1.11212"},{"key":"e_1_2_9_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2017.02.030"},{"key":"e_1_2_9_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2015.11.019"},{"key":"e_1_2_9_14_1","doi-asserted-by":"crossref","unstructured":"Si C. et\u00a0al.:An attention enhanced graph convolutional LSTM network for skeleton\u2010based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2019)","DOI":"10.1109\/CVPR.2019.00132"},{"key":"e_1_2_9_15_1","doi-asserted-by":"crossref","unstructured":"Yan S. Xiong Y. Lin D.:Spatial temporal graph convolutional networks for skeleton\u2010based action recognition. In:Proceedings of the AAAI Conference on Artificial Intelligence vol.32(1) (2018)","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"e_1_2_9_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2771306"},{"key":"e_1_2_9_17_1","doi-asserted-by":"crossref","unstructured":"Vemulapalli R. Chellapa R.:Rolling rotations for recognizing human actions from 3d skeletal data. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2016)","DOI":"10.1109\/CVPR.2016.484"},{"key":"e_1_2_9_18_1","doi-asserted-by":"crossref","unstructured":"Vemulapalli R. Arrate F. Chellappa R.:Human action recognition by representing 3d skeletons as points in a lie group. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2014)","DOI":"10.1109\/CVPR.2014.82"},{"key":"e_1_2_9_19_1","unstructured":"Du Y. Wang W. Wang L.:Hierarchical recurrent neural network for skeleton based action recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2015)"},{"key":"e_1_2_9_20_1","unstructured":"Thakkar K. Narayanan P.J.:Part\u2010based graph convolutional network for action recognition.arXiv preprintarXiv:1809.04983 (2018)"},{"key":"e_1_2_9_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2837386"},{"key":"e_1_2_9_22_1","doi-asserted-by":"crossref","unstructured":"Peng W. et\u00a0al.:Learning graph convolutional network for skeleton\u2010based human action recognition by neural searching. In:Proceedings of the AAAI Conference on Artificial Intelligence vol34(03) (2020)","DOI":"10.1609\/aaai.v34i03.5652"},{"key":"e_1_2_9_23_1","doi-asserted-by":"crossref","unstructured":"Li M. et\u00a0al.:Actional\u2010structural graph convolutional networks for skeleton\u2010based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2019)","DOI":"10.1109\/CVPR.2019.00371"},{"key":"e_1_2_9_24_1","doi-asserted-by":"crossref","unstructured":"Liu J. et\u00a0al.:Spatio\u2010temporal LSTM with trust gates for 3d human action recognition. In:Computer Vision\u2013ECCV 2016: 14th European Conference Amsterdam The Netherlands 11\u201314 October 2016 Proceedings Part III 14.Springer International Publishing(2016)","DOI":"10.1007\/978-3-319-46487-9_50"},{"key":"e_1_2_9_25_1","doi-asserted-by":"crossref","unstructured":"Qi S. et\u00a0al.:Learning human\u2010object interactions by graph parsing neural networks. In:Proceedings of the European Conference on Computer Vision (ECCV)(2018)","DOI":"10.1007\/978-3-030-01240-3_25"},{"key":"e_1_2_9_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2018.09.003"},{"key":"e_1_2_9_27_1","doi-asserted-by":"crossref","unstructured":"Zhou B. et\u00a0al.:Temporal relational reasoning in videos. In:Proceedings of the European Conference on Computer Vision (ECCV)(2018)","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"e_1_2_9_28_1","doi-asserted-by":"crossref","unstructured":"Kundu J.N. et\u00a0al.:Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In:2019 IEEE Winter Conference on Applications of Computer Vision (WACV).IEEE(2019)","DOI":"10.1109\/WACV.2019.00160"},{"key":"e_1_2_9_29_1","doi-asserted-by":"crossref","unstructured":"Caetano C. et\u00a0al.:Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In:2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) IEEE(2019)","DOI":"10.1109\/AVSS.2019.8909840"},{"key":"e_1_2_9_30_1","doi-asserted-by":"crossref","unstructured":"Si C. et\u00a0al.:Skeleton\u2010based action recognition with spatial reasoning and temporal stack learning. In:Proceedings of the European Conference on Computer Vision (ECCV)(2018)","DOI":"10.1007\/978-3-030-01246-5_7"},{"key":"e_1_2_9_31_1","doi-asserted-by":"crossref","unstructured":"Wang H. Wang L.:Modeling temporal dynamics and spatial configurations of actions using two\u2010stream recurrent neural networks. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2017)","DOI":"10.1109\/CVPR.2017.387"},{"key":"e_1_2_9_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2023.3247075"},{"key":"e_1_2_9_33_1","doi-asserted-by":"crossref","unstructured":"Zhang J. et\u00a0al.:Mixste: Seq2seq mixed spatio\u2010temporal encoder for 3d human pose estimation in video. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2022)","DOI":"10.1109\/CVPR52688.2022.01288"},{"key":"e_1_2_9_34_1","doi-asserted-by":"publisher","DOI":"10.1049\/cit2.12012"},{"key":"e_1_2_9_35_1","doi-asserted-by":"crossref","unstructured":"Cho S. et\u00a0al.:Self\u2010attention network for skeleton\u2010based human action recognition. In:Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision(2020)","DOI":"10.1109\/WACV45572.2020.9093639"},{"key":"e_1_2_9_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3168137"},{"key":"e_1_2_9_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3175605"},{"key":"e_1_2_9_38_1","doi-asserted-by":"crossref","unstructured":"Su K. Liu X. Shlizerman E.:Predict & cluster: Unsupervised skeleton based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2020)","DOI":"10.1109\/CVPR42600.2020.00965"},{"key":"e_1_2_9_39_1","doi-asserted-by":"crossref","unstructured":"Sun K. et\u00a0al.:Deep high\u2010resolution representation learning for human pose estimation. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2019)","DOI":"10.1109\/CVPR.2019.00584"},{"key":"e_1_2_9_40_1","doi-asserted-by":"crossref","unstructured":"Cho K. et\u00a0al.:Learning phrase representations using RNN encoder\u2010decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_2_9_41_1","unstructured":"Veli\u010dkovi\u0107 P. et\u00a0al.:Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)"},{"key":"e_1_2_9_42_1","unstructured":"Kipf T. et\u00a0al.:Neural relational inference for interacting systems. In:International Conference on Machine Learning (PMLR)(2018)"},{"key":"e_1_2_9_43_1","doi-asserted-by":"crossref","unstructured":"Shahroudy A. et\u00a0al.:Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2016)","DOI":"10.1109\/CVPR.2016.115"},{"key":"e_1_2_9_44_1","unstructured":"Kay W. et\u00a0al.:The kinetics human action video dataset.arXiv preprintarXiv:1705.06950 (2017)"},{"key":"e_1_2_9_45_1","doi-asserted-by":"crossref","unstructured":"Zhang P. et\u00a0al.:View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In:Proceedings of the IEEE International Conference on Computer Vision(2017)","DOI":"10.1109\/ICCV.2017.233"},{"key":"e_1_2_9_46_1","doi-asserted-by":"crossref","unstructured":"Wang C. et\u00a0al.:Mancs: A multi\u2010task attentional network with curriculum sampling for person re\u2010identification. In:Proceedings of the European Conference on Computer Vision (ECCV)(2018)","DOI":"10.1007\/978-3-030-01225-0_23"},{"key":"e_1_2_9_47_1","doi-asserted-by":"crossref","unstructured":"Shi L. et\u00a0al.:Two\u2010stream adaptive graph convolutional networks for skeleton\u2010based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2019)","DOI":"10.1109\/CVPR.2019.01230"},{"key":"e_1_2_9_48_1","doi-asserted-by":"crossref","unstructured":"Shi L. et\u00a0al.:Skeleton\u2010based action recognition with directed graph neural networks. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2019)","DOI":"10.1109\/CVPR.2019.00810"},{"key":"e_1_2_9_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3028207"},{"key":"e_1_2_9_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2818328"},{"key":"e_1_2_9_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/THMS.2018.2883001"},{"key":"e_1_2_9_52_1","doi-asserted-by":"crossref","unstructured":"Zhang P. et\u00a0al.:Semantics\u2010guided neural networks for efficient skeleton\u2010based human action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2020)","DOI":"10.1109\/CVPR42600.2020.00119"},{"key":"e_1_2_9_53_1","doi-asserted-by":"crossref","unstructured":"Cheng K. et\u00a0al.:Skeleton\u2010based action recognition with shift graph convolutional network. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2020)","DOI":"10.1109\/CVPR42600.2020.00026"},{"key":"e_1_2_9_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3129117"},{"key":"e_1_2_9_55_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2021.107921"},{"key":"e_1_2_9_56_1","doi-asserted-by":"crossref","unstructured":"Liu Z. et\u00a0al.:Disentangling and unifying graph convolutions for skeleton\u2010based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2020)","DOI":"10.1109\/CVPR42600.2020.00022"},{"key":"e_1_2_9_57_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.02.001"},{"key":"e_1_2_9_58_1","doi-asserted-by":"crossref","unstructured":"Chi H.\u2010G. et\u00a0al.:Infogcn: Representation learning for human skeleton\u2010based action recognition. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2022)","DOI":"10.1109\/CVPR52688.2022.01955"}],"container-title":["IET Image Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.12868","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T17:09:03Z","timestamp":1761671343000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/ipr2.12868"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,10]]},"references-count":57,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["10.1049\/ipr2.12868"],"URL":"https:\/\/doi.org\/10.1049\/ipr2.12868","archive":["Portico"],"relation":{},"ISSN":["1751-9659","1751-9667"],"issn-type":[{"type":"print","value":"1751-9659"},{"type":"electronic","value":"1751-9667"}],"subject":[],"published":{"date-parts":[[2023,7,10]]},"assertion":[{"value":"2023-03-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-28","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}