{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T02:59:02Z","timestamp":1774321142876,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T00:00:00Z","timestamp":1646352000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,8,31]]},"abstract":"<jats:p>Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods pose a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of the skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks, NTU RGB+D and SYSU, show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reducing the complexity of the network.<\/jats:p>","DOI":"10.1145\/3491228","type":"journal-article","created":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T10:26:32Z","timestamp":1646389592000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":37,"title":["Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9554-2133","authenticated-orcid":false,"given":"Xiaoguang","family":"Zhu","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Minhang, Shanghai, China"}]},{"given":"Ye","family":"Zhu","sequence":"additional","affiliation":[{"name":"Illinois Institute of Technology, Chicago, Illinois, U.S.A."}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4314-6099","authenticated-orcid":false,"given":"Haoyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Minhang, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4314-6099","authenticated-orcid":false,"given":"Honglin","family":"Wen","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Minhang, Shanghai, China"}]},{"given":"Yan","family":"Yan","sequence":"additional","affiliation":[{"name":"Illinois Institute of Technology, Chicago, Illinois, U.S.A."}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5321-2336","authenticated-orcid":false,"given":"Peilin","family":"Liu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Minhang, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2022,3,4]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","unstructured":"Fabien Baradel Christian Wolf Julien Mille and Graham W. Taylor. 2018. Glimpse clouds: Human activity recognition from unstructured feature points. In Computer Vision Foundation Salt Lake City UT USA June 18-22 . IEEE Computer Society 469\u2013478.","DOI":"10.1109\/CVPR.2018.00056"},{"issue":"4","key":"e_1_3_1_3_2","first-page":"119:1\u2013119:24","article-title":"Am I done? Predicting action progress in videos","volume":"16","author":"Becattini Federico","year":"2021","unstructured":"Federico Becattini, Tiberio Uricchio, Lorenzo Seidenari, Lamberto Ballan, and Alberto Del Bimbo. 2021. Am I done? Predicting action progress in videos. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 119:1\u2013119:24.","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/SIBGRAPI.2019.00011"},{"key":"e_1_3_1_5_2","unstructured":"Jinmiao Cai Nianjuan Jiang Xiaoguang Han Kui Jia and Jiangbo Lu. 2021. JOLO-GCN: Mining joint-centered light-weight information for skeleton-based action recognition. In WACV Waikoloa HI USA January 3-8 . IEEE 2734\u20132743."},{"key":"e_1_3_1_6_2","doi-asserted-by":"crossref","unstructured":"Jo\u00e3o Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? A new model and the kinetics dataset. In CVPR Honolulu HI USA July 21-26 . IEEE Computer Society 4724\u20134733.","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_1_7_2","doi-asserted-by":"crossref","unstructured":"Yunpeng Chen Yannis Kalantidis Jianshu Li Shuicheng Yan and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In ECCV Munich Germany September 8-14 Vittorio Ferrari Martial Hebert Cristian Sminchisescu and Yair Weiss (Eds.).","DOI":"10.1007\/978-3-030-01246-5_22"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107321"},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","unstructured":"Francois Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In CVPR Honolulu HI USA July 21-26 . IEEE Computer Society 1800\u20131807.","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_3_1_10_2","doi-asserted-by":"crossref","unstructured":"Zewei Ding Pichao Wang Philip O. Ogunbona and Wanqing Li. 2017. Investigation of different skeleton features for CNN-based 3D action recognition. In ICME Workshops Hong Kong China July 10-14 . IEEE Computer Society 617\u2013622.","DOI":"10.1109\/ICMEW.2017.8026286"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2599174"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_13_2","unstructured":"Jianfang Hu Wei-Shi Zheng Jian-Huang Lai and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In CVPR Boston MA USA June 7-12 . IEEE Computer Society 5344\u20135352."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2640292"},{"key":"e_1_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Jian-Fang Hu Wei-Shi Zheng Jiahui Pan Jianhuang Lai and Jianguo Zhang. 2018. Deep bilinear learning for RGB-D action recognition. In ECCV Munich Germany September 8-14 . Springer 346\u2013362.","DOI":"10.1007\/978-3-030-01234-2_21"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","unstructured":"Junqin Huang Zhenhuan Huang Xiang Xiang Xuan Gong and Baochang Zhang. 2020. Long-short graph memory network for skeleton-based action recognition. In WACV Snowmass Village CO USA March 1-5 . IEEE 634\u2013641.","DOI":"10.1109\/WACV45572.2020.9093598"},{"key":"e_1_3_1_17_2","unstructured":"Yanli Ji Feixiang Xu Yang Yang Ning Xie Heng Tao Shen and Tatsuya Harada. 2019. Attention transfer (ANT) network for view-invariant action recognition. In ACM MM Nice France October 21-25 . ACM 574\u2013582."},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Hamid Reza Vaezi Joze Amirreza Shaban Michael L. Iuzzolino and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In CVPR Seattle WA USA June 13-19 . Computer Vision Foundation\/IEEE 13286\u201313296.","DOI":"10.1109\/CVPR42600.2020.01330"},{"key":"e_1_3_1_19_2","unstructured":"Qiuhong Ke Mohammed Bennamoun Senjian An Ferdous Sohel and Farid Boussaid. 2017. A new representation of skeleton sequences for 3D action recognition. In CVPR Honolulu HI USA July 21-26 . IEEE Computer Society 4570\u20134579."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2937757"},{"key":"e_1_3_1_21_2","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR Toulon France April 24-26 . OpenReview.net."},{"key":"e_1_3_1_22_2","unstructured":"Inwoong Lee Doyoung Kim Seoungyoon Kang and Sanghoon Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In ICCV Venice Italy October 22\u201329 . IEEE Computer Society 1012\u20131020."},{"key":"e_1_3_1_23_2","volume-title":"IJCAI, 2018, Stockholm, Sweden, July 13\u201319.","author":"Li Chao","year":"2018","unstructured":"Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, 2018, Stockholm, Sweden, July 13\u201319. ijcai.org, 786\u2013792."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107356"},{"key":"e_1_3_1_25_2","unstructured":"Maosen Li Siheng Chen Xu Chen Ya Zhang Yanfeng Wang and Qi Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In CVPR Long Beach CA USA June 16-20 . Computer Vision Foundation\/IEEE 3595\u20133603."},{"key":"e_1_3_1_26_2","unstructured":"Shuai Li Wanqing Li Chris Cook Ce Zhu and Yanbo Gao. 2018. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In CVPR Salt Lake City UT USA June 18-22 . Computer Vision Foundation\/IEEE Computer Society 5457\u20135466."},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","unstructured":"Y. Li R. Xia X. Liu and Q. Huang. 2019. Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In ICME Shanghai China July 8-12 . IEEE 1066\u20131071.","DOI":"10.1109\/ICME.2019.00187"},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","unstructured":"Jun Liu Amir Shahroudy Dong Xu and Gang Wang. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV Amsterdam The Netherlands October 11-14 Vol. 9907. Springer 816\u2013833.","DOI":"10.1007\/978-3-319-46487-9_50"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3365212"},{"key":"e_1_3_1_30_2","doi-asserted-by":"crossref","unstructured":"Jun Liu Gang Wang Ping Hu Ling-Yu Duan and Alex C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In CVPR Honolulu HI USA July 21-26 . IEEE Computer Society 3671\u20133680.","DOI":"10.1109\/CVPR.2017.391"},{"key":"e_1_3_1_31_2","unstructured":"Mengyuan Liu and Junsong Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In CVPR Salt Lake City UT USA June 18-22 . Computer Vision Foundation\/IEEE Computer Society 1159\u20131168."},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Ziyu Liu Hongwen Zhang Zhenghao Chen Zhiyong Wang and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In CVPR Seattle WA USA June 13-19 . Computer Vision Foundation\/IEEE 140\u2013149.","DOI":"10.1109\/CVPR42600.2020.00022"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","unstructured":"Diogo C. Luvizon David Picard and Hedi Tabia. 2018. 2D\/3D pose estimation and action recognition using multitask deep learning. In CVPR Salt Lake City UT USA June 18-22 . Computer Vision Foundation\/IEEE Computer Society 5137\u20135146.","DOI":"10.1109\/CVPR.2018.00539"},{"key":"e_1_3_1_34_2","doi-asserted-by":"crossref","unstructured":"Juan-Manuel Perez-Rua Valentin Vielzeuf St\u00e9phane Pateux Moez Baccouche and Fr\u00e9d\u00e9ric Jurie. 2019. MFAS: Multimodal fusion architecture search. In CVPR Long Beach CA USA June 16-20 . Computer Vision Foundation\/IEEE 6966\u20136975.","DOI":"10.1109\/CVPR.2019.00713"},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","unstructured":"Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In ICCV Venice Italy October 22-29 . IEEE Computer Society 5833\u20135842.","DOI":"10.1109\/ICCV.2017.621"},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","unstructured":"Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require? In CVPR Anchorage Alaska USA 24-26 June . IEEE Computer Society.","DOI":"10.1109\/CVPR.2008.4587730"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_1_38_2","doi-asserted-by":"crossref","unstructured":"Amir Shahroudy Jun Liu Tian-Tsong Ng and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR Las Vegas NV USA June 27-30 . IEEE Computer Society 1010\u20131019.","DOI":"10.1109\/CVPR.2016.115"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2691321"},{"key":"e_1_3_1_40_2","doi-asserted-by":"crossref","unstructured":"Lei Shi Yifan Zhang Jian Cheng and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In CVPR Long Beach CA USA June 16-20 . Computer Vision Foundation\/IEEE 7912\u20137921.","DOI":"10.1109\/CVPR.2019.00810"},{"key":"e_1_3_1_41_2","doi-asserted-by":"crossref","unstructured":"L. Shi Y. Zhang J. Cheng and H. Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR Long Beach CA USA June 16-20 . Computer Vision Foundation\/IEEE 12026\u201312035.","DOI":"10.1109\/CVPR.2019.01230"},{"key":"e_1_3_1_42_2","unstructured":"Chenyang Si Wentao Chen Wei Wang Liang Wang and Tieniu Tan. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In CVPR Long Beach CA USA June 16-20 . Computer Vision Foundation\/IEEE 1227\u20131236."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2818328"},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Yansong Tang Yi Tian Jiwen Lu Peiyang Li and Jie Zhou. 2018. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR Salt Lake City UT USA June 18-22 . Computer Vision Foundation\/IEEE Computer Society 5323\u20135332.","DOI":"10.1109\/CVPR.2018.00558"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3300937"},{"key":"e_1_3_1_46_2","doi-asserted-by":"crossref","unstructured":"Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresani and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV Santiago Chile December 7-13 . IEEE Computer Society 4489\u20134497.","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_1_47_2","doi-asserted-by":"crossref","unstructured":"Vivek Veeriah Naifan Zhuang and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition. In ICCV Santiago Chile December 7-13 . IEEE Computer Society 4041\u20134049.","DOI":"10.1109\/ICCV.2015.460"},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","unstructured":"Raviteja Vemulapalli Felipe Arrate and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In CVPR Columbus OH USA June 23-28 . IEEE Computer Society 588\u2013595.","DOI":"10.1109\/CVPR.2014.82"},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","unstructured":"Hongsong Wang and Liang Wang. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In CVPR Honolulu HI USA July 21-26 . IEEE Computer Society 3633\u20133642.","DOI":"10.1109\/CVPR.2017.387"},{"key":"e_1_3_1_50_2","doi-asserted-by":"crossref","unstructured":"Junwu Weng Mengyuan Liu Xudong Jiang and Junsong Yuan. 2018. Deformable pose traversal convolution for 3D action and gesture recognition. In ECCV Munich Germany September 8-14 Vol. 11211. Springer 142\u2013157.","DOI":"10.1007\/978-3-030-01234-2_9"},{"key":"e_1_3_1_51_2","doi-asserted-by":"crossref","unstructured":"Di Wu and Ling Shao. 2014. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In CVPR Columbus OH USA June 23-28 . IEEE Computer Society 724\u2013731.","DOI":"10.1109\/CVPR.2014.98"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.image.2020.116098"},{"key":"e_1_3_1_53_2","unstructured":"Zuxuan Wu Yu-Gang Jiang Xi Wang Hao Ye and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In ACM MM Amsterdam The Netherlands October 15-19 . ACM 791\u2013800."},{"key":"e_1_3_1_54_2","doi-asserted-by":"crossref","unstructured":"Lu Xia Chia-Chih Chen and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In CVPR Workshops Providence RI USA June 16-21 . IEEE Computer Society 20\u201327.","DOI":"10.1109\/CVPRW.2012.6239233"},{"key":"e_1_3_1_55_2","volume-title":"IJCAI\u201918","author":"Xie Chunyu","year":"2018","unstructured":"Chunyu Xie, Ce Li, Baochang Zhang, Chen Chen, Jungong Han, and Jianzhuang Liu. 2018. Memory attention networks for skeleton-based action recognition. In IJCAI\u201918."},{"key":"e_1_3_1_56_2","unstructured":"Kelvin Xu Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C. Courville Ruslan Salakhutdinov Richard S. Zemel and Yoshua Bengio. 2015. Show attend and tell: Neural image caption generation with visual attention. In ICML Lille France 6-11 July Vol. 37. JMLR.org 2048\u20132057."},{"key":"e_1_3_1_57_2","doi-asserted-by":"crossref","unstructured":"Sijie Yan Yuanjun Xiong and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI New Orleans Louisiana USA February 2-7 Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press 7444\u20137452.","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3321511"},{"key":"e_1_3_1_59_2","doi-asserted-by":"crossref","unstructured":"Pengfei Zhang Cuiling Lan Junliang Xing Wenjun Zeng Jianru Xue and Nanning Zheng. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV Venice Italy October 22-29 . IEEE Computer Society 2136\u20132145.","DOI":"10.1109\/ICCV.2017.233"},{"key":"e_1_3_1_60_2","doi-asserted-by":"crossref","unstructured":"Pengfei Zhang Cuiling Lan Wenjun Zeng Junliang Xing Jianru Xue and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In CVPR Seattle WA USA June 13-19 . Computer Vision Foundation \/ IEEE 1109\u20131118.","DOI":"10.1109\/CVPR42600.2020.00119"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2937724"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2802648"},{"key":"e_1_3_1_63_2","doi-asserted-by":"crossref","unstructured":"Xikun Zhang Chang Xu and Dacheng Tao. 2020. Context aware graph convolution for skeleton-based action recognition. In CVPR Seattle WA USA June 13-19 . Computer Vision Foundation\/IEEE 14321\u201314330.","DOI":"10.1109\/CVPR42600.2020.01434"},{"key":"e_1_3_1_64_2","unstructured":"Yu Zhang and Dit-Yan Yeung. 2011. Multi-task learning in heterogeneous feature spaces. In AAAI San Francisco California USA August 7-11 . AAAI Press."},{"key":"e_1_3_1_65_2","doi-asserted-by":"crossref","unstructured":"Liming Zhao Xi Li Yueting Zhuang and Jingdong Wang. 2017. Deeply-learned part-aligned representations for person re-identification. In ICCV Venice Italy October 22-29 . IEEE Computer Society 3239\u20133248.","DOI":"10.1109\/ICCV.2017.349"},{"key":"e_1_3_1_66_2","doi-asserted-by":"crossref","unstructured":"R. Zhao H. Ali and P. van der Smagt. 2017. Two-stream RNN\/CNN for action recognition in 3D videos. In IROS Vancouver BC Canada September 24-28 . IEEE 4260\u20134267.","DOI":"10.1109\/IROS.2017.8206288"},{"issue":"4","key":"e_1_3_1_67_2","first-page":"112:1\u2013112:20","article-title":"Unsupervised learning of human action categories in still images with deep representations","volume":"15","author":"Zheng Yunpeng","year":"2020","unstructured":"Yunpeng Zheng, Xuelong Li, and Xiaoqiang Lu. 2020. Unsupervised learning of human action categories in still images with deep representations. ACM Trans. Multim. Comput. Commun. Appl. 15, 4 (2020), 112:1\u2013112:20.","journal-title":"ACM Trans. Multim. Comput. Commun. Appl."},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2962304"},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","unstructured":"Wentao Zhu Cuiling Lan Junliang Xing Wenjun Zeng Yanghao Li Li Shen and Xiaohui Xie. 2016. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI Phoenix Arizona USA February 12-17 . AAAI Press 3697\u20133704.","DOI":"10.1609\/aaai.v30i1.10451"},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","unstructured":"Xiaoguang Zhu Siran Huang Wenjing Fan Yuhao Cheng Huaqing Shao and Peilin Liu. 2021. SDAN: Stacked diverse attention network for video action recognition. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS) Daegu South Korea May 22-28 . IEEE 1\u20135.","DOI":"10.1109\/ISCAS51556.2021.9401289"},{"key":"e_1_3_1_71_2","doi-asserted-by":"crossref","unstructured":"Mohammadreza Zolfaghari Gabriel L. Oliveira Nima Sedaghat and Thomas Brox. 2017. Chained multi-stream networks exploiting pose motion and appearance for action classification and detection. In ICCV Venice Italy October 22-29 . IEEE Computer Society 2923\u20132932.","DOI":"10.1109\/ICCV.2017.316"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491228","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3491228","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:19Z","timestamp":1750183759000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3491228"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,4]]},"references-count":70,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,8,31]]}},"alternative-id":["10.1145\/3491228"],"URL":"https:\/\/doi.org\/10.1145\/3491228","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,4]]},"assertion":[{"value":"2021-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}