{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T04:00:51Z","timestamp":1775016051224,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,8,24]],"date-time":"2023-08-24T00:00:00Z","timestamp":1692835200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Major Science and Technology Innovation 2030 \u201cNew Generation Artificial Intelligence\u201d key project","award":["2021ZD0111700"],"award-info":[{"award-number":["2021ZD0111700"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities of China","doi-asserted-by":"crossref","award":["N2304012"],"award-info":[{"award-number":["N2304012"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Nature Science Foundation of China","doi-asserted-by":"crossref","award":["61773117, 61972397, 62276061, 62002090"],"award-info":[{"award-number":["61773117, 61972397, 62276061, 62002090"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,1,31]]},"abstract":"<jats:p>\n            The problem of long-tailed visual recognition has been receiving increasing research attention. However, the long-tailed distribution problem remains underexplored for video-based visual recognition. To address this issue, in this article we propose a compositional learning based solution for video-based human action recognition. Our method, named Attentional Composition Networks (ACN), first learns verb-like and preposition-like components, then shuffles these components to generate samples for the tail classes in the feature space to augment the data for the tail classes. Specifically, during training, we represent each action video by a graph that captures the spatial-temporal relations (edges) among detected human\/object instances (nodes). Then, ACN utilizes the position information to decompose each action into a set of verb and preposition representations using the edge features in the graph. After that, the verb and preposition features from different videos are combined via an attention structure to synthesize feature representations for tail classes. This way, we can enrich the data for the tail classes and consequently improve the action recognition for these classes. To evaluate the compositional human action recognition, we further contribute a new human action recognition dataset, namely NEU-Interaction (NEU-I). Experimental results on both Something-Something V2 and the proposed NEU-I demonstrate the effectiveness of the proposed method for long-tailed, few-shot, and zero-shot problems in human action recognition. Source code and the NEU-I dataset are available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/YajieW99\/ACN\">https:\/\/github.com\/YajieW99\/ACN<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3603253","type":"journal-article","created":{"date-parts":[[2023,6,9]],"date-time":"2023-06-09T11:57:23Z","timestamp":1686311843000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Attentional Composition Networks for Long-Tailed Human Action Recognition"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6350-5645","authenticated-orcid":false,"given":"Haoran","family":"Wang","sequence":"first","affiliation":[{"name":"Northeastern University, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8792-6088","authenticated-orcid":false,"given":"Yajie","family":"Wang","sequence":"additional","affiliation":[{"name":"Northeastern University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0761-7893","authenticated-orcid":false,"given":"Baosheng","family":"Yu","sequence":"additional","affiliation":[{"name":"The University of Sydney, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3180-0484","authenticated-orcid":false,"given":"Yibing","family":"Zhan","sequence":"additional","affiliation":[{"name":"JD Explore Academy, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2219-4961","authenticated-orcid":false,"given":"Chunfeng","family":"Yuan","sequence":"additional","affiliation":[{"name":"Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6385-6776","authenticated-orcid":false,"given":"Wankou","family":"Yang","sequence":"additional","affiliation":[{"name":"Southeast University, China"}]}],"member":"320","published-online":{"date-parts":[[2023,8,24]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"6548","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Alfassy Amit","year":"2019","unstructured":"Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogerio Feris, Raja Giryes, and Alex M. Bronstein. 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 6548\u20136557."},{"key":"e_1_3_2_3_2","first-page":"39","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Andreas Jacob","year":"2016","unstructured":"Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 39\u201348."},{"key":"e_1_3_2_4_2","first-page":"813","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning. 813\u2013824."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2018.07.011"},{"key":"e_1_3_2_6_2","first-page":"872","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Byrd Jonathon","year":"2019","unstructured":"Jonathon Byrd and Zachary Lipton. 2019. What is the effect of importance weighting in deep learning? In Proceedings of the International Conference on Machine Learning. 872\u2013881."},{"key":"e_1_3_2_7_2","first-page":"6299","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 6299\u20136308."},{"key":"e_1_3_2_8_2","first-page":"5157","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chen Yue","year":"2019","unstructured":"Yue Chen, Yalong Bai, Wei Zhang, and Tao Mei. 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 5157\u20135166."},{"key":"e_1_3_2_9_2","first-page":"3379","volume-title":"Proceedings of the 33rd AAAI Conference on Artificial Intelligence","author":"Chen Zitian","year":"2019","unstructured":"Zitian Chen, Yanwei Fu, Kaiyu Chen, and Yugang Jiang. 2019. Image block augmentation for one-shot learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 3379\u20133386."},{"key":"e_1_3_2_10_2","first-page":"694","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Chu Peng","year":"2020","unstructured":"Peng Chu, Xiao Bian, Shaopeng Liu, and Haibin Ling. 2020. Feature space augmentation for long-tailed data. In Proceedings of the European Conference on Computer Vision. 694\u2013710."},{"key":"e_1_3_2_11_2","first-page":"1300","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence","author":"Fang Haoshu","year":"2021","unstructured":"Haoshu Fang, Yichen Xie, Dian Shao, Yonglu Li, and Cewu Lu. 2021. DecAug: Augmenting HOI detection via decomposition. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. 1300\u20131308."},{"key":"e_1_3_2_12_2","first-page":"4768","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Feichtenhofer Christoph","year":"2017","unstructured":"Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 4768\u20134777."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_2_14_2","first-page":"8359","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Gkioxari Georgia","year":"2018","unstructured":"Georgia Gkioxari, Ross Girshick, Piotr Doll\u00e1r, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 8359\u20138367."},{"key":"e_1_3_2_15_2","first-page":"9211","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Gong Liyu","year":"2019","unstructured":"Liyu Gong and Qiang Cheng. 2019. Exploiting edge features for graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 9211\u20139219."},{"key":"e_1_3_2_16_2","first-page":"5842","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Goyal Raghav","year":"2017","unstructured":"Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, et\u00a0al. 2017. The \u201csomething something\u201d video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 5842\u20135850."},{"issue":"10","key":"e_1_3_2_17_2","doi-asserted-by":"crossref","first-page":"1775","DOI":"10.1109\/TPAMI.2009.83","article-title":"Observing human-object interactions: Using spatial and functional compatibility for recognition","volume":"31","author":"Gupta Abhinav","year":"2009","unstructured":"Abhinav Gupta, Aniruddha Kembhavi, and Larry S. Davis. 2009. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 10 (2009), 1775\u20131789.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2008.239"},{"key":"e_1_3_2_19_2","first-page":"584","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Hou Zhi","year":"2020","unstructured":"Zhi Hou, Xiaojiang Peng, Yu Qiao, and Dacheng Tao. 2020. Visual compositional learning for human-object interaction detection. In Proceedings of the European Conference on Computer Vision. 584\u2013600."},{"key":"e_1_3_2_20_2","first-page":"495","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Hou Zhi","year":"2021","unstructured":"Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and Dacheng Tao. 2021. Affordance transfer learning for human-object interaction detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 495\u2013504."},{"key":"e_1_3_2_21_2","first-page":"14646","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Hou Zhi","year":"2021","unstructured":"Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and Dacheng Tao. 2021. Detecting human-object interaction via fabricated compositional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 14646\u201314655."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00378"},{"key":"e_1_3_2_23_2","first-page":"804","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Hu Ronghang","year":"2017","unstructured":"Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 804\u2013813."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.580"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.3233\/IDA-2002-6504"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.59"},{"issue":"7","key":"e_1_3_2_27_2","doi-asserted-by":"crossref","first-page":"2129","DOI":"10.1109\/TCSVT.2019.2914137","article-title":"Action recognition scheme based on skeleton representation with DS-LSTM network","volume":"30","author":"Jiang Xinghao","year":"2019","unstructured":"Xinghao Jiang, Ke Xu, and Tanfeng Sun. 2019. Action recognition scheme based on skeleton representation with DS-LSTM network. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2019), 2129\u20132140.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"e_1_3_2_29_2","first-page":"1725","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1725\u20131732."},{"key":"e_1_3_2_30_2","first-page":"234","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Kato Keizo","year":"2018","unstructured":"Keizo Kato, Yin Li, and Abhinav Gupta. 2018. Compositional learning for human object interaction. In Proceedings of the European Conference on Computer Vision. 234\u2013251."},{"key":"e_1_3_2_31_2","article-title":"The kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, et\u00a0al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).","journal-title":"arXiv preprint arXiv:1705.06950"},{"key":"e_1_3_2_32_2","first-page":"8940","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Kortylewski Adam","year":"2020","unstructured":"Adam Kortylewski, Ju He, Qing Liu, and Alan L. Yuille. 2020. Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 8940\u20138949."},{"key":"e_1_3_2_33_2","first-page":"2556","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Kuehne Hildegard","year":"2011","unstructured":"Hildegard Kuehne, Hueihan Jhuang, Est\u00edbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 2556\u20132563."},{"key":"e_1_3_2_34_2","first-page":"909","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Li Yan","year":"2020","unstructured":"Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 909\u2013918."},{"key":"e_1_3_2_35_2","first-page":"1","volume-title":"Advances in Neural Information Processing Systems","author":"Li Yonglu","year":"2020","unstructured":"Yonglu Li, Xinpeng Liu, Xiaoqian Wu, Yizhuo Li, and Lu Cewu. 2020. HOI analysis: Integrating and decomposing human-object interaction. In Advances in Neural Information Processing Systems. 1\u201312."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3180585"},{"key":"e_1_3_2_38_2","first-page":"3192","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Long Fuchen","year":"2022","unstructured":"Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, and Tao Mei. 2022. Stand-alone inter-frame attention in video models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3192\u20133201."},{"key":"e_1_3_2_39_2","first-page":"1049","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Materzynska Joanna","year":"2020","unstructured":"Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1049\u20131059."},{"key":"e_1_3_2_40_2","first-page":"1792","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Misra Ishan","year":"2017","unstructured":"Ishan Misra, Abhinav Gupta, and Martial Hebert. 2017. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1792\u20131801."},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1016\/j.neucom.2017.04.007","article-title":"Dual-layer kernel extreme learning machine for action recognition","volume":"260","author":"Nguyen Tam V.","year":"2017","unstructured":"Tam V. Nguyen and Bilal Mirza. 2017. Dual-layer kernel extreme learning machine for action recognition. Neurocomputing 260 (2017), 123\u2013130.","journal-title":"Neurocomputing"},{"key":"e_1_3_2_42_2","first-page":"401","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Qi Siyuan","year":"2018","unstructured":"Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision. 401\u2013417."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.590"},{"key":"e_1_3_2_44_2","first-page":"1100","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Ramanathan Vignesh","year":"2015","unstructured":"Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles Rosenberg, and Li Fei-Fei. 2015. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 1100\u20131109."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_2_46_2","volume-title":"Advances in Neural Information Processing Systems","author":"Santoro Adam","year":"2017","unstructured":"Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems. 1\u201310."},{"key":"e_1_3_2_47_2","first-page":"467","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Shen Li","year":"2016","unstructured":"Li Shen, Zhouchen Lin, and Qingming Huang. 2016. Relay backpropagation for effective learning of deep convolutional neural networks. In Proceedings of the European Conference on Computer Vision. 467\u2013482."},{"issue":"3","key":"e_1_3_2_48_2","first-page":"1","article-title":"Shuffle-invariant network for action recognition in videos","volume":"18","author":"Shi Qinghongya","year":"2022","unstructured":"Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 3 (2022), 1\u201318.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_49_2","volume-title":"Advances in Neural Information Processing Systems","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 1\u20139."},{"key":"e_1_3_2_50_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA."},{"key":"e_1_3_2_51_2","first-page":"19958","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Thatipelli Anirudh","year":"2022","unstructured":"Anirudh Thatipelli, Sanath Narayan, Salman Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Bernard Ghanem. 2022. Spatio-temporal relation modeling for few-shot action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 19958\u201319967."},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_53_2","first-page":"6450","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Tran Du","year":"2018","unstructured":"Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 6450\u20136459."},{"key":"e_1_3_2_54_2","first-page":"20030","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Truong Thanh-Dat","year":"2022","unstructured":"Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong, Han-Seok Seo, Son Lam Phung, Xin Li, and Khoa Luu. 2022. DirecFormer: A directed attention in transformer approach to robust action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 20030\u201320040."},{"key":"e_1_3_2_55_2","first-page":"2635","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Tulsiani Shubham","year":"2017","unstructured":"Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, and Jitendra Malik. 2017. Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 2635\u20132643."},{"key":"e_1_3_2_56_2","first-page":"12645","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang Angtian","year":"2020","unstructured":"Angtian Wang, Yihong Sun, Adam Kortylewski, and Alan L. Yuille. 2020. Robust object detection under occlusion with context-aware CompositionalNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 12645\u201312654."},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wang Limin","year":"2021","unstructured":"Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA."},{"key":"e_1_3_2_58_2","first-page":"20","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wang Limin","year":"2016","unstructured":"Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20\u201336."},{"issue":"12","key":"e_1_3_2_59_2","doi-asserted-by":"crossref","first-page":"2613","DOI":"10.1109\/TCSVT.2016.2576761","article-title":"Temporal pyramid pooling-based convolutional neural network for action recognition","volume":"27","author":"Wang Peng","year":"2016","unstructured":"Peng Wang, Yuanzhouhan Cao, Chunhua Shen, Lingqiao Liu, and Heng Tao Shen. 2016. Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 27, 12 (2016), 2613\u20132622.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_2_61_2","first-page":"399","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision. 399\u2013417."},{"key":"e_1_3_2_62_2","first-page":"7032","volume-title":"Proceedings of the International Conference on Neural Information Processing Systems","author":"Wang Yu-Xiong","year":"2017","unstructured":"Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In Proceedings of the International Conference on Neural Information Processing Systems. 7032\u20137042."},{"key":"e_1_3_2_63_2","first-page":"3919","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Xiao Tete","year":"2019","unstructured":"Tete Xiao, Quanfu Fan, Dan Gutfreund, Mathew Monfort, Aude Oliva, and Bolei Zhou. 2019. Reasoning about human-object interactions through dual attention networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 3919\u20133928."},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_19"},{"issue":"4","key":"e_1_3_2_65_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3450410","article-title":"Dual-stream structured graph convolution network for skeleton-based action recognition","volume":"17","author":"Xu Chunyan","year":"2021","unstructured":"Chunyan Xu, Rong Liu, Tong Zhang, Zhen Cui, Jian Yang, and Chunlong Hu. 2021. Dual-stream structured graph convolution network for skeleton-based action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_66_2","article-title":"Exploiting attention-consistency loss for spatial-temporal stream action recognition","author":"Xu Haotian","year":"2022","unstructured":"Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting attention-consistency loss for spatial-temporal stream action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2s (2022), Article 19, 15 pages.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_67_2","first-page":"17","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Yao Bangpeng","year":"2010","unstructured":"Bangpeng Yao and Li Fei-Fei. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 17\u201324."},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/3478642"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_49"},{"key":"e_1_3_2_70_2","first-page":"843","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Zhou Penghao","year":"2019","unstructured":"Penghao Zhou and Mingmin Chi. 2019. Relation parsing neural network for human-object interaction detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 843\u2013851."},{"key":"e_1_3_2_71_2","first-page":"2827","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Zhou Tianfei","year":"2020","unstructured":"Tianfei Zhou, Wenguan Wang, Qiyuan Qi, Haibin Ling, and Jianbing Shen. 2020. Cascaded human-object interaction recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 2827\u20132840."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603253","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3603253","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:49:11Z","timestamp":1750286951000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603253"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,24]]},"references-count":70,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,1,31]]}},"alternative-id":["10.1145\/3603253"],"URL":"https:\/\/doi.org\/10.1145\/3603253","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,24]]},"assertion":[{"value":"2022-10-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-22","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}