{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,17]],"date-time":"2026-02-17T02:17:54Z","timestamp":1771294674015,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":86,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation for Distinguished Young Scholars of China","award":["61925204"],"award-info":[{"award-number":["61925204"]}]},{"name":"National Natural Science Foundation of China","award":["62072245"],"award-info":[{"award-number":["62072245"]}]},{"name":"National Key Research and Development Program of China","award":["2018AAA0102002"],"award-info":[{"award-number":["2018AAA0102002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547862","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"3666-3675","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Look Less Think More: Rethinking Compositional Action Recognition"],"prefix":"10.1145","author":[{"given":"Rui","family":"Yan","sequence":"first","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peng","family":"Huang","sequence":"additional","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiangbo","family":"Shu","sequence":"additional","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junhao","family":"Zhang","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yonghua","family":"Pan","sequence":"additional","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jinhui","family":"Tang","sequence":"additional","affiliation":[{"name":"Nanjing University of Science and Technology, Nanjing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00676"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_3_2_2_3_1","volume-title":"International Conference on Machine Learning","volume":"2","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius , Heng Wang , and Lorenzo Torresani . 2021 . Is Space-Time Attention All You Need for Video Understanding? . In International Conference on Machine Learning , Vol. 2 . 4. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In International Conference on Machine Learning, Vol. 2. 4."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3063297"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_2_6_1","volume-title":"Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, and Mike Zheng Shou.","author":"Cai Guanyu","year":"2022","unstructured":"Guanyu Cai , Yixiao Ge , Alex Jinpeng Wang , Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, and Mike Zheng Shou. 2022 . Revitalize Region Feature for Democratizing Video-Language Pre-training . arXiv preprint arXiv:2203.07720 (2022). Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, and Mike Zheng Shou. 2022. Revitalize Region Feature for Democratizing Video-Language Pre-training. arXiv preprint arXiv:2203.07720 (2022)."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206821"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01081"},{"key":"e_1_3_2_2_10_1","unstructured":"MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https:\/\/github.com\/open-mmlab\/mmaction2.  MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark. https:\/\/github.com\/open-mmlab\/mmaction2."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_2_13_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00630"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.213"},{"key":"e_1_3_2_2_16_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 16167--16176","author":"Ge Yuying","year":"2022","unstructured":"Yuying Ge , Yixiao Ge , Xihui Liu , Dian Li , Ying Shan , Xiaohu Qie , and Ping Luo . 2022 . BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 16167--16176 . Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, and Ping Luo. 2022. BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 16167--16176."},{"key":"e_1_3_2_2_17_1","volume-title":"Mutant: A training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566","author":"Gokhale Tejas","year":"2020","unstructured":"Tejas Gokhale , Pratyay Banerjee , Chitta Baral , and Yezhou Yang . 2020 . Mutant: A training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566 (2020). Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. Mutant: A training paradigm for out-of-distribution generalization in visual question answering. arXiv preprint arXiv:2009.08566 (2020)."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.622"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Donald D Hoffman and Whitman A Richards. 1984. Parts of recognition. In Cognition. 65--96.  Donald D Hoffman and Whitman A Richards. 1984. Parts of recognition. In Cognition. 65--96.","DOI":"10.1016\/0010-0277(84)90022-2"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01025"},{"key":"e_1_3_2_2_23_1","volume-title":"International Conference on Machine Learning. 4904--4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V Le , Yunhsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling up visual and visionlanguage representation learning with noisy text supervision . In International Conference on Machine Learning. 4904--4916 . Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and visionlanguage representation learning with noisy text supervision. In International Conference on Machine Learning. 4904--4916."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2670560"},{"key":"e_1_3_2_2_25_1","volume-title":"Ruart: A novel text-centered solution for text-based visual question answering","author":"Jin Zan-Xia","year":"2021","unstructured":"Zan-Xia Jin , Heran Wu , Chun Yang , Fang Zhou , Jingyan Qin , Lei Xiao , and Xu- Cheng Yin . 2021 . Ruart: A novel text-centered solution for text-based visual question answering . IEEE Transactions on Multimedia ( 2021). Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, and Xu- Cheng Yin. 2021. Ruart: A novel text-centered solution for text-based visual question answering. IEEE Transactions on Multimedia (2021)."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2020.05.110"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.223"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_15"},{"key":"e_1_3_2_2_29_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev etal 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_2_30_1","volume-title":"SAFCAR: Structured Attention Fusion for Compositional Action Recognition. arXiv preprint arXiv:2012.02109","author":"Kim Tae Soo","year":"2020","unstructured":"Tae Soo Kim and Gregory D Hager . 2020 . SAFCAR: Structured Attention Fusion for Compositional Action Recognition. arXiv preprint arXiv:2012.02109 (2020). Tae Soo Kim and Gregory D Hager. 2020. SAFCAR: Structured Attention Fusion for Compositional Action Recognition. arXiv preprint arXiv:2012.02109 (2020)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01283"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126543"},{"key":"e_1_3_2_2_33_1","volume-title":"On information and sufficiency. The annals of mathematical statistics 22, 1","author":"Kullback Solomon","year":"1951","unstructured":"Solomon Kullback and Richard A Leibler . 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 ( 1951 ), 79--86. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79--86."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_3_2_2_35_1","first-page":"1404","article-title":"TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition","volume":"36","author":"Fei Mengjuan","year":"2021","unstructured":"Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei , Xiaoyuan Yu , and Weiyao Lin . 2021 . TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition . In Association for the Advancement of Artificial Intelligence , Vol. 36. 1404 -- 1411 . Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei, Xiaoyuan Yu, and Weiyao Lin. 2021. TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition. In Association for the Advancement of Artificial Intelligence, Vol. 36. 1404--1411.","journal-title":"Association for the Advancement of Artificial Intelligence"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01318"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58621-8_25"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01600"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_3_2_2_40_1","volume-title":"Finding action tubes with a sparse-to-dense framework","author":"Qian Rui","unstructured":"Yuxi Li, Weiyao Lin, Tao Wang, John See, Rui Qian , Ning Xu , Limin Wang , and Shugong Xu. 2020. Finding action tubes with a sparse-to-dense framework . In Association for the Advancement of Artificial Intelligence. 11466--11473. Yuxi Li, Weiyao Lin, Tao Wang, John See, Rui Qian, Ning Xu, Limin Wang, and Shugong Xu. 2020. Finding action tubes with a sparse-to-dense framework. In Association for the Advancement of Artificial Intelligence. 11466--11473."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00718"},{"key":"e_1_3_2_2_42_1","volume-title":"Skeletonbased action recognition using spatio-temporal LSTM network with trust gates","author":"Liu Jun","year":"2017","unstructured":"Jun Liu , Amir Shahroudy , Dong Xu , Alex C Kot , and Gang Wang . 2017. Skeletonbased action recognition using spatio-temporal LSTM network with trust gates . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2017 ), 3007--3021. Jun Liu, Amir Shahroudy, Dong Xu, Alex C Kot, and Gang Wang. 2017. Skeletonbased action recognition using spatio-temporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017), 3007--3021."},{"key":"e_1_3_2_2_43_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019 . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_3_2_2_44_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3211","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Jia Ning , Yue Cao , Yixuan Wei , Zheng Zhang , Stephen Lin , and Han Hu . 2021 . Video swin transformer . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3211 . Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video swin transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3211."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.1999.790410"},{"key":"e_1_3_2_2_46_1","volume-title":"Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860","author":"Luo Huaishao","year":"2021","unstructured":"Huaishao Luo , Lei Ji , Ming Zhong , Yang Chen , Wen Lei , Nan Duan , and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 ( 2021 ). Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00113"},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00355"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_38"},{"key":"e_1_3_2_2_50_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_2_51_1","volume-title":"International Conference on Machine Learning. PMLR, 8821--8831","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh , Mikhail Pavlov , Gabriel Goh , Scott Gray , Chelsea Voss , Alec Radford , Mark Chen , and Ilya Sutskever . 2021 . Zero-shot text-to-image generation . In International Conference on Machine Learning. PMLR, 8821--8831 . Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821--8831."},{"key":"e_1_3_2_2_52_1","volume-title":"a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh , Lysandre Debut , Julien Chaumond , and Thomas Wolf . 2019. DistilBERT , a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 ( 2019 ). Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)."},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2942030"},{"key":"e_1_3_2_2_54_1","volume-title":"Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction","author":"Shu Xiangbo","year":"2021","unstructured":"Xiangbo Shu , Liyan Zhang , Guo-Jun Qi , Wei Liu , and Jinhui Tang . 2021. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2021 ). Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2021. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)."},{"key":"e_1_3_2_2_55_1","volume-title":"Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems (2014)."},{"key":"e_1_3_2_2_56_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro , Amir Roshan Zamir, and Mubarak Shah . 2012 . UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012). Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)."},{"key":"e_1_3_2_2_57_1","unstructured":"Pengzhan Sun Bo Wu Xunsong Li Wen Li Lixin Duan and Chuang Gan. 2021. Counterfactual Debiasing Inference for Compositional Action Recognition. In ACMMM. 3220--3228.  Pengzhan Sun Bo Wu Xunsong Li Wen Li Lixin Duan and Chuang Gan. 2021. Counterfactual Debiasing Inference for Compositional Action Recognition. In ACMMM. 3220--3228."},{"key":"e_1_3_2_2_58_1","volume-title":"BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning. In ACM international conference on Multimedia. 610--618","author":"Tang Hao","year":"2020","unstructured":"Hao Tang , Zechao Li , Zhimao Peng , and Jinhui Tang . 2020 . BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning. In ACM international conference on Multimedia. 610--618 . Hao Tang, Zechao Li, Zhimao Peng, and Jinhui Tang. 2020. BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning. In ACM international conference on Multimedia. 610--618."},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2022.108792"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.160"},{"key":"e_1_3_2_2_63_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008."},{"key":"e_1_3_2_2_64_1","volume-title":"All in One: Exploring Unified Video-Language Pre-training. arXiv preprint arXiv:2203.07303","author":"Wang Alex Jinpeng","year":"2022","unstructured":"Alex Jinpeng Wang , Yixiao Ge , Rui Yan , Yuying Ge , Xudong Lin , Guanyu Cai , Jianping Wu , Ying Shan , Xiaohu Qie , and Mike Zheng Shou . 2022. All in One: Exploring Unified Video-Language Pre-training. arXiv preprint arXiv:2203.07303 ( 2022 ). Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. 2022. All in One: Exploring Unified Video-Language Pre-training. arXiv preprint arXiv:2203.07303 (2022)."},{"key":"e_1_3_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.441"},{"key":"e_1_3_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00331"},{"key":"e_1_3_2_2_67_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314","author":"Qiao Yu","year":"2015","unstructured":"LiminWang, Yu Qiao , and Xiaoou Tang . 2015 . Action recognition with trajectorypooled deep-convolutional descriptors . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314 . LiminWang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectorypooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314."},{"key":"e_1_3_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_2"},{"key":"e_1_3_2_2_69_1","volume-title":"Temporal segment networks for action recognition in videos","author":"Wang Limin","unstructured":"Limin Wang , Yuanjun Xiong , Zhe Wang , Yu Qiao , Dahua Lin , Xiaoou Tang , and Luc Van Gool . 2018. Temporal segment networks for action recognition in videos . In IEEE Transactions on Pattern Analysis and Machine Intelligence . 2740--2755. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. In IEEE Transactions on Pattern Analysis and Machine Intelligence. 2740--2755."},{"key":"e_1_3_2_2_70_1","volume-title":"Actionclip: A newparadigm for video action recognition. arXiv preprint arXiv:2109.08472","author":"Xing Jiazheng","year":"2021","unstructured":"MengmengWang, Jiazheng Xing , and Yong Liu . 2021 . Actionclip: A newparadigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021). MengmengWang, Jiazheng Xing, and Yong Liu. 2021. Actionclip: A newparadigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)."},{"key":"e_1_3_2_2_71_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803","author":"Girshick Ross","year":"2018","unstructured":"XiaolongWang, Ross Girshick , Abhinav Gupta , and Kaiming He . 2018 . Non-local neural networks . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803 . XiaolongWang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803."},{"key":"e_1_3_2_2_72_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01228-1_25"},{"key":"e_1_3_2_2_73_1","volume-title":"Transformers: State-of-the-Art Natural Language Processing. In Conference on Empirical Methods in Natural Language Processing. 38--45","author":"Debut Lysandre","unstructured":"ThomasWolf, Lysandre Debut , Victor Sanh , Julien Chaumond , Clement Delangue , Anthony Moi , Pierric Cistac , Tim Rault , R\u00e9mi Louf , Morgan Funtowicz , Joe Davison , Sam Shleifer , Patrick von Platen , Clara Ma , Yacine Jernite , Julien Plu , Canwen Xu , Teven Le Scao , Sylvain Gugger , Mariama Drame , Quentin Lhoest , and Alexander M. Rush . 2020 . Transformers: State-of-the-Art Natural Language Processing. In Conference on Empirical Methods in Natural Language Processing. 38--45 . ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Conference on Empirical Methods in Natural Language Processing. 38--45."},{"key":"e_1_3_2_2_74_1","volume-title":"ACM international conference on Multimedia. 791--800","author":"Wu Zuxuan","year":"2016","unstructured":"Zuxuan Wu , Yu-Gang Jiang , Xi Wang , Hao Ye , and Xiangyang Xue . 2016 . Multistream multi-class fusion of deep networks for video classification . In ACM international conference on Multimedia. 791--800 . Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multistream multi-class fusion of deep networks for video classification. In ACM international conference on Multimedia. 791--800."},{"key":"e_1_3_2_2_75_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806222"},{"key":"e_1_3_2_2_76_1","volume-title":"Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, and Jinhui Tang.","author":"Yan Rui","year":"2021","unstructured":"Rui Yan , Mike Zheng Shou , Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, and Jinhui Tang. 2021 . Video-Text Pre-training with Learned Regions . arXiv preprint arXiv:2112.01194 (2021). Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, and Jinhui Tang. 2021. Video-Text Pre-training with Learned Regions. arXiv preprint arXiv:2112.01194 (2021)."},{"key":"e_1_3_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240572"},{"key":"e_1_3_2_2_78_1","volume-title":"Interactive Fusion of Multi-level Features for Compositional Activity Recognition. arXiv preprint arXiv:2012.05689","author":"Yan Rui","year":"2020","unstructured":"Rui Yan , Lingxi Xie , Xiangbo Shu , and Jinhui Tang . 2020. Interactive Fusion of Multi-level Features for Compositional Activity Recognition. arXiv preprint arXiv:2012.05689 ( 2020 ). Rui Yan, Lingxi Xie, Xiangbo Shu, and Jinhui Tang. 2020. Interactive Fusion of Multi-level Features for Compositional Activity Recognition. arXiv preprint arXiv:2012.05689 (2020)."},{"key":"e_1_3_2_2_79_1","volume-title":"HiGCIN: Hierarchical graph-based cross inference network for group activity recognition","author":"Yan Rui","year":"2020","unstructured":"Rui Yan , Lingxi Xie , Jinhui Tang , Xiangbo Shu , and Qi Tian . 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2020 ). Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)."},{"key":"e_1_3_2_2_80_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58598-3_13"},{"key":"e_1_3_2_2_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00067"},{"key":"e_1_3_2_2_82_1","unstructured":"Dong Zhang Hanwang Zhang Jinhui Tang Xian-Sheng Hua and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. In Advances in Neural Information Processing Systems. 655--666.  Dong Zhang Hanwang Zhang Jinhui Tang Xian-Sheng Hua and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. In Advances in Neural Information Processing Systems. 655--666."},{"key":"e_1_3_2_2_83_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58604-1_20"},{"key":"e_1_3_2_2_84_1","volume-title":"Morphmlp: A self-attention free, mlp-like backbone for image and video. arXiv preprint arXiv:2111.12527","author":"Zhang David Junhao","year":"2021","unstructured":"David Junhao Zhang , Kunchang Li , Yunpeng Chen , YaliWang, Shashwat Chandra , Yu Qiao , Luoqi Liu , and Mike Zheng Shou . 2021 . Morphmlp: A self-attention free, mlp-like backbone for image and video. arXiv preprint arXiv:2111.12527 (2021). David Junhao Zhang, Kunchang Li, Yunpeng Chen, YaliWang, Shashwat Chandra, Yu Qiao, Luoqi Liu, and Mike Zheng Shou. 2021. Morphmlp: A self-attention free, mlp-like backbone for image and video. arXiv preprint arXiv:2111.12527 (2021)."},{"key":"e_1_3_2_2_85_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3109517"},{"key":"e_1_3_2_2_86_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_49"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547862","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547862","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:35Z","timestamp":1750186955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547862"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":86,"alternative-id":["10.1145\/3503161.3547862","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547862","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}