{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:16:11Z","timestamp":1775578571261,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2024,10,28]],"date-time":"2024-10-28T00:00:00Z","timestamp":1730073600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the Fundamental Research Funds for the Central Universities"},{"DOI":"10.13039\/https:\/\/doi.org\/10.13039\/501100011151","name":"Key Laboratory of Computer Network and Information Integration","doi-asserted-by":"publisher","award":["93K-9"],"award-info":[{"award-number":["93K-9"]}],"id":[{"id":"10.13039\/https:\/\/doi.org\/10.13039\/501100011151","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Jiangsu Provincial Key Laboratory of Network and Information Security","award":["BM2003201"],"award-info":[{"award-number":["BM2003201"]}]},{"name":"the National Key Research and Development Program for the 14th-Five-Year Plan of China","award":["2023YFC3804104 in 2023YFC3804100"],"award-info":[{"award-number":["2023YFC3804104 in 2023YFC3804100"]}]},{"DOI":"10.13039\/https:\/\/doi.org\/10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072099, 62232004, 62373194, 62276063"],"award-info":[{"award-number":["62072099, 62232004, 62373194, 62276063"]}],"id":[{"id":"10.13039\/https:\/\/doi.org\/10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,10,28]]},"DOI":"10.1145\/3664647.3681062","type":"proceedings-article","created":{"date-parts":[[2024,10,26]],"date-time":"2024-10-26T06:59:33Z","timestamp":1729925973000},"page":"4572-4580","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6664-1172","authenticated-orcid":false,"given":"Wenbo","family":"Huang","sequence":"first","affiliation":[{"name":"Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9067-7896","authenticated-orcid":false,"given":"Jinghui","family":"Zhang","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3321-8327","authenticated-orcid":false,"given":"Xuwei","family":"Qian","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6221-280X","authenticated-orcid":false,"given":"Zhen","family":"Wu","sequence":"additional","affiliation":[{"name":"Southeast University, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2293-1709","authenticated-orcid":false,"given":"Meng","family":"Wang","sequence":"additional","affiliation":[{"name":"Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8749-7459","authenticated-orcid":false,"given":"Lei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Nanjing Normal University, Nanjing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,10,28]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Kaidi Cao Jingwei Ji Zhangjie Cao Chien-Yi Chang and Juan Carlos Niebles. 2020. Few-shot video classification via temporal alignment. In CVPR. 10618--10627.","DOI":"10.1109\/CVPR42600.2020.01063"},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Chun-Fu Richard Chen Rameswar Panda Kandan Ramakrishnan Rogerio Feris John Cohn Aude Oliva and Quanfu Fan. 2021. Deep analysis of cnn-based spatio-temporal representations for action recognition. In CVPR. 6165--6175.","DOI":"10.1109\/CVPR46437.2021.00610"},{"key":"e_1_3_2_1_4_1","volume-title":"Teachtext: Crossmodal generalized distillation for text-video retrieval. In ICCV. 11583--11593.","author":"Croitoru Ioana","year":"2021","unstructured":"Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, and Yang Liu. 2021. Teachtext: Crossmodal generalized distillation for text-video retrieval. In ICCV. 11583--11593."},{"key":"e_1_3_2_1_5_1","volume-title":"Imagenet: A large-scale hierarchical image database","author":"Deng Jia","year":"2009","unstructured":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Feifei Li. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248--255."},{"key":"e_1_3_2_1_6_1","volume-title":"Words: Transformers for Image Recognition at Scale. In ICLR.","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR."},{"key":"e_1_3_2_1_7_1","unstructured":"Chelsea Finn Pieter Abbeel and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML. PMLR 1126--1135."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Yuqian Fu Li Zhang Junke Wang Yanwei Fu and Yu-Gang Jiang. 2020. Depth guided adaptive meta-fusion network for few-shot video recognition. In ACM MM. ACM 1142--1151.","DOI":"10.1145\/3394171.3413502"},{"key":"e_1_3_2_1_9_1","volume-title":"Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.","author":"Goyal Raghav","year":"2017","unstructured":"Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The \"something something\" video database for learning and evaluating visual common sense. In ICCV. 5842--5850."},{"key":"e_1_3_2_1_10_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Jie Hu Li Shen and Gang Sun. 2018. Squeeze-and-excitation networks. In CVPR. 7132--7141.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"e_1_3_2_1_12_1","volume-title":"HMDB: a large video database for human motion recognition","author":"Kuehne Hildegard","unstructured":"Hildegard Kuehne, Hueihan Jhuang, Est\u00edbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. IEEE, 2556--2563."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i2.20029"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Baolong Liu Tianyi Zheng Peng Zheng Daizong Liu Xiaoye Qu Junyu Gao Jianfeng Dong and Xun Wang. 2023. Lite-MKD: A Multi-modal Knowledge Distillation Framework for Lightweight Few-shot Action Recognition. In ACM MM. 7283--7294.","DOI":"10.1145\/3581783.3612279"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"crossref","unstructured":"Kun Liu and Huadong Ma. 2019. Exploring background-bias for anomaly detection in surveillance videos. In ACM MM. 1490--1499.","DOI":"10.1145\/3343031.3350998"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Wenyang Luo Yufan Liu Bing Li Weiming Hu Yanan Miao and Yangxi Li. 2022. Long-Short Term Cross-Transformer in Compressed Domain for Few-Shot Video Classification.. In IJCAI. 1247--1253.","DOI":"10.24963\/ijcai.2022\/174"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"crossref","unstructured":"Panda Pan Yang Zhao Yuan Chen Wei Jia Zhao Zhang and Ronggang Wang. 2023. Cross-view Resolution and Frame Rate Joint Enhancement for Binocular Video. In ACM MM. 8367--8375.","DOI":"10.1145\/3581783.3612213"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Toby Perrett Alessandro Masullo Tilo Burghardt Majid Mirmehdi and Dima Damen. 2021. Temporal-relational crosstransformers for few-shot action recognition. In CVPR. 475--484.","DOI":"10.1109\/CVPR46437.2021.00054"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"crossref","unstructured":"AJ Piergiovanni and Michael S Ryoo. 2019. Representation flow for action recognition. In CVPR. 9945--9953.","DOI":"10.1109\/CVPR.2019.01018"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21370"},{"key":"e_1_3_2_1_21_1","volume-title":"Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. 618--626.","author":"Selvaraju Ramprasaath R","year":"2017","unstructured":"Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. 618--626."},{"key":"e_1_3_2_1_22_1","unstructured":"Jake Snell Kevin Swersky and Richard Zemel. 2017. Prototypical networks for few-shot learning. In NeurIPS."},{"key":"e_1_3_2_1_23_1","volume-title":"Amir Roshan Zamir, and Mubarak Shah","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)."},{"key":"e_1_3_2_1_24_1","volume-title":"Fahad Shahbaz Khan, and Bernard Ghanem.","author":"Thatipelli Anirudh","year":"2022","unstructured":"Anirudh Thatipelli, Sanath Narayan, Salman Khan, Rao Muhammad Anwer, Fahad Shahbaz Khan, and Bernard Ghanem. 2022. Spatio-temporal relation modeling for few-shot action recognition. In CVPR. 19958--19967."},{"key":"e_1_3_2_1_25_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_1_26_1","volume-title":"Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904.","author":"Wang Limin","year":"2021","unstructured":"Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In CVPR. 1895--1904."},{"key":"e_1_3_2_1_27_1","volume-title":"Temporal segment networks: Towards good practices for deep action recognition","author":"Wang Limin","unstructured":"Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Springer, 20--36."},{"key":"e_1_3_2_1_28_1","volume-title":"International Journal of Computer Vision","author":"Wang Xiang","year":"2023","unstructured":"Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, and Nong Sang. 2023. CLIP-guided Prototype Modulating for Few-shot Action Recognition. International Journal of Computer Vision (2023)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Xiang Wang Shiwei Zhang Zhiwu Qing Changxin Gao Yingya Zhang Deli Zhao and Nong Sang. 2023. MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition. In CVPR. 18011--18021.","DOI":"10.1109\/CVPR52729.2023.01727"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"crossref","unstructured":"Xiang Wang Shiwei Zhang Zhiwu Qing Mingqian Tang Zhengrong Zuo Changxin Gao Rong Jin and Nong Sang. 2022. Hybrid relation guided set matching for few-shot action recognition. In CVPR. 19948--19957.","DOI":"10.1109\/CVPR52688.2022.01932"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Yuyang Wanyan Xiaoshan Yang Chaofan Chen and Changsheng Xu. 2023. Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition. In CVPR. 6492--6502.","DOI":"10.1109\/CVPR52729.2023.00628"},{"key":"e_1_3_2_1_32_1","volume-title":"Muzammal Naseer, Salman Khan, Mubarak Shah, and Fahad Shahbaz Khan.","author":"Wasim Syed Talal","year":"2023","unstructured":"Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, and Fahad Shahbaz Khan. 2023. Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition. In ICCV. 13778--13789."},{"key":"e_1_3_2_1_33_1","unstructured":"Jiamin Wu Tianzhu Zhang Zhe Zhang Feng Wu and Yongdong Zhang. 2022. Motion-modulated temporal fragment alignment network for few-shot action recognition. In CVPR. 9151--9160."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Jiazheng Xing Mengmeng Wang Yong Liu and Boyu Mu. 2023. Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition. In AAAI. 3001--3009.","DOI":"10.1609\/aaai.v37i3.25403"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Jiazheng Xing Mengmeng Wang Yudi Ruan Bofan Chen Yaowei Guo Boyu Mu Guang Dai Jingdong Wang and Yong Liu. 2023. Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching. In ICCV. 1740--1750.","DOI":"10.1109\/ICCV51070.2023.00167"},{"key":"e_1_3_2_1_36_1","volume-title":"Tpcn: Temporal point cloud networks for motion forecasting. In CVPR. 11318--11327.","author":"Ye Maosheng","year":"2021","unstructured":"Maosheng Ye, Tongyi Cao, and Qifeng Chen. 2021. Tpcn: Temporal point cloud networks for motion forecasting. In CVPR. 11318--11327."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Tianwei Yu Peng Chen Yuanjie Dang Ruohong Huan and Ronghua Liang. 2023. Multi-Speed Global Contextual Subspace Matching for Few-Shot Action Recognition. In ACM MM. 2344--2352.","DOI":"10.1145\/3581783.3612380"},{"key":"e_1_3_2_1_38_1","volume-title":"Philip HS Torr, and Piotr Koniusz","author":"Zhang Hongguang","year":"2020","unstructured":"Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip HS Torr, and Piotr Koniusz. 2020. Few-shot action recognition with permutation-invariant attention. In ECCV. Springer, 525--542."},{"key":"e_1_3_2_1_39_1","volume-title":"Metagan: An adversarial approach to few-shot learning. In NeurIPS.","author":"Zhang Ruixiang","year":"2018","unstructured":"Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. 2018. Metagan: An adversarial approach to few-shot learning. In NeurIPS."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Songyang Zhang Jiale Zhou and Xuming He. 2021. Learning implicit temporal alignment for few-shot video classification. In IJCAI. 1309--1315.","DOI":"10.24963\/ijcai.2021\/181"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Yilun Zhang Yuqian Fu Xingjun Ma Lizhe Qi Jingjing Chen Zuxuan Wu and Yu-Gang Jiang. 2023. On the Importance of Spatial Relations for Few-shot Action Recognition. In ACM MM. 2243--2251.","DOI":"10.1145\/3581783.3612192"},{"key":"e_1_3_2_1_42_1","volume-title":"Few-shot action recognition with hierarchical matching and contrastive learning","author":"Zheng Sipeng","unstructured":"Sipeng Zheng, Shizhe Chen, and Qin Jin. 2022. Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV. Springer, 297--313."},{"key":"e_1_3_2_1_43_1","unstructured":"Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In ECCV. 751--766."},{"key":"e_1_3_2_1_44_1","first-page":"273","article-title":"Label independent memory for semi-supervised few-shot video classification","volume":"44","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu and Yi Yang. 2020. Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 1 (2020), 273--285.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"}],"event":{"name":"MM '24: The 32nd ACM International Conference on Multimedia","location":"Melbourne VIC Australia","acronym":"MM '24","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 32nd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664647.3681062","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3664647.3681062","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:52Z","timestamp":1750294672000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664647.3681062"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,28]]},"references-count":44,"alternative-id":["10.1145\/3664647.3681062","10.1145\/3664647"],"URL":"https:\/\/doi.org\/10.1145\/3664647.3681062","relation":{},"subject":[],"published":{"date-parts":[[2024,10,28]]},"assertion":[{"value":"2024-10-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}