{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T19:28:30Z","timestamp":1762543710337,"version":"3.41.0"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,9,27]],"date-time":"2023-09-27T00:00:00Z","timestamp":1695772800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"NSFC","doi-asserted-by":"crossref","award":["62102092"],"award-info":[{"award-number":["62102092"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Shanghai Science and Technology Program","award":["21JC1400600"],"award-info":[{"award-number":["21JC1400600"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,2,29]]},"abstract":"<jats:p>Videos are multimodal in nature. Conventional video recognition pipelines typically fuse multimodal features for improved performance. However, this is not only computationally expensive but also neglects the fact that different videos rely on different modalities for predictions. This article introduces Hierarchical and Conditional Modality Selection (HCMS), a simple yet efficient multimodal learning framework for efficient video recognition. HCMS operates on a low-cost modality, i.e., audio clues, by default, and dynamically decides on-the-fly whether to use computationally expensive modalities, including appearance and motion clues, on a per-input basis. This is achieved by the collaboration of three LSTMs that are organized in a hierarchical manner. In particular, LSTMs that operate on high-cost modalities contain a gating module, which takes as inputs lower-level features and historical information to adaptively determine whether to activate its corresponding modality; otherwise, it simply reuses historical information. We conduct extensive experiments on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate the proposed approach can effectively explore multimodal information for improved classification performance while requiring much less computation.<\/jats:p>","DOI":"10.1145\/3572776","type":"journal-article","created":{"date-parts":[[2022,12,2]],"date-time":"2022-12-02T13:46:50Z","timestamp":1669988810000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9706-6484","authenticated-orcid":false,"given":"Zejia","family":"Weng","sequence":"first","affiliation":[{"name":"Shanghai Key Lab of Intelligent Info. Processing, School of CS, Fudan University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8689-5807","authenticated-orcid":false,"given":"Zuxuan","family":"Wu","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Info. Processing, School of CS, Fudan University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5314-6853","authenticated-orcid":false,"given":"Hengduo","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Maryland, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3148-264X","authenticated-orcid":false,"given":"Jingjing","family":"Chen","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Info. Processing, School of CS, Fudan University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1907-8567","authenticated-orcid":false,"given":"Yu-Gang","family":"Jiang","sequence":"additional","affiliation":[{"name":"Shanghai Key Lab of Intelligent Info. Processing, School of CS, Fudan University, China"}]}],"member":"320","published-online":{"date-parts":[[2023,9,27]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"MMSUM digital twins: A multi-view multi-modality summarization framework for sporting events","author":"Aloufi Samah","year":"2022","unstructured":"Samah Aloufi and Abdulmotaleb El Saddik. 2022. MMSUM digital twins: A multi-view multi-modality summarization framework for sporting events. ACM Trans. Multim. Comput. Commun. Applic. (2022).","journal-title":"ACM Trans. Multim. Comput. Commun. Applic."},{"key":"e_1_3_2_3_2","volume-title":"ICCV","author":"Arandjelovic Relja","year":"2017","unstructured":"Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV."},{"key":"e_1_3_2_4_2","volume-title":"ICML","author":"Bolukbasi Tolga","year":"2017","unstructured":"Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017. Adaptive neural networks for fast test-time prediction. In ICML."},{"key":"e_1_3_2_5_2","volume-title":"ECCV","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV."},{"key":"e_1_3_2_6_2","volume-title":"CVPR","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR."},{"key":"e_1_3_2_7_2","volume-title":"CVPR","author":"Feichtenhofer Christoph","year":"2020","unstructured":"Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In CVPR."},{"key":"e_1_3_2_8_2","volume-title":"ICCV","author":"Feichtenhofer Christoph","year":"2019","unstructured":"Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV."},{"key":"e_1_3_2_9_2","volume-title":"CVPR","author":"Feichtenhofer C.","year":"2016","unstructured":"C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In CVPR."},{"key":"e_1_3_2_10_2","volume-title":"CVPR","author":"Gao Ruohan","year":"2020","unstructured":"Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, and Lorenzo Torresani. 2020. Listen to look: Action recognition by previewing audio. In CVPR."},{"key":"e_1_3_2_11_2","article-title":"Exploring deep learning for view-based 3D model retrieval","author":"Gao Zan","year":"2020","unstructured":"Zan Gao, Yinming Li, and Shaohua Wan. 2020. Exploring deep learning for view-based 3D model retrieval. ACM Trans. Multim. Comput. Commun. Applic. (2020).","journal-title":"ACM Trans. Multim. Comput. Commun. Applic."},{"key":"e_1_3_2_12_2","volume-title":"ICASSP","author":"Gemmeke Jort F.","year":"2017","unstructured":"Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP."},{"key":"e_1_3_2_13_2","volume-title":"CVPR","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR."},{"key":"e_1_3_2_14_2","volume-title":"CVPR","author":"Heilbron Fabian Caba","year":"2015","unstructured":"Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR."},{"key":"e_1_3_2_15_2","volume-title":"ACM Multimedia","author":"Hou Zhijian","year":"2021","unstructured":"Zhijian Hou, Chong-Wah Ngo, and Wing Kwong Chan. 2021. Conquer: Contextual query-aware ranking for video corpus moment retrieval. In ACM Multimedia."},{"key":"e_1_3_2_16_2","article-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).","journal-title":"arXiv preprint arXiv:1704.04861"},{"key":"e_1_3_2_17_2","volume-title":"ICLR","author":"Jang Eric","year":"2017","unstructured":"Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with Gumbel-Softmax. In ICLR."},{"key":"e_1_3_2_18_2","article-title":"Discovering joint audio-visual codewords for video event detection","author":"Jhuo I-Hong","year":"2014","unstructured":"I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Mach. Vis. Appl. (2014).","journal-title":"Mach. Vis. Appl."},{"key":"e_1_3_2_19_2","article-title":"Super fast event recognition in internet videos","author":"Jiang Yu-Gang","year":"2015","unstructured":"Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. 2015. Super fast event recognition in internet videos. IEEE Trans. Multim. (2015).","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_2_20_2","article-title":"Modeling multimodal clues in a hybrid deep learning framework for video classification","author":"Jiang Yu-Gang","year":"2018","unstructured":"Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multim. (2018).","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2670560"},{"key":"e_1_3_2_22_2","volume-title":"ECCV","author":"Kang Sunghun","year":"2018","unstructured":"Sunghun Kang, Junyeong Kim, Hyunsoo Choi, Sungjin Kim, and Chang D. Yoo. 2018. Pivot correlational neural network for multimodal video categorization. In ECCV."},{"key":"e_1_3_2_23_2","volume-title":"ICCV","author":"Korbar Bruno","year":"2019","unstructured":"Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. SCSampler: Sampling salient clips from video for efficient action recognition. In ICCV."},{"key":"e_1_3_2_24_2","volume-title":"ICCV","author":"Li Hao","year":"2019","unstructured":"Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao Huang. 2019. Improved techniques for training adaptive deep networks. In ICCV."},{"key":"e_1_3_2_25_2","article-title":"VideoLSTM convolves, attends and flows for action recognition","author":"Li Zhenyang","year":"2018","unstructured":"Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. (2018).","journal-title":"Comput. Vis. Image Underst."},{"key":"e_1_3_2_26_2","volume-title":"ICCV","author":"Lin Ji","year":"2019","unstructured":"Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In ICCV."},{"key":"e_1_3_2_27_2","article-title":"Selective feature compression for efficient activity recognition inference","author":"Liu Chunhui","year":"2021","unstructured":"Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, and Joseph Tighe. 2021. Selective feature compression for efficient activity recognition inference. arXiv preprint arXiv:2104.00179 (2021).","journal-title":"arXiv preprint arXiv:2104.00179"},{"key":"e_1_3_2_28_2","volume-title":"AAAI","author":"Long Xiang","year":"2018","unstructured":"Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In AAAI."},{"key":"e_1_3_2_29_2","volume-title":"ICLR","author":"Maddison Chris J.","year":"2017","unstructured":"Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR."},{"key":"e_1_3_2_30_2","volume-title":"ECCV","author":"Meng Yue","year":"2020","unstructured":"Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris. 2020. AR-Net: Adaptive frame resolution for efficient action recognition. In ECCV."},{"key":"e_1_3_2_31_2","volume-title":"ICLR","author":"Meng Yue","year":"2021","unstructured":"Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. AdaFuse: Adaptive temporal fusion network for efficient action recognition. In ICLR."},{"key":"e_1_3_2_32_2","volume-title":"CVPR","author":"Ng Joe Yue-Hei","year":"2015","unstructured":"Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR."},{"key":"e_1_3_2_33_2","volume-title":"ECCV","author":"Owens Andrew","year":"2018","unstructured":"Andrew Owens and Alexei A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In ECCV."},{"key":"e_1_3_2_34_2","volume-title":"ICLR","author":"Pan Bowen","year":"2021","unstructured":"Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, and Rogerio Feris. 2021. VA-RED \\(^{2}\\) : Video adaptive redundancy reduction. In ICLR."},{"key":"e_1_3_2_35_2","article-title":"YAMNet","author":"Plakal Manoj","year":"2020","unstructured":"Manoj Plakal and Dan Ellis. 2020. YAMNet. Retrieved from https:\/\/github.com\/tensorflow\/models\/tree\/master\/research\/audioset\/yamnet.","journal-title":"Retrieved from"},{"key":"e_1_3_2_36_2","volume-title":"ACM Multimedia","author":"Qi Zhaobo","year":"2020","unstructured":"Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2020. Towards more explainability: Concept knowledge mining network for event recognition. In ACM Multimedia."},{"key":"e_1_3_2_37_2","article-title":"Deep quantization: Encoding convolutional activations with deep generative model","author":"Qiu Zhaofan","year":"2016","unstructured":"Zhaofan Qiu, Ting Yao, and Tao Mei. 2016. Deep quantization: Encoding convolutional activations with deep generative model. arXiv preprint arXiv:1611.09502 (2016).","journal-title":"arXiv preprint arXiv:1611.09502"},{"key":"e_1_3_2_38_2","volume-title":"CVPR","author":"Qiu Zhaofan","year":"2017","unstructured":"Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Deep quantization: Encoding convolutional activations with deep generative model. In CVPR."},{"key":"e_1_3_2_39_2","volume-title":"CVPR","author":"Sandler Mark","year":"2018","unstructured":"Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In CVPR."},{"key":"e_1_3_2_40_2","article-title":"Learning to localize sound sources in visual scenes: Analysis and applications","author":"Senocak Arda","year":"2019","unstructured":"Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2019. Learning to localize sound sources in visual scenes: Analysis and applications. IEEE Trans. Pattern Anal. Mach. Intell. (2019).","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_2_41_2","volume-title":"NIPS","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS."},{"key":"e_1_3_2_42_2","article-title":"Modality compensation network: Cross-modal adaptation for action recognition","author":"Song Sijie","year":"2020","unstructured":"Sijie Song, Jiaying Liu, Yanghao Li, and Zongming Guo. 2020. Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans. Image Process. (2020).","journal-title":"IEEE Trans. Image Process."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.5555\/551283"},{"key":"e_1_3_2_44_2","volume-title":"ICML","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML."},{"key":"e_1_3_2_45_2","volume-title":"ECCV","author":"Tian Yapeng","year":"2018","unstructured":"Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In ECCV."},{"key":"e_1_3_2_46_2","volume-title":"ICCV","author":"Tran Du","year":"2015","unstructured":"Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. C3D: Generic features for video analysis. In ICCV."},{"key":"e_1_3_2_47_2","volume-title":"CVPR","author":"Tran Du","year":"2018","unstructured":"Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR."},{"key":"e_1_3_2_48_2","article-title":"Activity recognition using temporal optical flow convolutional features and multilayer LSTM","author":"Ullah Amin","year":"2018","unstructured":"Amin Ullah, Khan Muhammad, Javier Del Ser, Sung Wook Baik, and Victor Hugo C. de Albuquerque. 2018. Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Multim. (2018).","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_2_49_2","volume-title":"CVPR","author":"Uzkent Burak","year":"2020","unstructured":"Burak Uzkent and Stefano Ermon. 2020. Learning when and where to zoom with deep reinforcement learning. In CVPR."},{"key":"e_1_3_2_50_2","volume-title":"ICCV","author":"Wang Heng","year":"2013","unstructured":"Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV."},{"key":"e_1_3_2_51_2","article-title":"Multi-level temporal dilated dense prediction for action recognition","author":"Wang Jinpeng","year":"2021","unstructured":"Jinpeng Wang, Yiqi Lin, Manlin Zhang, Yuan Gao, and Andy J. Ma. 2021. Multi-level temporal dilated dense prediction for action recognition. IEEE Trans. Multim. (2021).","journal-title":"IEEE Trans. Multim."},{"key":"e_1_3_2_52_2","article-title":"Temporal segment networks for action recognition in videos","author":"Wang Limin","year":"2018","unstructured":"Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. (2018).","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_2_53_2","volume-title":"CVPR","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR."},{"key":"e_1_3_2_54_2","volume-title":"ECCV","author":"Wang Xin","year":"2018","unstructured":"Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E. Gonzalez. 2018. SkipNet: Learning dynamic routing in convolutional networks. In ECCV."},{"key":"e_1_3_2_55_2","article-title":"Concept-driven multi-modality fusion for video search","author":"Wei Xiao-Yong","year":"2011","unstructured":"Xiao-Yong Wei, Yu-Gang Jiang, and Chong-Wah Ngo. 2011. Concept-driven multi-modality fusion for video search. IEEE Trans. Circ. Syst. Vid. Technol. (2011).","journal-title":"IEEE Trans. Circ. Syst. Vid. Technol."},{"key":"e_1_3_2_56_2","article-title":"Semi-supervised vision transformers","author":"Weng Zejia","year":"2021","unstructured":"Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. 2021. Semi-supervised vision transformers. arXiv preprint arXiv:2111.11067 (2021).","journal-title":"arXiv preprint arXiv:2111.11067"},{"key":"e_1_3_2_57_2","volume-title":"ICCV","author":"Wu Wenhao","year":"2019","unstructured":"Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV."},{"key":"e_1_3_2_58_2","volume-title":"ACM Multimedia","author":"Wu Zuxuan","year":"2016","unstructured":"Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In ACM Multimedia."},{"key":"e_1_3_2_59_2","article-title":"A dynamic frame selection framework for fast video recognition","author":"Wu Zuxuan","year":"2020","unstructured":"Zuxuan Wu, Hengduo Li, Caiming Xiong, Yu-Gang Jiang, and Larry Steven Davis. 2020. A dynamic frame selection framework for fast video recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2020).","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_2_60_2","article-title":"Audiovisual slowfast networks for video recognition","author":"Xiao Fanyi","year":"2020","unstructured":"Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).","journal-title":"arXiv preprint arXiv:2001.08740"},{"key":"e_1_3_2_61_2","volume-title":"ECCV","author":"Xie Saining","year":"2018","unstructured":"Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV."},{"key":"e_1_3_2_62_2","article-title":"Dense dilated network for video action recognition","author":"Xu Baohan","year":"2019","unstructured":"Baohan Xu, Hao Ye, Yingbin Zheng, Heng Wang, Tianyu Luwang, and Yu-Gang Jiang. 2019. Dense dilated network for video action recognition. IEEE Trans. Image Process. (2019).","journal-title":"IEEE Trans. Image Process."},{"key":"e_1_3_2_63_2","article-title":"STA-CNN: Convolutional spatial-temporal attention learning for action recognition","author":"Yang Hao","year":"2020","unstructured":"Hao Yang, Chunfeng Yuan, Li Zhang, Yunda Sun, Weiming Hu, and Stephen J. Maybank. 2020. STA-CNN: Convolutional spatial-temporal attention learning for action recognition. IEEE Trans. Image Process. (2020).","journal-title":"IEEE Trans. Image Process."},{"key":"e_1_3_2_64_2","volume-title":"CVPR","author":"Yang Le","year":"2020","unstructured":"Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, and Gao Huang. 2020. Resolution adaptive networks for efficient inference. In CVPR."},{"key":"e_1_3_2_65_2","volume-title":"ACM Multimedia","author":"Yang Xiaodong","year":"2016","unstructured":"Xiaodong Yang, Pavlo Molchanov, and Jan Kautz. 2016. Multilayer and multimodal fusion of deep neural networks for video classification. In ACM Multimedia."},{"key":"e_1_3_2_66_2","volume-title":"ACM ICMR","author":"Ye Hao","year":"2015","unstructured":"Hao Ye, Zuxuan Wu, Rui-Wei Zhao, Xi Wang, Yu-Gang Jiang, and Xiangyang Xue. 2015. Evaluating two-stream CNN for video classification. In ACM ICMR."},{"key":"e_1_3_2_67_2","volume-title":"CVPR","author":"Zhang Bowen","year":"2016","unstructured":"Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In CVPR."},{"key":"e_1_3_2_68_2","volume-title":"CVPR","author":"Zhang Xiangyu","year":"2018","unstructured":"Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In CVPR."},{"key":"e_1_3_2_69_2","article-title":"Local correlation ensemble with GCN based on attention features for cross-domain person Re-ID","author":"Zhang Yue","year":"2022","unstructured":"Yue Zhang, Fanghui Zhang, Yi Jin, Yigang Cen, Viacheslav Voronin, and Shaohua Wan. 2022. Local correlation ensemble with GCN based on attention features for cross-domain person Re-ID. ACM Trans. Multim. Comput. Commun. Applic. (2022).","journal-title":"ACM Trans. Multim. Comput. Commun. Applic."},{"key":"e_1_3_2_70_2","article-title":"Visual content recognition by exploiting semantic feature map with attention and multi-task learning","author":"Zhao Rui-Wei","year":"2019","unstructured":"Rui-Wei Zhao, Qi Zhang, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2019. Visual content recognition by exploiting semantic feature map with attention and multi-task learning. ACM Trans. Multim. Comput. Commun. Applic. (2019).","journal-title":"ACM Trans. Multim. Comput. Commun. Applic."},{"key":"e_1_3_2_71_2","volume-title":"CVPR","author":"Zheng Liang","year":"2015","unstructured":"Liang Zheng, Shengjin Wang, Lu Tian, Fei He, Ziqiong Liu, and Qi Tian. 2015. Query-adaptive late fusion for image search and person re-identification. In CVPR."},{"key":"e_1_3_2_72_2","doi-asserted-by":"crossref","DOI":"10.1145\/3501404","article-title":"Clustering matters: Sphere feature for fully unsupervised person re-identification","author":"Zheng Yi","year":"2022","unstructured":"Yi Zheng, Yong Zhou, Jiaqi Zhao, Ying Chen, Rui Yao, Bing Liu, and Abdulmotaleb El Saddik. 2022. Clustering matters: Sphere feature for fully unsupervised person re-identification. ACM Trans. Multim. Comput. Commun. Applic. (2022).","journal-title":"ACM Trans. Multim. Comput. Commun. Applic."},{"key":"e_1_3_2_73_2","volume-title":"ECCV","author":"Zhou Bolei","year":"2018","unstructured":"Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In ECCV."},{"key":"e_1_3_2_74_2","volume-title":"ICCV","author":"Zhou Hang","year":"2019","unstructured":"Hang Zhou, Ziwei Liu, Xudong Xu, Ping Luo, and Xiaogang Wang. 2019. Vision-infused deep audio inpainting. In ICCV."},{"key":"e_1_3_2_75_2","volume-title":"CVPR","author":"Zhou Yipin","year":"2018","unstructured":"Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In CVPR."},{"key":"e_1_3_2_76_2","volume-title":"ECCV","author":"Zhu Chen","year":"2018","unstructured":"Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui Ding, and Yi Ma. 2018. Fine-grained video categorization with redundancy reduction attention. In ECCV."},{"key":"e_1_3_2_77_2","volume-title":"AAAI","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu, Du Tran, Laura Sevilla-Lara, Yi Yang, Matt Feiszli, and Heng Wang. 2020. Faster recurrent networks for efficient video classification. In AAAI."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572776","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3572776","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:08Z","timestamp":1750182668000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572776"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,27]]},"references-count":76,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2,29]]}},"alternative-id":["10.1145\/3572776"],"URL":"https:\/\/doi.org\/10.1145\/3572776","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2023,9,27]]},"assertion":[{"value":"2022-03-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-20","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}