{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:11:06Z","timestamp":1750219866728,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3552458.3556449","type":"proceedings-article","created":{"date-parts":[[2022,10,4]],"date-time":"2022-10-04T22:08:06Z","timestamp":1664921286000},"page":"25-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Multi-level Multi-modal Feature Fusion for Action Recognition in Videos"],"prefix":"10.1145","author":[{"given":"Xinghang","family":"Hu","sequence":"first","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yanli","family":"Ji","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gedamu Alemu","family":"Kumie","sequence":"additional","affiliation":[{"name":"Sichuan Artificial intelligence Research Institute, Yibin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"ViViT: A Video Vision Transformer. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV). 6816--6826","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lui , and Cordelia Schmid . 2021 . ViViT: A Video Vision Transformer. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV). 6816--6826 . Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lui, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In 2021 IEEE\/CVF International Conference on Computer Vision (ICCV). 6816--6826."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00548"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00086"},{"key":"e_1_3_2_2_6_1","volume-title":"Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray.","author":"Damen Dima","year":"2020","unstructured":"Dima Damen , Hazel Doughty , Giovanni Maria Farinella , Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2020 . Rescaling Egocentric Vision. ArXiv , Vol. abs\/ 2006 .13256 (2020). Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2020. Rescaling Egocentric Vision. ArXiv , Vol. abs\/2006.13256 (2020)."},{"key":"e_1_3_2_2_7_1","volume-title":"International Conference on Learning Representations.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In International Conference on Learning Representations. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_8_1","volume-title":"SlowFast Networks for Video Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 6201--6210","author":"Feichtenhofer Christoph","year":"2019","unstructured":"Christoph Feichtenhofer , Haoqi Fan , Jitendra Malik , and Kaiming He . 2019 . SlowFast Networks for Video Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 6201--6210 . Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 6201--6210."},{"key":"e_1_3_2_2_9_1","volume-title":"Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention. 2019 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Furnari Antonino","year":"2019","unstructured":"Antonino Furnari and Giovanni Maria Farinella . 2019 . What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention. 2019 IEEE\/CVF International Conference on Computer Vision (ICCV) (2019), 6251--6260. Antonino Furnari and Giovanni Maria Farinella. 2019. What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention. 2019 IEEE\/CVF International Conference on Computer Vision (ICCV) (2019), 6251--6260."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01047"},{"key":"e_1_3_2_2_12_1","volume-title":"Cees G. M. Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, S. Buch, and Cuong Duc Dao.","author":"Ghanem Bernard","year":"2018","unstructured":"Bernard Ghanem , Juan Carlos Niebles , Cees G. M. Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, S. Buch, and Cuong Duc Dao. 2018 . The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary. ArXiv , Vol. abs\/ 1808 .03766 (2018). Bernard Ghanem, Juan Carlos Niebles, Cees G. M. Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Krishna, S. Buch, and Cuong Duc Dao. 2018. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary. ArXiv , Vol. abs\/1808.03766 (2018)."},{"key":"e_1_3_2_2_13_1","volume-title":"Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 1440--1448","author":"Girshick Ross","year":"2015","unstructured":"Ross Girshick . 2015 . Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 1440--1448 . Ross Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 1440--1448."},{"key":"e_1_3_2_2_14_1","volume-title":"Proceedings of the 30th International Conference on Neural Information Processing Systems","author":"Harwath David","year":"1866","unstructured":"David Harwath , Antonio Torralba , and James R. Glass . 2016. Unsupervised Learning of Spoken Language with Visual Context . In Proceedings of the 30th International Conference on Neural Information Processing Systems ( Barcelona, Spain) (NIPS'16). Curran Associates Inc., Red Hook, NY, USA , 1866 --1874. David Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain) (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 1866--1874."},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00315"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2823900"},{"key":"e_1_3_2_2_18_1","volume-title":"Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy , George Toderici , Sanketh Shetty , Thomas Leung , Rahul Sukthankar , and Li Fei-Fei . 2014 . Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732 . Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732."},{"key":"e_1_3_2_2_19_1","unstructured":"Evangelos Kazakos Jaesung Huh Arsha Nagrani Andrew Zisserman and Dima Damen. 2021a. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. In BMVC. Evangelos Kazakos Jaesung Huh Arsha Nagrani Andrew Zisserman and Dima Damen. 2021a. With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition. In BMVC."},{"key":"e_1_3_2_2_20_1","volume-title":"EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 5491--5500","author":"Kazakos Evangelos","year":"2019","unstructured":"Evangelos Kazakos , Arsha Nagrani , Andrew Zisserman , and Dima Damen . 2019 . EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 5491--5500 . Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In 2019 IEEE\/CVF International Conference on Computer Vision (ICCV). 5491--5500."},{"key":"e_1_3_2_2_21_1","volume-title":"Slow-Fast Auditory Streams for Audio Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 855--859","author":"Kazakos Evangelos","year":"2021","unstructured":"Evangelos Kazakos , Arsha Nagrani , Andrew Zisserman , and Dima Damen . 2021 b. Slow-Fast Auditory Streams for Audio Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 855--859 . Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2021b. Slow-Fast Auditory Streams for Audio Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 855--859."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01576"},{"key":"e_1_3_2_2_23_1","volume-title":"Multi-modal Dense Video Captioning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4117--4126","author":"Esa Rahtu Vladimir","year":"2020","unstructured":"Vladimir lashin and Esa Rahtu . 2020 . Multi-modal Dense Video Captioning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4117--4126 . Vladimir lashin and Esa Rahtu. 2020. Multi-modal Dense Video Captioning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4117--4126."},{"key":"e_1_3_2_2_24_1","volume-title":"Fine-Grained Visual Classification of Aircraft. ArXiv","author":"Maji Subhransu","year":"2013","unstructured":"Subhransu Maji , Esa Rahtu , Juho Kannala , Matthew B. Blaschko , and Andrea Vedaldi . 2013. Fine-Grained Visual Classification of Aircraft. ArXiv , Vol. abs\/ 1306 .5151 ( 2013 ). Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. ArXiv , Vol. abs\/1306.5151 (2013)."},{"key":"e_1_3_2_2_25_1","unstructured":"Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. In NeurIPS. Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. In NeurIPS."},{"volume-title":"Proceedings of the European Conference on Computer Vision (ECCV).","author":"Owens Andrew","key":"e_1_3_2_2_26_1","unstructured":"Andrew Owens and Alexei A. Efros . 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features . In Proceedings of the European Conference on Computer Vision (ECCV). Andrew Owens and Alexei A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision (ECCV)."},{"key":"e_1_3_2_2_27_1","volume-title":"Christoph Feichtenhofer, Andrea Vedaldi, and Jo ao F. Henriques.","author":"Patrick Mandela","year":"2021","unstructured":"Mandela Patrick , Dylan Campbell , Yuki M. Asano , Ishan Misra Florian Metze , Christoph Feichtenhofer, Andrea Vedaldi, and Jo ao F. Henriques. 2021 . Keeping Your Eye on the Ball : Trajectory Attention in Video Transformers. In NeurIPS. Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo ao F. Henriques. 2021. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In NeurIPS."},{"key":"e_1_3_2_2_28_1","volume-title":"Technical Report: Temporal Aggregate Representations. ArXiv","author":"Sener Fadime","year":"2021","unstructured":"Fadime Sener , Dibyadip Chatterjee , and Angela Yao . 2021 . Technical Report: Temporal Aggregate Representations. ArXiv , Vol. abs\/ 2106 .03152 (2021). Fadime Sener, Dibyadip Chatterjee, and Angela Yao. 2021. Technical Report: Temporal Aggregate Representations. ArXiv , Vol. abs\/2106.03152 (2021)."},{"key":"e_1_3_2_2_29_1","volume-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems -","volume":"1","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014 . Two-Stream Convolutional Networks for Action Recognition in Videos . In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS'14). MIT Press, Cambridge, MA, USA, 568--576. Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS'14). MIT Press, Cambridge, MA, USA, 568--576."},{"key":"e_1_3_2_2_30_1","unstructured":"Chen Sun Fabien Baradel Kevin Murphy and Cordelia Schmid. 2020. Learning Video Representations using Contrastive Bidirectional Transformer. Chen Sun Fabien Baradel Kevin Murphy and Cordelia Schmid. 2020. Learning Video Representations using Contrastive Bidirectional Transformer."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2868668"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01271"},{"key":"e_1_3_2_2_36_1","volume-title":"Non-local Neural Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7794--7803","author":"Wang Xiaolong","year":"2018","unstructured":"Xiaolong Wang , Ross Girshick , Abhinav Gupta , and Kaiming He . 2018 . Non-local Neural Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7794--7803 . Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 7794--7803."},{"key":"e_1_3_2_2_37_1","volume-title":"Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer.","author":"Xiao Fanyi","year":"2020","unstructured":"Fanyi Xiao , Yong Jae Lee , Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020 . Audiovisual SlowFast Networks for Video Recognition. ArXiv , Vol. abs\/ 2001 .08740 (2020). Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Feichtenhofer. 2020. Audiovisual SlowFast Networks for Video Recognition. ArXiv , Vol. abs\/2001.08740 (2020)."},{"key":"e_1_3_2_2_38_1","volume-title":"A Multimodal Multiview Transformer Ensemble. arXiv preprint arXiv:2206.09852","author":"Xiong Xuehan","year":"2022","unstructured":"Xuehan Xiong , Anurag Arnab , Arsha Nagrani , and Cordelia Schmid . 2022. M& M Mix : A Multimodal Multiview Transformer Ensemble. arXiv preprint arXiv:2206.09852 ( 2022 ). Xuehan Xiong, Anurag Arnab, Arsha Nagrani, and Cordelia Schmid. 2022. M&M Mix: A Multimodal Multiview Transformer Ensemble. arXiv preprint arXiv:2206.09852 (2022)."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Huapeng Xu Guilin Qi Jingjing Li Meng Wang Kang Xu and Huan Gao. 2018. Fine-grained Image Classification by Visual-Semantic Embedding.. In IJCAI. 1043--1049. Huapeng Xu Guilin Qi Jingjing Li Meng Wang Kang Xu and Huan Gao. 2018. Fine-grained Image Classification by Visual-Semantic Embedding.. In IJCAI. 1043--1049.","DOI":"10.24963\/ijcai.2018\/145"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00333"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552458.3556449","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3552458.3556449","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:42Z","timestamp":1750178862000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552458.3556449"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":40,"alternative-id":["10.1145\/3552458.3556449","10.1145\/3552458"],"URL":"https:\/\/doi.org\/10.1145\/3552458.3556449","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}