{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T17:45:19Z","timestamp":1772300719613,"version":"3.50.1"},"reference-count":161,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2024,7,9]],"date-time":"2024-07-09T00:00:00Z","timestamp":1720483200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/"}],"funder":[{"name":"Edith Cowan University (ECU) and the Higher Education Commission (HEC) of Pakistan","award":["PM\/HRDI-UESTPs\/UETs-I\/Phase-1\/Batch-VI\/2018"],"award-info":[{"award-number":["PM\/HRDI-UESTPs\/UETs-I\/Phase-1\/Batch-VI\/2018"]}]},{"name":"Office of National Intelligence National Intelligence Postdoctoral","award":["NIPG-2021-001"],"award-info":[{"award-number":["NIPG-2021-001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,8,31]]},"abstract":"<jats:p>Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the past decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of \u2018fusing\u2019 the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.<\/jats:p>","DOI":"10.1145\/3664815","type":"journal-article","created":{"date-parts":[[2024,5,13]],"date-time":"2024-05-13T11:08:51Z","timestamp":1715598531000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":34,"title":["From CNNs to Transformers in Multimodal Human Action Recognition: A Survey"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9042-5018","authenticated-orcid":false,"given":"Muhammad Bilal","family":"Shaikh","sequence":"first","affiliation":[{"name":"School of Engineering, Edith Cowan University, Joondalup, Australia and Molycop, Balcatta, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9004-7608","authenticated-orcid":false,"given":"Douglas","family":"Chai","sequence":"additional","affiliation":[{"name":"School of Engineering, Edith Cowan University, Joondalup, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3200-2903","authenticated-orcid":false,"given":"Syed Muhammad Shamsul","family":"Islam","sequence":"additional","affiliation":[{"name":"School of Science, Edith Cowan University, Joondalup, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3406-673X","authenticated-orcid":false,"given":"Naveed","family":"Akhtar","sequence":"additional","affiliation":[{"name":"The University of Melbourne, Melbourne, Australia"}]}],"member":"320","published-online":{"date-parts":[[2024,7,9]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1007\/3-540-48157-5_29","volume-title":"Proceedings of the International Symposium on Handheld and Ubiquitous Computing","author":"Abowd Gregory D.","year":"1999","unstructured":"Gregory D. Abowd, Anind K. Dey, Peter J. Brown et\u00a0al. 1999. Towards a better understanding of context and context-awareness. In Proceedings of the International Symposium on Handheld and Ubiquitous Computing. Springer, 304\u2013307."},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.patrec.2014.04.011","article-title":"Human activity recognition from 3D data: A review","volume":"48","author":"Aggarwal J. K.","year":"2014","unstructured":"J. K. Aggarwal and Lu Xia. 2014. Human activity recognition from 3D data: A review. Pattern Recogn. Lett. 48 (2014), 70\u201380.","journal-title":"Pattern Recogn. Lett."},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"3808","DOI":"10.3390\/s19173808","article-title":"Multi-sensor fusion for activity recognition: A survey","volume":"19","author":"Aguileta Antonio A.","year":"2019","unstructured":"Antonio A. Aguileta, Ramon F. Brena, Oscar Mayora, Erik Molino-Minero-Re, and Luis A. Trejo. 2019. Multi-sensor fusion for activity recognition: A survey. Sensors 19 (Sept.2019), 3808.","journal-title":"Sensors"},{"issue":"2","key":"e_1_3_1_5_2","first-page":"39","article-title":"How deep features have improved event recognition in multimedia: A survey","volume":"15","author":"Ahmad Kashif","year":"2019","unstructured":"Kashif Ahmad and Nicola Conci. 2019. How deep features have improved event recognition in multimedia: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2, Article 39 (June2019), 27 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_6_2","first-page":"24206","article-title":"Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text","volume":"34","author":"Akbari Hassan","year":"2021","unstructured":"Hassan Akbari, Liangzhe Yuan, Rui Qian et\u00a0al. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Info. Process. Syst. 34 (2021), 24206\u201324221.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_7_2","first-page":"4575","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Alayrac Jean-Baptiste","year":"2016","unstructured":"Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal et\u00a0al. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 4575\u20134583."},{"key":"e_1_3_1_8_2","first-page":"6836","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Arnab Anurag","year":"2021","unstructured":"Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu\u010di\u0107, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the International Conference on Computer Vision. IEEE, 6836\u20136846."},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1007\/s00530-010-0182-0","article-title":"Multimodal fusion for multimedia analysis: A survey","volume":"16","author":"Atrey Pradeep K.","year":"2010","unstructured":"Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16 (2010), 345\u2013379.","journal-title":"Multimedia Systems"},{"key":"e_1_3_1_10_2","first-page":"469","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Baradel Fabien","year":"2018","unstructured":"Fabien Baradel, Christian Wolf, Julien Mille et\u00a0al. 2018. Glimpse clouds: Human activity recognition from unstructured feature points. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 469\u2013478."},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","first-page":"103893","DOI":"10.1109\/ACCESS.2019.2931804","article-title":"Action recognition from thermal videos","volume":"7","author":"Batchuluun Ganbayar","year":"2019","unstructured":"Ganbayar Batchuluun, Dat Tien Nguyen, Tuyen Danh Pham, Chanhum Park, and Kang Ryoung Park. 2019. Action recognition from thermal videos. IEEE Access 7 (2019), 103893\u2013103917.","journal-title":"IEEE Access"},{"key":"e_1_3_1_12_2","first-page":"4","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"2","author":"Bertasius Gedas","year":"2021","unstructured":"Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In Proceedings of the International Conference on Machine Learning, Vol. 2. 4."},{"key":"e_1_3_1_13_2","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Pervasive Computing and Communication Workshops","author":"Bhattacharya Sourav","year":"2016","unstructured":"Sourav Bhattacharya and Nicholas D. Lane. 2016. From smart to deep: Robust activity recognition on smartwatches using deep learning. In Proceedings of the IEEE International Conference on Pervasive Computing and Communication Workshops. IEEE, 1\u20136."},{"key":"e_1_3_1_14_2","first-page":"178","article-title":"Vision-based daily routine recognition for healthcare with transfer learning","volume":"14","author":"Bruce X. B.","year":"2020","unstructured":"X. B. Bruce, Yan Liu, and Keith C. C. Chan. 2020. Vision-based daily routine recognition for healthcare with transfer learning. Int. J. Biomed. Biol. Eng. 14 (2020), 178\u2013186.","journal-title":"Int. J. Biomed. Biol. Eng."},{"key":"e_1_3_1_15_2","first-page":"301","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Camgoz Necati Cihan","year":"2020","unstructured":"Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Multi-channel transformers for multi-articulatory sign language translation. In Proceedings of the European Conference on Computer Vision. Springer, 301\u2013319."},{"issue":"3","key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"489","DOI":"10.3390\/s19030489","article-title":"Metrological and critical characterization of the intel D415 stereo depth camera","volume":"19","author":"Carfagni Monica","year":"2019","unstructured":"Monica Carfagni, Rocco Furferi, Lapo Governi et\u00a0al. 2019. Metrological and critical characterization of the intel D415 stereo depth camera. Sensors 19, 3 (Jan.2019), 489.","journal-title":"Sensors"},{"key":"e_1_3_1_17_2","article-title":"A short note about kinetics-600","author":"Carreira Joao","year":"2018","unstructured":"Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. Retrieved from https:\/\/arXiv:1808.01340","journal-title":"Retrieved from https:\/\/arXiv:1808.01340"},{"key":"e_1_3_1_18_2","article-title":"A short note on the kinetics-700 human action dataset","author":"Carreira Joao","year":"2019","unstructured":"Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A short note on the kinetics-700 human action dataset. Retrieved from https:\/\/arXiv:1907.06987","journal-title":"Retrieved from https:\/\/arXiv:1907.06987"},{"key":"e_1_3_1_19_2","first-page":"6299","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Carreira Joao","year":"2017","unstructured":"Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 6299\u20136308."},{"key":"e_1_3_1_20_2","first-page":"168","volume-title":"Proceedings of the International Conference on Image Processing","author":"Chen C.","year":"2015","unstructured":"C. Chen, R. Jafari, and N. Kehtarnavaz. 2015. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the International Conference on Image Processing. IEEE, 168\u2013172."},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"4405","DOI":"10.1007\/s11042-015-3177-1","article-title":"A survey of depth and inertial sensor fusion for human action recognition","volume":"76","author":"Chen Chen","year":"2017","unstructured":"Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2017. A survey of depth and inertial sensor fusion for human action recognition. Multimedia Tools Appl. 76 (2017), 4405\u20134425.","journal-title":"Multimedia Tools Appl."},{"key":"e_1_3_1_22_2","first-page":"1597","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597\u20131607."},{"key":"e_1_3_1_23_2","first-page":"2177","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Curto David","year":"2021","unstructured":"David Curto, Albert Clap\u00e9s, Javier Selva, Sorina Smeureanu, Julio Junior, C. S. Jacques, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund et\u00a0al. 2021. Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions. In Proceedings of the International Conference on Computer Vision. IEEE, 2177\u20132188."},{"issue":"1","key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s11263-021-01531-2","article-title":"Rescaling egocentric vision","volume":"130","author":"Damen Dima","year":"2022","unstructured":"Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling egocentric vision. Int. J. Comput. Vision 130, 1 (2022), 33\u201355.","journal-title":"Int. J. Comput. Vision"},{"key":"e_1_3_1_25_2","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Damen Dima","year":"2018","unstructured":"Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision. Springer."},{"key":"e_1_3_1_26_2","first-page":"833","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Das Srijan","year":"2019","unstructured":"Srijan Das, Rui Dai, Michal Koperski, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, and Gianpiero Francesca. 2019. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE International Conference on Computer Vision. 833\u2013842."},{"key":"e_1_3_1_27_2","first-page":"72","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Das Srijan","year":"2020","unstructured":"Srijan Das, Saurav Sharma, Rui Dai et\u00a0al. 2020. VPN: Learning video-pose embedding for activities of daily living. In Proceedings of the European Conference on Computer Vision. Springer, 72\u201390."},{"key":"e_1_3_1_28_2","doi-asserted-by":"crossref","first-page":"168297","DOI":"10.1109\/ACCESS.2020.3023599","article-title":"Infrared and 3D skeleton feature fusion for RGB-D action recognition","volume":"8","author":"Boissiere A. M. De","year":"2020","unstructured":"A. M. De Boissiere and R. Noumeir. 2020. Infrared and 3D skeleton feature fusion for RGB-D action recognition. IEEE Access 8 (2020), 168297\u2013168308.","journal-title":"IEEE Access"},{"key":"e_1_3_1_29_2","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https:\/\/arXiv:1810.04805","journal-title":"Retrieved from https:\/\/arXiv:1810.04805"},{"key":"e_1_3_1_30_2","first-page":"961","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Escorcia Bernard Ghanem Fabian Caba Heilbron, Victor","year":"2015","unstructured":"Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia et\u00a0al. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 961\u2013970."},{"key":"e_1_3_1_31_2","first-page":"6824","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Fan Haoqi","year":"2021","unstructured":"Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the International Conference on Computer Vision. IEEE, 6824\u20136835."},{"key":"e_1_3_1_32_2","first-page":"538","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Fang Kuan","year":"2019","unstructured":"Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. 2019. Scene memory transformer for embodied agents in long-horizon tasks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 538\u2013547."},{"key":"e_1_3_1_33_2","first-page":"157","volume-title":"Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video","author":"Fassold Hannes","year":"2019","unstructured":"Hannes Fassold and Barnabas Takacs. 2019. Towards automatic cinematography and annotation for 360 \\(^{\\circ }\\) video. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video. ACM, 157\u2013166."},{"key":"e_1_3_1_34_2","first-page":"6202","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Feichtenhofer Christoph","year":"2019","unstructured":"Christoph Feichtenhofer et\u00a0al. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 6202\u20136211."},{"key":"e_1_3_1_35_2","first-page":"1933","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Feichtenhofer Christoph","year":"2016","unstructured":"Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 1933\u20131941."},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","first-page":"548","DOI":"10.1016\/j.neucom.2016.09.063","article-title":"Learning deep event models for crowd anomaly detection","volume":"219","author":"Feng Yachuang","year":"2017","unstructured":"Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu. 2017. Learning deep event models for crowd anomaly detection. Neurocomputing 219 (2017), 548\u2013556.","journal-title":"Neurocomputing"},{"key":"e_1_3_1_37_2","first-page":"3636","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Fernando Basura","year":"2017","unstructured":"Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 3636\u20133645."},{"key":"e_1_3_1_38_2","first-page":"214","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Gabeur Valentin","year":"2020","unstructured":"Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision. Springer, 214\u2013229."},{"key":"e_1_3_1_39_2","first-page":"839","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Gavrilyuk Kirill","year":"2020","unstructured":"Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees G. M. Snoek. 2020. Actor-transformers for group activity recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 839\u2013848."},{"key":"e_1_3_1_40_2","first-page":"244","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Girdhar Rohit","year":"2019","unstructured":"Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 244\u2013253."},{"issue":"10","key":"e_1_3_1_41_2","doi-asserted-by":"crossref","first-page":"3343","DOI":"10.1016\/j.patcog.2014.04.018","article-title":"A Survey on still-image-based human action recognition","volume":"47","author":"Guo Guodong","year":"2014","unstructured":"Guodong Guo and Alice Lai. 2014. A Survey on still-image-based human action recognition. Pattern Recogn. 47, 10 (2014), 3343\u20133361.","journal-title":"Pattern Recogn."},{"key":"e_1_3_1_42_2","first-page":"85","article-title":"Space-time representation of people based on 3D skeletal data: A review","volume":"158","author":"Han Fei","year":"2017","unstructured":"Fei Han, Brian Reily, William Hoff, and Hao Zhang. 2017. Space-time representation of people based on 3D skeletal data: A review. J. Vision Commun. Image Represent. 158 (2017), 85\u2013105.","journal-title":"J. Vision Commun. Image Represent."},{"key":"e_1_3_1_43_2","first-page":"1483","volume-title":"Proceedings of the International Conference on Computer Vision Workshops","author":"Han Tengda","year":"2019","unstructured":"Tengda Han, Weidi Xie, and Andrew Zisserman. 2019. Video representation learning by dense predictive coding. In Proceedings of the International Conference on Computer Vision Workshops. IEEE, 1483\u20131492."},{"key":"e_1_3_1_44_2","first-page":"770","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He et\u00a0al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 770\u2013778."},{"key":"e_1_3_1_45_2","article-title":"Pretrained transformers improve out-of-distribution robustness","author":"Hendrycks Dan","year":"2020","unstructured":"Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. Retrieved from https:\/\/arXiv:2004.06100","journal-title":"Retrieved from https:\/\/arXiv:2004.06100"},{"key":"e_1_3_1_46_2","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1016\/j.imavis.2017.01.010","article-title":"Going deeper into action recognition: A survey","volume":"60","author":"Herath Samitha","year":"2017","unstructured":"Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image Vision Comput. 60 (2017), 4\u201321.","journal-title":"Image Vision Comput."},{"key":"e_1_3_1_47_2","article-title":"Carnegie Mellon University, CMU Graphics Lab, Motion Capture Library","author":"Hodgins Jessica","year":"2021","unstructured":"Jessica Hodgins. 2021. Carnegie Mellon University, CMU Graphics Lab, Motion Capture Library. Retrieved January 28, 2021 from http:\/\/mocap.cs.cmu.edu\/","journal-title":"R"},{"issue":"4","key":"e_1_3_1_48_2","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1145\/3409332","article-title":"Knowledge-driven egocentric multimodal activity recognition","volume":"16","author":"Huang Yi","year":"2020","unstructured":"Yi Huang, Xiaoshan Yang, Junyu Gao, Jitao Sang, and Changsheng Xu. 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 133 (2020), 133 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_49_2","article-title":"A better use of audio-visual cues: Dense video captioning with bi-modal transformer","author":"Iashin Vladimir","year":"2020","unstructured":"Vladimir Iashin and Esa Rahtu. 2020. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. Retrieved from https:\/\/arXiv:2005.08271","journal-title":"Retrieved from https:\/\/arXiv:2005.08271"},{"key":"e_1_3_1_50_2","first-page":"958","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops","author":"Iashin Vladimir","year":"2020","unstructured":"Vladimir Iashin and Esa Rahtu. 2020. Multi-modal dense video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 958\u2013959."},{"key":"e_1_3_1_51_2","article-title":"Xtion PRO LIVE| 3D Sensor | ASUS USA","author":"Inc ASUSTeK Computer","year":"2020","unstructured":"ASUSTeK Computer Inc. 2020. Xtion PRO LIVE| 3D Sensor | ASUS USA. Retrieved July 1, 2022 from https:\/\/www.asus.com\/us\/3D-Sensor\/Xtion_PRO_LIVE\/","journal-title":"R"},{"key":"e_1_3_1_52_2","first-page":"10285","volume-title":"Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS\u201920)","author":"Islam Md Mofijul","year":"2020","unstructured":"Md Mofijul Islam and Tariq Iqbal. 2020. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In Proceedings of the IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS\u201920). IEEE, 10285\u201310292."},{"key":"e_1_3_1_53_2","first-page":"4651","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Jaegle Andrew","year":"2021","unstructured":"Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In Proceedings of the International Conference on Machine Learning. 4651\u20134664."},{"key":"e_1_3_1_54_2","first-page":"731","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Kalfaoglu M.","year":"2020","unstructured":"M. Kalfaoglu, Sinan Kalkan, and A. Aydin Alatan. 2020. Late temporal modeling in 3D CNN architectures with BERT for action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 731\u2013747."},{"key":"e_1_3_1_55_2","article-title":"The kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan et\u00a0al. 2017. The kinetics human action video dataset. Retrieved from https:\/\/arXiv:1705.06950","journal-title":"R"},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1016\/j.patrec.2018.04.035","article-title":"Combining CNN streams of RGB-D and skeletal data for human activity recognition","volume":"115","author":"Khaire Pushpajit","year":"2018","unstructured":"Pushpajit Khaire, Praveen Kumar, and Javed Imran. 2018. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn. Lett. 115 (2018), 107\u2013116.","journal-title":"Pattern Recogn. Lett."},{"key":"e_1_3_1_57_2","first-page":"1155","volume-title":"Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing","author":"Khan Aftab","year":"2015","unstructured":"Aftab Khan, Sebastian Mellor, Eugen Berlin et\u00a0al. 2015. Beyond activity recognition: Skill assessment from accelerometer data. In Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing. 1155\u20131166."},{"key":"e_1_3_1_58_2","first-page":"1","article-title":"Human action recognition using fusion of multiview and deep features: An application to video surveillance","author":"Khan Muhammad Attique","year":"2020","unstructured":"Muhammad Attique Khan, Kashif Javed, Sajid Ali Khan, Tanzila Saba, Usman Habib, Junaid Ali Khan, and Aaqif Afzaal Abbasi. 2020. Human action recognition using fusion of multiview and deep features: An application to video surveillance. Multimedia Tools Appl. (2020), 1\u201327.","journal-title":"Multimedia Tools Appl."},{"key":"e_1_3_1_59_2","first-page":"8545","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"33","author":"Kim Dahun","year":"2019","unstructured":"Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8545\u20138552."},{"key":"e_1_3_1_60_2","first-page":"673","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Kim Kyung-Min","year":"2018","unstructured":"Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. Springer, 673\u2013688."},{"issue":"3","key":"e_1_3_1_61_2","first-page":"302","article-title":"Lapformer: Surgical tool detection in laparoscopic surgical video using transformer architecture","volume":"9","author":"Kondo Satoshi","year":"2021","unstructured":"Satoshi Kondo. 2021. Lapformer: Surgical tool detection in laparoscopic surgical video using transformer architecture. Comput. Methods Biomech. Biomed. Eng.: Imag. Visual. 9, 3 (2021), 302\u2013307.","journal-title":"Comput. Methods Biomech. Biomed. Eng.: Imag. Visual."},{"key":"e_1_3_1_62_2","first-page":"8658","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Kong Quan","year":"2019","unstructured":"Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. 2019. MMAct: A large-scale dataset for cross modal human action understanding. In Proceedings of the International Conference on Computer Vision. IEEE, 8658\u20138667."},{"key":"e_1_3_1_63_2","first-page":"6231","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Korbar Bruno","year":"2019","unstructured":"Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. SCSampler: Sampling salient clips from video for efficient action recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 6231\u20136241."},{"key":"e_1_3_1_64_2","first-page":"2556","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Kuehne Hilde","year":"2011","unstructured":"Hilde Kuehne, Hueihan Jhuang, Estibaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2556\u20132563."},{"key":"e_1_3_1_65_2","first-page":"283","volume-title":"Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing","author":"Lane Nicholas D.","year":"2015","unstructured":"Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. 2015. Deepear: Robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing. 283\u2013294."},{"key":"e_1_3_1_66_2","article-title":"Parameter efficient multimodal transformers for video representation learning","author":"Lee Sangho","year":"2020","unstructured":"Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, and Yale Song. 2020. Parameter efficient multimodal transformers for video representation learning. Retrieved from https:\/\/arXiv:2012.04124","journal-title":"R"},{"key":"e_1_3_1_67_2","article-title":"Hero: Hierarchical encoder for video+ language omni-representation pre-training","author":"Li Linjie","year":"2020","unstructured":"Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video+ language omni-representation pre-training. Retrieved from https:\/\/arXiv:2005.00200","journal-title":"R"},{"key":"e_1_3_1_68_2","first-page":"13668","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Li Shuaicheng","year":"2021","unstructured":"Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. 2021. Groupformer: Group activity recognition with clustered spatial-temporal transformer. In Proceedings of the International Conference on Computer Vision. IEEE, 13668\u201313677."},{"key":"e_1_3_1_69_2","doi-asserted-by":"crossref","first-page":"501","DOI":"10.1016\/j.neucom.2018.10.104","article-title":"A sequential deep learning application for recognising human activities in smart homes","volume":"396","author":"Liciotti Daniele","year":"2020","unstructured":"Daniele Liciotti, Michele Bernardini, Luca Romeo, and Emanuele Frontoni. 2020. A sequential deep learning application for recognising human activities in smart homes. Neurocomputing 396 (2020), 501\u2013513.","journal-title":"Neurocomputing"},{"key":"e_1_3_1_70_2","article-title":"PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding","author":"Liu Chunhui","year":"2017","unstructured":"Chunhui Liu, Yueyu Hu, Yanghao Li, Sijie Song, and Jiaying Liu. 2017. PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. Retrieved from https:\/\/arXiv1903.11314","journal-title":"R"},{"key":"e_1_3_1_71_2","first-page":"2684","article-title":"NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding","author":"Liu Jun","year":"2019","unstructured":"Jun Liu, Amir Shahroudy, Mauricio Lisboa Perez, Gang Wang, Ling-Yu Duan, and Alex Kot Chichung. 2019. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019), 2684\u20132701.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_72_2","first-page":"1159","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Liu Mengyuan","year":"2018","unstructured":"Mengyuan Liu and Junsong Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 1159\u20131168."},{"key":"e_1_3_1_73_2","first-page":"11915","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Liu Song","year":"2021","unstructured":"Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the International Conference on Computer Vision. IEEE, 11915\u201311925."},{"key":"e_1_3_1_74_2","first-page":"3202","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Liu Ze","year":"2022","unstructured":"Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 3202\u20133211."},{"key":"e_1_3_1_75_2","first-page":"7202","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Long Xiang","year":"2018","unstructured":"Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 7202\u20137209."},{"key":"e_1_3_1_76_2","article-title":"VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Info. Process. Syst. 32 (2019).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_77_2","article-title":"Pretrained transformers as universal computation engines","author":"Lu Kevin","year":"2021","unstructured":"Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2021. Pretrained transformers as universal computation engines. Retrieved from https:\/\/arXiv:2103.05247","journal-title":"R"},{"key":"e_1_3_1_78_2","first-page":"5137","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Luvizon D. C.","year":"2018","unstructured":"D. C. Luvizon, D. Picard, and H. Tabia. 2018. 2D\/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 5137\u20135146."},{"key":"e_1_3_1_79_2","article-title":"Gimme signals: Discriminative signal encoding for multimodal activity recognition","author":"Memmesheimer Raphael","year":"2020","unstructured":"Raphael Memmesheimer, Nick Theisen, and Dietrich Paulus. 2020. Gimme signals: Discriminative signal encoding for multimodal activity recognition. Retrieved from https:\/\/arXiv:2003.06156","journal-title":"R"},{"issue":"1","key":"e_1_3_1_80_2","first-page":"6","article-title":"Fixation prediction through multimodal analysis","volume":"13","author":"Min Xiongkuo","year":"2016","unstructured":"Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2016. Fixation prediction through multimodal analysis. ACM Trans. Multimedia Comput. Commun. Appl. 13, 1, Article 6 (2016), 23 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"issue":"1","key":"e_1_3_1_81_2","first-page":"1","article-title":"Artificial intelligence-based methods for fusion of electronic health records and imaging data","volume":"12","author":"Mohsen Farida","year":"2022","unstructured":"Farida Mohsen, Hazrat Ali, Nady El Hajj, and Zubair Shah. 2022. Artificial intelligence-based methods for fusion of electronic health records and imaging data. Sci. Rep. 12, 1 (2022), 1\u201316.","journal-title":"Sci. Rep."},{"key":"e_1_3_1_82_2","first-page":"675","volume-title":"Proceedings of the International Conference on Control Systems and Computer Science","author":"Nan Mihai","year":"2019","unstructured":"Mihai Nan, Alexandra Stefania Ghi \\(\\underaccent{,}{{\\rm t}}\\) \u0103, Alexandru-Florin Gavril, Mihai Trascau, Alexandru Sorici, Bogdan Cramariuc, and Adina Magda Florea. 2019. Human action recognition for social robots. In Proceedings of the International Conference on Control Systems and Computer Science. IEEE, 675\u2013681."},{"key":"e_1_3_1_83_2","first-page":"1147","volume-title":"Proceedings of the IEEE International Conference on Computer Vision Workshops","author":"Ni Bingbing","year":"2011","unstructured":"Bingbing Ni, Gang Wang, and Pierre Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, 1147\u20131153."},{"issue":"4","key":"e_1_3_1_84_2","first-page":"131","article-title":"MMFN: Multimodal information fusion networks for 3D model classification and retrieval","volume":"16","author":"Nie Weizhi","year":"2020","unstructured":"Weizhi Nie, Qi Liang, Yixin Wang, Xing Wei, and Yuting Su. 2020. MMFN: Multimodal information fusion networks for 3D model classification and retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4, Article 131 (2020), 22 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_85_2","first-page":"53","volume-title":"Proceedings of the Workshop on Applications of Computer Vision","author":"Ofli F.","year":"2013","unstructured":"F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. 2013. Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the Workshop on Applications of Computer Vision. IEEE, 53\u201360."},{"key":"e_1_3_1_86_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. Retrieved from https:\/\/arxiv:2303.08774"},{"key":"e_1_3_1_87_2","first-page":"716","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Oreifej Omar","year":"2013","unstructured":"Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 716\u2013723."},{"key":"e_1_3_1_88_2","first-page":"15942","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Pashevich Alexander","year":"2021","unstructured":"Alexander Pashevich, Cordelia Schmid, and Chen Sun. 2021. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 15942\u201315952."},{"key":"e_1_3_1_89_2","doi-asserted-by":"crossref","first-page":"284","DOI":"10.1016\/j.compeleceng.2016.06.004","article-title":"Human action recognition using fusion of features for unconstrained video sequences","volume":"70","author":"Patel Chirag I.","year":"2018","unstructured":"Chirag I. Patel, Sanjay Garg, Tanish Zaveri, Asim Banerjee, and Ripal Patel. 2018. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electr. Eng. 70 (2018), 284\u2013301.","journal-title":"Comput. Electr. Eng."},{"key":"e_1_3_1_90_2","first-page":"10560","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Patrick Mandela","year":"2021","unstructured":"Mandela Patrick, Po-Yao Huang, Ishan Misra et\u00a0al. 2021. Space-time crop & attend: Improving cross-modal video representation learning. In Proceedings of the International Conference on Computer Vision. IEEE, 10560\u201310572."},{"key":"e_1_3_1_91_2","first-page":"6966","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Perez-Rua Juan-Manuel","year":"2019","unstructured":"Juan-Manuel Perez-Rua, Valentin Vielzeuf, Stephane Pateux et\u00a0al. 2019. MFAS: Multimodal fusion architecture search. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 6966\u20136975."},{"key":"e_1_3_1_92_2","first-page":"475","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Perrett Toby","year":"2021","unstructured":"Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, and Dima Damen. 2021. Temporal-relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 475\u2013484."},{"issue":"7638","key":"e_1_3_1_93_2","doi-asserted-by":"crossref","first-page":"532","DOI":"10.1038\/nature21054","article-title":"A solution to the single-question crowd wisdom problem","volume":"541","author":"Prelec Dra\u017een","year":"2017","unstructured":"Dra\u017een Prelec, H. Sebastian Seung, and John McCoy. 2017. A solution to the single-question crowd wisdom problem. Nature 541, 7638 (2017), 532\u2013535.","journal-title":"Nature"},{"key":"e_1_3_1_94_2","first-page":"961","volume-title":"Proceedings of the International Conference on Computer Vision Workshops","author":"Purwanto Didik","year":"2019","unstructured":"Didik Purwanto, Rizard Renanda Adhi Pramono, Yie-Tarng Chen, and Wen-Hsien Fang. 2019. Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation. In Proceedings of the International Conference on Computer Vision Workshops. IEEE, 961\u2013969."},{"issue":"2","key":"e_1_3_1_95_2","first-page":"549","article-title":"StagNet: An attentive semantic RNN for group activity and individual action recognition","volume":"30","author":"Qi Mengshi","year":"2019","unstructured":"Mengshi Qi, Yunhong Wang, Jie Qin et\u00a0al. 2019. StagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 2 (2019), 549\u2013565.","journal-title":"IEEE Trans. Circ. Syst. Video Technol."},{"issue":"1","key":"e_1_3_1_96_2","first-page":"18","article-title":"A multimodal, multimedia point-of-care deep learning framework for COVID-19 diagnosis","volume":"17","author":"Rahman MD Abdur","year":"2021","unstructured":"MD Abdur Rahman, M. Shamim Hossain, Nabil A. Alrajeh, and B. B. Gupta. 2021. A multimodal, multimedia point-of-care deep learning framework for COVID-19 diagnosis. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s, Article 18 (2021), 24 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_97_2","first-page":"8821","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning. PMLR, 8821\u20138831."},{"key":"e_1_3_1_98_2","first-page":"1255","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Recasens Adria","year":"2021","unstructured":"Adria Recasens, Pauline Luc, Jean-Baptiste Alayrac et\u00a0al. 2021. Broaden your views for self-supervised video learning. In Proceedings of the International Conference on Computer Vision. IEEE, 1255\u20131265."},{"key":"e_1_3_1_99_2","first-page":"91","volume-title":"Proceedings of the Conference on Neural Information Processing Systems (NeurIPS\u201915)","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren et\u00a0al. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS\u201915), Vol. 28. 91\u201399."},{"key":"e_1_3_1_100_2","first-page":"198","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops","author":"Roitberg Alina","year":"2019","unstructured":"Alina Roitberg, Tim Pollert, Monica Haurilet et\u00a0al. 2019. Analysis of deep fusion strategies for multi-modal gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 198\u2013206."},{"key":"e_1_3_1_101_2","first-page":"10684","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Rombach Robin","year":"2022","unstructured":"Robin Rombach, Andreas Blattmann, Dominik Lorenz et\u00a0al. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 10684\u201310695."},{"key":"e_1_3_1_102_2","article-title":"Video transformers: A survey","author":"Selva Javier","year":"2022","unstructured":"Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund, and Albert Clap\u00e9s. 2022. Video transformers: A survey. Retrieved from https:\/\/arXiv:2201.05991","journal-title":"R"},{"key":"e_1_3_1_103_2","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Shahroudy Amir","year":"2016","unstructured":"Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_1_104_2","article-title":"Deep multimodal feature analysis for action recognition in RGB-D Videos","author":"Shahroudy Amir","year":"2016","unstructured":"Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, and Gang Wang. 2016. Deep multimodal feature analysis for action recognition in RGB-D Videos. Retrieved from https:\/\/arXiv 1603.07120","journal-title":"Retrieved from https:\/\/arXiv 1603.07120"},{"issue":"12","key":"e_1_3_1_105_2","doi-asserted-by":"crossref","first-page":"4246","DOI":"10.3390\/s21124246","article-title":"RGB-D data-based action recognition: A review","volume":"21","author":"Shaikh Muhammad Bilal","year":"2021","unstructured":"Muhammad Bilal Shaikh and Douglas Chai. 2021. RGB-D data-based action recognition: A review. Sensors 21, 12 (2021), 4246.","journal-title":"Sensors"},{"key":"e_1_3_1_106_2","first-page":"1","volume-title":"Proceedings of the International Conference on Visual Communications and Image Processing (VCIP\u201922)","author":"Shaikh Muhammad Bilal","year":"2022","unstructured":"Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, and Naveed Akhtar. 2022. MAiVAR: Multimodal audio-image and video action recognizer. In Proceedings of the International Conference on Visual Communications and Image Processing (VCIP\u201922). IEEE, 1\u20135."},{"key":"e_1_3_1_107_2","first-page":"1","article-title":"Multimodal fusion for audio-image and video action recognition","author":"Shaikh Muhammad Bilal","year":"2024","unstructured":"Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, and Naveed Akhtar. 2024. Multimodal fusion for audio-image and video action recognition. Neural Comput. Appl. (2024), 1\u201314.","journal-title":"Neural Comput. Appl."},{"key":"e_1_3_1_108_2","first-page":"1","volume-title":"Proceedings of the 11th European Workshop on Visual Information Processing (EUVIP\u201923)","author":"Shaikh Muhammad Bilal","year":"2023","unstructured":"Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, and Naveed Akhtar. 2023. MAiVAR-T: Multimodal audio-image and video action recognizer using transformers. In Proceedings of the 11th European Workshop on Visual Information Processing (EUVIP\u201923). 1\u20136."},{"key":"e_1_3_1_109_2","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1007\/s10044-019-00789-0","article-title":"Human action recognition: A framework of statistical weighted segmentation and rank correlation-based selection","volume":"23","author":"Sharif Muhammad","year":"2020","unstructured":"Muhammad Sharif, Muhammad Attique Khan, Farooq Zahid et\u00a0al. 2020. Human action recognition: A framework of statistical weighted segmentation and rank correlation-based selection. Pattern Anal. Appl. 23 (2020), 281\u2013294.","journal-title":"Pattern Anal. Appl."},{"key":"e_1_3_1_110_2","first-page":"568","volume-title":"Proceedings of the International Conference on Neural Information Process. Systems (NIPS\u201914)","volume":"1","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Neural Information Process. Systems (NIPS\u201914), Vol. 1. MIT Press, 568\u2013576."},{"key":"e_1_3_1_111_2","first-page":"10389","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Singh Ankit","year":"2021","unstructured":"Ankit Singh, Omprakash Chakraborty, Ashutosh Varshney et\u00a0al. 2021. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 10389\u201310399."},{"key":"e_1_3_1_112_2","first-page":"9787","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Song Xiaolin","year":"2021","unstructured":"Xiaolin Song, Sicheng Zhao, Jingyu Yang et\u00a0al. 2021. Spatio-temporal contrastive domain adaptation for action recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 9787\u20139795."},{"key":"e_1_3_1_113_2","article-title":"UCF101: A dataset of 101 human actions classes from videos in the wild","author":"Soomro Khurram","year":"2012","unstructured":"Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. Retrieved from https:\/\/arXiv:1212.0402","journal-title":"R"},{"key":"e_1_3_1_114_2","first-page":"1533","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Su Rui","year":"2021","unstructured":"Rui Su, Qian Yu, and Dong Xu. 2021. STVGbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the International Conference on Computer Vision. IEEE, 1533\u20131542."},{"key":"e_1_3_1_115_2","article-title":"Learning video representations using contrastive bidirectional transformer","author":"Sun Chen","year":"2019","unstructured":"Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Learning video representations using contrastive bidirectional transformer. Retrieved from https:\/\/arXiv 1906.05743","journal-title":"Retrieved from https:\/\/arXiv 1906.05743"},{"key":"e_1_3_1_116_2","first-page":"7464","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Sun Chen","year":"2019","unstructured":"Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the International Conference on Computer Vision. IEEE, 7464\u20137473."},{"key":"e_1_3_1_117_2","first-page":"8834","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Sun Chen","year":"2021","unstructured":"Chen Sun, Arsha Nagrani, Yonglong Tian, and Cordelia Schmid. 2021. Composable augmentation encoding for video representation learning. In Proceedings of the International Conference on Computer Vision. IEEE, 8834\u20138844."},{"key":"e_1_3_1_118_2","first-page":"1","volume-title":"Proceedings of the IEEE International Symposium on Medical Information and Communication Technology","author":"Sun Han","year":"2022","unstructured":"Han Sun and Yu Chen. 2022. Real-time elderly monitoring for senior safety by lightweight human action recognition. In Proceedings of the IEEE International Symposium on Medical Information and Communication Technology. IEEE, 1\u20136."},{"key":"e_1_3_1_119_2","first-page":"1","article-title":"Human action recognition from various data modalities: A review","author":"Sun Zehua","year":"2022","unstructured":"Zehua Sun, Qiuhong Ke, Hossein Rahmani et\u00a0al. 2022. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. (2022), 1\u201320.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_120_2","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1049\/trit.2019.0002","article-title":"New shape descriptor in the context of edge continuity","volume":"4","author":"Susan S.","year":"2019","unstructured":"S. Susan, P. Agrawal, M. Mittal, and S. Bansal. 2019. New shape descriptor in the context of edge continuity. CAAI Trans. Intell. Technol. 4 (2019), 101\u2013109.","journal-title":"CAAI Trans. Intell. Technol."},{"key":"e_1_3_1_121_2","first-page":"247","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Tian Yapeng","year":"2018","unstructured":"Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. Springer, 247\u2013263."},{"key":"e_1_3_1_122_2","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1049\/trit.2019.0017","article-title":"Three-stage network for age estimation","volume":"4","author":"Tingting Y.","year":"2019","unstructured":"Y. Tingting, W. Junqian, W. Lintai, and X. Yong. 2019. Three-stage network for age estimation. CAAI Trans. Intell. Technol. 4 (2019), 122\u2013126.","journal-title":"CAAI Trans. Intell. Technol."},{"key":"e_1_3_1_123_2","first-page":"6450","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Tran Du","year":"2018","unstructured":"Du Tran, Heng Wang, Lorenzo Torresani et\u00a0al. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 6450\u20136459."},{"key":"e_1_3_1_124_2","doi-asserted-by":"crossref","first-page":"386","DOI":"10.1016\/j.future.2019.01.029","article-title":"Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments","volume":"96","author":"Ullah Amin","year":"2019","unstructured":"Amin Ullah, Khan Muhammad, Ijaz Ul Haq, and Sung Wook Baik. 2019. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gen. Comput. Syst. 96 (2019), 386\u2013397.","journal-title":"Future Gen. Comput. Syst."},{"key":"e_1_3_1_125_2","first-page":"13289","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Joze Hamid Reza Vaezi","year":"2020","unstructured":"Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, Virtual, 13289\u201313299."},{"key":"e_1_3_1_126_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar et\u00a0al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30."},{"key":"e_1_3_1_127_2","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Wearable and Implantable Body Sensor Networks (BSN \u201915)","author":"Vepakomma Praneeth","year":"2015","unstructured":"Praneeth Vepakomma, Debraj De, Sajal K. Das, and Shekhar Bhansali. 2015. A-Wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In Proceedings of the IEEE International Conference on Wearable and Implantable Body Sensor Networks (BSN \u201915). IEEE, 1\u20136."},{"key":"e_1_3_1_128_2","first-page":"1129","volume-title":"Proceddings of the IEEE International Conference on Data Mining","author":"Wang Jindong","year":"2017","unstructured":"Jindong Wang, Yiqiang Chen, Shuji Hao et\u00a0al. 2017. Balanced distribution adaptation for transfer learning. In Proceddings of the IEEE International Conference on Data Mining. IEEE, 1129\u20131134."},{"key":"e_1_3_1_129_2","first-page":"4006","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Wang Jiangliu","year":"2019","unstructured":"Jiangliu Wang, Jianbo Jiao, Linchao Bao et\u00a0al. 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 4006\u20134015."},{"key":"e_1_3_1_130_2","first-page":"1290","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Wang Jiang","year":"2012","unstructured":"Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 1290\u20131297."},{"key":"e_1_3_1_131_2","first-page":"2649","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Wang Jiang","year":"2014","unstructured":"Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. 2014. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 2649\u20132656."},{"key":"e_1_3_1_132_2","first-page":"2649","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Wang Jiang","year":"2014","unstructured":"Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. 2014. Cross-view action modelling, learning and recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. Columbus, USA, 2649\u20132656."},{"issue":"10","key":"e_1_3_1_133_2","doi-asserted-by":"crossref","first-page":"3349","DOI":"10.1109\/TPAMI.2020.2983686","article-title":"Deep high-resolution representation learning for visual recognition","volume":"43","author":"Wang Jingdong","year":"2020","unstructured":"Jingdong Wang, Ke Sun, Tianheng Cheng et\u00a0al. 2020. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43, 10 (2020), 3349\u20133364.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_1_134_2","first-page":"20","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wang Limin","year":"2016","unstructured":"Limin Wang et\u00a0al. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20\u201336."},{"key":"e_1_3_1_135_2","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1016\/j.cviu.2018.04.007","article-title":"RGB-D-based human motion recognition with deep learning: A survey","volume":"171","author":"Wang Pichao","year":"2018","unstructured":"Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, and Sergio Escalera. 2018. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vision Image Understand. 171 (2018), 118\u2013139.","journal-title":"Comput. Vision Image Understand."},{"key":"e_1_3_1_136_2","first-page":"12695","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Wang Weiyao","year":"2020","unstructured":"Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What makes training multi-modal classification networks hard?. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 12695\u201312705."},{"issue":"1","key":"e_1_3_1_137_2","first-page":"10","article-title":"Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion","volume":"17","author":"Wang Yang","year":"2021","unstructured":"Yang Wang. 2021. Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s, Article 10 (2021), 25 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_138_2","first-page":"212","volume-title":"Proceedings of the International Conference on Information Technology: IoT and Smart City","author":"Wang Zhen","year":"2019","unstructured":"Zhen Wang, Shixian Luo, He Sun et\u00a0al. 2019. An efficient non-local attention network for video-based person re-identification. In Proceedings of the International Conference on Information Technology: IoT and Smart City. ACM, 212\u2013217."},{"key":"e_1_3_1_139_2","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops","author":"Wang Zihao W.","year":"2019","unstructured":"Zihao W. Wang, Vibhav Vineet, Francesco Pittaluga et\u00a0al. 2019. Privacy-preserving action recognition using coded aperture videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops. 1\u201310."},{"key":"e_1_3_1_140_2","first-page":"10033","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Wu Kan","year":"2021","unstructured":"Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. 2021. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the International Conference on Computer Vision. IEEE, 10033\u201310041."},{"key":"e_1_3_1_141_2","first-page":"20","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Xia L.","year":"2012","unstructured":"L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 20\u201327."},{"key":"e_1_3_1_142_2","article-title":"Rethinking spatiotemporal feature learning for video understanding","author":"Xie Saining","year":"2017","unstructured":"Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2017. Rethinking spatiotemporal feature learning for video understanding. Retrieved from https:\/\/arXiv:1712.04851","journal-title":"Retrieved from https:\/\/arXiv:1712.04851"},{"issue":"1","key":"e_1_3_1_143_2","first-page":"23","article-title":"Socializing the videos: A multimodal approach for social relation recognition","volume":"17","author":"Xu Tong","year":"2021","unstructured":"Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article 23 (2021), 23 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_144_2","doi-asserted-by":"crossref","first-page":"4275","DOI":"10.1109\/JSEN.2015.2416651","article-title":"Evaluating and improving the depth accuracy of Kinect for Windows v2","volume":"15","author":"Yang Lin","year":"2015","unstructured":"Lin Yang, Longyu Zhang, Haiwei Dong, Abdulhameed Alelaiwi, and Abdulmotaleb El Saddik. 2015. Evaluating and improving the depth accuracy of Kinect for Windows v2. IEEE Sensors 15 (2015), 4275\u20134285.","journal-title":"IEEE Sensors"},{"issue":"3","key":"e_1_3_1_145_2","first-page":"78","article-title":"A multimodal framework for large-scale emotion recognition by fusing music and electrodermal activity signals","volume":"18","author":"Yin Guanghao","year":"2022","unstructured":"Guanghao Yin, Shouqian Sun, Dian Yu, Dejian Li, and Kejun Zhang. 2022. A multimodal framework for large-scale emotion recognition by fusing music and electrodermal activity signals. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3, Article 78 (2022), 23 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_146_2","first-page":"8188","volume-title":"Proceedings of the International Conference on Computer Vision","author":"Yu Bingyao","year":"2021","unstructured":"Bingyao Yu, Wanhua Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021. Frequency-aware spatiotemporal transformers for video inpainting detection. In Proceedings of the International Conference on Computer Vision. IEEE, 8188\u20138197."},{"key":"e_1_3_1_147_2","first-page":"28","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops","author":"Yun Kiwon","year":"2012","unstructured":"Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay et\u00a0al. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 28\u201335."},{"key":"e_1_3_1_148_2","first-page":"450","volume-title":"Proceedings of the International Conference on Artificial Intelligence for Communications and Networks","author":"Zahin Abrar","year":"2019","unstructured":"Abrar Zahin, Rose Qingyang Hu et\u00a0al. 2019. Sensor-based human activity recognition for smart healthcare: A semi-supervised machine learning. In Proceedings of the International Conference on Artificial Intelligence for Communications and Networks. Springer, 450\u2013472."},{"key":"e_1_3_1_149_2","first-page":"12310","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Zbontar Jure","year":"2021","unstructured":"Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St\u00e9phane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning. PMLR, 12310\u201312320."},{"key":"e_1_3_1_150_2","first-page":"11384","article-title":"Shifted chunk transformer for spatio-temporal representational learning","volume":"34","author":"Zha Xuefan","year":"2021","unstructured":"Xuefan Zha, Wentao Zhu, Lv Xun, Sen Yang, and Ji Liu. 2021. Shifted chunk transformer for spatio-temporal representational learning. Adv. Neural Info. Process. Syst. 34 (2021), 11384\u201311396.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_151_2","first-page":"7277","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Zhang Chongzhi","year":"2022","unstructured":"Chongzhi Zhang, Mingyuan Zhang, Zhang et\u00a0al. 2022. Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 7277\u20137286."},{"issue":"5","key":"e_1_3_1_152_2","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.3390\/s19051005","article-title":"A comprehensive survey of vision-based human action recognition methods","volume":"19","author":"Zhang Hong Bo","year":"2019","unstructured":"Hong Bo Zhang, Yi Xiang Zhang, Bineng Zhong et\u00a0al. 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019), 1005.","journal-title":"Sensors"},{"key":"e_1_3_1_153_2","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1016\/j.patcog.2016.05.019","article-title":"RGB-D-based action recognition datasets: A survey","volume":"60","author":"Zhang Jing","year":"2016","unstructured":"Jing Zhang, Wanqing Li, Philip O. Ogunbona et\u00a0al. 2016. RGB-D-based action recognition datasets: A survey. Pattern Recogn. 60 (2016), 86\u2013105.","journal-title":"Pattern Recogn."},{"key":"e_1_3_1_154_2","first-page":"12669","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Zhang Mingxing","year":"2021","unstructured":"Mingxing Zhang, Yang Yang, Xinghan Chen et\u00a0al. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 12669\u201312678."},{"issue":"1","key":"e_1_3_1_155_2","first-page":"2","article-title":"Deep learning\u2013based multimedia analytics: A review","volume":"15","author":"Zhang Wei","year":"2019","unstructured":"Wei Zhang, Ting Yao, Shiai Zhu, and Abdulmotaleb El Saddik. 2019. Deep learning\u2013based multimedia analytics: A review. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1s, Article 2 (2019), 26 pages.","journal-title":"ACM Trans. Multimedia Comput. Commun. Appl."},{"key":"e_1_3_1_156_2","first-page":"3780","volume-title":"Proceedings of the Chinese Automation Congress (CAC\u201917)","author":"Zhang Z.","year":"2017","unstructured":"Z. Zhang, X. Ma, R. Song, X. Rong, X. Tian, G. Tian, and Y. Li. 2017. Deep learning-based human action recognition: A survey. In Proceedings of the Chinese Automation Congress (CAC\u201917). IEEE, 3780\u20133785."},{"key":"e_1_3_1_157_2","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1016\/j.imavis.2016.06.007","article-title":"From handcrafted to learned representations for human action recognition: A survey","volume":"55","author":"Zhu Fan","year":"2016","unstructured":"Fan Zhu, Ling Shao, Jin Xie, and Yi Fang. 2016. From handcrafted to learned representations for human action recognition: A survey. Image Vision Comput. 55 (2016), 42\u201352.","journal-title":"Image Vision Comput."},{"key":"e_1_3_1_158_2","first-page":"19","volume-title":"Proceedings of the International Conference on Pattern Recognition","author":"Zhu G.","year":"2016","unstructured":"G. Zhu, L. Zhang, L. Mei, Jie Shao, Juan Song, and Peiyi Shen. 2016. Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In Proceedings of the International Conference on Pattern Recognition. IEEE, 19\u201324."},{"key":"e_1_3_1_159_2","article-title":"Action machine: Rethinking action recognition in trimmed videos","author":"Zhu Jiagang","year":"2018","unstructured":"Jiagang Zhu, Wei Zou, Liang Xu et\u00a0al. 2018. Action machine: Rethinking action recognition in trimmed videos. Retrieved from https:\/\/arXiv1812.05770","journal-title":"R"},{"key":"e_1_3_1_160_2","first-page":"8746","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu and Yi Yang. 2020. ActBERT: Learning global-local video-text representations. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 8746\u20138755."},{"key":"e_1_3_1_161_2","first-page":"1878","volume-title":"Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition","author":"Zhuang Bohan","year":"2017","unstructured":"Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian Reid. 2017. Attend in groups: A weakly supervised deep learning framework for learning from web data. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, 1878\u20131887."},{"key":"e_1_3_1_162_2","doi-asserted-by":"crossref","first-page":"103451","DOI":"10.1016\/j.robot.2020.103451","article-title":"I-Support: A robotic platform of an assistive bathing robot for the elderly population","volume":"126","author":"Zlatintsi Athanasia","year":"2020","unstructured":"Athanasia Zlatintsi, A. C. Dometios, Nikolaos Kardaris et\u00a0al. 2020. I-Support: A robotic platform of an assistive bathing robot for the elderly population. Robot. Auton. Syst. 126 (2020), 103451.","journal-title":"Robot. Auton. Syst."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664815","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3664815","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:29Z","timestamp":1750295849000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664815"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,9]]},"references-count":161,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,8,31]]}},"alternative-id":["10.1145\/3664815"],"URL":"https:\/\/doi.org\/10.1145\/3664815","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,9]]},"assertion":[{"value":"2023-04-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-16","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}