{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T10:31:01Z","timestamp":1756463461774,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3551575","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"7040-7044","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Two stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge"],"prefix":"10.1145","author":[{"given":"Siyang","family":"Sun","sequence":"first","affiliation":[{"name":"Alibaba Group, Beijing, China"}]},{"given":"Xiong","family":"Xiong","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}]},{"given":"Yun","family":"Zheng","sequence":"additional","affiliation":[{"name":"Alibaba Group, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479220"},{"key":"e_1_3_2_2_2_1","volume-title":"Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755","author":"Ba Jimmy","year":"2014","unstructured":"Jimmy Ba , Volodymyr Mnih , and Koray Kavukcuoglu . 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 ( 2014 ). Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-97909-0_46"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00967"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372278.3390742"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2019.00228"},{"key":"e_1_3_2_2_9_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_10_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_2_11_1","volume-title":"PYSKL: Towards Good Practices for Skeleton Action Recognition. arXiv preprint arXiv:2205.09443","author":"Duan Haodong","year":"2022","unstructured":"Haodong Duan , Jiaqi Wang , Kai Chen , and Dahua Lin . 2022 . PYSKL: Towards Good Practices for Skeleton Action Recognition. arXiv preprint arXiv:2205.09443 (2022). Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. 2022. PYSKL: Towards Good Practices for Skeleton Action Recognition. arXiv preprint arXiv:2205.09443 (2022)."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00298"},{"key":"e_1_3_2_2_13_1","volume-title":"PFLD: A practical facial landmark detector. arXiv preprint arXiv:1902.10859","author":"Guo Xiaojie","year":"2019","unstructured":"Xiaojie Guo , Siyuan Li , Jinke Yu , Jiawan Zhang , Jiayi Ma , Lin Ma , Wei Liu , and Haibin Ling . 2019 . PFLD: A practical facial landmark detector. arXiv preprint arXiv:1902.10859 (2019). Xiaojie Guo, Siyuan Li, Jinke Yu, Jiawan Zhang, Jiayi Ma, Lin Ma, Wei Liu, and Haibin Ling. 2019. PFLD: A practical facial landmark detector. arXiv preprint arXiv:1902.10859 (2019)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_15_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev etal 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.bspc.2021.102893"},{"key":"e_1_3_2_2_17_1","volume-title":"Hero: Hierarchical encoder for video language omni-representation pre-training. arXiv preprint arXiv:2005.00200","author":"Li Linjie","year":"2020","unstructured":"Linjie Li , Yen-Chun Chen , Yu Cheng , Zhe Gan , Licheng Yu , and Jingjing Liu . 2020 . Hero: Hierarchical encoder for video language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020). Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020)."},{"key":"e_1_3_2_2_18_1","volume-title":"In International Symposium on Music Information Retrieval. Citeseer.","author":"Logan Beth","year":"2000","unstructured":"Beth Logan . 2000 . Mel frequency cepstral coefficients for music modeling . In In International Symposium on Music Information Retrieval. Citeseer. Beth Logan. 2000. Mel frequency cepstral coefficients for music modeling. In In International Symposium on Music Information Retrieval. Citeseer."},{"key":"e_1_3_2_2_19_1","volume-title":"Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353","author":"Luo Huaishao","year":"2020","unstructured":"Huaishao Luo , Lei Ji , Botian Shi , Haoyang Huang , Nan Duan , Tianrui Li , Jason Li , Taroon Bharti , and Ming Zhou . 2020 . Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020). Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)."},{"key":"e_1_3_2_2_20_1","volume-title":"Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860","author":"Luo Huaishao","year":"2021","unstructured":"Huaishao Luo , Lei Ji , Ming Zhong , Yang Chen , Wen Lei , Nan Duan , and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 ( 2021 ). Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)."},{"key":"e_1_3_2_2_21_1","volume-title":"M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. arXiv preprint arXiv:2203.12441","author":"Mao Huisheng","year":"2022","unstructured":"Huisheng Mao , Ziqi Yuan , Hua Xu , Wenmeng Yu , Yihe Liu , and Kai Gao . 2022. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. arXiv preprint arXiv:2203.12441 ( 2022 ). Huisheng Mao, Ziqi Yuan, Hua Xu, Wenmeng Yu, Yihe Liu, and Kai Gao. 2022. M-SENA: An Integrated Platform for Multimodal Sentiment Analysis. arXiv preprint arXiv:2203.12441 (2022)."},{"key":"e_1_3_2_2_22_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01016"},{"key":"e_1_3_2_2_24_1","volume-title":"Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross Girshick , and Jian Sun . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 ( 2015 ). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/SISY52375.2021.9582508"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123380"},{"key":"e_1_3_2_2_27_1","volume-title":"Conference on Computer Vision and Pattern Recognition Workshop","volume":"2","author":"Tang Jiasheng","year":"2020","unstructured":"Jiasheng Tang , Xiong Xiong , Chenwei Xie , Yanhao Zhang , Pichao Wang , Fan Wang , Fei Du , Liang Han , Yun Zheng , Pan Pan , 2020 . Min-cost network flow and trajectory fix for multiple objects tracking . In Conference on Computer Vision and Pattern Recognition Workshop , Vol. 2 . 3. Jiasheng Tang, Xiong Xiong, Chenwei Xie, Yanhao Zhang, Pichao Wang, Fan Wang, Fei Du, Liang Han, Yun Zheng, Pan Pan, et al. 2020. Min-cost network flow and trajectory fix for multiple objects tracking. In Conference on Computer Vision and Pattern Recognition Workshop, Vol. 2. 3."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479207"},{"key":"e_1_3_2_2_29_1","volume-title":"Speechpy-a library for speech processing and recognition. arXiv preprint arXiv:1803.01094","author":"Torfi Amirsina","year":"2018","unstructured":"Amirsina Torfi . 2018. Speechpy-a library for speech processing and recognition. arXiv preprint arXiv:1803.01094 ( 2018 ). Amirsina Torfi. 2018. Speechpy-a library for speech processing and recognition. arXiv preprint arXiv:1803.01094 (2018)."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_2_31_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3478324"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Jingdong Wang Ke Sun Tianheng Cheng Borui Jiang Chaorui Deng Yang Zhao Dong Liu Yadong Mu Mingkui Tan Xinggang Wang etal 2020. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43 10 (2020) 3349--3364.  Jingdong Wang Ke Sun Tianheng Cheng Borui Jiang Chaorui Deng Yang Zhao Dong Liu Yadong Mu Mingkui Tan Xinggang Wang et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43 10 (2020) 3349--3364.","DOI":"10.1109\/TPAMI.2020.2983686"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12328"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01589"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479214"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551575","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3551575","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:18Z","timestamp":1750182558000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3551575"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":36,"alternative-id":["10.1145\/3503161.3551575","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3551575","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}