{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:38:07Z","timestamp":1777657087599,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":56,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61836011,62021001"],"award-info":[{"award-number":["61836011,62021001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475431","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T04:59:18Z","timestamp":1634533158000},"page":"2567-2576","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":29,"title":["Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training"],"prefix":"10.1145","author":[{"given":"Chenyi","family":"Lei","sequence":"first","affiliation":[{"name":"University of Science and Technology of China &amp; Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shixian","family":"Luo","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yong","family":"Liu","sequence":"additional","affiliation":[{"name":"Nanyang Technological University, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wanggui","family":"He","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jiamang","family":"Wang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guoxin","family":"Wang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haihong","family":"Tang","sequence":"additional","affiliation":[{"name":"Alibaba Group, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chunyan","family":"Miao","sequence":"additional","affiliation":[{"name":"Nanyang Technological University, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Houqiang","family":"Li","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Hugo Larochelle Aaron Courville Atousa Torabi Christopher Pal. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arxiv: 1503.01070  Hugo Larochelle Aaron Courville Atousa Torabi Christopher Pal. 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arxiv: 1503.01070"},{"key":"e_1_3_2_1_2_1","volume-title":"2020 a. A Simple Framework for Contrastive Learning of Visual Representations","author":"Chen Ting","unstructured":"Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey Hinton . 2020 a. A Simple Framework for Contrastive Learning of Visual Representations . In ICML. IEEE. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. IEEE."},{"key":"e_1_3_2_1_3_1","volume-title":"Transactions on Multimedia","author":"Chen Xusong","unstructured":"Xusong Chen , Chenyi Lei , Dong Liu , Guoxin Wang , Haihong Tang , Zheng-Jun Zha , and Houqiang Li. 2021. E-Commerce Storytelling Recommendation Using Attentional Domain-Transfer Network and Adversarial Pre-Training . In Transactions on Multimedia . IEEE. Xusong Chen, Chenyi Lei, Dong Liu, Guoxin Wang, Haihong Tang, Zheng-Jun Zha, and Houqiang Li. 2021. E-Commerce Storytelling Recommendation Using Attentional Domain-Transfer Network and Adversarial Pre-Training. In Transactions on Multimedia. IEEE."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3356051"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng and Jingjing Liu. 2020 b. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.  Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng and Jingjing Liu. 2020 b. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.340"},{"key":"e_1_3_2_1_7_1","volume-title":"ImageNet: A large-scale hierarchical image database","author":"Deng Jia","unstructured":"Jia Deng , Wei Dong , Richard Socher , Li-Jia Li , Kai Li , and Li Fei-Fei . 2009. ImageNet: A large-scale hierarchical image database . In CVPR. IEEE. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. IEEE."},{"key":"e_1_3_2_1_8_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv : 1810.04805 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv: 1810.04805"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Valentin Gabeur Chen Sun Karteek Alahari and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.  Valentin Gabeur Chen Sun Karteek Alahari and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"e_1_3_2_1_10_1","volume-title":"COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NIPS.","author":"Ging Simon","year":"2020","unstructured":"Simon Ging , Mohammadreza Zolfaghari , Hamed Pirsiavash , and Thomas Brox . 2020 . COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NIPS. Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NIPS."},{"key":"e_1_3_2_1_11_1","volume-title":"XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In ICML.","author":"Hu Junjie","year":"2020","unstructured":"Junjie Hu , Sebastian Ruder , Aditya Siddhant , Graham Neubig , Orhan Firat , and Melvin Johnson . 2020 . XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In ICML. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In ICML."},{"key":"e_1_3_2_1_12_1","volume-title":"Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arxiv","author":"Huang Zhicheng","year":"2004","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arxiv : 2004 .00849 Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arxiv: 2004.00849"},{"key":"e_1_3_2_1_13_1","unstructured":"Yuqi Huo Manli Zhang Guangzhen Liu Haoyu Lu Yizhao Gao Guoxing Yang Jingyuan Wen Heng Zhang Baogui Xu Weihao Zheng Zongzheng Xi and etal 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. arxiv: 2103.06561  Yuqi Huo Manli Zhang Guangzhen Liu Haoyu Lu Yizhao Gao Guoxing Yang Jingyuan Wen Heng Zhang Baogui Xu Weihao Zheng Zongzheng Xi and et al. 2021. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. arxiv: 2103.06561"},{"key":"e_1_3_2_1_14_1","volume-title":"Momentum Contrast for Unsupervised Visual Representation Learning","author":"Saining Xie Ross Girshick Yuxin Wu","unstructured":"Yuxin Wu Saining Xie Ross Girshick Kaiming He , Haoqi Fan . 2020. Momentum Contrast for Unsupervised Visual Representation Learning . In CVPR. IEEE. Yuxin Wu Saining Xie Ross Girshick Kaiming He, Haoqi Fan. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. IEEE."},{"key":"e_1_3_2_1_15_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola and etal 2017. The Kinetics Human Action Video Dataset. arxiv: 1705.06950  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola and et al. 2017. The Kinetics Human Action Video Dataset. arxiv: 1705.06950"},{"key":"e_1_3_2_1_16_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_17_1","volume-title":"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332","author":"Krishna Ranjay","year":"2016","unstructured":"Ranjay Krishna , Yuke Zhu , Oliver Groth , Justin Johnson , Kenji Hata , Joshua Kravitz , Stephanie Chen , Yannis Kalantidis , Li-Jia Li , David A. Shamma , Michael S. Bernstein , and Fei-Fei Li . 2016 . Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332 Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arxiv: 1602.07332"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467189"},{"key":"e_1_3_2_1_19_1","unstructured":"Chenyi Lei Lei Wu Dong Liu Zhao Li Guoxin Wang Haihong Tang and Houqiang Li. 2020. Multi-Question Learning for Visual Question Answering. In AAAI.  Chenyi Lei Lei Wu Dong Liu Zhao Li Guoxin Wang Haihong Tang and Houqiang Li. 2020. Multi-Question Learning for Visual Question Answering. In AAAI."},{"key":"e_1_3_2_1_20_1","volume-title":"2021 a. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling","author":"Lei Jie","unstructured":"Jie Lei , Linjie Li , Luowei Zhou , Zhe Gan , Tamara L. Berg , Mohit Bansal , and Jingjing Liu . 2021 a. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling . In CVPR. IEEE. Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. 2021 a. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In CVPR. IEEE."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Gen Li Nan Duan Yuejian Fang Ming Gong Daxin Jiang and Ming Zhou. 2020 b. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. In AAAI.  Gen Li Nan Duan Yuejian Fang Ming Gong Daxin Jiang and Ming Zhou. 2020 b. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. In AAAI.","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Linjie Li Yen-Chun Chen Yu Cheng Zhe Gan Licheng Yu and Jingjing Liu. 2020 a. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In EMNLP.  Linjie Li Yen-Chun Chen Yu Cheng Zhe Gan Licheng Yu and Jingjing Liu. 2020 a. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. In EMNLP.","DOI":"10.18653\/v1\/2020.emnlp-main.161"},{"key":"e_1_3_2_1_23_1","unstructured":"Xiujun Li Xi Yin Chunyuan Li Pengchuan Zhang Xiaowei Hu Lei Zhang Lijuan Wang Houdong Hu Li Dong Furu Wei Yejin Choi and Jianfeng Gao. 2020 c. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.  Xiujun Li Xi Yin Chunyuan Li Pengchuan Zhang Xiaowei Hu Lei Zhang Lijuan Wang Houdong Hu Li Dong Furu Wei Yejin Choi and Jianfeng Gao. 2020 c. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV."},{"key":"e_1_3_2_1_24_1","unstructured":"Yehao Li Yingwei Pan Ting Yao Jingwen Chen and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.  Yehao Li Yingwei Pan Ting Yao Jingwen Chen and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI."},{"key":"e_1_3_2_1_25_1","volume-title":"TGIF: A New Dataset and Benchmark on Animated GIF Description. arxiv: 1604.02748","author":"Li Yuncheng","year":"2016","unstructured":"Yuncheng Li , Yale Song , Liangliang Cao , Joel Tetreault , Larry Goldberg , Alejandro Jaimes , and Jiebo Luo . 2016 . TGIF: A New Dataset and Benchmark on Animated GIF Description. arxiv: 1604.02748 Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. arxiv: 1604.02748"},{"key":"e_1_3_2_1_26_1","unstructured":"Junyang Lin Rui Men An Yang Chang Zhou Ming Ding Yichang Zhang Peng Wang and etal 2021. M6: A Chinese Multimodal Pretrainer. arxiv: 2103.00823  Junyang Lin Rui Men An Yang Chang Zhou Ming Ding Yichang Zhang Peng Wang and et al. 2021. M6: A Chinese Multimodal Pretrainer. arxiv: 2103.00823"},{"key":"e_1_3_2_1_27_1","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick and Piotr Doll\u00e1r. 2014. Microsoft COCO: Common Objects in Context. arxiv: 1405.0312  Tsung-Yi Lin Michael Maire Serge Belongie Lubomir Bourdev Ross Girshick James Hays Pietro Perona Deva Ramanan C. Lawrence Zitnick and Piotr Doll\u00e1r. 2014. Microsoft COCO: Common Objects in Context. arxiv: 1405.0312"},{"key":"e_1_3_2_1_28_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv","author":"Liu Yinhan","year":"1907","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv : 1907 .11692 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv: 1907.11692"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454289"},{"key":"e_1_3_2_1_30_1","volume-title":"UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv","author":"Luo Huaishao","year":"1906","unstructured":"Huaishao Luo , Lei Ji , Botian Shi , Haoyang Huang , Nan Duan , Tianrui Li , Jason Li , Taroon Bharti , and Ming Zhou . 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv : 1906 .05743 Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 1906.05743"},{"key":"e_1_3_2_1_31_1","volume-title":"HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips","author":"Miech Antoine","unstructured":"Antoine Miech , Dimitri Zhukov , Jean-Baptiste Alayrac , Makarand Tapaswi , Ivan Laptev , and Josef Sivic . 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips . In ICCV. IEEE. Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE."},{"key":"e_1_3_2_1_32_1","volume-title":"Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arxiv","author":"Pan Yingwei","year":"2007","unstructured":"Yingwei Pan , Yehao Li , Jianjie Luo , Jun Xu , Ting Yao , and Tao Mei . 2020. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arxiv : 2007 .02375 Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arxiv: 2007.02375"},{"key":"e_1_3_2_1_33_1","unstructured":"Mandela Patrick Po-Yao Huang Yuki Asano Florian Metze Alexander Hauptmann Jo\u00e3o Henriques and Andrea Vedaldi. 2021. Support-Set Bottlenecks for Video-Text Representation Learning. In ICLR.  Mandela Patrick Po-Yao Huang Yuki Asano Florian Metze Alexander Hauptmann Jo\u00e3o Henriques and Andrea Vedaldi. 2021. Support-Set Bottlenecks for Video-Text Representation Learning. In ICLR."},{"key":"e_1_3_2_1_34_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al. 2021 . Learning Transferable Visual Models From Natural Language Supervision . arxiv: 2103.00020 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, and et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv: 2103.00020"},{"key":"e_1_3_2_1_35_1","volume":"201","author":"Raffel Colin","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J. Liu. 201 9. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arxiv: 1910.10683 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arxiv: 1910.10683","journal-title":"Peter J. Liu."},{"key":"e_1_3_2_1_36_1","unstructured":"Aditya Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv: 2102.12092  Aditya Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv: 2102.12092"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Anna Rohrbach Marcus Rohrbach Niket Tandon and Bernt Schiele. 2015. A Dataset for Movie Description. arxiv: 1501.02530  Anna Rohrbach Marcus Rohrbach Niket Tandon and Bernt Schiele. 2015. A Dataset for Movie Description. arxiv: 1501.02530","DOI":"10.1109\/CVPR.2015.7298940"},{"key":"e_1_3_2_1_38_1","volume-title":"Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta.","author":"Sigurdsson Gunnar A.","year":"2016","unstructured":"Gunnar A. Sigurdsson , Gu l Varol , Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016 . Hollywood in Homes Crowdsourcing Data Collection for Activity Understanding. In ECCV. Gunnar A. Sigurdsson, Gu l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes Crowdsourcing Data Collection for Activity Understanding. In ECCV."},{"key":"e_1_3_2_1_39_1","unstructured":"Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.  Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR."},{"key":"e_1_3_2_1_40_1","volume-title":"2019 a. Learning Video Representations using Contrastive Bidirectional Transformer. arxiv","author":"Sun Chen","year":"2002","unstructured":"Chen Sun , Fabien Baradel , Kevin Murphy , and Cordelia Schmid . 2019 a. Learning Video Representations using Contrastive Bidirectional Transformer. arxiv : 2002 .06353 Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019 a. Learning Video Representations using Contrastive Bidirectional Transformer. arxiv: 2002.06353"},{"key":"e_1_3_2_1_41_1","volume-title":"Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 b. Videobert: A joint model for video and language representation learning","author":"Sun Chen","unstructured":"Chen Sun , Austin Myers , Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 b. Videobert: A joint model for video and language representation learning . In ICCV. IEEE. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 b. Videobert: A joint model for video and language representation learning. In ICCV. IEEE."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"crossref","unstructured":"Siqi Sun Yen-Chun Chen Linjie Li Shuohang Wang Yuwei Fang and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.  Siqi Sun Yen-Chun Chen Linjie Li Shuohang Wang Yuwei Fang and Jingjing Liu. 2021. LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. In NAACL.","DOI":"10.18653\/v1\/2021.naacl-main.77"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"crossref","unstructured":"Yu Sun Shuohuan Wang Yukun Li Shikun Feng Hao Tian Hua Wu and Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. In AAAI.  Yu Sun Shuohuan Wang Yukun Li Shikun Feng Hao Tian Hua Wu and Haifeng Wang. 2020. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. In AAAI.","DOI":"10.1609\/aaai.v34i05.6428"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Sergey Ioffe Vincent Vanhoucke and Alex Alemi. 2016. Inception-v4 Inception-ResNet and the Impact of Residual Connections on Learning. arxiv: 1602.07261  Christian Szegedy Sergey Ioffe Vincent Vanhoucke and Alex Alemi. 2016. Inception-v4 Inception-ResNet and the Impact of Residual Connections on Learning. arxiv: 1602.07261","DOI":"10.1609\/aaai.v31i1.11231"},{"key":"e_1_3_2_1_45_1","volume-title":"LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP.","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal . 2019 . LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP."},{"key":"e_1_3_2_1_46_1","volume-title":"Representation Learning with Contrastive Predictive Coding. arxiv","author":"van den Oord Aaron","year":"1807","unstructured":"Aaron van den Oord , Yazhe Li , and Oriol Vinyals . 2018. Representation Learning with Contrastive Predictive Coding. arxiv : 1807 .03748 Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arxiv: 1807.03748"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Xin Wang Jiawei Wu Junkun Chen Lei Li Yuan-Fang Wang and William Yang Wang. 2019. VaTeX: A Large-Scale High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV.  Xin Wang Jiawei Wu Junkun Chen Lei Li Yuan-Fang Wang and William Yang Wang. 2019. VaTeX: A Large-Scale High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV.","DOI":"10.1109\/ICCV.2019.00468"},{"key":"e_1_3_2_1_49_1","unstructured":"Yonghui Wu Mike Schuster Zhifeng Chen Quoc V. Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun and etal 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144  Yonghui Wu Mike Schuster Zhifeng Chen Quoc V. Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arxiv: 1609.08144"},{"key":"e_1_3_2_1_50_1","volume-title":"MSR-VTT: A Large Video Description Dataset for Bridging Video and Language","author":"Xu Jun","unstructured":"Jun Xu , Tao Mei , Ting Yao , and Yong Rui . 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language . In CVPR. IEEE. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454804"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"crossref","unstructured":"Guorui Zhou Na Mou Ying Fan Qi Pi Weijie Bian Chang Zhou Xiaoqiang Zhu and Kun Gai. 2018a. Deep Interest Evolution Network for Click-Through Rate Prediction. In KDD. ACM.  Guorui Zhou Na Mou Ying Fan Qi Pi Weijie Bian Chang Zhou Xiaoqiang Zhu and Kun Gai. 2018a. Deep Interest Evolution Network for Click-Through Rate Prediction. In KDD. ACM.","DOI":"10.1145\/3219819.3219823"},{"key":"e_1_3_2_1_53_1","unstructured":"Luowei Zhou Jingjing Liu Yu Cheng Zhe Gan and Lei Zhang. 2021 a. CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning. arxiv: 2104.00285  Luowei Zhou Jingjing Liu Yu Cheng Zhe Gan and Lei Zhang. 2021 a. CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning. arxiv: 2104.00285"},{"key":"e_1_3_2_1_54_1","volume":"2018","author":"Zhou Luowei","unstructured":"Luowei Zhou , Chenliang Xu , and Jason J Corso. 2018 b. Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI. Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018b. Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI.","journal-title":"Jason J Corso."},{"key":"e_1_3_2_1_55_1","volume-title":"End-to-End Dense Video Captioning with Masked Transformer","author":"Zhou Luowei","unstructured":"Luowei Zhou , Yingbo Zhou , Jason J. Corso , Richard Socher , and Caiming Xiong . 2018c. End-to-End Dense Video Captioning with Masked Transformer . In CVPR. IEEE. Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018c. End-to-End Dense Video Captioning with Masked Transformer. In CVPR. IEEE."},{"key":"e_1_3_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Mingyang Zhou Luowei Zhou Shuohang Wang Yu Cheng Linjie Li Zhou Yu and Jingjing Liu. 2021 b. UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training. In CVPR.  Mingyang Zhou Luowei Zhou Shuohang Wang Yu Cheng Linjie Li Zhou Yu and Jingjing Liu. 2021 b. UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00414"}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475431","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475431","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:33Z","timestamp":1750193313000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475431"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":56,"alternative-id":["10.1145\/3474085.3475431","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475431","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}