{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,15]],"date-time":"2025-11-15T10:28:55Z","timestamp":1763202535824,"version":"3.41.0"},"reference-count":100,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,7,20]],"date-time":"2021-07-20T00:00:00Z","timestamp":1626739200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Knowl. Discov. Data"],"published-print":{"date-parts":[[2022,2,28]]},"abstract":"<jats:p>\n            Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are\n            <jats:italic>entangled<\/jats:italic>\n            in\n            <jats:italic>one common latent space<\/jats:italic>\n            . To tackle this problem, we propose DiMBERT (short for\n            <jats:bold>Di<\/jats:bold>\n            sentangled\n            <jats:bold>M<\/jats:bold>\n            ultimodal-Attention\n            <jats:bold>BERT<\/jats:bold>\n            ), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image\u2013sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for\n            <jats:bold>Di<\/jats:bold>\n            sentangled\n            <jats:bold>M<\/jats:bold>\n            ultimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.\n          <\/jats:p>","DOI":"10.1145\/3447685","type":"journal-article","created":{"date-parts":[[2021,7,20]],"date-time":"2021-07-20T21:06:18Z","timestamp":1626815178000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention"],"prefix":"10.1145","volume":"16","author":[{"given":"Fenglin","family":"Liu","sequence":"first","affiliation":[{"name":"ADSPLAB, School of ECE, Peking University, Shenzhen, Guangdong, China"}]},{"given":"Xian","family":"Wu","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"given":"Shen","family":"Ge","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"given":"Xuancheng","family":"Ren","sequence":"additional","affiliation":[{"name":"MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University, Beijing, China"}]},{"given":"Wei","family":"Fan","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"given":"Xu","family":"Sun","sequence":"additional","affiliation":[{"name":"School of EECS, Peking University and Center for Data Science, Peking University, Beijing, China"}]},{"given":"Yuexian","family":"Zou","sequence":"additional","affiliation":[{"name":"ADSPLAB, School of ECE, Peking University and Peng Cheng Laboratory, Shenzhen, Guangdong, China"}]}],"member":"320","published-online":{"date-parts":[[2021,7,20]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Chris Alberti Jeffrey Ling Michael Collins and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In EMNLP.  Chris Alberti Jeffrey Ling Michael Collins and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In EMNLP.","DOI":"10.18653\/v1\/D19-1219"},{"key":"e_1_2_1_2_1","volume-title":"SPICE: Semantic propositional image caption evaluation. In ECCV.","author":"Anderson Peter","year":"2016","unstructured":"Peter Anderson , Basura Fernando , Mark Johnson , and Stephen Gould . 2016 . SPICE: Semantic propositional image caption evaluation. In ECCV. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In ECCV."},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and VQA. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_4_1","volume-title":"Hinton","author":"Ba Lei Jimmy","year":"2016","unstructured":"Lei Jimmy Ba , Ryan Kiros , and Geoffrey E . Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_2_1_5_1","volume-title":"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL.","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie . 2005 . METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL."},{"key":"e_1_2_1_6_1","volume-title":"Multimodal pretraining unmasked: Unifying the vision and language BERTs. arXiv preprint arXiv:2011.15124","author":"Bugliarello Emanuele","year":"2020","unstructured":"Emanuele Bugliarello , Ryan Cotterell , Naoaki Okazaki , and Desmond Elliott . 2020. Multimodal pretraining unmasked: Unifying the vision and language BERTs. arXiv preprint arXiv:2011.15124 ( 2020 ). Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and Desmond Elliott. 2020. Multimodal pretraining unmasked: Unifying the vision and language BERTs. arXiv preprint arXiv:2011.15124 (2020)."},{"key":"e_1_2_1_7_1","volume-title":"Plummer","author":"Burns Andrea","year":"2020","unstructured":"Andrea Burns , Donghyun Kim , Derry Wijaya , Kate Saenko , and Bryan A . Plummer . 2020 . Learning to scale multilingual representations for vision-language tasks. In ECCV. Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A. Plummer. 2020. Learning to scale multilingual representations for vision-language tasks. In ECCV."},{"key":"e_1_2_1_8_1","volume-title":"Pranava Madhyastha, Erkut Erdem, Aykut Erdem, and Lucia Specia.","author":"Caglayan Ozan","year":"2021","unstructured":"Ozan Caglayan , Menekse Kuyu , Mustafa Sercan Amac , Pranava Madhyastha, Erkut Erdem, Aykut Erdem, and Lucia Specia. 2021 . Cross-lingual visual pre-training for multimodal machine translation. In EACL. Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, and Lucia Specia. 2021. Cross-lingual visual pre-training for multimodal machine translation. In EACL."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Jize Cao Zhe Gan Yu Cheng Licheng Yu Yen-Chun Chen and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In ECCV.  Jize Cao Zhe Gan Yu Cheng Licheng Yu Yen-Chun Chen and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In ECCV.","DOI":"10.1007\/978-3-030-58539-6_34"},{"key":"e_1_2_1_10_1","volume-title":"VisualGPT: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. arXiv preprint arXiv:2102.10407","author":"Chen Jun","year":"2021","unstructured":"Jun Chen , Han Guo , Kai Yi , Boyang Li , and Mohamed Elhoseiny . 2021. VisualGPT: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. arXiv preprint arXiv:2102.10407 ( 2021 ). Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2021. VisualGPT: Data-efficient image captioning by balancing visual input and linguistic knowledge from pretraining. arXiv preprint arXiv:2102.10407 (2021)."},{"key":"e_1_2_1_11_1","volume-title":"Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325","author":"Chen Xinlei","year":"2015","unstructured":"Xinlei Chen , Hao Fang , Tsung-Yi Lin , Ramakrishna Vedantam , Saurabh Gupta , Piotr Doll\u00e1r , and C. Lawrence Zitnick . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 ( 2015 ). Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)."},{"key":"e_1_2_1_12_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 . UNITER : Learning universal image-text representations. In ECCV. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Learning universal image-text representations. In ECCV."},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Jaemin Cho Jiasen Lu Dustin Schwenk Hannaneh Hajishirzi and Aniruddha Kembhavi. 2020. X-LXMERT: Paint caption and answer questions with multi-modal transformers. In EMNLP.  Jaemin Cho Jiasen Lu Dustin Schwenk Hannaneh Hajishirzi and Aniruddha Kembhavi. 2020. X-LXMERT: Paint caption and answer questions with multi-modal transformers. In EMNLP.","DOI":"10.18653\/v1\/2020.emnlp-main.707"},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.  Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Li and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_15_1","volume-title":"BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3455457"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1207\/s15516709cog1402_1"},{"key":"e_1_2_1_18_1","volume-title":"Li Deng, Piotr Doll\u00e1r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C.","author":"Fang Hao","year":"2015","unstructured":"Hao Fang , Saurabh Gupta , Forrest N. Iandola , Rupesh Kumar Srivastava , Li Deng, Piotr Doll\u00e1r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick , and Geoffrey Zweig. 2015 . From captions to visual concepts and back. In CVPR. Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Doll\u00e1r, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In CVPR."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"Zhangyin Feng Daya Guo Duyu Tang Nan Duan Xiaocheng Feng Ming Gong Linjun Shou Bing Qin Ting Liu Daxin Jiang and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In EMNLP (Findings).  Zhangyin Feng Daya Guo Duyu Tang Nan Duan Xiaocheng Feng Ming Gong Linjun Shou Bing Qin Ting Liu Daxin Jiang and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In EMNLP (Findings).","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"e_1_2_1_20_1","unstructured":"Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS.  Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng and Jingjing Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401430"},{"key":"e_1_2_1_22_1","doi-asserted-by":"crossref","unstructured":"Fran\u00e7ois Gard\u00e8res Maryam Ziaeefard Baptiste Abeloos and Freddy L\u00e9cu\u00e9. 2020. ConceptBert: Concept-aware representation for visual question answering. In EMNLP (Findings).  Fran\u00e7ois Gard\u00e8res Maryam Ziaeefard Baptiste Abeloos and Freddy L\u00e9cu\u00e9. 2020. ConceptBert: Concept-aware representation for visual question answering. In EMNLP (Findings).","DOI":"10.18653\/v1\/2020.findings-emnlp.44"},{"key":"e_1_2_1_23_1","volume-title":"COOT: Cooperative hierarchical transformer for video-text representation learning. In NeurIPS.","author":"Ging Simon","year":"2020","unstructured":"Simon Ging , Mohammadreza Zolfaghari , Hamed Pirsiavash , and Thomas Brox . 2020 . COOT: Cooperative hierarchical transformer for video-text representation learning. In NeurIPS. Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In NeurIPS."},{"key":"e_1_2_1_24_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_26_1","volume-title":"M3P: Learning universal representations via multitask multilingual multimodal pre-training. arXiv preprint arXiv:2006.02635","author":"Huang Haoyang","year":"2020","unstructured":"Haoyang Huang , Lin Su , Di Qi , Nan Duan , Edward Cui , Taroon Bharti , Lei Zhang , Lijuan Wang , Jianfeng Gao , Bei Liu , Jianlong Fu , Dongdong Zhang , Xin Liu , and Ming Zhou . 2020. M3P: Learning universal representations via multitask multilingual multimodal pre-training. arXiv preprint arXiv:2006.02635 ( 2020 ). Haoyang Huang, Lin Su, Di Qi, Nan Duan, Edward Cui, Taroon Bharti, Lei Zhang, Lijuan Wang, Jianfeng Gao, Bei Liu, Jianlong Fu, Dongdong Zhang, Xin Liu, and Ming Zhou. 2020. M3P: Learning universal representations via multitask multilingual multimodal pre-training. arXiv preprint arXiv:2006.02635 (2020)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Lun Huang Wenmin Wang Jie Chen and Xiaoyong Wei. 2019. Attention on attention for image captioning. In ICCV.  Lun Huang Wenmin Wang Jie Chen and Xiaoyong Wei. 2019. Attention on attention for image captioning. In ICCV.","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_2_1_28_1","unstructured":"Ting-Hao (Kenneth) Huang Francis Ferraro Nasrin Mostafazadeh Ishan Misra Aishwarya Agrawal Jacob Devlin Ross B. Girshick Xiaodong He Pushmeet Kohli Dhruv Batra C. Lawrence Zitnick Devi Parikh Lucy Vanderwende Michel Galley and Margaret Mitchell. 2016. Visual storytelling. In HLT-NAACL.  Ting-Hao (Kenneth) Huang Francis Ferraro Nasrin Mostafazadeh Ishan Misra Aishwarya Agrawal Jacob Devlin Ross B. Girshick Xiaodong He Pushmeet Kohli Dhruv Batra C. Lawrence Zitnick Devi Parikh Lucy Vanderwende Michel Galley and Margaret Mitchell. 2016. Visual storytelling. In HLT-NAACL."},{"key":"e_1_2_1_29_1","volume-title":"Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849","author":"Huang Zhicheng","year":"2020","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 ( 2020 ). Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)."},{"key":"e_1_2_1_30_1","volume-title":"Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 ( 2021 ). Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)."},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Wenhao Jiang Lin Ma Yu-Gang Jiang Wei Liu and Tong Zhang. 2018. Recurrent fusion network for image captioning. In ECCV.  Wenhao Jiang Lin Ma Yu-Gang Jiang Wei Liu and Tong Zhang. 2018. Recurrent fusion network for image captioning. In ECCV.","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.  Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_33_1","volume-title":"Berg","author":"Kazemzadeh Sahar","year":"2014","unstructured":"Sahar Kazemzadeh , Vicente Ordonez , Mark Matten , and Tamara L . Berg . 2014 . ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP."},{"key":"e_1_2_1_34_1","volume-title":"Niels da Vitoria Lobo, and Mubarak Shah","author":"Khan Aisha Urooj","year":"2020","unstructured":"Aisha Urooj Khan , Amir Mazaheri , Niels da Vitoria Lobo, and Mubarak Shah . 2020 . MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In EMNLP (Findings) . Aisha Urooj Khan, Amir Mazaheri, Niels da Vitoria Lobo, and Mubarak Shah. 2020. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In EMNLP (Findings)."},{"key":"e_1_2_1_35_1","volume-title":"ViLT: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021. ViLT: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 ( 2021 ). Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 (2021)."},{"key":"e_1_2_1_36_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A method for stochastic optimization. In ICLR. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Jie Lei Linjie Li Luowei Zhou Zhe Gan Tamara L. Berg Mohit Bansal and Jingjing Liu. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In CVPR.  Jie Lei Linjie Li Luowei Zhou Zhe Gan Tamara L. Berg Mohit Bansal and Jingjing Liu. 2021. Less is more: ClipBERT for video-and-language learning via sparse sampling. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_2_1_39_1","doi-asserted-by":"crossref","unstructured":"Gen Li Nan Duan Yuejian Fang Daxin Jiang and Ming Zhou. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI.  Gen Li Nan Duan Yuejian Fang Daxin Jiang and Ming Zhou. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI.","DOI":"10.1609\/aaai.v34i07.6795"},{"volume-title":"Entangled transformer for image captioning","author":"Li Guang","key":"e_1_2_1_40_1","unstructured":"Guang Li , Linchao Zhu , Ping Liu , and Yi Yang . 2019. Entangled transformer for image captioning . In ICCV. IEEE , 8927\u20138936. Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In ICCV. IEEE, 8927\u20138936."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350918"},{"key":"e_1_2_1_42_1","volume-title":"HERO: Hierarchical encoder for video+language omni-representation pre-training. In EMNLP.","author":"Li Linjie","year":"2020","unstructured":"Linjie Li , Yen-Chun Chen , Yu Cheng , Zhe Gan , Licheng Yu , and Jingjing Liu . 2020 . HERO: Hierarchical encoder for video+language omni-representation pre-training. In EMNLP. Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical encoder for video+language omni-representation pre-training. In EMNLP."},{"key":"e_1_2_1_43_1","volume-title":"A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673","author":"Li Linjie","year":"2020","unstructured":"Linjie Li , Zhe Gan , and Jingjing Liu . 2020. A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673 ( 2020 ). Linjie Li, Zhe Gan, and Jingjing Liu. 2020. A closer look at the robustness of vision-and-language pre-trained models. arXiv preprint arXiv:2012.08673 (2020)."},{"key":"e_1_2_1_44_1","volume-title":"VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arxiv:1908.03557","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arxiv:1908.03557 ( 2019 ). Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arxiv:1908.03557 (2019)."},{"key":"e_1_2_1_45_1","volume-title":"Weakly-supervised VisualBERT: Pre-training without parallel images and captions. arXiv preprint arXiv:2010.12831","author":"Li Liunian Harold","year":"2020","unstructured":"Liunian Harold Li , Haoxuan You , Zhecan Wang , Alireza Zareian , Shih-Fu Chang , and Kai-Wei Chang . 2020. Weakly-supervised VisualBERT: Pre-training without parallel images and captions. arXiv preprint arXiv:2010.12831 ( 2020 ). Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang. 2020. Weakly-supervised VisualBERT: Pre-training without parallel images and captions. arXiv preprint arXiv:2010.12831 (2020)."},{"key":"e_1_2_1_46_1","volume-title":"UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409","author":"Li Wei","year":"2020","unstructured":"Wei Li , Can Gao , Guocheng Niu , Xinyan Xiao , Hao Liu , Jiachen Liu , Hua Wu , and Haifeng Wang . 2020 . UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020). Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2020. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)."},{"key":"e_1_2_1_47_1","volume-title":"Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li , Xi Yin , Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang , Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , Yejin Choi , and Jianfeng Gao . 2020 . Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073445.1073465"},{"key":"e_1_2_1_49_1","volume-title":"ROUGE: A package for automatic evaluation of summaries. In ACL.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . ROUGE: A package for automatic evaluation of summaries. In ACL. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In ACL."},{"key":"e_1_2_1_50_1","unstructured":"Tsung-Yi Lin Michael Maire Serge J. Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.  Tsung-Yi Lin Michael Maire Serge J. Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240632"},{"key":"e_1_2_1_52_1","unstructured":"Fenglin Liu Meng Gao Tianhao Zhang and Yuexian Zou. 2019. Exploring semantic relationships for image captioning without parallel data. In ICDM.  Fenglin Liu Meng Gao Tianhao Zhang and Yuexian Zou. 2019. Exploring semantic relationships for image captioning without parallel data. In ICDM."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454902"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/3367722.3367754"},{"key":"e_1_2_1_55_1","unstructured":"Fenglin Liu Xuancheng Ren Yuanxin Liu Houfeng Wang and Xu Sun. 2018. simNet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In EMNLP.  Fenglin Liu Xuancheng Ren Yuanxin Liu Houfeng Wang and Xu Sun. 2018. simNet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In EMNLP."},{"key":"e_1_2_1_56_1","unstructured":"Fenglin Liu Xian Wu Shen Ge Wei Fan and Yuexian Zou. 2020. Federated learning for vision-and-language grounding problems. In AAAI.  Fenglin Liu Xian Wu Shen Ge Wei Fan and Yuexian Zou. 2020. Federated learning for vision-and-language grounding problems. In AAAI."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414004"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454289"},{"key":"e_1_2_1_59_1","unstructured":"Jiasen Lu Vedanuj Goswami Marcus Rohrbach Devi Parikh and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.  Jiasen Lu Vedanuj Goswami Marcus Rohrbach Devi Parikh and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR."},{"key":"e_1_2_1_60_1","unstructured":"Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2018. Neural baby talk. In CVPR.  Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2018. Neural baby talk. In CVPR."},{"key":"e_1_2_1_61_1","volume-title":"UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353","author":"Luo Huaishao","year":"2020","unstructured":"Huaishao Luo , Lei Ji , Botian Shi , Haoyang Huang , Nan Duan , Tianrui Li , Xilin Chen , and Ming Zhou . 2020. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 ( 2020 ). Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. 2020. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)."},{"key":"e_1_2_1_62_1","volume-title":"lamBERT: Language and action learning using multimodal BERT. arXiv preprint arXiv:2004.07093","author":"Miyazawa Kazuki","year":"2020","unstructured":"Kazuki Miyazawa , Tatsuya Aoki , Takato Horii , and Takayuki Nagai . 2020. lamBERT: Language and action learning using multimodal BERT. arXiv preprint arXiv:2004.07093 ( 2020 ). Kazuki Miyazawa, Tatsuya Aoki, Takato Horii, and Takayuki Nagai. 2020. lamBERT: Language and action learning using multimodal BERT. arXiv preprint arXiv:2004.07093 (2020)."},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"Vishvak Murahari Dhruv Batra Devi Parikh and Abhishek Das. 2020. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In ECCV.  Vishvak Murahari Dhruv Batra Devi Parikh and Abhishek Das. 2020. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In ECCV.","DOI":"10.1007\/978-3-030-58523-5_20"},{"key":"e_1_2_1_64_1","unstructured":"Hyeonseob Nam Jung-Woo Ha and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR.  Hyeonseob Nam Jung-Woo Ha and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In CVPR."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.5555\/2986459.2986587"},{"key":"e_1_2_1_66_1","unstructured":"Yingwei Pan Ting Yao Houqiang Li and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR.  Yingwei Pan Ting Yao Houqiang Li and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969239.2969250"},{"key":"e_1_2_1_69_1","doi-asserted-by":"crossref","unstructured":"Steven J. Rennie Etienne Marcheret Youssef Mroueh Jarret Ross and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR.  Steven J. Rennie Etienne Marcheret Youssef Mroueh Jarret Ross and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In CVPR.","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_70_1","doi-asserted-by":"crossref","unstructured":"Piyush Sharma Nan Ding Sebastian Goodman and Radu Soricut. 2018. Conceptual captions: A cleaned hypernymed image alt-text dataset for automatic image captioning. In ACL.  Piyush Sharma Nan Ding Sebastian Goodman and Radu Soricut. 2018. Conceptual captions: A cleaned hypernymed image alt-text dataset for automatic image captioning. In ACL.","DOI":"10.18653\/v1\/P18-1238"},{"key":"e_1_2_1_71_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_1_72_1","unstructured":"Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR.  Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR."},{"key":"e_1_2_1_73_1","volume-title":"Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arxiv:1906.05743","author":"Sun Chen","year":"2019","unstructured":"Chen Sun , Fabien Baradel , Kevin Murphy , and Cordelia Schmid . 2019. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arxiv:1906.05743 ( 2019 ). Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arxiv:1906.05743 (2019)."},{"key":"e_1_2_1_74_1","volume-title":"Carl Vondrick, Kevin Murphy, and Cordelia Schmid.","author":"Sun Chen","year":"2019","unstructured":"Chen Sun , Austin Myers , Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019 . VideoBERT: A joint model for video and language representation learning. In ICCV. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In ICCV."},{"key":"e_1_2_1_75_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.  Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_76_1","volume-title":"LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP.","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal . 2019 . LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_78_1","doi-asserted-by":"crossref","unstructured":"Ramakrishna Vedantam C. Lawrence Zitnick and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In CVPR.  Ramakrishna Vedantam C. Lawrence Zitnick and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In CVPR.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_79_1","volume-title":"MiniVLM: A smaller and faster vision-language model. arXiv preprint arXiv:2012.06946","author":"Wang Jianfeng","year":"2020","unstructured":"Jianfeng Wang , Xiaowei Hu , Pengchuan Zhang , Xiujun Li , Lijuan Wang , Lei Zhang , Jianfeng Gao , and Zicheng Liu . 2020. MiniVLM: A smaller and faster vision-language model. arXiv preprint arXiv:2012.06946 ( 2020 ). Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. 2020. MiniVLM: A smaller and faster vision-language model. arXiv preprint arXiv:2012.06946 (2020)."},{"key":"e_1_2_1_80_1","doi-asserted-by":"crossref","unstructured":"Xin Wang Wenhu Chen Yuan-Fang Wang and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL.  Xin Wang Wenhu Chen Yuan-Fang Wang and William Yang Wang. 2018. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL.","DOI":"10.18653\/v1\/P18-1083"},{"key":"e_1_2_1_81_1","doi-asserted-by":"crossref","unstructured":"Qi Wu Chunhua Shen Lingqiao Liu Anthony R. Dick and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In CVPR.  Qi Wu Chunhua Shen Lingqiao Liu Anthony R. Dick and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In CVPR.","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_2_1_82_1","volume-title":"Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144","author":"Wu Yonghui","year":"2016","unstructured":"Yonghui Wu , Mike Schuster , Zhifeng Chen , Quoc V. Le , Mohammad Norouzi , Wolfgang Macherey , Maxim Krikun , Yuan Cao , Qin Gao , Klaus Macherey , Jeff Klingner , Apurva Shah , Melvin Johnson , Xiaobing Liu , Lukasz Kaiser , Stephan Gouws , Yoshikiyo Kato , Taku Kudo , Hideto Kazawa , Keith Stevens , George Kurian , Nishant Patil , Wei Wang , Cliff Young , Jason Smith , Jason Riesa , Alex Rudnick , Oriol Vinyals , Greg Corrado , Macduff Hughes , and Jeffrey Dean . 2016. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 ( 2016 ). Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)."},{"key":"e_1_2_1_83_1","volume-title":"XGPT: Cross-modal generative pre-training for image captioning. arXiv preprint arXiv:2003.01473","author":"Xia Qiaolin","year":"2020","unstructured":"Qiaolin Xia , Haoyang Huang , Nan Duan , Dongdong Zhang , Lei Ji , Zhifang Sui , Edward Cui , Taroon Bharti , Xin Liu , and Ming Zhou . 2020 . XGPT: Cross-modal generative pre-training for image captioning. arXiv preprint arXiv:2003.01473 (2020). Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, and Ming Zhou. 2020. XGPT: Cross-modal generative pre-training for image captioning. arXiv preprint arXiv:2003.01473 (2020)."},{"key":"e_1_2_1_84_1","unstructured":"Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR.  Saining Xie Ross B. Girshick Piotr Doll\u00e1r Zhuowen Tu and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR."},{"key":"e_1_2_1_85_1","volume-title":"KM-BART: Knowledge enhanced multimodal BART for visual commonsense generation. arXiv preprint arXiv:2101.00419","author":"Xing Yiran","year":"2021","unstructured":"Yiran Xing , Zai Shi , Zhao Meng , Yunpu Ma , and Roger Wattenhofer . 2021. KM-BART: Knowledge enhanced multimodal BART for visual commonsense generation. arXiv preprint arXiv:2101.00419 ( 2021 ). Yiran Xing, Zai Shi, Zhao Meng, Yunpu Ma, and Roger Wattenhofer. 2021. KM-BART: Knowledge enhanced multimodal BART for visual commonsense generation. arXiv preprint arXiv:2101.00419 (2021)."},{"key":"e_1_2_1_86_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045336"},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.5555\/3367722.3367790"},{"key":"e_1_2_1_88_1","doi-asserted-by":"crossref","unstructured":"Xu Yang Kaihua Tang Hanwang Zhang and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR.  Xu Yang Kaihua Tang Hanwang Zhang and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR.","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_2_1_89_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454804"},{"key":"e_1_2_1_90_1","doi-asserted-by":"crossref","unstructured":"Ting Yao Yingwei Pan Yehao Li and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV.  Ting Yao Yingwei Pan Yehao Li and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV.","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_2_1_91_1","doi-asserted-by":"crossref","unstructured":"Ting Yao Yingwei Pan Yehao Li Zhaofan Qiu and Tao Mei. 2017. Boosting image captioning with attributes. In ICCV.  Ting Yao Yingwei Pan Yehao Li Zhaofan Qiu and Tao Mei. 2017. Boosting image captioning with attributes. In ICCV.","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_92_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_2_1_93_1","volume-title":"ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934","author":"Yu Fei","year":"2020","unstructured":"Fei Yu , Jiji Tang , Weichong Yin , Yu Sun , Hao Tian , Hua Wu , and Haifeng Wang . 2020. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 ( 2020 ). Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020)."},{"key":"e_1_2_1_94_1","volume-title":"Berg","author":"Yu Licheng","year":"2018","unstructured":"Licheng Yu , Zhe Lin , Xiaohui Shen , Jimei Yang , Xin Lu , Mohit Bansal , and Tamara L . Berg . 2018 . MAttNet: Modular attention network for referring expression comprehension. In CVPR. IEEE Computer Society , 1307\u20131315. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular attention network for referring expression comprehension. In CVPR. IEEE Computer Society, 1307\u20131315."},{"key":"e_1_2_1_95_1","volume-title":"Viola","author":"Zhang Cha","year":"2006","unstructured":"Cha Zhang , John C. Platt , and Paul A . Viola . 2006 . Multiple instance boosting for object detection. In NIPS. Cha Zhang, John C. Platt, and Paul A. Viola. 2006. Multiple instance boosting for object detection. In NIPS."},{"key":"e_1_2_1_96_1","volume-title":"VinVL: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529","author":"Zhang Pengchuan","year":"2021","unstructured":"Pengchuan Zhang , Xiujun Li , Xiaowei Hu , Jianwei Yang , Lei Zhang , Lijuan Wang , Yejin Choi , and Jianfeng Gao . 2021. VinVL: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 ( 2021 ). Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529 (2021)."},{"key":"e_1_2_1_97_1","volume-title":"MUSE: Parallel multi-scale attention for sequence to sequence learning. arXiv preprint arxiv:1911.09483","author":"Zhao Guangxiang","year":"2019","unstructured":"Guangxiang Zhao , Xu Sun , Jingjing Xu , Zhiyuan Zhang , and Liangchen Luo . 2019 . MUSE: Parallel multi-scale attention for sequence to sequence learning. arXiv preprint arxiv:1911.09483 (2019). Guangxiang Zhao, Xu Sun, Jingjing Xu, Zhiyuan Zhang, and Liangchen Luo. 2019. MUSE: Parallel multi-scale attention for sequence to sequence learning. arXiv preprint arxiv:1911.09483 (2019)."},{"key":"e_1_2_1_98_1","doi-asserted-by":"crossref","unstructured":"Luowei Zhou Yannis Kalantidis Xinlei Chen Jason J. Corso and Marcus Rohrbach. 2019. Grounded video description. In CVPR.  Luowei Zhou Yannis Kalantidis Xinlei Chen Jason J. Corso and Marcus Rohrbach. 2019. Grounded video description. In CVPR.","DOI":"10.1109\/CVPR.2019.00674"},{"key":"e_1_2_1_99_1","doi-asserted-by":"crossref","unstructured":"Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI.  Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason J. Corso and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI.","DOI":"10.1609\/aaai.v34i07.7005"},{"key":"e_1_2_1_100_1","unstructured":"Linchao Zhu and Yi Yang. 2020. ActBERT: Learning global-local video-text representations. In CVPR.  Linchao Zhu and Yi Yang. 2020. ActBERT: Learning global-local video-text representations. In CVPR."}],"container-title":["ACM Transactions on Knowledge Discovery from Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447685","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447685","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:10Z","timestamp":1750200070000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447685"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,20]]},"references-count":100,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,2,28]]}},"alternative-id":["10.1145\/3447685"],"URL":"https:\/\/doi.org\/10.1145\/3447685","relation":{},"ISSN":["1556-4681","1556-472X"],"issn-type":[{"type":"print","value":"1556-4681"},{"type":"electronic","value":"1556-472X"}],"subject":[],"published":{"date-parts":[[2021,7,20]]},"assertion":[{"value":"2020-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}