{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,30]],"date-time":"2025-12-30T08:49:19Z","timestamp":1767084559997,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":42,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,9,14]],"date-time":"2022-09-14T00:00:00Z","timestamp":1663113600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100010666","name":"H2020 Research Infrastructures","doi-asserted-by":"publisher","award":["GA 951911"],"award-info":[{"award-number":["GA 951911"]}],"id":[{"id":"10.13039\/100010666","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100009888","name":"Regione Toscana","doi-asserted-by":"publisher","award":["CUP B15J19001040004"],"award-info":[{"award-number":["CUP B15J19001040004"]}],"id":[{"id":"10.13039\/501100009888","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100021856","name":"Ministero dell'Universit\u00e0 e della Ricerca","doi-asserted-by":"publisher","award":["CUP B87G22000460001"],"award-info":[{"award-number":["CUP B87G22000460001"]}],"id":[{"id":"10.13039\/501100021856","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006601","name":"Ministero degli Affari Esteri e della Cooperazione Internazionale","doi-asserted-by":"publisher","award":["Artificial Intelligence for Cultural Heritage (AI4CH)"],"award-info":[{"award-number":["Artificial Intelligence for Cultural Heritage (AI4CH)"]}],"id":[{"id":"10.13039\/501100006601","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,9,14]]},"DOI":"10.1145\/3549555.3549576","type":"proceedings-article","created":{"date-parts":[[2022,10,7]],"date-time":"2022-10-07T16:14:01Z","timestamp":1665159241000},"page":"64-70","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":29,"title":["ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval"],"prefix":"10.1145","author":[{"given":"Nicola","family":"Messina","sequence":"first","affiliation":[{"name":"ISTI-CNR, Italy"}]},{"given":"Matteo","family":"Stefanini","sequence":"additional","affiliation":[{"name":"Department of Engineering, University of Modena and Reggio Emilia, Italy"}]},{"given":"Marcella","family":"Cornia","sequence":"additional","affiliation":[{"name":"Department of Education and Humanities, University of Modena and Reggio Emilia, Italy"}]},{"given":"Lorenzo","family":"Baraldi","sequence":"additional","affiliation":[{"name":"Department of Engineering, University of Modena and Reggio Emilia, Italy"}]},{"given":"Fabrizio","family":"Falchi","sequence":"additional","affiliation":[{"name":"ISTI-CNR, Italy"}]},{"given":"Giuseppe","family":"Amato","sequence":"additional","affiliation":[{"name":"ISTI-CNR, Italy"}]},{"given":"Rita","family":"Cucchiara","sequence":"additional","affiliation":[{"name":"Department of Engineering, University of Modena and Reggio Emilia, Italy"}]}],"member":"320","published-online":{"date-parts":[[2022,10,7]]},"reference":[{"doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.","key":"e_1_3_2_1_1_1","DOI":"10.1109\/CVPR.2018.00636"},{"unstructured":"Rohan Anil Gabriel Pereyra Alexandre Passos Robert Ormandi George\u00a0E Dahl and Geoffrey\u00a0E Hinton. 2018. Large scale distributed neural network training through online distillation. In ICLR.  Rohan Anil Gabriel Pereyra Alexandre Passos Robert Ormandi George\u00a0E Dahl and Geoffrey\u00a0E Hinton. 2018. Large scale distributed neural network training through online distillation. In ICLR.","key":"e_1_3_2_1_2_1"},{"doi-asserted-by":"crossref","unstructured":"Pratyay Banerjee Tejas Gokhale Yezhou Yang and Chitta Baral. 2021. Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. In ICCV.  Pratyay Banerjee Tejas Gokhale Yezhou Yang and Chitta Baral. 2021. Weakly Supervised Relative Spatial Reasoning for Visual Question Answering. In ICCV.","key":"e_1_3_2_1_3_1","DOI":"10.1109\/ICCV48922.2021.00192"},{"doi-asserted-by":"crossref","unstructured":"Manuele Barraco Matteo Stefanini Marcella Cornia Silvia Cascianelli Lorenzo Baraldi and Rita Cucchiara. 2022. CaMEL: Mean Teacher Learning for Image Captioning. In ICPR.  Manuele Barraco Matteo Stefanini Marcella Cornia Silvia Cascianelli Lorenzo Baraldi and Rita Cucchiara. 2022. CaMEL: Mean Teacher Learning for Image Captioning. In ICPR.","key":"e_1_3_2_1_4_1","DOI":"10.1109\/ICPR56361.2022.9955644"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_5_1","DOI":"10.1145\/3442381.3449794"},{"doi-asserted-by":"crossref","unstructured":"Zhe Cao Tao Qin Tie-Yan Liu Ming-Feng Tsai and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In ICML.  Zhe Cao Tao Qin Tie-Yan Liu Ming-Feng Tsai and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In ICML.","key":"e_1_3_2_1_6_1","DOI":"10.1145\/1273496.1273513"},{"doi-asserted-by":"crossref","unstructured":"Mathilde Caron Hugo Touvron Ishan Misra Herv\u00e9 J\u00e9gou Julien Mairal Piotr Bojanowski and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV.  Mathilde Caron Hugo Touvron Ishan Misra Herv\u00e9 J\u00e9gou Julien Mairal Piotr Bojanowski and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV.","key":"e_1_3_2_1_7_1","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"e_1_3_2_1_8_1","volume-title":"UNITER: UNiversal Image-TExt Representation Learning. In ECCV.","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El\u00a0Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , and Jingjing Liu . 2020 . UNITER: UNiversal Image-TExt Representation Learning. In ECCV. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El\u00a0Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV."},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_9_1","DOI":"10.1007\/s11042-020-09251-4"},{"doi-asserted-by":"crossref","unstructured":"Marcella Cornia Matteo Stefanini Lorenzo Baraldi and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.  Marcella Cornia Matteo Stefanini Lorenzo Baraldi and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.","key":"e_1_3_2_1_10_1","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_1_11_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL."},{"unstructured":"Fartash Faghri David\u00a0J Fleet Jamie\u00a0Ryan Kiros and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.  Fartash Faghri David\u00a0J Fleet Jamie\u00a0Ryan Kiros and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.","key":"e_1_3_2_1_12_1"},{"doi-asserted-by":"publisher","key":"e_1_3_2_1_13_1","DOI":"10.1109\/TPAMI.2018.2883466"},{"unstructured":"Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu Pham Quoc Le Yun-Hsuan Sung Zhen Li and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.  Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu Pham Quoc Le Yun-Hsuan Sung Zhen Li and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML.","key":"e_1_3_2_1_14_1"},{"doi-asserted-by":"crossref","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.  Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.","key":"e_1_3_2_1_15_1","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_16_1","volume-title":"NeurIPS Workshops.","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros , Ruslan Salakhutdinov , and Richard\u00a0 S Zemel . 2014 . Unifying visual-semantic embeddings with multimodal neural language models . In NeurIPS Workshops. Ryan Kiros, Ruslan Salakhutdinov, and Richard\u00a0S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. In NeurIPS Workshops."},{"unstructured":"Kuang-Huei Lee Hamid Palangi Xi Chen Houdong Hu and Jianfeng Gao. 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953(2019).  Kuang-Huei Lee Hamid Palangi Xi Chen Houdong Hu and Jianfeng Gao. 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953(2019).","key":"e_1_3_2_1_17_1"},{"key":"e_1_3_2_1_18_1","volume-title":"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.","author":"Lewis Mike","year":"2020","unstructured":"Mike Lewis , Yinhan Liu , Naman Goyal , Marjan Ghazvininejad , Abdelrahman Mohamed , Omer Levy , Ves Stoyanov , and Luke Zettlemoyer . 2020 . BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL."},{"key":"e_1_3_2_1_19_1","volume-title":"Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI.","author":"Li Gen","year":"2020","unstructured":"Gen Li , Nan Duan , Yuejian Fang , Ming Gong , and Daxin Jiang . 2020 . Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI."},{"unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.  Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.","key":"e_1_3_2_1_20_1"},{"key":"e_1_3_2_1_21_1","volume-title":"Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li , Xi Yin , Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang , Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , 2020 . Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV."},{"unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692(2019).  Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692(2019).","key":"e_1_3_2_1_22_1"},{"unstructured":"Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.  Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.","key":"e_1_3_2_1_23_1"},{"unstructured":"Jiasen Lu Vedanuj Goswami Marcus Rohrbach Devi Parikh and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.  Jiasen Lu Vedanuj Goswami Marcus Rohrbach Devi Parikh and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.","key":"e_1_3_2_1_24_1"},{"key":"e_1_3_2_1_25_1","volume-title":"Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM TOMM 17, 4","author":"Messina Nicola","year":"2021","unstructured":"Nicola Messina , Giuseppe Amato , Andrea Esuli , Fabrizio Falchi , Claudio Gennaro , and St\u00e9phane Marchand-Maillet . 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM TOMM 17, 4 ( 2021 ). Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and St\u00e9phane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM TOMM 17, 4 (2021)."},{"doi-asserted-by":"crossref","unstructured":"Nicola Messina Giuseppe Amato Fabrizio Falchi Claudio Gennaro and St\u00e9phane Marchand-Maillet. 2021. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. In CBMI.  Nicola Messina Giuseppe Amato Fabrizio Falchi Claudio Gennaro and St\u00e9phane Marchand-Maillet. 2021. Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. In CBMI.","key":"e_1_3_2_1_26_1","DOI":"10.1109\/CBMI50038.2021.9461890"},{"doi-asserted-by":"crossref","unstructured":"Nicola Messina Fabrizio Falchi Andrea Esuli and Giuseppe Amato. 2021. Transformer reasoning network for image-text matching and retrieval. In ICPR.  Nicola Messina Fabrizio Falchi Andrea Esuli and Giuseppe Amato. 2021. Transformer reasoning network for image-text matching and retrieval. In ICPR.","key":"e_1_3_2_1_27_1","DOI":"10.1109\/ICPR48806.2021.9413172"},{"unstructured":"Przemys\u0142aw Pobrotyn Tomasz Bartczak Miko\u0142aj Synowiec Rados\u0142aw Bia\u0142obrzeski and Jaros\u0142aw Bojar. 2020. Context-aware learning to rank with self-attention. arXiv preprint arXiv:2005.10084(2020).  Przemys\u0142aw Pobrotyn Tomasz Bartczak Miko\u0142aj Synowiec Rados\u0142aw Bia\u0142obrzeski and Jaros\u0142aw Bojar. 2020. Context-aware learning to rank with self-attention. arXiv preprint arXiv:2005.10084(2020).","key":"e_1_3_2_1_28_1"},{"unstructured":"Di Qi Lin Su Jia Song Edward Cui Taroon Bharti and Arun Sacheti. 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. arXiv preprint arXiv:2001.07966(2020).  Di Qi Lin Su Jia Song Edward Cui Taroon Bharti and Arun Sacheti. 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. arXiv preprint arXiv:2001.07966(2020).","key":"e_1_3_2_1_29_1"},{"unstructured":"Leigang Qu Meng Liu Da Cao Liqiang Nie and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In ACM Multimedia.  Leigang Qu Meng Liu Da Cao Liqiang Nie and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In ACM Multimedia.","key":"e_1_3_2_1_30_1"},{"unstructured":"Alec Radford Jong\u00a0Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark 2021. Learning transferable visual models from natural language supervision. In ICML.  Alec Radford Jong\u00a0Wook Kim Chris Hallacy Aditya Ramesh Gabriel Goh Sandhini Agarwal Girish Sastry Amanda Askell Pamela Mishkin Jack Clark 2021. Learning transferable visual models from natural language supervision. In ICML.","key":"e_1_3_2_1_31_1"},{"doi-asserted-by":"crossref","unstructured":"Nikolaos Sarafianos Xiang Xu and Ioannis\u00a0A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In ICCV.  Nikolaos Sarafianos Xiang Xu and Ioannis\u00a0A Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In ICCV.","key":"e_1_3_2_1_32_1","DOI":"10.1109\/ICCV.2019.00591"},{"key":"e_1_3_2_1_33_1","volume-title":"From Show to Tell: A Survey on Deep Learning-based Image Captioning","author":"Stefanini Matteo","year":"2022","unstructured":"Matteo Stefanini , Marcella Cornia , Lorenzo Baraldi , Silvia Cascianelli , Giuseppe Fiameni , and Rita Cucchiara . 2022. From Show to Tell: A Survey on Deep Learning-based Image Captioning . IEEE Trans. PAMI ( 2022 ). Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From Show to Tell: A Survey on Deep Learning-based Image Captioning. IEEE Trans. PAMI (2022)."},{"doi-asserted-by":"crossref","unstructured":"Matteo Stefanini Marcella Cornia Lorenzo Baraldi and Rita Cucchiara. 2021. A novel attention-based aggregation function to combine vision and language. In ICPR.  Matteo Stefanini Marcella Cornia Lorenzo Baraldi and Rita Cucchiara. 2021. A novel attention-based aggregation function to combine vision and language. In ICPR.","key":"e_1_3_2_1_34_1","DOI":"10.1109\/ICPR48806.2021.9413269"},{"unstructured":"Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.  Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu Furu Wei and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.","key":"e_1_3_2_1_35_1"},{"key":"e_1_3_2_1_36_1","volume-title":"Learning Dual Semantic Relations with Graph Attention for Image-Text Matching","author":"Wen Keyu","year":"2020","unstructured":"Keyu Wen , Xiaodong Gu , and Qingrong Cheng . 2020. Learning Dual Semantic Relations with Graph Attention for Image-Text Matching . IEEE Transactions on Circuits and Systems for Video Technology ( 2020 ). Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning Dual Semantic Relations with Graph Attention for Image-Text Matching. IEEE Transactions on Circuits and Systems for Video Technology (2020)."},{"unstructured":"Yiling Wu Shuhui Wang Guoli Song and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM Multimedia.  Yiling Wu Shuhui Wang Guoli Song and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM Multimedia.","key":"e_1_3_2_1_37_1"},{"doi-asserted-by":"crossref","unstructured":"Qizhe Xie Minh-Thang Luong Eduard Hovy and Quoc\u00a0V Le. 2020. Self-Training With Noisy Student Improves ImageNet Classification. In CVPR.  Qizhe Xie Minh-Thang Luong Eduard Hovy and Quoc\u00a0V Le. 2020. Self-Training With Noisy Student Improves ImageNet Classification. In CVPR.","key":"e_1_3_2_1_38_1","DOI":"10.1109\/CVPR42600.2020.01070"},{"doi-asserted-by":"crossref","unstructured":"Linfeng Zhang Jiebo Song Anni Gao Jingwei Chen Chenglong Bao and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In ICCV.  Linfeng Zhang Jiebo Song Anni Gao Jingwei Chen Chenglong Bao and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In ICCV.","key":"e_1_3_2_1_39_1","DOI":"10.1109\/ICCV.2019.00381"},{"doi-asserted-by":"crossref","unstructured":"Pengchuan Zhang Xiujun Li Xiaowei Hu Jianwei Yang Lei Zhang Lijuan Wang Yejin Choi and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. In CVPR.  Pengchuan Zhang Xiujun Li Xiaowei Hu Jianwei Yang Lei Zhang Lijuan Wang Yejin Choi and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. In CVPR.","key":"e_1_3_2_1_40_1","DOI":"10.1109\/CVPR46437.2021.00553"},{"doi-asserted-by":"crossref","unstructured":"Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason\u00a0J Corso and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI.  Luowei Zhou Hamid Palangi Lei Zhang Houdong Hu Jason\u00a0J Corso and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI.","key":"e_1_3_2_1_41_1","DOI":"10.1609\/aaai.v34i07.7005"},{"doi-asserted-by":"crossref","unstructured":"Yuanen Zhou Meng Wang Daqing Liu Zhenzhen Hu and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In CVPR.  Yuanen Zhou Meng Wang Daqing Liu Zhenzhen Hu and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In CVPR.","key":"e_1_3_2_1_42_1","DOI":"10.1109\/CVPR42600.2020.00483"}],"event":{"acronym":"CBMI 2022","name":"CBMI 2022: International Conference on Content-based Multimedia Indexing","location":"Graz Austria"},"container-title":["Proceedings of the 19th International Conference on Content-based Multimedia Indexing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3549555.3549576","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3549555.3549576","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:12Z","timestamp":1750186812000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3549555.3549576"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,14]]},"references-count":42,"alternative-id":["10.1145\/3549555.3549576","10.1145\/3549555"],"URL":"https:\/\/doi.org\/10.1145\/3549555.3549576","relation":{},"subject":[],"published":{"date-parts":[[2022,9,14]]},"assertion":[{"value":"2022-10-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}