{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T17:10:22Z","timestamp":1772298622842,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100014219","name":"National Science Fund for Distinguished Young Scholars","doi-asserted-by":"publisher","award":["62125601"],"award-info":[{"award-number":["62125601"]}],"id":[{"id":"10.13039\/501100014219","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072032, 62076024"],"award-info":[{"award-number":["62072032, 62076024"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2020AAA09701"],"award-info":[{"award-number":["2020AAA09701"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548320","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"4957-4966","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling"],"prefix":"10.1145","author":[{"given":"Hongyu","family":"Gao","sequence":"first","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chao","family":"Zhu","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mengyin","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weibo","family":"Gu","sequence":"additional","affiliation":[{"name":"Tencent, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hongfa","family":"Wang","sequence":"additional","affiliation":[{"name":"Tencent, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wei","family":"Liu","sequence":"additional","affiliation":[{"name":"Tencent, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xu-cheng","family":"Yin","sequence":"additional","affiliation":[{"name":"University of Science and Technology Beijing, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"Oriol Vinyals Aaron van den Oord and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In NeurIPS.  Oriol Vinyals Aaron van den Oord and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In NeurIPS."},{"key":"e_1_3_2_2_2_1","volume-title":"BEiT: BERT Pre-Training of Image Transformers. CoRR","author":"Bao Hangbo","year":"2021","unstructured":"Hangbo Bao , Li Dong , and Furu Wei . 2021. BEiT: BERT Pre-Training of Image Transformers. CoRR , Vol. abs\/ 2106 .08254 ( 2021 ). showeprint[arXiv]2106.08254 https:\/\/arxiv.org\/abs\/2106.08254 Hangbo Bao, Li Dong, and Furu Wei. 2021. BEiT: BERT Pre-Training of Image Transformers. CoRR, Vol. abs\/2106.08254 (2021). showeprint[arXiv]2106.08254 https:\/\/arxiv.org\/abs\/2106.08254"},{"key":"e_1_3_2_2_3_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 . UNITER : UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings,. Springer , 104--120. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings,. Springer, 104--120."},{"key":"e_1_3_2_2_4_1","volume-title":"2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020","author":"Cubuk Ekin D.","year":"2020","unstructured":"Ekin D. Cubuk , Barret Zoph , Jonathon Shlens , and Quoc V. Le . 2020. Randaugment: Practical automated data augmentation with a reduced search space . In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020 , Seattle, WA, USA, June 14--19 , 2020 . Computer Vision Foundation \/ IEEE, 3008--3017. Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14--19, 2020. Computer Vision Foundation \/ IEEE, 3008--3017."},{"key":"e_1_3_2_2_5_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 , Minneapolis, MN, USA, June 2--7 , 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186."},{"key":"e_1_3_2_2_6_1","volume-title":"9th International Conference on Learning Representations, ICLR 2021","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In 9th International Conference on Learning Representations, ICLR 2021 , Virtual Event, Austria, May 3--7 , 2021. OpenReview.net. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net."},{"key":"e_1_3_2_2_7_1","unstructured":"Zi-Yi Dou Yichong Xu Zhe Gan Jianfeng Wang Shuohang Wang Lijuan Wang Chenguang Zhu Pengchuan Zhang Lu Yuan Nanyun Peng Zicheng Liu and Michael Zeng. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. (2021). https:\/\/arxiv.org\/abs\/2111.02387  Zi-Yi Dou Yichong Xu Zhe Gan Jianfeng Wang Shuohang Wang Lijuan Wang Chenguang Zhu Pengchuan Zhang Lu Yuan Nanyun Peng Zicheng Liu and Michael Zeng. 2021. An Empirical Study of Training End-to-End Vision-and-Language Transformers. (2021). https:\/\/arxiv.org\/abs\/2111.02387"},{"key":"e_1_3_2_2_8_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri , David J. Fleet , Jamie Ryan Kiros, and Sanja Fidler . 2018 . VSE : Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3--6, 2019. BMVA Press , 12. Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3--6, 2019. BMVA Press, 12."},{"key":"e_1_3_2_2_9_1","volume-title":"Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome , Gregory S. Corrado , Jonathon Shlens , Samy Bengio , Jeffrey Dean , Marc'Aurelio Ranzato , and Tom\u00e1 s Mikolov . 2013. DeViSE: A Deep Visual-Semantic Embedding Model . In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013 . Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, Nevada, United States . 2121--2129. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tom\u00e1 s Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, Nevada, United States. 2121--2129."},{"key":"e_1_3_2_2_10_1","unstructured":"Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng1 and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS.  Zhe Gan Yen-Chun Chen Linjie Li Chen Zhu Yu Cheng1 and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS."},{"key":"e_1_3_2_2_11_1","volume-title":"Piotr Doll\u00e1 r, and Ross B. Girshick","author":"He Kaiming","year":"2021","unstructured":"Kaiming He , Xinlei Chen , Saining Xie , Yanghao Li , Piotr Doll\u00e1 r, and Ross B. Girshick . 2021 . Masked Autoencoders Are Scalable Vision Learners. CoRR , Vol. abs\/ 2111 .06377 (2021). https:\/\/arxiv.org\/abs\/2111.06377 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1 r, and Ross B. Girshick. 2021. Masked Autoencoders Are Scalable Vision Learners. CoRR, Vol. abs\/2111.06377 (2021). https:\/\/arxiv.org\/abs\/2111.06377"},{"key":"e_1_3_2_2_12_1","volume-title":"Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"He Kaiming","year":"2020","unstructured":"Kaiming He , Haoqi Fan , Yuxin Wu , Saining Xie , and Ross B. Girshick . 2020 . Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA, June 13--19 , 2020 . Computer Vision Foundation \/ IEEE, 9726--9735. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation \/ IEEE, 9726--9735."},{"key":"e_1_3_2_2_13_1","volume-title":"Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. CoRR","author":"Huang Zhicheng","year":"2020","unstructured":"Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. CoRR , Vol. abs\/ 2004 .00849 ( 2020 ). showeprint[arXiv]2004.00849 https:\/\/arxiv.org\/abs\/2004.00849 Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. CoRR, Vol. abs\/2004.00849 (2020). showeprint[arXiv]2004.00849 https:\/\/arxiv.org\/abs\/2004.00849"},{"key":"e_1_3_2_2_14_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","volume":"139","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision . In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event , Vol. 139 . PMLR, 4904--4916. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event, Vol. 139. PMLR, 4904--4916."},{"key":"e_1_3_2_2_15_1","volume-title":"Dhruv Batra and Stefan Lee","author":"Jiasen Lu Devi Parikh","year":"2019","unstructured":"Devi Parikh Jiasen Lu , Dhruv Batra and Stefan Lee . 2019 . Vil-bert : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS. Devi Parikh Jiasen Lu, Dhruv Batra and Stefan Lee. 2019. Vil-bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS."},{"key":"e_1_3_2_2_16_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision . In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event. PMLR, 5583--5594. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event. PMLR, 5583--5594."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_2_18_1","volume-title":"Proceedings, Part IV (Lecture Notes in Computer Science","volume":"228","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee , Xi Chen , Gang Hua , Houdong Hu , and Xiaodong He . 2018 . Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018 , Proceedings, Part IV (Lecture Notes in Computer Science , Vol. 11208). Springer, 212-- 228 . Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 11208). Springer, 212--228."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_2_20_1","volume-title":"Shafiq Joty, Caiming Xiong, and Steven Hoi.","author":"Li Junnan","year":"2021","unstructured":"Junnan Li , Ramprasaath R. Selvaraju , Akhilesh Deepak Gotmare , Shafiq Joty, Caiming Xiong, and Steven Hoi. 2021 b. Align before Fuse : Vision and Language Representation Learning with Momentum Distillation. In NeurIPS. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. 2021b. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS."},{"key":"e_1_3_2_2_21_1","volume-title":"Visual Semantic Reasoning for Image-Text Matching. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Li Kunpeng","year":"2019","unstructured":"Kunpeng Li , Yulun Zhang , Kai Li , Yuanyuan Li , and Yun Fu. [n.d.]. Visual Semantic Reasoning for Image-Text Matching. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019 . IEEE, 4653--4661. Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. [n.d.]. Visual Semantic Reasoning for Image-Text Matching. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4653--4661."},{"key":"e_1_3_2_2_22_1","volume-title":"VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR , Vol. abs\/ 1908 .03557 ( 2019 ). showeprint[arXiv]1908.03557 http:\/\/arxiv.org\/abs\/1908.03557 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR, Vol. abs\/1908.03557 (2019). showeprint[arXiv]1908.03557 http:\/\/arxiv.org\/abs\/1908.03557"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.202"},{"key":"e_1_3_2_2_24_1","volume-title":"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li , Xi Yin , Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang , Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , Yejin Choi , and Jianfeng Gao . 2020 b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference , Glasgow, UK , August 23--28, 2020, Proceedings,. Springer , 121--137. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020b. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings,. Springer, 121--137."},{"key":"e_1_3_2_2_25_1","volume-title":"Proceedings, Part V. Springer, 740--755","author":"Lin Tsung-Yi","unstructured":"Tsung-Yi Lin , Michael Maire , Serge J. Belongie , James Hays , Pietro Perona , Deva Ramanan , Piotr Doll\u00e1 r, and C. Lawrence Zitnick . 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014 , Proceedings, Part V. Springer, 740--755 . Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1 r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V. Springer, 740--755."},{"key":"e_1_3_2_2_26_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR , Vol. abs\/ 1907 .11692 ( 2019 ). http:\/\/arxiv.org\/abs\/1907.11692 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs\/1907.11692 (2019). http:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_2_2_27_1","volume-title":"Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter . 2019 . Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019 , New Orleans, LA, USA, May 6--9 , 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7 Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7"},{"key":"e_1_3_2_2_28_1","volume-title":"1st International Conference on Learning Representations, ICLR","author":"Mikolov Tom\u00e1","year":"2013","unstructured":"Tom\u00e1 s Mikolov , Kai Chen , Greg Corrado , and Jeffrey Dean . 2013. Efficient Estimation of Word Representations in Vector Space . In 1st International Conference on Learning Representations, ICLR 2013 , Scottsdale, Arizona, USA , May 2--4, 2013, Workshop Track Proceedings . Tom\u00e1 s Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2--4, 2013, Workshop Track Proceedings."},{"key":"e_1_3_2_2_29_1","volume-title":"Advances in Neural Information Processing Systems","volume":"24","author":"Ordonez Vicente","year":"2011","unstructured":"Vicente Ordonez , Girish Kulkarni , and Tamara Berg . 2011 . Im2Text: Describing Images Using 1 Million Captioned Photographs . In Advances in Neural Information Processing Systems , Vol. 24 . Curran Associates, Inc. Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Advances in Neural Information Processing Systems, Vol. 24. Curran Associates, Inc."},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_2_31_1","volume-title":"ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. CoRR","author":"Qi Di","year":"2020","unstructured":"Di Qi , Lin Su , Jia Song , Edward Cui , Taroon Bharti , and Arun Sacheti . 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. CoRR , Vol. abs\/ 2001 .07966 ( 2020 ). showeprint[arXiv]2001.07966 https:\/\/arxiv.org\/abs\/2001.07966 Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. CoRR, Vol. abs\/2001.07966 (2020). showeprint[arXiv]2001.07966 https:\/\/arxiv.org\/abs\/2001.07966"},{"key":"e_1_3_2_2_32_1","volume-title":"Context-Aware Multi-View Summarization Network for Image-Text Matching. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event \/ Seattle, WA, USA, October 12--16","author":"Qu Leigang","year":"2020","unstructured":"Leigang Qu , Meng Liu , Da Cao , Liqiang Nie , and Qi Tian . 2020 . Context-Aware Multi-View Summarization Network for Image-Text Matching. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event \/ Seattle, WA, USA, October 12--16 , 2020. ACM, 1047--1055. https:\/\/doi.org\/10.1145\/3394171.3413961 Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In MM '20: The 28th ACM International Conference on Multimedia, Virtual Event \/ Seattle, WA, USA, October 12--16, 2020. ACM, 1047--1055. https:\/\/doi.org\/10.1145\/3394171.3413961"},{"key":"e_1_3_2_2_33_1","volume-title":"Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Qu Leigang","year":"2021","unstructured":"Leigang Qu , Meng Liu , Jianlong Wu , Zan Gao , and Liqiang Nie . 2021 . Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , Virtual Event, Canada, July 11--15 , 2021. ACM, 1104--1113. Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11--15, 2021. ACM, 1104--1113."},{"key":"e_1_3_2_2_34_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , Gretchen Krueger , and Ilya Sutskever . 2021 . Learning Transferable Visual Models From Natural Language Supervision . In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event. PMLR, 8748--8763. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event. PMLR, 8748--8763."},{"key":"e_1_3_2_2_35_1","series-title":"Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL","volume-title":"Long Papers. 2556--2565.","author":"Sharma Piyush","year":"2018","unstructured":"Piyush Sharma , Nan Ding , Sebastian Goodman , and Radu Soricut . 2018 . Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia , July 15--20, 2018, Volume 1 : Long Papers. 2556--2565. Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers. 2556--2565."},{"key":"e_1_3_2_2_36_1","volume-title":"Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer.","author":"Shen Sheng","year":"2021","unstructured":"Sheng Shen , Liunian Harold Li , Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2021 . How Much Can CLIP Benefit Vision-and-Language Tasks? CoRR , Vol. abs\/ 2107 .06383 (2021). showeprint[arXiv]2107.06383 https:\/\/arxiv.org\/abs\/2107.06383 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2021. How Much Can CLIP Benefit Vision-and-Language Tasks? CoRR, Vol. abs\/2107.06383 (2021). showeprint[arXiv]2107.06383 https:\/\/arxiv.org\/abs\/2107.06383"},{"key":"e_1_3_2_2_37_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020","author":"Su Weijie","year":"2020","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2020 . VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020 , Addis Ababa, Ethiopia, April 26--30 , 2020. OpenReview.net. https:\/\/openreview.net\/forum?id=SygXPaEYvH Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https:\/\/openreview.net\/forum?id=SygXPaEYvH"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_2_39_1","volume-title":"Multi-Modality Cross Attention Network for Image and Sentence Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Wei Xi","year":"2020","unstructured":"Xi Wei , Tianzhu Zhang , Yan Li , Yongdong Zhang , and Feng Wu . 2020 . Multi-Modality Cross Attention Network for Image and Sentence Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA, June 13--19 , 2020. Computer Vision Foundation \/ IEEE, 10938--10947. https:\/\/openaccess.thecvf.com\/content_CVPR_2020\/html\/Wei_Multi-Modality_Cross_Attention_Network_for_Image_and_Sentence_Matching_CVPR_2020_paper.html Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation \/ IEEE, 10938--10947. https:\/\/openaccess.thecvf.com\/content_CVPR_2020\/html\/Wei_Multi-Modality_Cross_Attention_Network_for_Image_and_Sentence_Matching_CVPR_2020_paper.html"},{"key":"e_1_3_2_2_40_1","volume-title":"Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training. In NeurIPS.","author":"Xue Hongwei","year":"2021","unstructured":"Hongwei Xue , Yupan Huang , Bei Liu , Houwen Peng , Jianlong Fu , Houqiang Li , and Jiebo Luo . 2021 . Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training. In NeurIPS. Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. 2021. Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training. In NeurIPS."},{"key":"e_1_3_2_2_41_1","volume-title":"VinVL: Revisiting Visual Representations in Vision-Language Models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021","author":"Zhang Pengchuan","year":"2021","unstructured":"Pengchuan Zhang , Xiujun Li , Xiaowei Hu , Jianwei Yang , Lei Zhang , Lijuan Wang , Yejin Choi , and Jianfeng Gao . 2021 . VinVL: Revisiting Visual Representations in Vision-Language Models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021 , virtual, June 19 --25 , 2021. Computer Vision Foundation \/ IEEE, 5579--5588. Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation \/ IEEE, 5579--5588."},{"key":"e_1_3_2_2_42_1","volume-title":"Context-Aware Attention Network for Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Zhang Qi","year":"2020","unstructured":"Qi Zhang , Zhen Lei , Zhaoxiang Zhang , and Stan Z. Li . 2020 . Context-Aware Attention Network for Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA, June 13--19 , 2020 . Computer Vision Foundation \/ IEEE, 3533--3542. Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li. 2020. Context-Aware Attention Network for Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. Computer Vision Foundation \/ IEEE, 3533--3542."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383184"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548320","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548320","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:43Z","timestamp":1750186843000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548320"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":43,"alternative-id":["10.1145\/3503161.3548320","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548320","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}