{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,26]],"date-time":"2026-04-26T11:50:16Z","timestamp":1777204216383,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Science and Technology Commission of Shanghai Municipality Grant","award":["No.20dz1200600, 21QA1400600, GWV-1.1, 21511101000"],"award-info":[{"award-number":["No.20dz1200600, 21QA1400600, GWV-1.1, 21511101000"]}]},{"name":"Zhejiang Lab","award":["No.2019KD0AD01"],"award-info":[{"award-number":["No.2019KD0AD01"]}]},{"name":"Natural Science Foundation of China","award":["No.6217020551"],"award-info":[{"award-number":["No.6217020551"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,27]]},"DOI":"10.1145\/3512527.3531368","type":"proceedings-article","created":{"date-parts":[[2022,6,23]],"date-time":"2022-06-23T22:23:32Z","timestamp":1656023012000},"page":"137-145","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval"],"prefix":"10.1145","author":[{"given":"Zhihao","family":"Fan","sequence":"first","affiliation":[{"name":"Fudan University, Shanghai, UNK, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhongyu","family":"Wei","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, UNK, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zejun","family":"Li","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, UNK, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Siyuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, UNK, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haijun","family":"Shan","sequence":"additional","affiliation":[{"name":"Zhejiang Lab, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xuanjing","family":"Huang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, UNK, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianqing","family":"Fan","sequence":"additional","affiliation":[{"name":"Princeton University, Princeton, NJ, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Spice: Semantic propositional image caption evaluation. In ECCV.","author":"Anderson Peter","year":"2016","unstructured":"Peter Anderson , Basura Fernando , Mark Johnson , and Stephen Gould . 2016 . Spice: Semantic propositional image caption evaluation. In ECCV. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV."},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_3_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2020","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 . Uniter : Universal image-text representation learning. In ECCV. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV."},{"key":"e_1_3_2_2_4_1","volume-title":"Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29","author":"Defferrard Michael","year":"2016","unstructured":"Michael Defferrard , Xavier Bresson , and Pierre Vandergheynst . 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29 ( 2016 ), 3844--3852. Michael Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29 (2016), 3844--3852."},{"key":"e_1_3_2_2_5_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186."},{"key":"e_1_3_2_2_6_1","volume-title":"Proceedings of the British Machine Vision Conference (BMVC). https:\/\/github.com\/fartashf\/vsepp","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri , David J Fleet , Jamie Ryan Kiros , and Sanja Fidler . 2018 . VSE++: Improving Visual-Semantic Embeddings with Hard Negatives . In Proceedings of the British Machine Vision Conference (BMVC). https:\/\/github.com\/fartashf\/vsepp Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC). https:\/\/github.com\/fartashf\/vsepp"},{"key":"e_1_3_2_2_7_1","volume-title":"Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval. arXiv preprint arXiv:2111.03349","author":"Fan Zhihao","year":"2021","unstructured":"Zhihao Fan , Zhongyu Wei , Zejun Li , Siyuan Wang , and Jianqing Fan . 2021. Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval. arXiv preprint arXiv:2111.03349 ( 2021 ). Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, and Jianqing Fan. 2021. Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval. arXiv preprint arXiv:2111.03349 (2021)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1652"},{"key":"e_1_3_2_2_9_1","volume-title":"TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. arXiv preprint arXiv:2106.10936","author":"Fan Zhihao","year":"2021","unstructured":"Zhihao Fan , Zhongyu Wei , Siyuan Wang , Ruize Wang , Zejun Li , Haijun Shan , and Xuanjing Huang . 2021 . TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. arXiv preprint arXiv:2106.10936 (2021). Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun Shan, and Xuanjing Huang. 2021. TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. arXiv preprint arXiv:2106.10936 (2021)."},{"key":"e_1_3_2_2_10_1","volume-title":"Marc' Aurelio Ranzato, and Tomas Mikolov","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome , Greg S Corrado , Jon Shlens , Samy Bengio , Jeff Dean , Marc' Aurelio Ranzato, and Tomas Mikolov . 2013 . DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26 . Curran Associates, Inc ., 2121--2129. https:\/\/proceedings.neurips.cc\/paper\/2013\/file\/7cce53cf90577442771720a370c3c723-Paper.pdf Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc., 2121--2129. https:\/\/proceedings.neurips.cc\/paper\/2013\/file\/7cce53cf90577442771720a370c3c723-Paper.pdf"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00585"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.  Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_2_13_1","volume-title":"Adam: A Method for Stochastic Optimization. In ICLR (Poster).","author":"Kingma Diederik P","year":"2015","unstructured":"Diederik P Kingma and Jimmy Ba . 2015 . Adam: A Method for Stochastic Optimization. In ICLR (Poster). Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster)."},{"key":"e_1_3_2_2_14_1","volume-title":"Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs\/1411.2539","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros , Ruslan Salakhutdinov , and Richard S Zemel . 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs\/1411.2539 ( 2014 ). Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs\/1411.2539 (2014)."},{"key":"e_1_3_2_2_15_1","volume-title":"Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV","author":"Krishna Ranjay","year":"2017","unstructured":"Ranjay Krishna , Yuke Zhu , Oliver Groth , Justin Johnson , Kenji Hata , Joshua Kravitz , Stephanie Chen , Yannis Kalantidis , Li-Jia Li , David A Shamma , 2017 . Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2017). Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2017)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_2_18_1","unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV.  Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV."},{"key":"e_1_3_2_2_19_1","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.  Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350869"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01093"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_2_24_1","volume-title":"100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . 2016. Squad : 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 ( 2016 ). Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)."},{"key":"e_1_3_2_2_25_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS."},{"key":"e_1_3_2_2_26_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093614"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Yaxiong Wang Hao Yang Xueming Qian Lin Ma Jing Lu Biao Li and Xin Fan. 2019. Position focused attention network for image-text matching. In IJCAI.  Yaxiong Wang Hao Yang Xueming Qian Lin Ma Jing Lu Biao Li and Xin Fan. 2019. Position focused attention network for image-text matching. In IJCAI.","DOI":"10.24963\/ijcai.2019\/526"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00586"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"crossref","unstructured":"Jiwei Wei Xing Xu Zheng Wang and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In ACM MM.  Jiwei Wei Xing Xu Zheng Wang and Guoqing Wang. 2021. Meta Self-Paced Learning for Cross-Modal Matching. In ACM MM.","DOI":"10.1145\/3474085.3475451"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01095"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.3301387"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Xu Yang Kaihua Tang Hanwang Zhang and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR.  Xu Yang Kaihua Tang Hanwang Zhang and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR.","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00601"},{"key":"e_1_3_2_2_35_1","volume-title":"Ernie-VIL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934","author":"Yu Fei","year":"2020","unstructured":"Fei Yu , Jiji Tang , Weichong Yin , Yu Sun , Hao Tian , Hua Wu , and Haifeng Wang . 2020. Ernie-VIL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 ( 2020 ). Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-VIL: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020)."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00553"}],"event":{"name":"ICMR '22: International Conference on Multimedia Retrieval","location":"Newark NJ USA","acronym":"ICMR '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2022 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512527.3531368","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3512527.3531368","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:12Z","timestamp":1750188612000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3512527.3531368"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,27]]},"references-count":36,"alternative-id":["10.1145\/3512527.3531368","10.1145\/3512527"],"URL":"https:\/\/doi.org\/10.1145\/3512527.3531368","relation":{},"subject":[],"published":{"date-parts":[[2022,6,27]]},"assertion":[{"value":"2022-06-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}