{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T02:38:05Z","timestamp":1774579085347,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":34,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the National Natural Science Foundation of China","award":["62102384"],"award-info":[{"award-number":["62102384"]}]},{"name":"the Science Fund for Creative Research Groups","award":["62121002"],"award-info":[{"award-number":["62121002"]}]},{"name":"the National Key Research and Development Program of China","award":["2020YFB1406603"],"award-info":[{"award-number":["2020YFB1406603"]}]},{"name":"the Microsoft Research Asia"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547883","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"4365-4373","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Fine-tuning with Multi-modal Entity Prompts for News Image Captioning"],"prefix":"10.1145","author":[{"given":"Jingjing","family":"Zhang","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"given":"Shancheng","family":"Fang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"given":"Zhendong","family":"Mao","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, Hefei, China"}]},{"given":"Zhiwei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Kuaishou Technology, Beijing, China"}]},{"given":"Yongdong","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China &amp; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_2_1","volume-title":"Contrastive Language-Image Pre-training for the Italian Language. arXiv preprint arXiv:2108.08688","author":"Bianchi Federico","year":"2021","unstructured":"Federico Bianchi , Giuseppe Attanasio , Raphael Pisoni , Silvia Terragni , Gabriele Sarti , and Sri Lakshmi . 2021. Contrastive Language-Image Pre-training for the Italian Language. arXiv preprint arXiv:2108.08688 ( 2021 ). Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, and Sri Lakshmi. 2021. Contrastive Language-Image Pre-training for the Italian Language. arXiv preprint arXiv:2108.08688 (2021)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01275"},{"key":"e_1_3_2_2_4_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_2_5_1","volume-title":"Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.","author":"Chen Yen-Chun","year":"2019","unstructured":"Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 . Uniter : Learning universal image-text representations. (2019). Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. (2019)."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3348"},{"key":"e_1_3_2_2_7_1","volume-title":"Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946","author":"Gal Rinon","year":"2021","unstructured":"Rinon Gal , Or Patashnik , Haggai Maron , Gal Chechik , and Daniel Cohen-Or . 2021 . Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021). Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413576"},{"key":"e_1_3_2_2_9_1","volume-title":"International Conference on Machine Learning. PMLR, 4904--4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc Le , Yun-Hsuan Sung , Zhen Li , and Tom Duerig . 2021 . Scaling up visual and visionlanguage representation learning with noisy text supervision . In International Conference on Machine Learning. PMLR, 4904--4916 . Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and visionlanguage representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916."},{"key":"e_1_3_2_2_10_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_11_1","volume-title":"The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691","author":"Lester Brian","year":"2021","unstructured":"Brian Lester , Rami Al-Rfou , and Noah Constant . 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 ( 2021 ). Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)."},{"key":"e_1_3_2_2_12_1","volume-title":"Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461","author":"Lewis Mike","year":"2019","unstructured":"Mike Lewis , Yinhan Liu , Naman Goyal , Marjan Ghazvininejad , Abdelrahman Mohamed , Omer Levy , Ves Stoyanov , and Luke Zettlemoyer . 2019 . Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019). Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)."},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_2_14_1","volume-title":"Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190","author":"Li Xiang Lisa","year":"2021","unstructured":"Xiang Lisa Li and Percy Liang . 2021 . Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021). Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)."},{"key":"e_1_3_2_2_15_1","volume-title":"Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin . 2004 . Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81."},{"key":"e_1_3_2_2_16_1","volume-title":"Visual News: Benchmark and Challenges in News Image Captioning. arXiv preprint arXiv:2010.03743","author":"Liu Fuxiao","year":"2020","unstructured":"Fuxiao Liu , Yinghan Wang , Tianlu Wang , and Vicente Ordonez . 2020 . Visual News: Benchmark and Challenges in News Image Captioning. arXiv preprint arXiv:2010.03743 (2020). Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. 2020. Visual News: Benchmark and Challenges in News Image Captioning. arXiv preprint arXiv:2010.03743 (2020)."},{"key":"e_1_3_2_2_17_1","volume-title":"prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586","author":"Liu Pengfei","year":"2021","unstructured":"Pengfei Liu , Weizhe Yuan , Jinlan Fu , Zhengbao Jiang , Hiroaki Hayashi , and Graham Neubig . 2021. Pre-train , prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 ( 2021 ). Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)."},{"key":"e_1_3_2_2_18_1","volume-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019). Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_2_19_1","volume-title":"Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734","author":"Mokady Ron","year":"2021","unstructured":"Ron Mokady , Amir Hertz , and Amit H Bermano . 2021 . Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021). Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)."},{"key":"e_1_3_2_2_20_1","volume-title":"fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038","author":"Ott Myle","year":"2019","unstructured":"Myle Ott , Sergey Edunov , Alexei Baevski , Angela Fan , Sam Gross , Nathan Ng , David Grangier , and Michael Auli . 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 ( 2019 ). Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019)."},{"key":"e_1_3_2_2_21_1","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni , Salim Roukos , Todd Ward , and Wei-Jing Zhu . 2002 . Bleu: a method for automatic evaluation of machine translation . In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318 . Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318."},{"key":"e_1_3_2_2_22_1","unstructured":"Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).  Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017)."},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00209"},{"key":"e_1_3_2_2_24_1","volume-title":"Language models as knowledge bases? arXiv preprint arXiv:1909.01066","author":"Petroni Fabio","year":"2019","unstructured":"Fabio Petroni , Tim Rockt\u00e4schel , Patrick Lewis , Anton Bakhtin , Yuxiang Wu , Alexander H Miller , and Sebastian Riedel . 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 ( 2019 ). Fabio Petroni, Tim Rockt\u00e4schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019)."},{"key":"e_1_3_2_2_25_1","unstructured":"Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9.  Alec Radford JeffreyWu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_3_2_2_26_1","volume-title":"Breakingnews: Article annotation by image and text processing","author":"Ramisa Arnau","year":"2017","unstructured":"Arnau Ramisa , Fei Yan , Francesc Moreno-Noguer , and Krystian Mikolajczyk . 2017 . Breakingnews: Article annotation by image and text processing . IEEE transactions on pattern analysis and machine intelligence 40, 5 (2017), 1072--1085. Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer, and Krystian Mikolajczyk. 2017. Breakingnews: Article annotation by image and text processing. IEEE transactions on pattern analysis and machine intelligence 40, 5 (2017), 1072--1085."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_2_28_1","volume-title":"Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676","author":"Schick Timo","year":"2020","unstructured":"Timo Schick and Hinrich Sch\u00fctze . 2020. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 ( 2020 ). Timo Schick and Hinrich Sch\u00fctze. 2020. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 (2020)."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01305"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_2_32_1","volume-title":"Journalistic Guidelines Aware News Image Captioning. arXiv preprint arXiv:2109.02865","author":"Yang Xuewen","year":"2021","unstructured":"Xuewen Yang , Svebor Karaman , Joel Tetreault , and Alex Jaimes . 2021. Journalistic Guidelines Aware News Image Captioning. arXiv preprint arXiv:2109.02865 ( 2021 ). Xuewen Yang, Svebor Karaman, Joel Tetreault, and Alex Jaimes. 2021. Journalistic Guidelines Aware News Image Captioning. arXiv preprint arXiv:2109.02865 (2021)."},{"key":"e_1_3_2_2_33_1","volume-title":"International Conference on Machine Learning. PMLR, 11328--11339","author":"Zhang Jingqing","year":"2020","unstructured":"Jingqing Zhang , Yao Zhao , Mohammad Saleh , and Peter Liu . 2020 . Pegasus: Pre-training with extracted gap-sentences for abstractive summarization . In International Conference on Machine Learning. PMLR, 11328--11339 . Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328--11339."},{"key":"e_1_3_2_2_34_1","volume-title":"Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph. arXiv preprint arXiv:2107.11970","author":"Zhao Wentian","year":"2021","unstructured":"Wentian Zhao , Yao Hu , Heda Wang , Xinxiao Wu , and Jiebo Luo . 2021. Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph. arXiv preprint arXiv:2107.11970 ( 2021 ). Wentian Zhao, Yao Hu, Heda Wang, Xinxiao Wu, and Jiebo Luo. 2021. Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph. arXiv preprint arXiv:2107.11970 (2021)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547883","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547883","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:30Z","timestamp":1750186830000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547883"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":34,"alternative-id":["10.1145\/3503161.3547883","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547883","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}