{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T19:17:54Z","timestamp":1762543074340,"version":"3.41.0"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"2s","license":[{"start":{"date-parts":[[2019,4,30]],"date-time":"2019-04-30T00:00:00Z","timestamp":1556582400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Key Research Program of Frontier Sciences, CAS","award":["QYZDJSSWJSC039"],"award-info":[{"award-number":["QYZDJSSWJSC039"]}]},{"name":"Research Program of National Laboratory of Pattern Recognition","award":["Z-2018007"],"award-info":[{"award-number":["Z-2018007"]}]},{"name":"CCF-Tencent Open Fund"},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61702511, 61720106006, 61620106003, 61432019, 61632007, U1705262, and U1836220"],"award-info":[{"award-number":["61702511, 61720106006, 61620106003, 61432019, 61632007, U1705262, and U1836220"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key Research and Development Program of China","award":["2017YFB1002804"],"award-info":[{"award-number":["2017YFB1002804"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,4,30]]},"abstract":"<jats:p>Image captioning and visual question answering are typical tasks that connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this article, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image-captioning method that transfers the knowledge learned from the VQA corpora to the image-captioning task. A VQA model is first pretrained on image--question--answer instances. Then, the pretrained VQA model is used to extract VQA-grounded semantic representations according to selected free-form open-ended visual question--answer pairs. The VQA-grounded features are complementary to the visual features, because they interpret images from a different perspective. We incorporate the VQA model into the image-captioning model by adaptively fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image-captioning (VQA-IIC) models perform better than conventional image-captioning methods on large-scale public datasets.<\/jats:p>","DOI":"10.1145\/3313873","type":"journal-article","created":{"date-parts":[[2019,7,19]],"date-time":"2019-07-19T13:17:14Z","timestamp":1563542234000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Image Captioning by Asking Questions"],"prefix":"10.1145","volume":"15","author":[{"given":"Xiaoshan","family":"Yang","sequence":"first","affiliation":[{"name":"Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China"}]},{"given":"Changsheng","family":"Xu","sequence":"additional","affiliation":[{"name":"Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2019,7,19]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1181"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.12"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization@ACL","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization@ACL 2005. 65--72."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/944919.944965"},{"key":"e_1_2_1_7_1","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'16)","author":"Chen Kan","year":"2015","unstructured":"Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'16). 1--10."},{"key":"e_1_2_1_8_1","volume-title":"Microsoft COCO captions: Data collection and evaluation server. CoRR abs\/1504.00325","author":"Chen Xinlei","year":"2015","unstructured":"Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs\/1504.00325 (2015). arxiv:1504.00325 http:\/\/arxiv.org\/abs\/1504.00325"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201915)","author":"Chen Xinlei","key":"e_1_2_1_9_1","unstructured":"Xinlei Chen and C. Lawrence Zitnick. 2015. Mind\u2019s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201915). 2422--2431."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/645318.649254"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/1888089.1888092"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999792.2999849"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969442.2969496"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.1422953112"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0658-4"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2614132"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2710635"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2043612.2043613"},{"key":"e_1_2_1_21_1","volume-title":"Compositional memory for visual question answering. CoRR abs\/1511.05676","author":"Jiang Aiwen","year":"2015","unstructured":"Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. 2015. Compositional memory for visual question answering. CoRR abs\/1511.05676 (2015)."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969038"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157137"},{"key":"e_1_2_1_25_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs\/1412.6980 (2014)."},{"key":"e_1_2_1_26_1","volume-title":"Zemel","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs\/1411.2539 (2014)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00188"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045340"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the Workshop on Text Summarization Branches Out (WAS\u201904)","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS\u201904)."},{"volume-title":"Proceedings of the 13th European Conference on Computer Vision (ECCV\u201914)","author":"Lin Tsung-Yi","key":"e_1_2_1_30_1","unstructured":"Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV\u201914). 740--755."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_17"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.5555\/3016387.3016405"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2968826.2969014"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.9"},{"key":"e_1_2_1_36_1","volume-title":"Yuille","author":"Mao Junhua","year":"2014","unstructured":"Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs\/1412.6632 (2014)."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.11"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_2_1_40_1","volume-title":"Zemel","author":"Ren Mengye","year":"2015","unstructured":"Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Image question answering: A visual semantic embedding model and a new dataset. CoRR abs\/1505.02074 (2015)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.128"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.499"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00177"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2014.29"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.130"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/3171642.3171825"},{"key":"e_1_2_1_50_1","volume-title":"Dick","author":"Wang Peng","year":"2016","unstructured":"Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony R. Dick. 2016. FVQA: Fact-based visual question answering. CoRR abs\/1606.05433 (2016)."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2017.05.001"},{"volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Wu Qi","key":"e_1_2_1_52_1","unstructured":"Qi Wu, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 4622--4630."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240640"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045643"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46478-7_28"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045336"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2846664"},{"volume-title":"Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"Yang Zichao","key":"e_1_2_1_58_1","unstructured":"Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). 21--29."},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.559"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.540"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3313873","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3313873","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:23:46Z","timestamp":1750202626000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3313873"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,4,30]]},"references-count":62,"journal-issue":{"issue":"2s","published-print":{"date-parts":[[2019,4,30]]}},"alternative-id":["10.1145\/3313873"],"URL":"https:\/\/doi.org\/10.1145\/3313873","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2019,4,30]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}