{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T16:01:01Z","timestamp":1774540861195,"version":"3.50.1"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,1,23]],"date-time":"2019-01-23T00:00:00Z","timestamp":1548201600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Science and Technology Program of Guangzhou of China","award":["201704020180 and 201604020024"],"award-info":[{"award-number":["201704020180 and 201604020024"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities of China","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61673402, 61273270, and 60802069"],"award-info":[{"award-number":["61673402, 61273270, and 60802069"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003453","name":"Natural Science Foundation of Guangdong Province","doi-asserted-by":"crossref","award":["2017A030311029, 2016B010123005 and 2017B090909005"],"award-info":[{"award-number":["2017A030311029, 2016B010123005 and 2017B090909005"]}],"id":[{"id":"10.13039\/501100003453","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2019,2,28]]},"abstract":"<jats:p>In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each timestep, our model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.<\/jats:p>","DOI":"10.1145\/3292058","type":"journal-article","created":{"date-parts":[[2019,1,23]],"date-time":"2019-01-23T13:02:14Z","timestamp":1548248534000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["Image Captioning With Visual-Semantic Double Attention"],"prefix":"10.1145","volume":"15","author":[{"given":"Chen","family":"He","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangdong, People's Republic of China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4884-323X","authenticated-orcid":false,"given":"Haifeng","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-Sen University, Guangdong, People's Republic of China"}]}],"member":"320","published-online":{"date-parts":[[2019,1,23]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65--72","author":"Banerjee Satanjeev","year":"2005"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Jacob Devlin Hao Cheng Hao Fang Saurabh Gupta Li Deng Xiaodong He etal 2015. Language models for image captioning: The quirks and what works. arXiv:1505.01809.  Jacob Devlin Hao Cheng Hao Fang Saurabh Gupta Li Deng Xiaodong He et al. 2015. Language models for image captioning: The quirks and what works. arXiv:1505.01809.","DOI":"10.3115\/v1\/P15-2017"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302","author":"Elliott Desmond","year":"2013"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1888089.1888092"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.138"},{"key":"e_1_2_1_7_1","volume-title":"Image captioning with text-based visual attention. Neural Processing Letters","author":"He Chen","year":"2018"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_10_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_2_1_12_1","unstructured":"R\u00e9mi Lebret Pedro O. Pinheiro and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv:1412.8419.  R\u00e9mi Lebret Pedro O. Pinheiro and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv:1412.8419."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of theWorkshoponText Summarization Branches Out, Post-Conference Workshop of ACL","author":"Lin Chin-Yew","year":"2004"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.100"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_17_1","unstructured":"Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.   Jiasen Lu Jianwei Yang Dhruv Batra and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.291"},{"key":"e_1_2_1_19_1","unstructured":"Junhua Mao Wei Xu Yi Yang Jiang Wang Zhiheng Huang and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632.  Junhua Mao Wei Xu Yi Yang Jiang Wang Zhiheng Huang and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 747--756","author":"Mitchell Margaret","year":"2012"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.445"},{"key":"e_1_2_1_24_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2397--2406","author":"Xiong Caiming","year":"2016"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.10"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3292058","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3292058","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:57:52Z","timestamp":1750208272000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3292058"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,1,23]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,2,28]]}},"alternative-id":["10.1145\/3292058"],"URL":"https:\/\/doi.org\/10.1145\/3292058","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,1,23]]},"assertion":[{"value":"2018-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-01-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}