{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T08:56:25Z","timestamp":1774947385548,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,5,11]],"date-time":"2021-05-11T00:00:00Z","timestamp":1620691200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2021,5,31]]},"abstract":"<jats:p>Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus intensity on the image. In this article, we first propose an improved visual attention model. At each timestep, we calculated the focus intensity coefficient of the attention mechanism through the context information of the model, then automatically adjusted the focus intensity of the attention mechanism through the coefficient to extract more accurate visual information. In addition, we represented the scene semantic knowledge of the image through topic words related to the image scene, then added them to the language model. We used the attention mechanism to determine the visual information and scene semantic information that the model pays attention to at each timestep and combined them to enable the model to generate more accurate and scene-specific captions. Finally, we evaluated our model on Microsoft COCO (MSCOCO) and Flickr30k standard datasets. The experimental results show that our approach generates more accurate captions and outperforms many recent advanced models in various evaluation metrics.<\/jats:p>","DOI":"10.1145\/3439734","type":"journal-article","created":{"date-parts":[[2021,5,12]],"date-time":"2021-05-12T00:56:03Z","timestamp":1620780963000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":51,"title":["Integrating Scene Semantic Knowledge into Image Captioning"],"prefix":"10.1145","volume":"17","author":[{"given":"Haiyang","family":"Wei","sequence":"first","affiliation":[{"name":"Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5313-6134","authenticated-orcid":false,"given":"Zhixin","family":"Li","sequence":"additional","affiliation":[{"name":"Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feicheng","family":"Huang","sequence":"additional","affiliation":[{"name":"Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Canlong","family":"Zhang","sequence":"additional","affiliation":[{"name":"Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, Guangxi, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huifang","family":"Ma","sequence":"additional","affiliation":[{"name":"College of Computer Science and Engineering, Northwest Normal University, Lanzhou, Gansu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhongzhi","family":"Shi","sequence":"additional","affiliation":[{"name":"Key Laboratory of Intelligent Information Processing, Institute of ComputingTechnology, Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,5,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_3_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65--72","author":"Banerjee Satanjeev","year":"2005"},{"key":"e_1_2_1_5_1","first-page":"993","article-title":"Latent Dirichlet allocation","author":"Blei David M.","year":"2003","journal-title":"J. Mach. Learn. Res. 3"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_2_1_7_1","volume-title":"Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.","author":"Cho Kyunghyun","year":"2014"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.323"},{"key":"e_1_2_1_9_1","unstructured":"Jifeng Dai Yi Li Kaiming He and Jian Sun. 2016. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems. 379--387.  Jifeng Dai Yi Li Kaiming He and Jian Sun. 2016. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems. 379--387."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2642953"},{"key":"e_1_2_1_11_1","unstructured":"Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.  Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 6837--6844","author":"Gu Jiuxiang","year":"2018"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.138"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201918)","author":"Jiang Y. G."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_2_1_17_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the ACL Workshop on Text Summarization Branches Out. 74--81","author":"Lin Chin-Yew","year":"2004"},{"key":"e_1_2_1_20_1","volume-title":"Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation. arXiv preprint arXiv:1808.07374","author":"Lin Junyang","year":"2018"},{"key":"e_1_2_1_21_1","volume-title":"Context-aware visual policy network for sequence-level image captioning. arXiv preprint arXiv:1808.05864","author":"Liu Daqing","year":"2018"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.100"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_21"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00754"},{"key":"e_1_2_1_26_1","volume-title":"Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632","author":"Mao Junhua","year":"2014"},{"key":"e_1_2_1_27_1","volume-title":"Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784","author":"Mirza Mehdi","year":"2014"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 40th Meeting on Association for Computational Linguistics. 311--318","author":"Papineni Kishore","year":"2002"},{"key":"e_1_2_1_29_1","volume-title":"Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732","author":"Ranzato Marc\u2019Aurelio","year":"2015"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_2_1_31_1","volume-title":"Le","author":"Sutskever Ilya","year":"2014"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2708709"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00435"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3439734","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3439734","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:01:52Z","timestamp":1750197712000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3439734"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,11]]},"references-count":40,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,5,31]]}},"alternative-id":["10.1145\/3439734"],"URL":"https:\/\/doi.org\/10.1145\/3439734","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,11]]},"assertion":[{"value":"2019-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}