{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T07:59:28Z","timestamp":1767772768701,"version":"build-2065373602"},"reference-count":47,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2021,11,30]],"date-time":"2021-11-30T00:00:00Z","timestamp":1638230400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the National Key Research and Development Program of China","award":["No. 2021YFB2206200"],"award-info":[{"award-number":["No. 2021YFB2206200"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Image captioning generates written descriptions of an image. In recent image captioning research, attention regions seldom cover all objects, and generated captions may lack the details of objects and may remain far from reality. In this paper, we propose a word guided attention (WGA) method for image captioning. First, WGA extracts word information using the embedded word and memory cell by applying transformation and multiplication. Then, WGA applies word information to the attention results and obtains the attended feature vectors via elementwise multiplication. Finally, we apply WGA with the words from different time steps to obtain previous word guided attention (PW) and current word attention (CW) in the decoder. Experiments on the MSCOCO dataset show that our proposed WGA can achieve competitive performance against state-of-the-art methods, with PW results of a 39.1 Bilingual Evaluation Understudy score (BLEU-4) and a 127.6 Consensus-Based Image Description Evaluation score (CIDEr-D); and CW results of a 39.1 BLEU-4 score and a 127.2 CIDER-D score on a Karpathy test split.<\/jats:p>","DOI":"10.3390\/s21237982","type":"journal-article","created":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T01:45:02Z","timestamp":1638323102000},"page":"7982","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Attention-Guided Image Captioning through Word Information"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6780-8432","authenticated-orcid":false,"given":"Ziwei","family":"Tang","sequence":"first","affiliation":[{"name":"School of Printing and Packaging, Wuhan University, Wuhan 430072, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yaohua","family":"Yi","sequence":"additional","affiliation":[{"name":"School of Printing and Packaging, Wuhan University, Wuhan 430072, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hao","family":"Sheng","sequence":"additional","affiliation":[{"name":"School of Printing and Packaging, Wuhan University, Wuhan 430072, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,11,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., and Berg, T.L. (2011, January 20\u201325). Baby talk: Understanding and generating simple image descriptions. Proceedings of the CVPR 2011, Colorado Springs, CO, USA.","DOI":"10.1109\/CVPR.2011.5995466"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Doll\u00e1r, P., Gao, J., He, X., Mitchell, M., and Platt, J.C. (2015, January 7\u201312). From captions to visual concepts and back. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"ref_5","unstructured":"Yang, Y., Teo, C.L., Daum\u00e9, H., and Aloimonos, Y. (2011, January 27\u201331). Corpus-guided sentence generation of natural images. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK."},{"key":"ref_6","unstructured":"Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daum\u00e9, H. (2012, January 23\u201327). Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"104126","DOI":"10.1016\/j.imavis.2021.104126","article-title":"Image captioning via proximal policy optimization","volume":"108","author":"Zhang","year":"2021","journal-title":"Image Vis. Comput."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Cho, K., Merrienboer, B.v., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv Preprint.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_9","unstructured":"Sutskever, I., Vinyals, O., and Le, Q.V. (2014, January 8\u201313). Sequence to sequence learning with neural networks. Proceedings of the 27th International Conference on Neural Information Processing Systems\u2014Volume 2, Cambridge, MA, USA."},{"key":"ref_10","unstructured":"Yang, Z., Yuan, Y., Wu, Y., Cohen, W.W., and Salakhutdinov, R.R. (2016, January 5\u201310). Review networks for caption generation. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., and Deng, L. (2017, January 21\u201326). Semantic Compositional Networks for Visual Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.127"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Chen, Y., Wang, S., Zhang, W., and Huang, Q. (2018). Less Is More: Picking Informative Frames for Video Captioning, Springer International Publishing.","DOI":"10.1007\/978-3-030-01261-8_22"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. (2017, January 21\u201326). StyleNet: Generating Attractive Visual Captions with Styles. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.108"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1016\/j.neucom.2020.02.041","article-title":"Dual-CNN: A Convolutional language decoder for paragraph image captioning","volume":"396","author":"Li","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1016\/j.neucom.2020.03.087","article-title":"Evolutionary recurrent neural network for image captioning","volume":"401","author":"Wang","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_16","unstructured":"Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., and Bengio, Y. (2015, January 7\u20139). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the 32nd International Conference on International Conference on Machine Learning\u2014Volume 37, Lille, France."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1016\/j.neucom.2019.06.085","article-title":"DAA: Dual LSTMs with adaptive attention for image captioning","volume":"364","author":"Xiao","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_18","unstructured":"Dauphin, Y.N., Fan, A., Auli, M., and Grangier, D. (2017, January 6\u201311). Language Modeling with Gated Convolutional Networks. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_19","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A Neural Image Caption Generator. Proceedings of the Name of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wu, Q., Shen, C., Liu, L., Dick, A., and Hengel, A.V.D. (2016, January 27\u201330). What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.29"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yang, X., Tang, K., Zhang, H., and Cai, J. (2019, January 15\u201320). Auto-Encoding Scene Graphs for Image Captioning. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01094"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yao, T., Pan, Y., Li, Y., and Mei, T. (2018). Exploring Visual Relationship for Image Captioning, Springer International Publishing.","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Yao, T., Pan, Y., Li, Y., Qiu, Z., and Mei, T. (2017, January 22\u201329). Boosting Image Captioning with Attributes. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.","DOI":"10.1109\/ICCV.2017.524"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1038\/nrn755","article-title":"Control of goal-directed and stimulus-driven attention in the brain","volume":"3","author":"Corbetta","year":"2002","journal-title":"Nat. Rev. Neurosci."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Lu, J., Xiong, C., Parikh, D., and Socher, R. (2017, January 21\u201326). Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.345"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and Chua, T. (2017, January 21\u201326). SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.667"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wu, L., Tian, F., Zhao, L., Lai, J., and Liu, T. (2018). Word Attention for Sequence to Sequence Text Understanding, AAAI.","DOI":"10.1609\/aaai.v32i1.11971"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017, January 21\u201326). Self-Critical Sequence Training for Image Captioning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.131"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., and Parikh, D. (2015, January 7\u201312). CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. (2014, January 6\u201312). Microsoft COCO: Common objects in context. Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Fei-Fei, L. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_35","unstructured":"Satanjeev, B. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, ACL."},{"key":"ref_36","unstructured":"Anderson, P., Fernando, B., Johnson, M., and Gould, S. (2014, January 3\u20137). SPICE: Semantic propositional image caption evaluation. Proceedings of the 21st ACM Conference on Computer and Communications Security, CCS 2014, Scottsdale, AZ, USA."},{"key":"ref_37","unstructured":"Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries, Association for Computational Linguistics."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L., Kai, L., and Li, F.-F. (2009, January 20\u201325). ImageNet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., and Zhang, T. (2018). Recurrent Fusion Network for Image Captioning, Springer International Publishing.","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"7615","DOI":"10.1109\/TIP.2020.3004729","article-title":"Spatio-Temporal Memory Attention for Image Captioning","volume":"29","author":"Ji","year":"2020","journal-title":"IEEE Trans. Image Process."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1016\/j.neucom.2020.06.112","article-title":"Image captioning with semantic-enhanced features and extremely hard negative examples","volume":"413","author":"Cai","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"2149","DOI":"10.1109\/TMM.2019.2951226","article-title":"Show, Tell, and Polish: Ruminant Decoding for Image Captioning","volume":"22","author":"Guo","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"107812","DOI":"10.1016\/j.patcog.2020.107812","article-title":"Linguistically-aware attention for reducing the semantic gap in vision-language tasks","volume":"112","author":"Kv","year":"2021","journal-title":"Pattern Recognit."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"104146","DOI":"10.1016\/j.imavis.2021.104146","article-title":"Exploring region relationships implicitly: Image captioning with visual relationship attention","volume":"109","author":"Zhang","year":"2021","journal-title":"Image Vis. Comput."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Fei, Z. (2021, January 2\u20139). Memory-Augmented Image Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, Online.","DOI":"10.1609\/aaai.v35i2.16220"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., Chen, Z., and Gao, X. (2021). Task-Adaptive Attention for Image Captioning. IEEE Transactions on Circuits and Systems for Video Technology (Early Access), IEEE.","DOI":"10.1109\/TCSVT.2021.3067449"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/7982\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:37:39Z","timestamp":1760168259000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/23\/7982"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,30]]},"references-count":47,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["s21237982"],"URL":"https:\/\/doi.org\/10.3390\/s21237982","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2021,11,30]]}}}