{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T19:20:38Z","timestamp":1762543238690,"version":"3.41.0"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2020,7,5]],"date-time":"2020-07-05T00:00:00Z","timestamp":1593907200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2018YFB0203904"],"award-info":[{"award-number":["2018YFB0203904"]}]},{"name":"Hunan Key R8D Program","award":["2017GK2224"],"award-info":[{"award-number":["2017GK2224"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61502157, 61502158, and 61502137"],"award-info":[{"award-number":["61502157, 61502158, and 61502137"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2020,8,31]]},"abstract":"<jats:p>The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could offer accurate subregions to train a model, the learned captioner may predict wrong, especially for visual concept words, which are the most important parts to understand an image. To tackle the preceding problem, in this article we propose Visual Concept Enhanced Captioner, which employs a joint attention mechanism with visual concept samples to strengthen prediction abilities for visual concepts in image captioning. Different from traditional attention approaches that adopt one LSTM to explore one noticed subregion each time, Visual Concept Enhanced Captioner introduces multiple virtual LSTMs in parallel to simultaneously receive multiple subregions from visual concept samples. Then, the model could update parameters by jointly exploring these subregions according to a composite loss function. Technically, this joint learning is helpful in finding the common characters of a visual concept, and thus it enhances the prediction accuracy for visual concepts. Moreover, by integrating diverse visual concept samples from different domains, our model can be extended to bridge visual bias in cross-domain learning for image captioning, which saves the cost for labeling captions. Extensive experiments have been conducted on two image datasets (MSCOCO and Flickr30K), and superior results are reported when comparing to state-of-the-art approaches. It is impressive that our approach could significantly increase BLUE-1 and F1 scores, which demonstrates an accuracy improvement for visual concepts in image captioning.<\/jats:p>","DOI":"10.1145\/3394955","type":"journal-article","created":{"date-parts":[[2020,7,6]],"date-time":"2020-07-06T04:16:30Z","timestamp":1594008990000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Image Captioning with a Joint Attention Mechanism by Visual Concept Samples"],"prefix":"10.1145","volume":"16","author":[{"given":"Jin","family":"Yuan","sequence":"first","affiliation":[{"name":"Hunan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lei","family":"Zhang","sequence":"additional","affiliation":[{"name":"Hunan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Songrui","family":"Guo","sequence":"additional","affiliation":[{"name":"Hunan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yi","family":"Xiao","sequence":"additional","affiliation":[{"name":"Hunan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhiyong","family":"Li","sequence":"additional","affiliation":[{"name":"Hunan University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,7,5]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the International Conference on Digital Image Computing: Techniques and Applications. 1--8.","author":"Ali Baig Mirza Muhammad","year":"2018","unstructured":"Mirza Muhammad Ali Baig , Mian Ihtisham Shah , Muhammad Abdullah Wajahat , Nauman Zafar , and Omar Arif . 2018 . Image caption generator with novel object injection . In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications. 1--8. Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat, Nauman Zafar, and Omar Arif. 2018. Image caption generator with novel object injection. In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications. 1--8."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2018.2831447"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/84"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 6706--6713","author":"Chen Hui","year":"2018","unstructured":"Hui Chen , Guiguang Ding , Sicheng Zhao , and Jungong Han . 2018 . Temporal-difference learning with sampling baseline for image captioning . In Proceedings of the AAAI Conference on Artificial Intelligence. 6706--6713 . Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6706--6713."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.64"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3177745"},{"volume-title":"Proceedings of the Workshop on Statistical Machine Translation. 376--380","author":"Michael","key":"e_1_2_1_9_1","unstructured":"Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language . In Proceedings of the Workshop on Statistical Machine Translation. 376--380 . Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Workshop on Statistical Machine Translation. 376--380."},{"volume-title":"Proceedings of the European Conference on Computer Vision. 15--29","author":"Farhadi Ali","key":"e_1_2_1_10_1","unstructured":"Ali Farhadi , Seyyed Mohammad Mohsen Hejrati , Mohammad Amin Sadeghi , Peter Young , Cyrus Rashtchian , Julia Hockenmaier , and David A. Forsyth . 2010. Every picture tells a story: Generating sentences from images . In Proceedings of the European Conference on Computer Vision. 15--29 . Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. 15--29."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the Workshop on Text Summarization Branches Out.","author":"Flick Carlos","year":"2004","unstructured":"Carlos Flick . 2004 . ROUGE: A package for automatic evaluation of summaries . In Proceedings of the Workshop on Text Summarization Branches Out. Carlos Flick. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018320"},{"volume-title":"Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence","author":"Graves Alex","key":"e_1_2_1_13_1","unstructured":"Alex Graves . 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence , Vol. 385 . Springer , Berlin, Germany . Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, Vol. 385. Springer, Berlin, Germany."},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 6837--6844","author":"Gu Jiuxiang","year":"2018","unstructured":"Jiuxiang Gu , Jianfei Cai , Gang Wang , and Tsuhan Chen . 2018 . Stack-captioning: Coarse-to-fine learning for image captioning . In Proceedings of the AAAI Conference on Artificial Intelligence. 6837--6844 . Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6837--6844."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.8"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 6959--6966","author":"Jiang Wenhao","year":"2018","unstructured":"Wenhao Jiang , Lin Ma , Xinpeng Chen , Hanwang Zhang , and Wei Liu . 2018 . Learning to guide decoding for image captioning . In Proceedings of the AAAI Conference on Artificial Intelligence. 6959--6966 . Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6959--6966."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"volume-title":"Proceedings of the International Conference on Learning Representations.","author":"Diederik","key":"e_1_2_1_19_1","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization . In Proceedings of the International Conference on Learning Representations. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00475"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2017.2751140"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2896516"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01278"},{"volume-title":"Proceedings of the European Conference on Computer Vision. 740--755","author":"Lin Tsung-Yi","key":"e_1_2_1_26_1","unstructured":"Tsung-Yi Lin , Michael Maire , Serge J. Belongie , James Hays , Pietro Perona , Deva Ramanan , Piotr Doll\u00e1r , and C. Lawrence Zitnick . 2014. Microsoft COCO: Common objects in context . In Proceedings of the European Conference on Computer Vision. 740--755 . Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/114"},{"volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 4176--4182","author":"Liu Chenxi","key":"e_1_2_1_28_1","unstructured":"Chenxi Liu , Junhua Mao , Fei Sha , and Alan L. Yuille . 2017. Attention correctness in neural image captioning . In Proceedings of the AAAI Conference on Artificial Intelligence. 4176--4182 . Chenxi Liu, Junhua Mao, Fei Sha, and Alan L. Yuille. 2017. Attention correctness in neural image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4176--4182."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240632"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_21"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1435"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00754"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/592"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni , Salim Roukos , Todd Ward , and Wei-Jing Zhu . 2002 . BLEU: A method for automatic evaluation of machine translation . In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318 . Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.140"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems. 91--99","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B. Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards real-time object detection with region proposal networks . In Proceedings of the Annual Conference on Neural Information Processing Systems. 91--99 . Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 91--99."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2851077"},{"key":"e_1_2_1_40_1","unstructured":"Jingkuan Song Xiangpeng Li Lianli Gao and Heng Tao Shen. 2018. Hierarchical LSTMs with adaptive attention for visual captioning. arXiv:1812.11004.  Jingkuan Song Xiangpeng Li Lianli Gao and Heng Tao Shen. 2018. Hierarchical LSTMs with adaptive attention for visual captioning. arXiv:1812.11004."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018885"},{"volume-title":"Proceedings of the Annual Conference on Neural Information Processing Systems. 3104--3112","author":"Sutskever Ilya","key":"e_1_2_1_42_1","unstructured":"Ilya Sutskever , Oriol Vinyals , and Quoc V. Le . 2014. Sequence to sequence learning with neural networks . In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104--3112 . Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104--3112."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.130"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2207397"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018957"},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7378--7387","author":"Wang Yufei","key":"e_1_2_1_48_1","unstructured":"Yufei Wang , Zhe Lin , Xiaohui Shen , Scott Cohen , and Garrison W. Cottrell . 2017. Skeleton key: Image captioning by skeleton-attribute decomposition . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7378--7387 . Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garrison W. Cottrell. 2017. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7378--7387."},{"volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212","author":"Wu Qi","key":"e_1_2_1_49_1","unstructured":"Qi Wu , Chunhua Shen , Lingqiao Liu , Anthony R. Dick , and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212 . Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2048--2057","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron C. Courville , Ruslan Salakhutdinov , Richard S. Zemel , and Yoshua Bengio . 2015 . Show, attend and tell: Neural image caption generation with visual attention . In Proceedings of the International Conference on Machine Learning. 2048--2057 . Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057."},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2869276"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2855422"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.559"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00271"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.524"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2018.2855406"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2888822"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/168"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00483"},{"key":"e_1_2_1_64_1","volume-title":"Proceedings of the British Machine Vision Conference. 82","author":"Zhu Zhihao","year":"2018","unstructured":"Zhihao Zhu , Zhan Xue , and Zejian Yuan . 2018 . Think and tell: Preview network for image captioning . In Proceedings of the British Machine Vision Conference. 82 . Zhihao Zhu, Zhan Xue, and Zejian Yuan. 2018. Think and tell: Preview network for image captioning. In Proceedings of the British Machine Vision Conference. 82."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394955","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394955","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:00Z","timestamp":1750193280000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394955"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,5]]},"references-count":64,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,8,31]]}},"alternative-id":["10.1145\/3394955"],"URL":"https:\/\/doi.org\/10.1145\/3394955","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2020,7,5]]},"assertion":[{"value":"2019-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-07-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}