{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T08:02:53Z","timestamp":1769155373346,"version":"3.49.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2021,11,12]],"date-time":"2021-11-12T00:00:00Z","timestamp":1636675200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62076262, 61673402, 61273270, and 60802069"],"award-info":[{"award-number":["62076262, 61673402, 61273270, and 60802069"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"National Key Research and Development Program of China","award":["2018YFB1601101 and 2018YFB1601100"],"award-info":[{"award-number":["2018YFB1601101 and 2018YFB1601100"]}]},{"DOI":"10.13039\/501100003453","name":"Natural Science Foundation of Guangdong Province","doi-asserted-by":"crossref","award":["2017A030311029"],"award-info":[{"award-number":["2017A030311029"]}],"id":[{"id":"10.13039\/501100003453","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2021,11,30]]},"abstract":"<jats:p>\n            Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a\n            <jats:bold>Bidirectional Co-Attention Network (BCAN)<\/jats:bold>\n            that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are\n            <jats:italic>a priori<\/jats:italic>\n            relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the\n            <jats:bold>Multivariate Residual Module (MRM)<\/jats:bold>\n            to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.\n          <\/jats:p>","DOI":"10.1145\/3460474","type":"journal-article","created":{"date-parts":[[2021,11,12]],"date-time":"2021-11-12T21:16:06Z","timestamp":1636751766000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":38,"title":["Bi-Directional Co-Attention Network for Image Captioning"],"prefix":"10.1145","volume":"17","author":[{"given":"Weitao","family":"Jiang","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weixuan","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Haifeng","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,11,12]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_4_2","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correalation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.285"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/1888089.1888092"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2642953"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.127"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292058"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"e_1_3_2_17_2","doi-asserted-by":"crossref","first-page":"381","DOI":"10.1007\/978-3-030-68780-9_32","volume-title":"Pattern Recognition. ICPR International Workshops and Challenges","author":"Kalimuthu Marimuthu","year":"2021","unstructured":"Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, editor=\u201cDel Bimbo Alberto Klakow, Dietrich\u201d, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani. 2021. Fusion models for improved image captioning. In Pattern Recognition. ICPR International Workshops and Challenges. 381\u2013395."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969038"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00898"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.5555\/3157096.3157137"},{"key":"e_1_3_2_22_2","article-title":"Hadamard product for low-rank bilinear pooling","author":"Kim Jin-Hwa","year":"2016","unstructured":"Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).","journal-title":"arXiv preprint arXiv:1610.04325"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.162"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00902"},{"key":"e_1_3_2_26_2","first-page":"74","volume-title":"Proceedings of the Workshop on Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out. 74\u201381."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.100"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00754"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.5555\/3504035.3504919"},{"key":"e_1_3_2_32_2","article-title":"Deep captioning with multimodal recurrent neural networks (m-RNN)","author":"Mao Junhua","year":"2014","unstructured":"Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014).","journal-title":"arXiv preprint arXiv:1412.6632"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_36_2","article-title":"Sequence level training with recurrent neural networks","author":"Ranzato Marc\u2019Aurelio","year":"2015","unstructured":"Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).","journal-title":"arXiv preprint arXiv:1511.06732"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969239.2969250"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_42_2","first-page":"587","volume-title":"Asian Conference on Computer Vision","author":"Wang Weixuan","year":"2018","unstructured":"Weixuan Wang, Zhihong Chen, and Haifeng Hu. 2018. Multivariate attention network for image captioning. In Asian Conference on Computer Vision. Springer, 587\u2013602."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018957"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.780"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.29"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045118.3045336"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3386725"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00271"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_2_52_2","article-title":"Actor-critic sequence training for image captioning","author":"Zhang Li","year":"2017","unstructured":"Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. 2017. Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601 (2017).","journal-title":"arXiv preprint arXiv:1706.09601"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58568-6_13"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460474","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460474","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:04Z","timestamp":1750191424000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460474"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,12]]},"references-count":52,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,11,30]]}},"alternative-id":["10.1145\/3460474"],"URL":"https:\/\/doi.org\/10.1145\/3460474","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,12]]},"assertion":[{"value":"2020-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-11-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}