{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T09:03:44Z","timestamp":1773824624855,"version":"3.50.1"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2024,1,22]],"date-time":"2024-01-22T00:00:00Z","timestamp":1705881600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62222212"],"award-info":[{"award-number":["62222212"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U19A2057"],"award-info":[{"award-number":["U19A2057"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100003999","name":"Science Fund for Creative Research Groups","doi-asserted-by":"crossref","award":["62121002"],"award-info":[{"award-number":["62121002"]}],"id":[{"id":"10.13039\/501100003999","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62302474"],"award-info":[{"award-number":["62302474"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,5,31]]},"abstract":"<jats:p>\n            Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural relations. Content relations, which comprise geometric positions content (i.e., distances and sizes) and semantic interactions content (i.e., actions and possessives), unveil the mutual correlations between objects. In contrast, structural relations pertain to the topological connectivity of object regions. Existing Transformer-based methods typically resort to geometric positions to enhance the visual relations, yet only using the shallow geometric content is unable to precisely cover actional content correlations and structural connection relations. In this article, we adopt a comprehensive perspective to examine the correlations between objects, incorporating both content relations (i.e., geometric and semantic relations) and structural relations, with the aim of generating plausible captions. To achieve this, first, we construct a geometric graph from bounding box features and a semantic graph from the scene graph parser to model the content relations. Innovatively, we construct a topology graph that amalgamates the sparsity characteristics of the geometric and semantic graphs, enabling the representation of image structural relations. Second, we propose a novel unified approach to enrich image relation representations by integrating semantic, geometric, and structural relations into self-attention. Finally, in the language decoding stage, we further leverage the semantic relation as prior knowledge to generate accurate words. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our model, with improvements of CIDEr from 128.6% to 136.6%. Codes have been released at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/CrossmodalGroup\/ER-SAN\/tree\/main\/VG-Cap\">https:\/\/github.com\/CrossmodalGroup\/ER-SAN\/tree\/main\/VG-Cap<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3638558","type":"journal-article","created":{"date-parts":[[2023,12,25]],"date-time":"2023-12-25T11:39:17Z","timestamp":1703504357000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9561-7550","authenticated-orcid":false,"given":"Jingyu","family":"Li","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5739-8126","authenticated-orcid":false,"given":"Zhendong","family":"Mao","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China and the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-9621-0925","authenticated-orcid":false,"given":"Hao","family":"Li","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2774-2875","authenticated-orcid":false,"given":"Weidong","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1151-1792","authenticated-orcid":false,"given":"Yongdong","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China and the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,22]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_1_4_2","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3178844"},{"key":"e_1_3_1_6_2","first-page":"213","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV\u201920)","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV\u201920). Springer, 213\u2013229."},{"key":"e_1_3_1_7_2","unstructured":"Rewon Child Scott Gray Alec Radford and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. Retrieved from https:\/\/arXiv:1904.10509 (2019)."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_1_9_2","unstructured":"Zhihao Fan Zhongyu Wei Siyuan Wang Ruize Wang Zejun Li Haijun Shan and Xuanjing Huang. 2021. TCIC: Theme concepts learning cross language and vision for image captioning. Retrieved from https:\/\/arXiv:2106.10936"},{"key":"e_1_3_1_10_2","first-page":"607","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"36","author":"Fei Zhengcong","year":"2022","unstructured":"Zhengcong Fei. 2022. Attention-aligned transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 607\u2013615."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2020.2965966"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350943"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01034"},{"key":"e_1_3_1_14_2","article-title":"Image captioning: Transforming objects into words","volume":"32","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. Adv. Neural Info. Process. Syst. 32 (2019).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_15_2","first-page":"1945","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922)","author":"Huang Feicheng","year":"2022","unstructured":"Feicheng Huang and Zhixin Li. 2022. Improve image captioning via relation modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201922). IEEE, 1945\u20131949."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_1_17_2","article-title":"Adaptively aligned image captioning via adaptive attention time","volume":"32","author":"Huang Lun","year":"2019","unstructured":"Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019. Adaptively aligned image captioning via adaptive attention time. Adv. Neural Info. Process. Syst. 32 (2019).","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16258"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","unstructured":"Jiayi Ji Yiwei Ma Xiaoshuai Sun Yiyi Zhou Yongjian Wu and Rongrong Ji. 2022. Knowing what to Learn: A metric-oriented focal mechanism for image captioning. IEEE Transactions on Image Processing 31 (2022) 4321\u20134335. DOI:10.1109\/TIP.2022.3183434","DOI":"10.1109\/TIP.2022.3183434"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01028"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460474"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3181490"},{"key":"e_1_3_1_23_2","first-page":"3128","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR\u201915)","author":"Karpathy Andrej","year":"2015","unstructured":"Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR\u201915). 3128\u20133137."},{"key":"e_1_3_1_24_2","unstructured":"Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. Retrieved from https:\/\/arXiv:1609.02907"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_26_2","first-page":"1056","volume-title":"Proceedings of the 31th International Joint Conference on Artificial Intelligence","author":"Li Jingyu","year":"2022","unstructured":"Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022. ER-SAN: Enhanced-adaptive relation self-attention network for image captioning. In Proceedings of the 31th International Joint Conference on Artificial Intelligence. 1056\u20131062."},{"key":"e_1_3_1_27_2","unstructured":"Zhongli Li Qingyu Zhou Chao Li Ke Xu and Yunbo Cao. 2020. Improving bert with syntax-aware local attention. Retrieved from https:\/\/arXiv:2012.15150"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","unstructured":"An-An Liu Chenxi Huang Ning Xu Hongshuo Tian Jing Liu and Yongdong Zhang. 2023. Counterfactual Visual dialog: Robust commonsense knowledge learning from unbiased training. IEEE Transactions on Multimedia (2023) 1\u201313. DOI:10.1109\/TMM.2023.3284594","DOI":"10.1109\/TMM.2023.3284594"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","unstructured":"An-An Liu Haochun Lu Heyu Zhou Tianbao Li and Mohan Kankanhalli. 2024. Balanced class-incremental 3D object classification and retrieval. IEEE Transactions on Knowledge and Data Engineering 36 1 (2024) 35\u201348. DOI:10.1109\/TKDE.2023.3284032","DOI":"10.1109\/TKDE.2023.3284032"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2537337"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","unstructured":"An-An Liu Yingchen Zhai Ning Xu Weizhi Nie Wenhui Li and Yongdong Zhang. 2022. Region-aware image captioning via interaction learning. IEEE Transactions on Circuits and Systems for Video Technology 32 6 (2022) 3685\u20133696. DOI:10.1109\/TCSVT.2021.3107035","DOI":"10.1109\/TCSVT.2021.3107035"},{"key":"e_1_3_1_33_2","unstructured":"Shilong Liu Feng Li Hao Zhang Xiao Yang Xianbiao Qi Hang Su Jun Zhu and Lei Zhang. 2022. DAB-DETR: Dynamic anchor boxes are better queries for DETR. Retrieved from https:\/\/arXiv:2201.12329"},{"issue":"4","key":"e_1_3_1_34_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3409388","article-title":"Adaptive attention-based high-level semantic introduction for image caption","volume":"16","author":"Liu Xiaoxiao","year":"2020","unstructured":"Xiaoxiao Liu and Qingyang Xu. 2020. Adaptive attention-based high-level semantic introduction for image caption. ACM Trans. Multimedia Comput., Commun. Appl. 16, 4 (2020), 1\u201322.","journal-title":"ACM Trans. Multimedia Comput., Commun. Appl."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_1_36_2","unstructured":"Yunpeng Luo Jiayi Ji Xiaoshuai Sun Liujuan Cao Yongjian Wu Feiyue Huang Chia-Wen Lin and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. Retrieved from https:\/\/arXiv:2101.06462"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_1_38_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_39_2","first-page":"91","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Info. Process. Syst. 28 (2015), 91\u201399.","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_1_41_2","volume-title":"Proceedings of Workshop on Text Summarization of ACL","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL."},{"key":"e_1_3_1_42_2","doi-asserted-by":"crossref","unstructured":"Zhan Shi Xu Zhou Xipeng Qiu and Xiaodan Zhu. 2020. Improving image captioning with better use of captions. Retrieved from https:\/\/arXiv:2006.11807","DOI":"10.18653\/v1\/2020.acl-main.664"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475607"},{"key":"e_1_3_1_44_2","first-page":"5998","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00271"},{"key":"e_1_3_1_47_2","first-page":"19048","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Wang Xiaohan","year":"2023","unstructured":"Xiaohan Wang, Wenguan Wang, Jiayi Shao, and Yi Yang. 2023. LANA: A language-capable navigator for instruction following and generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19048\u201319058."},{"issue":"7","key":"e_1_3_1_48_2","doi-asserted-by":"crossref","first-page":"4417","DOI":"10.1109\/TCSVT.2021.3121062","article-title":"High-order interaction learning for image captioning","volume":"32","author":"Wang Yanhui","year":"2021","unstructured":"Yanhui Wang, Ning Xu, An-An Liu, Wenhui Li, and Yongdong Zhang. 2021. High-order interaction learning for image captioning. IEEE Trans. Circ. Syst. Video Technol. 32, 7 (2021), 4417\u20134430.","journal-title":"IEEE Trans. Circ. Syst. Video Technol."},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3478024"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_1_51_2","unstructured":"Yang Xu Li Li Haiyang Xu Songfang Huang Fei Huang and Jianfei Cai. 2022. Image captioning in the transformer age. Retrieved from https:\/\/arXiv:2204.07374"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3067449"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_1_54_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Yang Xu","year":"2019","unstructured":"Xu Yang, Hanwang Zhang, and Jianfei Cai. 2019. Learning to collocate neural modules for image captioning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1631\/FITEE.2100463"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2019.2947482"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3532627"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3307554"},{"key":"e_1_3_1_60_2","article-title":"Unified adaptive relevance distinguishable attention network for image-text matching","author":"Zhang Kun","year":"2022","unstructured":"Kun Zhang, Zhendong Mao, Anan Liu, and Yongdong Zhang. 2022. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia 25 (Jan. 2022), 1320\u20131332.","journal-title":"IEEE Trans. Multimedia"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01521"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01521"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6511"},{"key":"e_1_3_1_64_2","unstructured":"Guangxiang Zhao Junyang Lin Zhiyuan Zhang Xuancheng Ren Qi Su and Xu Sun. 2019. Explicit sparse transformer: Concentrated attention through explicit selection. Retrieved from https:\/\/arXivpreprintarXiv:1912.11637"},{"issue":"6","key":"e_1_3_1_65_2","doi-asserted-by":"crossref","first-page":"2827","DOI":"10.1109\/TPAMI.2021.3049156","article-title":"Cascaded parsing of human-object interaction recognition","volume":"44","author":"Zhou Tianfei","year":"2021","unstructured":"Tianfei Zhou, Siyuan Qi, Wenguan Wang, Jianbing Shen, and Song-Chun Zhu. 2021. Cascaded parsing of human-object interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2021), 2827\u20132840.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3638558","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3638558","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:06:13Z","timestamp":1750291573000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3638558"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,22]]},"references-count":64,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5,31]]}},"alternative-id":["10.1145\/3638558"],"URL":"https:\/\/doi.org\/10.1145\/3638558","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,22]]},"assertion":[{"value":"2023-05-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-12-20","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}