{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T22:11:46Z","timestamp":1766268706146,"version":"3.41.0"},"reference-count":81,"publisher":"Association for Computing Machinery (ACM)","issue":"12","license":[{"start":{"date-parts":[[2024,11,25]],"date-time":"2024-11-25T00:00:00Z","timestamp":1732492800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62337001"],"award-info":[{"award-number":["62337001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["226-2024-00058"],"award-info":[{"award-number":["226-2024-00058"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"HKUST Special","award":["F0927"],"award-info":[{"award-number":["F0927"]}]},{"name":"HKUST Sports Science and Technology Research","award":["SSTRG24EG04"],"award-info":[{"award-number":["SSTRG24EG04"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            Distinctive Image Captioning (DIC)\u2014generating distinctive captions that describe the unique details of a target image\u2014has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-Based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish:\n            <jats:italic>these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images.<\/jats:italic>\n            For example, if the target image contains objects \u201c\n            <jats:monospace>towel<\/jats:monospace>\n            \u201d and \u201c\n            <jats:monospace>toilet<\/jats:monospace>\n            \u201d while all reference images are without them, then a simple caption \u201c\n            <jats:monospace>A bathroom with a towel and a toilet<\/jats:monospace>\n            \u201d is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at the object-\/attribute-level (vs. scene-level). Second, to generate distinctive captions, we develop a Transformer-based Ref-DIC baseline\n            <jats:italic>TransDIC<\/jats:italic>\n            . It not only extracts visual features from the target image but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger\n            <jats:italic>TransDIC<\/jats:italic>\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\({++}\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            , which consists of an extra contrastive learning module to make full use of the reference images. This new module is model-agnostic, which can be easily incorporated into various Ref-DIC architectures. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named\n            <jats:italic>DisCIDEr<\/jats:italic>\n            for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our\n            <jats:italic>TransDIC<\/jats:italic>\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\({++}\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.\n          <\/jats:p>","DOI":"10.1145\/3694683","type":"journal-article","created":{"date-parts":[[2024,9,24]],"date-time":"2024-09-24T14:47:31Z","timestamp":1727189251000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7759-6015","authenticated-orcid":false,"given":"Yangjun","family":"Mao","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0303-134X","authenticated-orcid":false,"given":"Jun","family":"Xiao","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4543-2179","authenticated-orcid":false,"given":"Dong","family":"Zhang","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8946-4228","authenticated-orcid":false,"given":"Meng","family":"Cao","sequence":"additional","affiliation":[{"name":"Peking University, Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7842-7616","authenticated-orcid":false,"given":"Jian","family":"Shao","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9017-2508","authenticated-orcid":false,"given":"Yueting","family":"Zhuang","sequence":"additional","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6148-9709","authenticated-orcid":false,"given":"Long","family":"Chen","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,11,25]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_4_2","unstructured":"J. Lei Ba J. R. Kiros G. E. Hinton. 2016. Layer Normalization. arXiv: 1607.06450. Retrieved from https:\/\/arxiv.org\/abs\/1607.06450"},{"key":"e_1_3_2_5_2","first-page":"65","volume-title":"Association for Computational Linguistics Workshop","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Association for Computational Linguistics Workshop, 65\u201372."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW56347.2022.00512"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00146"},{"key":"e_1_3_2_8_2","first-page":"16846","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chen Long","year":"2021","unstructured":"Long Chen, Zhihong Jiang, Jun Xiao, and Wei Liu. 2021. Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 16846\u201316856."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.displa.2023.102377"},{"key":"e_1_3_2_10_2","first-page":"4613","article-title":"Counterfactual critic multi-agent training for scene graph generation","author":"Chen Long","year":"2019","unstructured":"Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 4613\u20134623.","journal-title":"ICCV"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3290012"},{"key":"e_1_3_2_13_2","first-page":"9962","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chen Shizhe","year":"2020","unstructured":"Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9962\u20139971"},{"key":"e_1_3_2_14_2","first-page":"1597","volume-title":"ICML","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597\u20131607."},{"key":"e_1_3_2_15_2","unstructured":"Xinlei Chen Hao Fang Tsung-Yi Lin Ramakrishna Vedantam Saurabh Gupta Piotr Doll\u00e1r and C. Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https:\/\/arxiv.org\/abs\/1504.00325"},{"key":"e_1_3_2_16_2","first-page":"15750","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Chen Xinlei","year":"2021","unstructured":"Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 15750\u201315758."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_18_2","first-page":"10578","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Cornia M.","year":"2020","unstructured":"M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10578\u201310587."},{"key":"e_1_3_2_19_2","first-page":"2970","volume-title":"International Conference on Computer Vision","author":"Dai Bo","year":"2017","unstructured":"Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In International Conference on Computer Vision, 2970\u20132979."},{"key":"e_1_3_2_20_2","first-page":"898","article-title":"Contrastive learning for image captioning","author":"Dai Bo","year":"2017","unstructured":"Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems, 898\u2013907.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_21_2","unstructured":"Fartash Faghri David J. Fleet Jamie Ryan Kiros and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612. Retrieved from https:\/\/arxiv.org\/abs\/1707.05612"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i1.19940"},{"key":"e_1_3_2_23_2","first-page":"614","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"37","author":"Fei Zhengcong","year":"2023","unstructured":"Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. 2023. Uncertainty-aware image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 614\u2013622."},{"key":"e_1_3_2_24_2","first-page":"266","volume-title":"European Conference on Computer Vision","author":"Gao Mingfei","year":"2022","unstructured":"Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. 2022. Open vocabulary object detection with pseudo bounding-box labels. In European Conference on Computer Vision, 266\u2013282."},{"key":"e_1_3_2_25_2","first-page":"21271","article-title":"Bootstrap your own latent-a new approach to self-supervised learning","volume":"33","author":"Grill Jean-Bastien","year":"2020","unstructured":"Jean-Bastien Grill, Florian Strub, Florent Altch\u00e9, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R\u00e9mi Munos, and Michal Valko. 2020. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, Vol. 33, 21271\u201321284.","journal-title":"NeurIPS"},{"key":"e_1_3_2_26_2","first-page":"118","article-title":"Face recognition with contrastive convolution","author":"Han Chunrui","year":"2018","unstructured":"Chunrui Han, Shiguang Shan, Meina Kan, Shuzhe Wu, and Xilin Chen. 2018. Face recognition with contrastive convolution. In ECCV, 118\u2013134.","journal-title":"ECCV"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV56688.2023.00118"},{"key":"e_1_3_2_29_2","first-page":"4634","volume-title":"International Conference on Computer Vision","author":"Huang Lun","year":"2019","unstructured":"Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In International Conference on Computer Vision, 4634\u20134643."},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"2367","DOI":"10.1109\/TMM.2023.3295098","article-title":"Memory-based augmentation network for video captioning","volume":"26","author":"Jing Shuaiqi","year":"2023","unstructured":"Shuaiqi Jing, Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2023. Memory-based augmentation network for video captioning. IEEE Transactions on Multimedia 26 (2023), 2367\u20132379.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.494"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_33_2","first-page":"8928","volume-title":"International Conference on Computer Vision","author":"Li Guang","year":"2019","unstructured":"Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In International Conference on Computer Vision, 8928\u20138937."},{"key":"e_1_3_2_34_2","first-page":"21685","volume-title":"International Conference on Computer Vision","author":"Li Lin","year":"2023","unstructured":"Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, and Long Chen. 2023. Compositional feature augmentation for unbiased scene graph generation. In International Conference on Computer Vision, 21685\u201321695."},{"key":"e_1_3_2_35_2","first-page":"18869","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Li Lin","year":"2022","unstructured":"Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The devil is in the labels: Noisy label correction for robust scene graph generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 18869\u201318878."},{"key":"e_1_3_2_36_2","first-page":"50105","article-title":"Zero-shot visual relation detection via composite visual cues from large language models","volume":"36","author":"Li Lin","year":"2023","unstructured":"Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, and Long Chen. 2023. Zero-shot visual relation detection via composite visual cues from large language models. In Advances in Neural Information Processing Systems, Vol. 36, 50105\u201350116.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_37_2","first-page":"3440","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Li Zhuowan","year":"2020","unstructured":"Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, and Alan L. Yuille. 2020. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3440\u20133450."},{"key":"e_1_3_2_38_2","first-page":"74","volume-title":"Association for Computational Linguistics Workshop","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Association for Computational Linguistics Workshop, 74\u201381."},{"issue":"12","key":"e_1_3_2_39_2","first-page":"7655","article-title":"Toward region-aware attention learning for scene graph generation","volume":"33","author":"Liu An-An","year":"2021","unstructured":"An-An Liu, Hongshuo Tian, Ning Xu, Weizhi Nie, Yongdong Zhang, and Mohan Kankanhalli. 2021. Toward region-aware attention learning for scene graph generation. IEEE Transactions on Neural Networks and Learning Systems 33, 12 (2021), 7655\u20137666.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3107035"},{"key":"e_1_3_2_41_2","first-page":"4240","volume-title":"International Conference on Computer Vision","author":"Liu Lixin","year":"2019","unstructured":"Lixin Liu, Jiajun Tang, Xiaojun Wan, and Zongming Guo. 2019. Generating diverse and descriptive image captions using visual paraphrases. In International Conference on Computer Vision, 4240\u20134249."},{"key":"e_1_3_2_42_2","first-page":"873","volume-title":"International Conference on Computer Vision","author":"Liu Siqi","year":"2017","unstructured":"Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In International Conference on Computer Vision, 873\u2013881."},{"key":"e_1_3_2_43_2","first-page":"338","volume-title":"European Conference on Computer Vision","author":"Liu Xihui","year":"2018","unstructured":"Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In European Conference on Computer Vision, 338\u2013354."},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00728"},{"key":"e_1_3_2_45_2","first-page":"4374","volume-title":"ACM International Conference on Multimedia","author":"Mao Yangjun","year":"2022","unstructured":"Yangjun Mao, Long Chen, Zhihong Jiang, Dong Zhang, Zhimeng Zhang, Jian Shao, and Jun Xiao. 2022. Rethinking the reference-based distinctive image captioning. In ACM International Conference on Multimedia, 4374\u20134384."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_2_47_2","first-page":"311","volume-title":"Meeting of Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Meeting of Association for Computational Linguistics, 311\u2013318."},{"key":"e_1_3_2_48_2","first-page":"4624","volume-title":"International Conference on Computer Vision","author":"Park Dong Huk","year":"2019","unstructured":"Dong Huk Park, Trevor Darrell, and Anna Rohrbach. 2019. Robust change captioning. In International Conference on Computer Vision, 4624\u20134633."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.303"},{"key":"e_1_3_2_50_2","first-page":"1971","volume-title":"International Conference on Computer Vision","author":"Qiu Yue","year":"2021","unstructured":"Yue Qiu, Shintaro Yamamoto, Kodai Nakashima, Ryota Suzuki, Kenji Iwata, Hirokatsu Kataoka, and Yutaka Satoh. 2021. Describing and localizing multiple changes with transformers. In International Conference on Computer Vision, 1971\u20131980."},{"key":"e_1_3_2_51_2","first-page":"8748","volume-title":"International Conference on Computer Vision","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In International Conference on Computer Vision, 8748\u20138763."},{"key":"e_1_3_2_52_2","unstructured":"Marc\u2019Aurelio Ranzato Sumit Chopra Michael Auli and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732. Retrieved from https:\/\/arxiv.org\/abs\/1511.06732"},{"key":"e_1_3_2_53_2","first-page":"1137","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Ren Shaoqing","year":"2016","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1137\u20131149."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_55_2","first-page":"618","volume-title":"International Conference on Computer Vision","author":"Selvaraju Ramprasaath R.","year":"2017","unstructured":"Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision, 618\u2013626."},{"key":"e_1_3_2_56_2","first-page":"4135","volume-title":"International Conference on Computer Vision","author":"Shetty Rakshith","year":"2017","unstructured":"Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In International Conference on Computer Vision, 4135\u20134144."},{"key":"e_1_3_2_57_2","doi-asserted-by":"crossref","unstructured":"Alane Suhr Stephanie Zhou Ally Zhang Iris Zhang Huajun Bai and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https:\/\/arxiv.org\/abs\/1811.00491","DOI":"10.18653\/v1\/P19-1644"},{"key":"e_1_3_2_58_2","doi-asserted-by":"crossref","unstructured":"Hao Tan Franck Dernoncourt Zhe Lin Trung Bui and Mohit Bansal. 2019. Expressing visual relationships via language. arXiv:1906.07689. Retrieved from https:\/\/arxiv.org\/abs\/1906.07689","DOI":"10.18653\/v1\/P19-1182"},{"key":"e_1_3_2_59_2","first-page":"6000","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 6000\u20136010.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.120"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_63_2","doi-asserted-by":"crossref","first-page":"2966","DOI":"10.1109\/TMM.2022.3154149","article-title":"A text-guided generation and refinement model for image captioning","volume":"25","author":"Wang Depeng","year":"2022","unstructured":"Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, and Meng Wang. 2022. A text-guided generation and refinement model for image captioning. IEEE Transactions on Multimedia 25 (2022), 2966\u20132977.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_64_2","first-page":"370","volume-title":"European Conference on Computer Vision","author":"Wang Jiuniu","year":"2020","unstructured":"Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan. 2020. Compare and reweight: Distinctive image captioning using similar images sets. In European Conference on Computer Vision, 370\u2013386."},{"key":"e_1_3_2_65_2","first-page":"5020","volume-title":"ACM International Conference on Multimedia","author":"Wang Jiuniu","year":"2021","unstructured":"Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan. 2021. Group-based distinctive image captioning with memory attention. In ACM International Conference on Multimedia, 5020\u20135028."},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3121062"},{"key":"e_1_3_2_67_2","first-page":"113","volume-title":"European Conference on Computer Vision","author":"Wang Zhen","year":"2022","unstructured":"Zhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, and Jun Xiao. 2022. Explicit image caption editing. In European Conference on Computer Vision. Springer, 113\u2013129."},{"key":"e_1_3_2_68_2","first-page":"629","volume-title":"European Conference on Computer Vision","author":"Wang Zeyu","year":"2020","unstructured":"Zeyu Wang, Berthy Feng, Karthik Narasimhan, and Olga Russakovsky. 2020. Towards unique and informative captioning of images. In European Conference on Computer Vision, 629\u2013644."},{"key":"e_1_3_2_69_2","first-page":"365","volume-title":"European Conference on Computer Vision","author":"Wang Zhen","year":"2024","unstructured":"Zhen Wang, Jun Xiao, Tao Chen, and Long Chen. 2024. DECap: Towards generalized explicit caption editing via diffusion mechanism. In European Conference on Computer Vision, 365\u2013381."},{"key":"e_1_3_2_70_2","first-page":"1","article-title":"Learning combinatorial prompts for universal controllable image captioning","author":"Wang Zhen","year":"2024","unstructured":"Zhen Wang, Jun Xiao, Yueting Zhuang, Fei Gao, Jian Shao, and Long Chen. 2024. Learning combinatorial prompts for universal controllable image captioning. International Journal of Computer Vision (2024), 1\u201322.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_2_71_2","first-page":"2048","volume-title":"International Conference on Machine Learning","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, 2048\u20132057."},{"key":"e_1_3_2_72_2","first-page":"1031","article-title":"Scene graph inference via multi-scale context modeling","author":"Xu Ning","year":"2020","unstructured":"Ning Xu, An-An Liu, Yongkang Wong, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2020. Scene graph inference via multi-scale context modeling. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1031\u20131041.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_2_73_2","first-page":"1372","article-title":"Multi-level policy and reward-based deep reinforcement learning framework for image captioning","author":"Xu Ning","year":"2019","unstructured":"Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia (2019), 1372\u20131383.","journal-title":"IEEE Transactions on Multimedia"},{"issue":"2","key":"e_1_3_2_74_2","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1007\/s13735-023-00307-3","article-title":"PSNet: Position-shift alignment network for image caption","volume":"12","author":"Xue Lixia","year":"2023","unstructured":"Lixia Xue, Awen Zhang, Ronggui Wang, and Juan Yang. 2023. PSNet: Position-shift alignment network for image caption. International Journal of Multimedia Information Retrieval 12, 2 (2023), 42.","journal-title":"International Journal of Multimedia Information Retrieval"},{"key":"e_1_3_2_75_2","doi-asserted-by":"crossref","unstructured":"An Yan Xin Eric Wang Tsu-Jui Fu and William Yang Wang. 2021. L2C: Describing visual differences needs semantic understanding of individuals. arXiv:2102.01860. Retrieved from https:\/\/arxiv.org\/abs\/12102.01860","DOI":"10.18653\/v1\/2021.eacl-main.196"},{"issue":"1","key":"e_1_3_2_76_2","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1007\/s00530-023-01230-7","article-title":"BENet: Bi-directional enhanced network for image captioning","volume":"30","author":"Yan Peixin","year":"2024","unstructured":"Peixin Yan, Zuoyong Li, Rong Hu, and Xinrong Cao. 2024. BENet: Bi-directional enhanced network for image captioning. Multimedia Systems 30, 1 (2024), 48.","journal-title":"Multimedia Systems"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_2_78_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299064"},{"key":"e_1_3_2_80_2","first-page":"1608","article-title":"S2 transformer for image captioning","author":"Zeng Pengpeng","year":"2022","unstructured":"Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In IJCAI, 1608\u20131614.","journal-title":"IJCAI"},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3336371"},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01521"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3694683","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3694683","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:47Z","timestamp":1750291547000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3694683"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,25]]},"references-count":81,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3694683"],"URL":"https:\/\/doi.org\/10.1145\/3694683","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,11,25]]},"assertion":[{"value":"2023-06-18","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-25","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}