{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T15:54:29Z","timestamp":1776441269119,"version":"3.51.2"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2022YFB4500600"],"award-info":[{"award-number":["2022YFB4500600"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["72188101, 62020106007, 62272144, U20A20183, 62272435, and U22A2094"],"award-info":[{"award-number":["72188101, 62020106007, 62272144, U20A20183, 62272435, and U22A2094"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Project of Anhui Province","award":["202203a05020011"],"award-info":[{"award-number":["202203a05020011"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Generating image captions in different languages is worth exploring and essential for non-native speakers. Nevertheless, collecting paired annotation for every language is time-consuming and impractical, particularly for minor languages. To this end, the cross-lingual image captioning task is proposed, which leverages existing image-source caption annotation data and wild unrelated target corpus to generate satisfactory caption in the target language. Current methods perform a two-step translation process of image-to-pivot (source) and pivot-to-target. The distinct two-step process comes with certain caption issues, such as the weak semantic alignment between the image and the generated caption and the generated caption\u2019s non-target language style. To address these issues, we propose an end-to-end reinforce learning framework with Visual-linguistic-stylistic Triple Reward named TriR. In TriR, we jointly consider the visual, linguistic, and stylistic alignments to generate factual, fluent, and natural caption in the target language. To be specific, the image-source caption annotation provides factual semantic guidance, whereas the unrelated target corpus guides the language style of generated caption. To achieve this, we construct a visual reward module to measure the cross-modal semantic embedding of image and target caption, a linguistic reward module to measure the cross-linguistic embedding of source and target captions, and a stylistic reward module to imitate the presentation style of target corpus. The TriR can be implemented with either classical CNN-LSTM or prevalent Transformer architecture. Extensive experiments are conducted with four cross-lingual settings, i.e., Chinese-to-English, English-to-Chinese, English-to-German, and English-to-French. Experimental results demonstrate the remarkable superiority of our method, and sufficient ablation experiments validate the beneficial impact of every reward.<\/jats:p>","DOI":"10.1145\/3634917","type":"journal-article","created":{"date-parts":[[2023,11,28]],"date-time":"2023-11-28T12:16:25Z","timestamp":1701173785000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Visual-linguistic-stylistic Triple Reward for Cross-lingual Image Captioning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-1590-5886","authenticated-orcid":false,"given":"Jing","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2594-254X","authenticated-orcid":false,"given":"Dan","family":"Guo","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology (HFUT), Intelligent Interconnected Systems Laboratory of Anhui Province(HFUT), Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0201-1638","authenticated-orcid":false,"given":"Xun","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6764-3375","authenticated-orcid":false,"given":"Peipei","family":"Song","sequence":"additional","affiliation":[{"name":"School of Information Science and Technology, University of Science and Technology of China, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3094-7735","authenticated-orcid":false,"given":"Meng","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Information Engineering, Hefei University of Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3060948"},{"key":"e_1_3_2_4_2","article-title":"Doubly-attentive decoder for multi-modal neural machine translation","author":"Calixto Iacer","year":"2017","unstructured":"Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-attentive decoder for multi-modal neural machine translation. arXiv preprint arXiv:1702.01287 (2017).","journal-title":"arXiv preprint arXiv:1702.01287"},{"key":"e_1_3_2_5_2","article-title":"Incorporating global visual features into attention-based neural machine translation","author":"Calixto Iacer","year":"2017","unstructured":"Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Incorporating global visual features into attention-based neural machine translation. arXiv preprint arXiv:1701.06521 (2017).","journal-title":"arXiv preprint arXiv:1701.06521"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_7_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Chen Shizhe","year":"2020","unstructured":"Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)."},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11976"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W14-3348"},{"issue":"8","key":"e_1_3_2_10_2","first-page":"4065","article-title":"Dual encoding for video retrieval by text","volume":"44","author":"Dong Jianfeng","year":"2021","unstructured":"Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8 (2021), 4065\u20134080.","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Desmond Elliott Stella Frank Khalil Sima\u2019an and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. arxiv:1605.00459","DOI":"10.18653\/v1\/W16-3210"},{"key":"e_1_3_2_12_2","article-title":"Imagination improves multimodal translation","author":"Elliott Desmond","year":"2017","unstructured":"Desmond Elliott and Akos K\u00e1d\u00e1r. 2017. Imagination improves multimodal translation. arXiv preprint arXiv:1705.04350 (2017).","journal-title":"arXiv preprint arXiv:1705.04350"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00425"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21310"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_31"},{"key":"e_1_3_2_16_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Gu Jiuxiang","year":"2019","unstructured":"Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. 2019. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350881"},{"key":"e_1_3_2_18_2","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence","author":"Guo Dan","year":"2020","unstructured":"Dan Guo, Yang Wang, Peipei Song, and Meng Wang. 2020. Recurrent relational memory network for unsupervised image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12235"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_22_2","unstructured":"Yushi Hu Hang Hua Zhengyuan Yang Weijia Shi Noah A. Smith and Jiebo Luo. 2023. PromptCap: Prompt-Guided Task-aware Image Captioning. arxiv:2211.09699"},{"key":"e_1_3_2_23_2","article-title":"Unsupervised multimodal neural machine translation with pseudo visual pivoting","author":"Huang Po-Yao","year":"2020","unstructured":"Po-Yao Huang, Junjie Hu, Xiaojun Chang, and Alexander Hauptmann. 2020. Unsupervised multimodal neural machine translation with pseudo visual pivoting. arXiv preprint arXiv:2005.03119 (2020).","journal-title":"arXiv preprint arXiv:2005.03119"},{"key":"e_1_3_2_24_2","article-title":"Distilling translations with visual awareness","author":"Ive Julia","year":"2019","unstructured":"Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019. Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701 (2019).","journal-title":"arXiv preprint arXiv:1906.07701"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_26_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980"},{"key":"e_1_3_2_27_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)","author":"Laina Iro","year":"2019","unstructured":"Iro Laina, Christian Rupprecht, and Nassir Navab. 2019. Towards unsupervised image captioning with shared multimodal embeddings. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201919)."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123366"},{"key":"e_1_3_2_29_2","first-page":"10648","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Guozhang","year":"2023","unstructured":"Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. 2023. Boosting weakly-supervised temporal action localization with text information. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10648\u201310657."},{"issue":"2","key":"e_1_3_2_30_2","first-page":"1","article-title":"Guided graph attention learning for video-text matching","volume":"18","author":"Li Kunpeng","year":"2023","unstructured":"Kunpeng Li, Chang Liu, Mike Stopa, Jun Amano, and Yun Fu. 2023. Guided graph attention learning for video-text matching. ACM Trans. Multim. Comput., Commun. Applic. 18, 2s (2023), 1\u201323.","journal-title":"ACM Trans. Multim. Comput., Commun. Applic."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/2911996.2912049"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2896494"},{"key":"e_1_3_2_33_2","first-page":"17990","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922)","author":"Li Yehao","year":"2022","unstructured":"Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922). 17990\u201317999."},{"key":"e_1_3_2_34_2","first-page":"5216","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Yi","year":"2022","unstructured":"Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Richard Chen, Rogerio S. Feris, David Cox, and Nuno Vasconcelos. 2022. VALHALLA: Visual hallucination for machine translation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 5216\u20135226."},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2022.3218921"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2019.00054"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_21"},{"key":"e_1_3_2_39_2","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)","author":"Liu Yu","year":"2017","unstructured":"Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV\u201917)."},{"key":"e_1_3_2_40_2","article-title":"Generative imagination elevates machine translation","author":"Long Quanyu","year":"2020","unstructured":"Quanyu Long, Mingxuan Wang, and Lei Li. 2020. Generative imagination elevates machine translation. arXiv preprint arXiv:2009.09654 (2020).","journal-title":"arXiv preprint arXiv:2009.09654"},{"key":"e_1_3_2_41_2","first-page":"1043","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV\u201923)","author":"Malla Srikanth","year":"2023","unstructured":"Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. 2023. DRAMA: Joint risk localization and captioning in driving. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV\u201923). 1043\u20131052."},{"key":"e_1_3_2_42_2","first-page":"1780","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics","author":"Miyazaki Takashi","year":"2016","unstructured":"Takashi Miyazaki and Nobuyuki Shimizu. 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1780\u20131052."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10590-017-9197-z"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475604"},{"key":"e_1_3_2_45_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_2_46_2","article-title":"Sequence level training with recurrent neural networks","volume":"1511","author":"Ranzato Marc\u2019Aurelio","year":"2015","unstructured":"Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR abs\/1511.06732 (2015).","journal-title":"CoRR"},{"key":"e_1_3_2_47_2","volume-title":"Advances in Neural Information Processing Systems","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2015\/file\/14bfa6bb14875e45bba028a21ed38046-Paper.pdf"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2023.3241517"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2022.3175012"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350996"},{"key":"e_1_3_2_53_2","first-page":"543","volume-title":"Proceedings of the 1st Conference on Machine Translation","author":"Specia Lucia","year":"2016","unstructured":"Lucia Specia, Stella Frank, Khalil Sima\u2019an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation. 543\u2013553."},{"key":"e_1_3_2_54_2","first-page":"10482","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Su Yuanhang","year":"2019","unstructured":"Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay Kuo, and Fei Huang. 2019. Unsupervised multi-modal neural machine translation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 10482\u201310491."},{"key":"e_1_3_2_55_2","volume-title":"Advances in Neural Information Processing Systems","author":"Sutton Richard S.","year":"1999","unstructured":"Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. M\u00fcller (Eds.), Vol. 12. MIT Press. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/1999\/file\/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf"},{"issue":"11","key":"e_1_3_2_56_2","article-title":"Visualizing data using t-SNE.","volume":"9","author":"Maaten Laurens Van der","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_57_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240671"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/icme.2019.00256"},{"key":"e_1_3_2_62_2","first-page":"975","volume-title":"Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI\u201919)","author":"Wu Siying","year":"2019","unstructured":"Siying Wu, Zheng-Jun Zha, Zilei Wang, Houqiang Li, and Feng Wu. 2019. Densely supervised hierarchical policy-value network for image paragraph generation. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI\u201919). 975\u2013981."},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2941820"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3191841"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401151"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462823"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413610"},{"key":"e_1_3_2_68_2","first-page":"5098","volume-title":"Proceedings of the 29th International Conference on Computational Linguistics","author":"Ye Junjie","year":"2022","unstructured":"Junjie Ye, Junjun Guo, Yan Xiang, Kaiwen Tan, and Zhengtao Yu. 2022. Noise-robust cross-modal interactive learning with text2image mask for multi-modal neural machine translation. In Proceedings of the 29th International Conference on Computational Linguistics. 5098\u20135108."},{"key":"e_1_3_2_69_2","article-title":"Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training","author":"Zeng Yan","year":"2022","unstructured":"Yan Zeng, Wangchunshu Zhou, Ao Luo, and Xinsong Zhang. 2022. Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training. arXiv preprint arXiv:2206.00621 (2022).","journal-title":"arXiv preprint arXiv:2206.00621"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383972.3384072"},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3502092"},{"key":"e_1_3_2_72_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)","author":"Zhou Yuanen","year":"2020","unstructured":"Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201920)."},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3214090"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634917","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3634917","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:51:07Z","timestamp":1750287067000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634917"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":72,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3634917"],"URL":"https:\/\/doi.org\/10.1145\/3634917","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"2023-04-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}