{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T19:50:23Z","timestamp":1769629823328,"version":"3.49.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T00:00:00Z","timestamp":1646352000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Chinese Scientific and Technical Innovation Project 2030","award":["2018AAA0102100"],"award-info":[{"award-number":["2018AAA0102100"]}]},{"DOI":"10.13039\/100017053","name":"NSFC-Xinjiang Joint Fund","doi-asserted-by":"crossref","award":["U1903128"],"award-info":[{"award-number":["U1903128"]}],"id":[{"id":"10.13039\/100017053","id-type":"DOI","asserted-by":"crossref"}]},{"name":"NSFC-General Technology Joint Fund for Basic Research","award":["U1936206"],"award-info":[{"award-number":["U1936206"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2022,8,31]]},"abstract":"<jats:p>Image captioning for low-resource languages has attracted much attention recently. Researchers propose to augment the low-resource caption dataset into (image, rich-resource language, and low-resource language) triplets and develop the dual attention mechanism to exploit the existence of triplets in training to improve the performance. However, datasets in triplet form are usually small due to their high collecting cost. On the other hand, there are already many large-scale datasets, which contain one pair from the triplet, such as caption datasets in the rich-resource language and translation datasets from the rich-resource language to the low-resource language. In this article, we revisit the caption-translation pipeline of the translation-based approach to utilize not only the triplet dataset but also large-scale paired datasets in training. The caption-translation pipeline is composed of two models, one caption model of the rich-resource language and one translation model from the rich-resource language to the low-resource language. Unfortunately, it is not trivial to fully benefit from incorporating both the triplet dataset and paired datasets into the pipeline, due to the gap between the training and testing phases and the instability in the training process. We propose to jointly optimize the two models of the pipeline in an end-to-end manner to bridge the training and testing gap, and introduce two auxiliary training objectives to stabilize the training process. Experimental results show that the proposed method improves significantly over the state-of-the-art methods.<\/jats:p>","DOI":"10.1145\/3492325","type":"journal-article","created":{"date-parts":[[2022,3,4]],"date-time":"2022-03-04T10:26:32Z","timestamp":1646389592000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization"],"prefix":"10.1145","volume":"18","author":[{"given":"Yike","family":"Wu","sequence":"first","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shiwan","family":"Zhao","sequence":"additional","affiliation":[{"name":"IBM Research - China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ying","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaojie","family":"Yuan","sequence":"additional","affiliation":[{"name":"College of Computer Science, Nankai University, Tianjin, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhong","family":"Su","sequence":"additional","affiliation":[{"name":"IBM Research - China, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,3,4]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_3_4_2","volume-title":"Proceedings of the 3rd International Conference on Learning Representations. 2015","author":"Bahdanau Dzmitry","year":"2015","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. 2015."},{"key":"e_1_3_3_5_2","first-page":"65","volume-title":"Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00998"},{"key":"e_1_3_3_7_2","unstructured":"Kyunghyun Cho Bart van Merri\u00ebnboer Dzmitry Bahdanau and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder\u2013decoder approaches. In Proceedings of SSST-8 8th Workshop on Syntax Semantics and Structure in Statistical Translation 103."},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2018.10.054"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.04.095"},{"key":"e_1_3_3_11_2","article-title":"Multilingual image description with neural sequence models","author":"Elliott Desmond","year":"2015","unstructured":"Desmond Elliott, Stella Frank, and Eva Hasler. 2015. Multilingual image description with neural sequence models. arXiv:1510.04709 . Retrieved from https:\/\/arxiv.org\/abs\/1510.04709.","journal-title":"arXiv:1510.04709"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-3210"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12016"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_31"},{"key":"e_1_3_3_15_2","article-title":"Statistical theory of extreme values and some practical applications","volume":"33","author":"Gumbel Emil Julius","year":"1954","unstructured":"Emil Julius Gumbel. 1954. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series 33 (1954), 51.","journal-title":"NBS Applied Mathematics Series"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1083"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4750"},{"key":"e_1_3_3_20_2","article-title":"Categorical reparameterization with gumbel-softmax","author":"Jang Eric","year":"2016","unstructured":"Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144. Retrieved from https:\/\/arxiv.org\/abs\/1611.01144.","journal-title":"arXiv:1611.01144"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00898"},{"key":"e_1_3_3_22_2","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980.","journal-title":"arXiv:1412.6980"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123366"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2911996.2912049"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00754"},{"key":"e_1_3_3_28_2","unstructured":"Chris J. Maddison Andriy Mnih and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations . International Conference on Learning Representations."},{"key":"e_1_3_3_29_2","first-page":"3086","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Maddison Chris J","year":"2014","unstructured":"Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Proceedings of the Advances in Neural Information Processing Systems. 3086\u20133094."},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1168"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_3_32_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting on Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311\u2013318."},{"key":"e_1_3_3_33_2","first-page":"91","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91\u201399."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.130"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2019.00070"},{"key":"e_1_3_3_39_2","first-page":"2048","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Xu Kelvin","year":"2015","unstructured":"Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048\u20132057."},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00271"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.503"},{"key":"e_1_3_3_44_2","doi-asserted-by":"crossref","unstructured":"Peter Young Alice Lai Micah Hodosh and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014) 67\u201378.","DOI":"10.1162\/tacl_a_00166"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3492325","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3492325","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:31:09Z","timestamp":1750188669000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3492325"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,4]]},"references-count":43,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,8,31]]}},"alternative-id":["10.1145\/3492325"],"URL":"https:\/\/doi.org\/10.1145\/3492325","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,4]]},"assertion":[{"value":"2020-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}