{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:13:25Z","timestamp":1750220005240,"version":"3.41.0"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"1s","license":[{"start":{"date-parts":[[2023,2,3]],"date-time":"2023-02-03T00:00:00Z","timestamp":1675382400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Project","award":["2020AAA0106200"],"award-info":[{"award-number":["2020AAA0106200"]}]},{"DOI":"10.13039\/501100001809","name":"National Nature Science Foundation of China","doi-asserted-by":"crossref","award":["61936005, 61872424"],"award-info":[{"award-number":["61936005, 61872424"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"crossref","award":["BK20200037, and BK20210595"],"award-info":[{"award-number":["BK20200037, and BK20210595"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,2,28]]},"abstract":"<jats:p>\n            Image caption editing, which aims at editing the inaccurate descriptions of the images, is an interdisciplinary task of computer vision and natural language processing. As the task requires encoding the image and its corresponding inaccurate caption simultaneously and decoding to generate an accurate image caption, the encoder-decoder framework is widely adopted for image caption editing. However, existing methods mostly focus on the decoder, yet ignore a big challenge on the encoder: the semantic inconsistency between image and caption. To this end, we propose a novel\n            <jats:bold>A<\/jats:bold>\n            daptive\n            <jats:bold>T<\/jats:bold>\n            ext\n            <jats:bold>D<\/jats:bold>\n            enoising\n            <jats:bold>Net<\/jats:bold>\n            work (ATD-Net) to filter out noises at the word level and improve the model\u2019s robustness at sentence level. Specifically, at the word level, we design a cross-attention mechanism called Textual Attention Mechanism (TAM), to differentiate the misdescriptive words. The TAM is designed to encode the inaccurate caption word by word based on the content of both image and caption. At the sentence level, in order to minimize the influence of misdescriptive words on the semantic of an entire caption, we introduce a Bidirectional Encoder to extract the correct semantic representation from the raw caption. The Bidirectional Encoder is able to model the global semantics of the raw caption, which enhances the robustness of the framework. We extensively evaluate our proposals on the MS-COCO image captioning dataset and prove the effectiveness of our method when compared with the state-of-the-arts.\n          <\/jats:p>","DOI":"10.1145\/3532627","type":"journal-article","created":{"date-parts":[[2022,7,14]],"date-time":"2022-07-14T11:17:31Z","timestamp":1657797451000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Adaptive Text Denoising Network for Image Caption Editing"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9183-3267","authenticated-orcid":false,"given":"Mengqi","family":"Yuan","sequence":"first","affiliation":[{"name":"Nanjing University of Postsand Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5956-831X","authenticated-orcid":false,"given":"Bing-Kun","family":"Bao","sequence":"additional","affiliation":[{"name":"Nanjing University of Postsand Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1209-2817","authenticated-orcid":false,"given":"Zhiyi","family":"Tan","sequence":"additional","affiliation":[{"name":"Nanjing University of Postsand Telecommunications, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8343-9665","authenticated-orcid":false,"given":"Changsheng","family":"Xu","sequence":"additional","affiliation":[{"name":"Peng Cheng Laboratory; University of Chinese Academy of Sciences; NLPR, Institute of Automation, CAS, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,2,3]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"A. Karpathy and L. Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39 4 (2017) 664\u2013676.","DOI":"10.1109\/TPAMI.2016.2598339"},{"key":"e_1_3_1_4_2","first-page":"2048","volume-title":"Proceedings of the International Conference on Machine Learning.","author":"Xu K.","year":"2015","unstructured":"K. Xu et\u00a0al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048\u20132057."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.131"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_1_7_2","volume-title":"Proceedings of the British Machine Vision Conference","author":"Fawaz S.","year":"2019","unstructured":"S. Fawaz and E. Mahmoud. 2019. Look and modify: Modification networks for image captioning. In Proceedings of the British Machine Vision Conference. 75."},{"key":"e_1_3_1_8_2","first-page":"4807","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Fawaz S.","year":"2020","unstructured":"S. Fawaz and E. Mahmoud. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4807\u20134815."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2931815"},{"key":"e_1_3_1_10_2","first-page":"2149","article-title":"Show, tell and polish: Ruminant decoding for image captioning","author":"Guo L.","year":"2019","unstructured":"L. Guo, J. Liu, S. Lu, and H. Lu. 2019. Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia. 2149\u20132162.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01034"},{"key":"e_1_3_1_12_2","first-page":"5998","volume-title":"Proceedings of the Advances in Neural Information Processing Systems.","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, N. A. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_13_2","first-page":"711","volume-title":"Proceedings of the European Conference on Computer Vision.","author":"Yao T.","year":"2019","unstructured":"T. Yao, Y. Pan, Y. Li, and T. Mei. 2019. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 711\u2013727."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00473"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.667"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.345"},{"key":"e_1_3_1_18_2","first-page":"3104","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Sutskever I.","year":"2014","unstructured":"I. Sutskever, O. Vinyals, and V. Q. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 3104\u20133112."},{"key":"e_1_3_1_19_2","volume-title":"Proceedings of the 3rd Int. Conf. Learn. Representations","author":"Bahdanau D.","year":"2015","unstructured":"D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3004729"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2969330"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2941820"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00998"},{"key":"e_1_3_1_24_2","first-page":"694","volume-title":"IEEE Transactions on Image Processing.","author":"Zhou L.","year":"2019","unstructured":"L. Zhou, Y. Zhang, Y. Jiang, T. Zhang, and W. Fan. 2019. Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Transactions on Image Processing. 694\u2013709."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3042192"},{"key":"e_1_3_1_27_2","volume-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence","author":"Zha Z.","year":"2019","unstructured":"Z. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00433"},{"key":"e_1_3_1_29_2","first-page":"4171","volume-title":"Proceedings of the North American Chapter of the Association for Computational Linguistics","author":"Devlin J.","year":"2018","unstructured":"J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 4171\u20134186."},{"key":"e_1_3_1_30_2","first-page":"2556","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.","author":"Soricut R.","year":"2018","unstructured":"R. Soricut, N. Ding, P. Sharma, and S. Goodman. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2556\u20132565."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_33_2","first-page":"664","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","author":"Andrej K.","year":"2015","unstructured":"K. Andrej and F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 664\u2013676."},{"key":"e_1_3_1_34_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.","author":"Kishore P.","year":"2002","unstructured":"P. Kishore, R. Salim, W. Todd, and W. Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_35_2","first-page":"1","volume-title":"Proceedings of the Text Summarization Branches Out.","author":"Lin C.","year":"2004","unstructured":"C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 1\u20138."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_38_2","first-page":"7","volume-title":"Proceedings of the 3rd Int. Conf. Learn. Representations.","author":"Kingma D. P.","year":"2015","unstructured":"D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd Int. Conf. Learn. Representations. 7\u20139."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01216-8_31"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_1_41_2","first-page":"1724","volume-title":"Proceedings of the Conf. Empirical Methods Natural Lang. Process.","author":"Kyunghyun C.","year":"2014","unstructured":"C. Kyunghyun, V. M. Bart, G. \u00c7aglar, B. Fethi, S. Holger, and B. Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 1724\u20131734."},{"key":"e_1_3_1_42_2","volume-title":"Proceedings of the 3rd Int. Conf. Learn. Representations.","author":"Dzmitry B.","year":"2015","unstructured":"B. Dzmitry, C. Kyunghyun, and B. Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations."},{"key":"e_1_3_1_43_2","first-page":"1","volume-title":"Proceedings of the Assoc. Comput. Linguistics.","author":"S\u00e9bastien J.","year":"2015","unstructured":"J. S\u00e9bastien, C. KyungHyun, M. Roland, and B. Yoshua. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the Assoc. Comput. Linguistics. 1\u201310."},{"key":"e_1_3_1_44_2","first-page":"1412","volume-title":"Proceedings of the Conf. Empirical Methods Natural Lang. Process.","author":"Thang L.","year":"2015","unstructured":"L. Thang, P. Hieu, and D. M. Christopher. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 1412\u20131421."},{"key":"e_1_3_1_45_2","first-page":"5053","volume-title":"Proceedings of the Conf. Empirical Methods Natural Lang. Process.","author":"Eric M.","year":"2019","unstructured":"M. Eric, K. Sebastian, R. Sascha, M. Daniil, and S. Aliaksei. 2019. Encode, tag, realize: high-precision text editing. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 5053\u20135064."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.159"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1420"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1412"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_50_2","first-page":"8940","volume-title":"Advances in Neural Information Processing Systems.","author":"Huang L.","year":"2019","unstructured":"L. Huang, W. Wang, Y. Xia, and J. Chen. 2019. Adaptively aligned image captioning via adaptive attention time. In Advances in Neural Information Processing Systems. 8940\u20138949."},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00856"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6898"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i3.16328"},{"key":"e_1_3_1_54_2","volume-title":"IEEE Transactions on Multimedia","author":"Liu A.","year":"2020","unstructured":"A. Liu, Y. Wang, N. Xu, W. Nie, J. Nie, and Y. Zhang. 2020. Adaptively clustering-driven learning for visual relationship detection. IEEE Transactions on Multimedia."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01602"},{"key":"e_1_3_1_56_2","volume-title":"IEEE Transactions on Multimedia","author":"Wang J.","year":"2021","unstructured":"J. Wang, B. Bao, and C. Xu. 2021. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 14, 8 (2021)."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME51207.2021.9428120"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414012"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01355"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3532627","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3532627","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:04Z","timestamp":1750182664000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3532627"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,3]]},"references-count":58,"journal-issue":{"issue":"1s","published-print":{"date-parts":[[2023,2,28]]}},"alternative-id":["10.1145\/3532627"],"URL":"https:\/\/doi.org\/10.1145\/3532627","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2023,2,3]]},"assertion":[{"value":"2022-01-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-17","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-03","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}