{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,15]],"date-time":"2025-11-15T10:34:46Z","timestamp":1763202886481,"version":"3.41.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"10","license":[{"start":{"date-parts":[[2024,9,12]],"date-time":"2024-09-12T00:00:00Z","timestamp":1726099200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62376111, U23A20388, U21B2027, 62322211"],"award-info":[{"award-number":["62376111, U23A20388, U21B2027, 62322211"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Yunnan High-tech Industry Development Project","award":["201606"],"award-info":[{"award-number":["201606"]}]},{"name":"Yunnan Key Research and Development Plan","award":["202303AP140008, 202302AD080003, 202401BC070021, 202103AA080015"],"award-info":[{"award-number":["202303AP140008, 202302AD080003, 202401BC070021, 202103AA080015"]}]},{"name":"Reserve Talents for Academic and Technological Leaders in Yunnan Province","award":["202105AC160018"],"award-info":[{"award-number":["202105AC160018"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this article, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC, and Spot-the-Diff.<\/jats:p>","DOI":"10.1145\/3660346","type":"journal-article","created":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T10:54:00Z","timestamp":1713783240000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Multi-Grained Representation Aggregating Transformer with Gating Cycle for Change Captioning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6764-1756","authenticated-orcid":false,"given":"Shengbin","family":"Yue","sequence":"first","affiliation":[{"name":"Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9525-9060","authenticated-orcid":false,"given":"Yunbin","family":"Tu","sequence":"additional","affiliation":[{"name":"University of Chinese Academy of Sciences, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1943-8219","authenticated-orcid":false,"given":"Liang","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology,, Chinese Academy of Sciences, Beijing China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2980-8420","authenticated-orcid":false,"given":"Shengxiang","family":"Gao","sequence":"additional","affiliation":[{"name":"Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China and Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4012-461X","authenticated-orcid":false,"given":"Zhengtao","family":"Yu","sequence":"additional","affiliation":[{"name":"Kunming University of Science and Technology, Kunming, China"}]}],"member":"320","published-online":{"date-parts":[[2024,9,12]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the Neural Information Processing Systems","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the Neural Information Processing Systems."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_1_4_2","unstructured":"Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https:\/\/arxiv.org\/abs\/1607.06450"},{"key":"e_1_3_1_5_2","first-page":"65","volume-title":"Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization. 65\u201372."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/S1053-8119(03)00406-3"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00307"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548206"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01059"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2021.3063423"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298878"},{"issue":"8","key":"e_1_3_1_13_2","first-page":"4065","article-title":"Dual encoding for video retrieval by text","volume":"44","author":"Dong Jianfeng","year":"2021","unstructured":"Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4065\u20134080.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2011.2170702"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475619"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3157136"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_18_2","unstructured":"Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (GELUs). arXiv:1606.08415. Retrieved from https:\/\/arxiv.org\/abs\/1606.08415"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683307"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00275"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","unstructured":"Genc Hoxha Seloua Chouaf Farid Melgani and Youcef Smara. 2022. Change captioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1\u201314. DOI:10.1109\/TGRS.2022.3195692","DOI":"10.1109\/TGRS.2022.3195692"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3074803"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1436"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460474"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.215"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00210"},{"key":"e_1_3_1_27_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3031173"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","unstructured":"Liang Li Xingyu Gao Jincan Deng Yunbin Tu Zheng-Jun Zha and Qingming Huang. 2022. Long short-term relation transformer with global gating for video captioning. IEEE Transactions on Image Processing 31 (2022) 2726\u20132738. DOI:10.1109\/TIP.2022.3158546","DOI":"10.1109\/TIP.2022.3158546"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2015.04.108"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3359753"},{"issue":"3","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3489142","article-title":"Inner knowledge-based Img2Doc scheme for visual question answering","volume":"18","author":"Li Qun","year":"2022","unstructured":"Qun Li, Fu Xiao, Bir Bhanu, Biyun Sheng, and Richang Hong. 2022. Inner knowledge-based Img2Doc scheme for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 3 (2022), 1\u201321.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_1_33_2","first-page":"74","volume-title":"Proceedings of the Workshop on Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE : A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out. 74\u201381."},{"issue":"3","key":"e_1_3_1_34_2","first-page":"3003","article-title":"Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding","volume":"45","author":"Liu Xuejing","year":"2022","unstructured":"Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3003\u20133018.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"4","key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3409388","article-title":"Adaptive attention-based high-level semantic introduction for image caption","volume":"16","author":"Liu Xiaoxiao","year":"2020","unstructured":"Xiaoxiao Liu and Qingyang Xu. 2020. Adaptive attention-based high-level semantic introduction for image caption. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 4 (2020), 1\u201322.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i3.16328"},{"key":"e_1_3_1_37_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00472"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00198"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00681"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2020.105920"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58568-6_34"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.868677"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1002\/int.22821"},{"key":"e_1_3_1_45_2","doi-asserted-by":"crossref","unstructured":"Hao Tan Franck Dernoncourt Zhe Lin Trung Bui and Mohit Bansal. 2019. Expressing visual relationships via language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1873\u20131883.","DOI":"10.18653\/v1\/P19-1182"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","unstructured":"Yunbin Tu Liang Li Li Su Junping Du Ke Lu and Qingming Huang. 2023. Viewpoint-adaptive representation disentanglement network for change captioning. IEEE Transactions on Image Processing 32 (2023) 2620\u20132635. DOI:10.1109\/TIP.2023.3268004","DOI":"10.1109\/TIP.2023.3268004"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","unstructured":"Yunbin Tu Liang Li Li Su Shengxiang Gao Chenggang Yan Zheng-Jun Zha Zhengtao Yu and Qingming Huang. 2022. I2Transformer: Intra- and inter-relation embedding transformer for TV show captioning. IEEE Transactions on Image Processing 31 (2022) 3565\u20133577. DOI:10.1109\/TIP.2022.3159472","DOI":"10.1109\/TIP.2022.3159472"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","unstructured":"Yunbin Tu Liang Li Li Su Ke Lu and Qingming Huang. 2023. Neighborhood contrastive transformer for change captioning. IEEE Transactions on Multimedia 25 (2023) 9518\u20139529. DOI:10.1109\/TMM.2023.3254162","DOI":"10.1109\/TMM.2023.3254162"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3365104"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00263"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.735"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.6"},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","unstructured":"Yunbin Tu Chang Zhou Junjun Guo Huafeng Li Shengxiang Gao and Zhengtao Yu. 2023. Relation-aware attention for video captioning via graph learning. Pattern Recognition 136 (2023) 109204.","DOI":"10.1016\/j.patcog.2022.109204"},{"key":"e_1_3_1_54_2","first-page":"5998","volume-title":"Proceedings of the International Conference on Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00660"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00267"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3446618"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i3.20218"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","unstructured":"Litao Yu Jian Zhang and Qiang Wu. 2022. Dual attention on pyramid feature maps for image captioning. IEEE Transactions on Multimedia 24 (2022) 1775\u20131786. DOI:10.1109\/TMM.2021.3072479","DOI":"10.1109\/TMM.2021.3072479"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","unstructured":"Shengbin Yue Yunbin Tu Liang Li Ying Yang Shengxiang Gao and Zhengtao Yu. 2023. I3N: Intra- and Inter-representation interaction network for change captioning. IEEE Transactions on Multimedia 25 (2023) 8828\u20138841. DOI:10.1109\/TMM.2023.3242142","DOI":"10.1109\/TMM.2023.3242142"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2022\/224"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3478642"},{"key":"e_1_3_1_63_2","doi-asserted-by":"crossref","unstructured":"Qian Zhang Wei Feng Yi-Bo Shi and Di Lin. 2022. Fast and robust active camera relocalization in the wild for fine-grained change detection. Neurocomputing 495 (2022) 11\u201325.","DOI":"10.1016\/j.neucom.2022.04.102"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.244"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3660346","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3660346","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:56:49Z","timestamp":1750291009000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3660346"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,12]]},"references-count":63,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3660346"],"URL":"https:\/\/doi.org\/10.1145\/3660346","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,9,12]]},"assertion":[{"value":"2023-07-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}