{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T20:11:17Z","timestamp":1759176677590,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2024,5,10]],"date-time":"2024-05-10T00:00:00Z","timestamp":1715299200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62102187"],"award-info":[{"award-number":["62102187"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"crossref","award":["BK20210639"],"award-info":[{"award-number":["BK20210639"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U20B2061"],"award-info":[{"award-number":["U20B2061"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62162031"],"award-info":[{"award-number":["62162031"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012154","name":"Graduate Research and Innovation Projects of Jiangsu Province","doi-asserted-by":"crossref","award":["SJCX23_0406"],"award-info":[{"award-number":["SJCX23_0406"]}],"id":[{"id":"10.13039\/501100012154","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2024,5,31]]},"abstract":"<jats:p>This article focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. However, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposed Multization method effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates the multi-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.<\/jats:p>","DOI":"10.1145\/3651983","type":"journal-article","created":{"date-parts":[[2024,3,9]],"date-time":"2024-03-09T09:42:26Z","timestamp":1709977346000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment"],"prefix":"10.1145","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0542-1827","authenticated-orcid":false,"given":"Huan","family":"Rong","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, Nanjing University of Information Science &amp; Technology, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-6723-8422","authenticated-orcid":false,"given":"Zhongfeng","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Nanjing University of Information Science &amp; Technology, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5066-4716","authenticated-orcid":false,"given":"Zhenyu","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Nanjing University of Information Science &amp; Technology, Nanjing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7477-9331","authenticated-orcid":false,"given":"Fan","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Computer and Infromation Engieering, Jiangxi Normal University, Nanchang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4960-174X","authenticated-orcid":false,"given":"Victor S","family":"Sheng","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Texas Tech University, Lubbock, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,5,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2014.2384912"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1438"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612408"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.473"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00855"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413678"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01428"},{"key":"e_1_3_2_11_2","doi-asserted-by":"crossref","unstructured":"Lifeng Hua Xiaojun Wan and Lei Li. 2018. Overview of the NLPCC 2017 shared task: Single document summarization. In Natural Language Processing and Chinese Computing. Lecture Notes in Computer Science Vol. 10619. Springer 942\u2013947.","DOI":"10.1007\/978-3-319-73618-1_84"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00069"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3584700"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3561819"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.nlpbt-1.7"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIFS.2018.2866319"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6332"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.5555\/3304222.3304347"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1114"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.496"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.752"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1210"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-2031"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413715"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00749"},{"key":"e_1_3_2_26_2","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang Long","year":"2022","unstructured":"Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730\u201327744.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1659"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.748"},{"key":"e_1_3_2_29_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. 8748\u20138763."},{"issue":"1","key":"e_1_3_2_30_2","first-page":"1","article-title":"Denigrate comment detection in low-resource Hindi language using attention-based residual networks","volume":"21","author":"Sangwan Saurabh R.","year":"2021","unstructured":"Saurabh R. Sangwan and M. P. S. Bhatia. 2021. Denigrate comment detection in low-resource Hindi language using attention-based residual networks. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1\u201314.","journal-title":"Transactions on Asian and Low-Resource Language Information Processing"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1099"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2013.08.015"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2022.3158605"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2021.3082942"},{"key":"e_1_3_2_35_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1\u201311.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.10.103"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2021.3072153"},{"key":"e_1_3_2_38_2","article-title":"CFSum: A coarse-to-fine contribution network for multimodal summarization","author":"Xiao Min","year":"2023","unstructured":"Min Xiao, Junnan Zhu, Haitao Lin, Yu Zhou, and Chengqing Zong. 2023. CFSum: A coarse-to-fine contribution network for multimodal summarization. arXiv preprint arXiv:2307.02716 (2023).","journal-title":"arXiv preprint arXiv:2307.02716"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2020.2986778"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2018.2876317"},{"key":"e_1_3_2_41_2","article-title":"A novel graph-based multi-modal fusion encoder for neural machine translation","author":"Yin Yongjing","year":"2020","unstructured":"Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. arXiv preprint arXiv:2007.08742 (2020).","journal-title":"arXiv preprint arXiv:2007.08742"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2023.103986"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21422"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21431"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3596219"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1101"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1448"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445794"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6525"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3651983","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3651983","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:13Z","timestamp":1750268953000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3651983"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,10]]},"references-count":48,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5,31]]}},"alternative-id":["10.1145\/3651983"],"URL":"https:\/\/doi.org\/10.1145\/3651983","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2024,5,10]]},"assertion":[{"value":"2023-09-07","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-03-05","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-05-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}