{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,8]],"date-time":"2025-11-08T13:44:42Z","timestamp":1762609482469},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>Multimodal abstractive summarization for videos (MAS) requires generating a concise textual summary to describe the highlights of a video according to multimodal resources, in our case, the video content and its transcript. Inspired by the success of the large-scale generative pre-trained language model (GPLM) in generating high-quality textual content (e.g., summary), recent MAS methods have proposed to adapt the GPLM to this task by equipping it with the visual information, which is often obtained through a general-purpose visual feature extractor. However, the generally extracted visual features may overlook some summary-worthy visual information, which impedes model performance. In this work, we propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary. Extensive experiments on three public multimodal datasets show that our method outperforms all competing baselines. Furthermore, with the advantages of summary-worthy visual information, our model can have a significant improvement on small datasets or even datasets with limited training data.<\/jats:p>","DOI":"10.24963\/ijcai.2023\/582","type":"proceedings-article","created":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:31:30Z","timestamp":1691742690000},"page":"5242-5250","source":"Crossref","is-referenced-by-count":7,"title":["Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video"],"prefix":"10.24963","author":[{"given":"Zenan","family":"Xu","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University"}]},{"given":"Xiaojun","family":"Meng","sequence":"additional","affiliation":[{"name":"Noah's Ark Lab, Huawei Technologies"}]},{"given":"Yasheng","family":"Wang","sequence":"additional","affiliation":[{"name":"Noah's Ark Lab, Huawei Technologies"}]},{"given":"Qinliang","family":"Su","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University"},{"name":"Guangdong Key Laboratory of Big Data Analysis and Processing"}]},{"given":"Zexuan","family":"Qiu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong"}]},{"given":"Xin","family":"Jiang","sequence":"additional","affiliation":[{"name":"Noah's Ark Lab, Huawei Technologies"}]},{"given":"Qun","family":"Liu","sequence":"additional","affiliation":[{"name":"Noah's Ark Lab, Huawei Technologies"}]}],"member":"10584","event":{"number":"32","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2023","name":"Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}","start":{"date-parts":[[2023,8,19]]},"theme":"Artificial Intelligence","location":"Macau, SAR China","end":{"date-parts":[[2023,8,25]]}},"container-title":["Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:51:58Z","timestamp":1691743918000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2023\/582"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2023\/582","relation":{},"subject":[],"published":{"date-parts":[[2023,8]]}}}