{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T17:07:28Z","timestamp":1778692048142,"version":"3.51.4"},"reference-count":212,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n            The proliferation of information-sharing platforms and the ease of access to diverse resources have led to an overwhelming volume of multimodal data that is increasingly difficult to process effectively. The integration of multiple data types, including text, images, video, and audio, highlights the growing importance of Multimodal Text Summarization (MMTS). Collecting and synthesizing existing research on this topic can provide a comprehensive foundation for advancing the field. Following a Systematic Literature Review (SLR) methodology, we addressed three pivotal research questions concerning methodologies, evaluation measures, and datasets in MMTS. Through a\n            <jats:bold>systematic analysis of 132 papers<\/jats:bold>\n            , we examined the strategies employed to address MMTS challenges, assessed the evaluation methods used to quantify performance, and compiled a detailed list of available datasets along with their limitations. This review offers critical insights and identifies future research directions, aiming to inform and guide continued innovation in this dynamic and evolving domain.\n          <\/jats:p>","DOI":"10.1145\/3763245","type":"journal-article","created":{"date-parts":[[2025,8,20]],"date-time":"2025-08-20T09:48:57Z","timestamp":1755683337000},"page":"1-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["A Systematic Literature Review on Multimodal Text Summarization"],"prefix":"10.1145","volume":"58","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0691-1695","authenticated-orcid":false,"given":"Abid","family":"Ali","sequence":"first","affiliation":[{"name":"School of Computing, Macquarie University","place":["Sydney, Australia"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4973-0963","authenticated-orcid":false,"given":"Diego","family":"Molla","sequence":"additional","affiliation":[{"name":"School of Computing, Macquarie University","place":["Sydney, Australia"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,9,29]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","unstructured":"Laith Abualigah Mohammad Qassem Bashabsheh Hamzeh Alabool and Mohammad Shehab. 2020. Text summarization: A brief review. In Recent Advances in NLP: The Case of Arabic Language Mohamed Abd Elaziz Mohammed A. A. Al-qaness Ahmed A. Ewees and Abdelghani Dahou (Eds.). Springer International Publishing Cham 1\u201315. DOI:10.1007\/978-3-030-34614-0_1","DOI":"10.1007\/978-3-030-34614-0_1"},{"key":"e_1_3_3_3_2","doi-asserted-by":"crossref","first-page":"1934","DOI":"10.1109\/COMPSAC61105.2024.00307","volume-title":"2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)","author":"Alam Md Jahangir","year":"2024","unstructured":"Md Jahangir Alam, Ismail Hossain, Sai Puppala, and Sajedul Talukder. 2024. Advancements in multimodal social media post summarization: Integrating GPT-4 for enhanced understanding. In 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1934\u20131940."},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3462777"},{"key":"e_1_3_3_5_2","volume-title":"Towards Subjective Multimedia Summarization Framework for Sporting Event in the Context of Digital Twins","author":"Aloufi Samah Bader","year":"2020","unstructured":"Samah Bader Aloufi. 2020. Towards Subjective Multimedia Summarization Framework for Sporting Event in the Context of Digital Twins. Ph.D. Dissertation. Universit\u00e9 d\u2019Ottawa\/University of Ottawa."},{"key":"e_1_3_3_6_2","first-page":"1","volume-title":"2024 28th International Conference on Information Technology (IT)","author":"Altundogan Turan Goktug","year":"2024","unstructured":"Turan Goktug Altundogan, Mehmet Karakose, and Senem Tanberk. 2024. Transformer based multimodal summarization and highlight abstraction approach for texts and speech audios. In 2024 28th International Conference on Information Technology (IT). IEEE, 1\u20134."},{"key":"e_1_3_3_7_2","article-title":"Creating multimedia summaries using tweets and videos","author":"Andy Anietie","year":"2022","unstructured":"Anietie Andy, Siyi Liu, Daphne Ippolito, Reno Kriz, Chris Callison-Burch, and Derry Wijaya. 2022. Creating multimedia summaries using tweets and videos. arXiv preprint arXiv:2203.08931 (2022).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3117472"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","unstructured":"Dakshata Argade Vaishali Khairnar Deepali Vora Shruti Patil Ketan Kotecha and Sultan Alfarhood. 2024. Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism. Heliyon 10 4 (2024) e26162. DOI:10.1016\/j.heliyon.2024.e26162","DOI":"10.1016\/j.heliyon.2024.e26162"},{"key":"e_1_3_3_10_2","volume-title":"AIP Conference Proceedings","volume":"2802","author":"Aruneshwari R. R.","year":"2024","unstructured":"R. R. Aruneshwari, K. M. Anandkumar, and D. Kavitha. 2024. A comprehensive review of text summarization. In AIP Conference Proceedings, Vol. 2802. AIP Publishing."},{"key":"e_1_3_3_11_2","first-page":"797","volume-title":"Proceedings of the 20th International Conference on Natural Language Processing (ICON)","author":"Atharva Kumbhar","year":"2023","unstructured":"Kumbhar Atharva, Kulkarni Harsh, Mali Atmaja, Sonawane Sheetal, and Mulay Prathamesh. 2023. The current landscape of multimodal summarization. In Proceedings of the 20th International Conference on Natural Language Processing (ICON). 797\u2013806."},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107152"},{"key":"e_1_3_3_13_2","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449\u201312460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_14_2","article-title":"Neural machine translation by jointly learning to align and translate","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_15_2","doi-asserted-by":"crossref","first-page":"1166","DOI":"10.1109\/ICCES51350.2021.9488968","volume-title":"2021 6th International Conference on Communication and Electronics Systems (ICCES)","author":"Banerjee Shreya","year":"2021","unstructured":"Shreya Banerjee, Rachana B. Karennavar, Prerana Sirigeri, et\u00a0al. 2021. Multimedia text summary generator for visually impaired. In 2021 6th International Conference on Communication and Electronics Systems (ICCES). IEEE, 1166\u20131173."},{"key":"e_1_3_3_16_2","article-title":"A survey on bias and fairness in natural language processing","author":"Bansal Rajas","year":"2022","unstructured":"Rajas Bansal. 2022. A survey on bias and fairness in natural language processing. arXiv preprint arXiv:2204.09591 (2022).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_17_2","article-title":"Procedures for undertaking systematic reviews","author":"Barbara Kitchenham","year":"2004","unstructured":"Kitchenham Barbara and C. Stuart. 2004. Procedures for undertaking systematic reviews. Computer Science Department, Keele University (TRISE-0401) and National ICT Australia Ltd (0400011T. 1) Joint Technical Report (2004). Google Scholar Google Scholar Reference (2004).","journal-title":"Computer Science Department, Keele University (TRISE-0401) and National ICT Australia Ltd (0400011T. 1) Joint Technical Report (2004). Google Scholar Google Scholar Reference"},{"key":"e_1_3_3_18_2","doi-asserted-by":"crossref","first-page":"541","DOI":"10.1145\/2578726.2582623","volume-title":"Proceedings of International Conference on Multimedia Retrieval","author":"Batko Michal","year":"2014","unstructured":"Michal Batko, Petra Budikova, Petr Elias, and Pavel Zezula. 2014. CLAN photo presenter: Multi-modal summarization tool for image collections. In Proceedings of International Conference on Multimedia Retrieval. 541\u2013542."},{"key":"e_1_3_3_19_2","volume-title":"arXiv preprint","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. In arXiv preprint arXiv:2004.05150."},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505652"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2014.2384912"},{"key":"e_1_3_3_22_2","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei David M.","year":"2003","unstructured":"David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning Research 3, Jan (2003), 993\u20131022.","journal-title":"Journal of machine Learning Research"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2013.03.012"},{"issue":"12","key":"e_1_3_3_24_2","first-page":"01","article-title":"Survey of automated text document summarization tools: Approaches and trends","volume":"5","author":"Borde Ravi","year":"2023","unstructured":"Ravi Borde. 2023. Survey of automated text document summarization tools: Approaches and trends. The American Journal of Applied Sciences 5, 12 (2023), 01\u201305.","journal-title":"The American Journal of Applied Sciences"},{"key":"e_1_3_3_25_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_26_2","volume-title":"A Multimodal Approach: Acoustic-Linguistic Modelling for Neural Extractive Speech Summarisation on Podcasts","author":"Calik Berk","year":"2023","unstructured":"Berk Calik. 2023. A Multimodal Approach: Acoustic-Linguistic Modelling for Neural Extractive Speech Summarisation on Podcasts. Master\u2019s thesis."},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/290941.291025"},{"key":"e_1_3_3_28_2","unstructured":"Brandon Castellano. 2014. PySceneDetect. Retrieved from https:\/\/www.scenedetect.com\/"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICALT.2011.19"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1438"},{"key":"e_1_3_3_31_2","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1109\/SKG.2018.00033","article-title":"Extractive text-image summarization using multi-modal RNN","author":"Chen Jingqiang","year":"2018","unstructured":"Jingqiang Chen and Hai Zhuge. 2018. Extractive text-image summarization using multi-modal RNN. 2018 14th International Conference on Semantics, Knowledge and Grids (SKG), 245\u2013248.","journal-title":"2018 14th International Conference on Semantics, Knowledge and Grids (SKG)"},{"key":"e_1_3_3_32_2","doi-asserted-by":"crossref","first-page":"8742","DOI":"10.18653\/v1\/2024.acl-long.474","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Chen Ting-Chih","year":"2024","unstructured":"Ting-Chih Chen, Chia-Wei Tang, and Chris Thomas. 2024. MetaSumPerceiver: Multimodal multi-document evidence summarization for fact-checking. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8742\u20138757."},{"key":"e_1_3_3_33_2","volume-title":"NIPS 2014 Workshop on Deep Learning, December 2014","author":"Chung Junyoung","year":"2014","unstructured":"Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00521-024-09908-3"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1980.1163420"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.bionlp-1.33"},{"key":"e_1_3_3_38_2","first-page":"4171","volume-title":"Proceedings of NAACL","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL. 4171\u20134186."},{"key":"e_1_3_3_39_2","volume-title":"International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et\u00a0al.. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations."},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2020.113679"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.5555\/1622487.1622501"},{"key":"e_1_3_3_42_2","first-page":"17245","volume-title":"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)","author":"Faheem Ali","year":"2024","unstructured":"Ali Faheem, Faizad Ullah, Muhammad Sohaib Ayub, and Asim Karim. 2024. UrduMASD: A multimodal abstractive summarization dataset for Urdu. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 17245\u201317253."},{"key":"e_1_3_3_43_2","article-title":"Multi-modal summarization for video-containing documents","author":"Fu Xiyan","year":"2020","unstructured":"Xiyan Fu, Jun Wang, and Zhenglu Yang. 2020. Multi-modal summarization for video-containing documents. arXiv preprint arXiv:2009.08018 (2020).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i20.30206"},{"key":"e_1_3_3_46_2","article-title":"MedSumm: A multimodal approach to summarizing code-mixed hindi-english clinical queries","author":"Ghosh Akash","year":"2024","unstructured":"Akash Ghosh, Arkadeep Acharya, Prince Jha, Aniket Gaudgaul, Rajdeep Majumdar, Sriparna Saha, Aman Chadha, Raghav Jain, Setu Sinha, and Shivani Agarwal. 2024. MedSumm: A multimodal approach to summarizing code-mixed hindi-english clinical queries. arXiv preprint arXiv:2401.01596 (2024).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_47_2","doi-asserted-by":"crossref","first-page":"11546","DOI":"10.18653\/v1\/2024.findings-emnlp.675","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2024","author":"Ghosh Akash","year":"2024","unstructured":"Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Gaurav Pandey, Dinesh Raghu, and Setu Sinha. 2024. HealthAlignSumm: Utilizing alignment for multimodal summarization of code-mixed healthcare dialogues. In Findings of the Association for Computational Linguistics: EMNLP 2024. 11546\u201311560."},{"key":"e_1_3_3_48_2","doi-asserted-by":"crossref","first-page":"13117","DOI":"10.18653\/v1\/2024.acl-long.708","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Ghosh Akash","year":"2024","unstructured":"Akash Ghosh, Mohit Tomar, Abhisek Tiwari, Sriparna Saha, Jatin Salve, and Setu Sinha. 2024. From sights to insights: Towards summarization of multimodal clinical documents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13117\u201313129."},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3617233.3617238"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2916887"},{"key":"e_1_3_3_51_2","first-page":"678","volume-title":"International Conference on Medical Image Computing and Computer-Assisted Intervention","author":"Guo Xiaoqing","year":"2024","unstructured":"Xiaoqing Guo, Qianhui Men, and J. Alison Noble. 2024. MMSummary: Multimodal summary generation for fetal ultrasound video. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 678\u2013688."},{"key":"e_1_3_3_52_2","article-title":"Survey on sociodemographic bias in natural language processing","author":"Gupta Vipul","year":"2023","unstructured":"Vipul Gupta, Pranav Narayanan Venkit, Shomir Wilson, and Rebecca J. Passonneau. 2023. Survey on sociodemographic bias in natural language processing. arXiv preprint arXiv:2306.08158 (2023).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01428"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_56_2","first-page":"290","volume-title":"CCF International Conference on Natural Language Processing and Chinese Computing","author":"He Rui","year":"2024","unstructured":"Rui He, Minjie Qiang, Hongling Wang, and Zhongqing Wang. 2024. Sequential structured fusion of image and text for enhanced multimodal abstractive summarization. In CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 290\u2013302."},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952132"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.595"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_3_60_2","article-title":"Mobilenets: Efficient convolutional neural networks for mobile vision applications","author":"Howard Andrew G.","year":"2017","unstructured":"Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.3390\/app9050987"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107567"},{"key":"e_1_3_3_63_2","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1007\/978-3-030-45442-5_24","volume-title":"Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14\u201317, 2020, Proceedings, Part II 42","author":"Jangra Anubhav","year":"2020","unstructured":"Anubhav Jangra, Adam Jatowt, Mohammad Hasanuzzaman, and Sriparna Saha. 2020. Text-image-video summary generation using joint integer linear programming. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14\u201317, 2020, Proceedings, Part II 42. Springer, 190\u2013198."},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3584700"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462877"},{"key":"e_1_3_3_66_2","unstructured":"Stefanie Jegelka Francis Bach and Suvrit Sra. 2013. Reflection methods for user-friendly submodular optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (NIPS\u201913) Curran Associates Inc. Lake Tahoe Nevada 1313\u20131321."},{"key":"e_1_3_3_67_2","doi-asserted-by":"crossref","unstructured":"Xiankai Jiang and Jingqiang Chen. 2025. Heterogeneous graphormer for extractive multimodal summarization. Journal of Intelligent Information Systems 63 2 (2025) 355\u2013373.","DOI":"10.1007\/s10844-024-00886-5"},{"key":"e_1_3_3_68_2","doi-asserted-by":"crossref","first-page":"5042","DOI":"10.18653\/v1\/2024.findings-emnlp.290","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2024","author":"Jing Liqiang","year":"2024","unstructured":"Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. 2024. FaithScore: Fine-grained evaluations of hallucinations in large vision-language models. In Findings of the Association for Computational Linguistics: EMNLP 2024. 5042\u20135063."},{"key":"e_1_3_3_69_2","doi-asserted-by":"crossref","unstructured":"Liqiang Jing Yiren Li Junhao Xu Yongcan Yu Pei Shen and Xuemeng Song. 2023. Vision enhanced generative pre-trained language model for multimodal sentence summarization. Machine Intelligence Research 20 2 (2023) 289\u2013298.","DOI":"10.1007\/s11633-022-1372-x"},{"key":"e_1_3_3_70_2","doi-asserted-by":"crossref","unstructured":"Ambedkar Kanapala Sukomal Pal and Rajendra Pamula. 2019. Text summarization from legal documents: A survey. Artificial Intelligence Review 51 3 (2019) 371\u2013402.","DOI":"10.1007\/s10462-017-9566-2"},{"key":"e_1_3_3_71_2","article-title":"The kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et\u00a0al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.2463"},{"key":"e_1_3_3_73_2","article-title":"Fairness-aware summarization for justified decision-making","author":"Keymanesh Moniba","year":"2021","unstructured":"Moniba Keymanesh, Tanya Berger-Wolf, Micha Elsner, and Srinivasan Parthasarathy. 2021. Fairness-aware summarization for justified decision-making. arXiv preprint arXiv:2107.06243 (2021).","journal-title":"arXiv preprint"},{"issue":"4","key":"e_1_3_3_74_2","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1007\/s12046-023-02284-z","article-title":"Multimodal text summarization with evaluation approaches","volume":"48","author":"Khilji Abdullah Faiz Ur Rahman","year":"2023","unstructured":"Abdullah Faiz Ur Rahman Khilji, Utkarsh Sinha, Pintu Singh, Adnan Ali, Sahinur Rahman Laskar, Pankaj Dadure, Riyanka Manna, Partha Pakray, Benoit Favre, and Sivaji Bandyopadhyay. 2023. Multimodal text summarization with evaluation approaches. S\u0101dhan\u0101 48, 4 (2023), 226.","journal-title":"S\u0101dhan\u0101"},{"key":"e_1_3_3_75_2","doi-asserted-by":"crossref","first-page":"60","DOI":"10.18653\/v1\/2020.nlpbt-1.7","volume-title":"Proceedings of the First International Workshop on Natural Language Processing Beyond Text","author":"Khullar Aman","year":"2020","unstructured":"Aman Khullar and Udit Arora. 2020. MAST: Multimodal abstractive summarization with trimodal hierarchical attention. In Proceedings of the First International Workshop on Natural Language Processing Beyond Text. 60\u201369."},{"key":"e_1_3_3_76_2","unstructured":"Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing systematic literature reviews in software engineering technical report. Software Engineering Group EBSE Technical Report Keele University and Department of Computer Science University of Durham 2 (2007)."},{"key":"e_1_3_3_77_2","doi-asserted-by":"crossref","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60 6 (2017) 84\u201390.","DOI":"10.1145\/3065386"},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3065386"},{"key":"e_1_3_3_79_2","first-page":"880","volume-title":"Findings of the Association for Computational Linguistics: EACL 2023","author":"Krubi\u0144ski Mateusz","year":"2023","unstructured":"Mateusz Krubi\u0144ski and Pavel Pecina. 2023. MLASK: Multimodal summarization of video-based news articles. In Findings of the Association for Computational Linguistics: EACL 2023. 880\u2013894."},{"key":"e_1_3_3_80_2","first-page":"264","volume-title":"International Conference on Document Analysis and Recognition","author":"Kumar Raghvendra","year":"2024","unstructured":"Raghvendra Kumar, Deepak Prakash, Sriparna Saha, and Shubham Sharma. 2024. IndicBART alongside visual element: multimodal summarization in diverse indian languages. In International Conference on Document Analysis and Recognition. Springer, 264\u2013280."},{"key":"e_1_3_3_81_2","doi-asserted-by":"publisher","unstructured":"Raghvendra Kumar Ritika Sinha Sriparna Saha and Adam Jatowt. 2024. Extracting the full story: A multimodal approach and dataset to crisis summarization in Tweets. IEEE Transactions on Computational Social Systems 11 6 (2024) 7846\u20137856. DOI:10.1109\/TCSS.2024.3436690","DOI":"10.1109\/TCSS.2024.3436690"},{"key":"e_1_3_3_82_2","first-page":"10790","volume-title":"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)","author":"Kumar Sandeep","year":"2024","unstructured":"Sandeep Kumar, Guneet Singh Kohli, Tirthankar Ghosal, and Asif Ekbal. 2024. Longform multimodal lay summarization of scientific papers: Towards automatically generating science blogs from research articles. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 10790\u201310801."},{"issue":"7","key":"e_1_3_3_83_2","doi-asserted-by":"crossref","first-page":"922","DOI":"10.1121\/1.1936476","article-title":"Attenuation of torsional waves in teflon","volume":"32","author":"Leonard R. W.","year":"1960","unstructured":"R. W. Leonard. 1960. Attenuation of torsional waves in teflon. The Journal of the Acoustical Society of America 32, 7 (1960), 922\u2013922.","journal-title":"The Journal of the Acoustical Society of America"},{"key":"e_1_3_3_84_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_3_3_85_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6332"},{"key":"e_1_3_3_86_2","first-page":"4152","volume-title":"IJCAI","author":"Li Haoran","year":"2018","unstructured":"Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Chengqing Zong, et\u00a0al. 2018. Multi-modal sentence summarization with modality attention and image filtering. In IJCAI. 4152\u20134158."},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1114"},{"issue":"5","key":"e_1_3_3_88_2","first-page":"996","article-title":"Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video","volume":"31","author":"Li Haoran","year":"2018","unstructured":"Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2018. Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video. IEEE Transactions on Knowledge and Data Engineering 31, 5 (2018), 996\u20131009.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_3_89_2","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning. PMLR."},{"key":"e_1_3_3_90_2","article-title":"VMSMO: Learning to generate multimodal summary for video-based news articles","author":"Li Mingzhe","year":"2020","unstructured":"Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, and Rui Yan. 2020. VMSMO: Learning to generate multimodal summary for video-based news articles. arXiv preprint arXiv:2010.05406 (2020).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_91_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1210"},{"key":"e_1_3_3_92_2","doi-asserted-by":"crossref","unstructured":"Zechao Li. 2017. Understanding-oriented multimedia news summarization. In Understanding-Oriented Multimedia Content Analysis. Springer 131\u2013153.","DOI":"10.1007\/978-981-10-3689-7_6"},{"key":"e_1_3_3_93_2","article-title":"Towards debiasing sentence representations","author":"Liang Paul Pu","year":"2020","unstructured":"Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Towards debiasing sentence representations. arXiv preprint arXiv:2007.08100 (2020).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_94_2","article-title":"Modeling paragraph-level vision-language semantic alignment for multi-modal summarization","author":"Liang Xinnian","year":"2022","unstructured":"Xinnian Liang, Chenhao Cui, Shuangzhi Wu, Jiali Zeng, Yufan Jiang, and Zhoujun Li. 2022. Modeling paragraph-level vision-language semantic alignment for multi-modal summarization. arXiv preprint arXiv:2208.11303 (2022).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_95_2","volume-title":"EMNLP (Findings)","author":"Liang Yunlong","year":"2023","unstructured":"Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, and Jie Zhou. 2023. D \\(^2\\) TV: Dual knowledge distillation and target-oriented vision modeling for many-to-many multimodal summarization. In EMNLP (Findings)."},{"key":"e_1_3_3_96_2","first-page":"1","volume-title":"ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Lin Chen","year":"2023","unstructured":"Chen Lin, Ye Liu, Siyu An, and Di Yin. 2023. Unsupervised extractive summarization with heterogeneous graph embeddings for chinese documents. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1\u20135."},{"key":"e_1_3_3_97_2","first-page":"74","volume-title":"Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. 74\u201381."},{"key":"e_1_3_3_98_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591633"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.144"},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.04.072"},{"key":"e_1_3_3_101_2","first-page":"6959","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Liu Nayu","year":"2022","unstructured":"Nayu Liu, Kaiwen Wei, Xian Sun, Hongfeng Yu, Fanglong Yao, Li Jin, Guo Zhi, and Guangluan Xu. 2022. Assist non-native viewers: Multimodal cross-lingual summarization for how2 videos. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 6959\u20136969."},{"key":"e_1_3_3_102_2","doi-asserted-by":"publisher","DOI":"10.1145\/3696409.3700234"},{"key":"e_1_3_3_103_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_3_104_2","doi-asserted-by":"crossref","unstructured":"David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 2 (2004) 91\u2013110.","DOI":"10.1023\/B:VISI.0000029664.99615.94"},{"issue":"20","key":"e_1_3_3_105_2","doi-asserted-by":"crossref","first-page":"9184","DOI":"10.3390\/app14209184","article-title":"A modality-enhanced multi-channel attention network for multi-modal dialogue summarization","volume":"14","author":"Lu Ming","year":"2024","unstructured":"Ming Lu, Yang Liu, and Xiaoming Zhang. 2024. A modality-enhanced multi-channel attention network for multi-modal dialogue summarization. Applied Sciences 14, 20 (2024), 9184.","journal-title":"Applied Sciences"},{"key":"e_1_3_3_106_2","doi-asserted-by":"publisher","DOI":"10.3390\/app14209563"},{"key":"e_1_3_3_107_2","first-page":"594","article-title":"MTCA: A multimodal summarization model based on two-stream cross attention","author":"Lu Qiduo","year":"2022","unstructured":"Qiduo Lu, Xia Ye, and Chenhao Zhu. 2022. MTCA: A multimodal summarization model based on two-stream cross attention. 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), 594\u2013601.","journal-title":"2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI)"},{"key":"e_1_3_3_108_2","first-page":"882","volume-title":"2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA)","author":"Lu Qiduo","year":"2022","unstructured":"Qiduo Lu, Chenhao Zhu, and Xia Ye. 2022. Research on Multimodal Summarization by Integrating Visual and Text Modal Information. In 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE, 882\u2013889."},{"key":"e_1_3_3_109_2","doi-asserted-by":"publisher","DOI":"10.1147\/rd.22.0159"},{"key":"e_1_3_3_110_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358104"},{"key":"e_1_3_3_111_2","doi-asserted-by":"crossref","first-page":"1858","DOI":"10.1109\/BigData47090.2019.9005659","volume-title":"2019 IEEE International Conference on Big Data (Big Data)","author":"Mahoney Christian J.","year":"2019","unstructured":"Christian J. Mahoney, Jianping Zhang, Nathaniel Huber-Fliflet, Peter Gronvall, and Haozhen Zhao. 2019. A framework for explainable text classification in legal document review. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 1858\u20131867."},{"key":"e_1_3_3_112_2","volume-title":"Rhetorical Structure Theory: A Theory of Text Organization","author":"Mann William C.","year":"1987","unstructured":"William C. Mann and Sandra A. Thompson. 1987. Rhetorical Structure Theory: A Theory of Text Organization. University of Southern California, Information Sciences Institute Los Angeles."},{"key":"e_1_3_3_113_2","article-title":"A rhetorical relations-based framework for tailored multimedia document summarization","author":"Maredj Azze-Eddine","year":"2024","unstructured":"Azze-Eddine Maredj and Madjid Sadallah. 2024. A rhetorical relations-based framework for tailored multimedia document summarization. arXiv preprint arXiv:2412.19133 (2024).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_114_2","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_3_3_115_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2022.105667"},{"key":"e_1_3_3_116_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2015.7404790"},{"key":"e_1_3_3_117_2","first-page":"404","volume-title":"Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing","author":"Mihalcea Rada","year":"2004","unstructured":"Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 404\u2013411."},{"key":"e_1_3_3_118_2","doi-asserted-by":"crossref","first-page":"340","DOI":"10.1007\/978-3-319-48743-4_27","volume-title":"Web Information Systems Engineering\u2013WISE 2016: 17th International Conference, Shanghai, China, November 8-10, 2016, Proceedings, Part II 17","author":"Modani Natwar","year":"2016","unstructured":"Natwar Modani, Pranav Maneriker, Gaurush Hiranandani, Atanu R Sinha, Utpal, Vaishnavi Subramanian, and Shivani Gupta. 2016. Summarizing multimedia content. In Web Information Systems Engineering\u2013WISE 2016: 17th International Conference, Shanghai, China, November 8-10, 2016, Proceedings, Part II 17. Springer, 340\u2013348."},{"key":"e_1_3_3_119_2","unstructured":"Xiyu Wu Qimai Chen Hai Liu and Chaobo He. 2018. Collaborative filtering recommendation algorithm based on representation learning of knowledge graph. Computer Engineering 44 2 (2018) 226\u2013232."},{"key":"e_1_3_3_120_2","doi-asserted-by":"crossref","first-page":"387","DOI":"10.18653\/v1\/2022.findings-aacl.36","volume-title":"Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022","author":"Mukherjee Sourajit","year":"2022","unstructured":"Sourajit Mukherjee, Anubhav Jangra, Sriparna Saha, and Adam Jatowt. 2022. Topic-aware multimodal summarization. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 387\u2013398."},{"key":"e_1_3_3_121_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10958"},{"issue":"1","key":"e_1_3_3_122_2","first-page":"121","article-title":"A survey on automatic text summarization","volume":"7","author":"Nazari Narges","year":"2019","unstructured":"Narges Nazari and MA Mahdavi. 2019. A survey on automatic text summarization. Journal of AI and Data Mining 7, 1 (2019), 121\u2013135.","journal-title":"Journal of AI and Data Mining"},{"key":"e_1_3_3_123_2","article-title":"Diversity driven attention model for query-based abstractive summarization","author":"Nema Preksha","year":"2017","unstructured":"Preksha Nema, Mitesh Khapra, Anirban Laha, and Balaraman Ravindran. 2017. Diversity driven attention model for query-based abstractive summarization. arXiv preprint arXiv:1704.08300 (2017).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_124_2","article-title":"LoRaLay: A multilingual and multimodal dataset for long range and layout-aware summarization","author":"Nguyen Laura","year":"2023","unstructured":"Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, and Jacopo Staiano. 2023. LoRaLay: A multilingual and multimodal dataset for long range and layout-aware summarization. arXiv preprint arXiv:2301.11312 (2023).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_125_2","doi-asserted-by":"publisher","DOI":"10.1145\/3411763.3443441"},{"key":"e_1_3_3_126_2","volume-title":"A Multimodal, Multispeaker Abstractive Summarization Dataset of Discussion Threads","author":"Overbay Keili Shay","year":"2023","unstructured":"Keili Shay Overbay. 2023. A Multimodal, Multispeaker Abstractive Summarization Dataset of Discussion Threads. Ph.D. Dissertation. Seoul National University Graduate School."},{"key":"e_1_3_3_127_2","article-title":"Multimodal abstractive summarization for how2 videos","author":"Palaskar Shruti","year":"2019","unstructured":"Shruti Palaskar, Jindrich Libovick\u1ef3, Spandana Gella, and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901 (2019).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_128_2","first-page":"791","volume-title":"Interspeech","author":"Palaskar Shruti","year":"2021","unstructured":"Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, and Florian Metze. 2021. Multimodal speech summarization through semantic concept learning. In Interspeech. 791\u2013795."},{"key":"e_1_3_3_129_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_3_130_2","doi-asserted-by":"crossref","first-page":"13773","DOI":"10.18653\/v1\/2024.acl-long.743","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Patil Vaidehi","year":"2024","unstructured":"Vaidehi Patil, Leonardo Ribeiro, Mengwen Liu, Mohit Bansal, and Markus Dreyer. 2024. REFINESUMM: Self-refining MLLM for generating a multimodal summarization dataset. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13773\u201313786."},{"key":"e_1_3_3_131_2","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1109\/ASRU.2015.7404842","volume-title":"2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)","author":"Peddinti Vijayaditya","year":"2015","unstructured":"Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, and Sanjeev Khudanpur. 2015. Jhu ASpIRE system: Robust lvcsr with tdnns, ivector adaptation and RNN-LMS. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 539\u2013546."},{"key":"e_1_3_3_132_2","doi-asserted-by":"publisher","unstructured":"Siginamsetty Phani Ashu Abdul M. Krishna Siva Prasad and Hiren Kumar Deva Sarma. 2024. MMSFT: Multilingual multimodal summarization by fine-tuning transformers. IEEE Access 12 (2024) 129673\u2013129689. DOI:10.1109\/ACCESS.2024.3454382","DOI":"10.1109\/ACCESS.2024.3454382"},{"key":"e_1_3_3_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.118"},{"key":"e_1_3_3_134_2","volume-title":"IEEE 2011 Workshop on Automatic Speech Recognition and Understanding","author":"Povey Daniel","year":"2011","unstructured":"Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et\u00a0al. 2011. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society."},{"key":"e_1_3_3_135_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2024.106417"},{"key":"e_1_3_3_136_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2018.10.028"},{"key":"e_1_3_3_137_2","first-page":"21909","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qiu Jielin","year":"2024","unstructured":"Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, et\u00a0al. 2024. Mmsum: A dataset for multimodal summarization and thumbnail generation of videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 21909\u201321921."},{"key":"e_1_3_3_138_2","article-title":"Mhms: Multimodal hierarchical multimedia summarization","author":"Qiu Jielin","year":"2022","unstructured":"Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Bo Li, Ding Zhao, and Hailin Jin. 2022. Mhms: Multimodal hierarchical multimedia summarization. arXiv preprint arXiv:2204.03734 (2022).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_139_2","first-page":"8748","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et\u00a0al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PmLR, 8748\u20138763."},{"key":"e_1_3_3_140_2","first-page":"28492","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492\u201328518."},{"issue":"8","key":"e_1_3_3_141_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et\u00a0al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_3_3_143_2","first-page":"141","volume-title":"2023 10th International Conference on Soft Computing & Machine Intelligence (ISCMI)","author":"Rafi Shaik","year":"2023","unstructured":"Shaik Rafi and Ranjita Das. 2023. Abstractive text summarization using multimodal information. In 2023 10th International Conference on Soft Computing & Machine Intelligence (ISCMI). IEEE, 141\u2013145."},{"key":"e_1_3_3_144_2","doi-asserted-by":"publisher","unstructured":"Shaik Rafi and Ranjita Das. 2025. Topic-guided abstractive multimodal summarization with multimodal output. Neural Computing and Applications 37 18 (2025) 11619\u201311634. DOI:10.1007\/s00521-023-08821-5","DOI":"10.1007\/s00521-023-08821-5"},{"key":"e_1_3_3_145_2","doi-asserted-by":"publisher","DOI":"10.1145\/3645029"},{"key":"e_1_3_3_146_2","doi-asserted-by":"crossref","unstructured":"Riya Mol Raji Merin Ann Philipose Julie Jose Kuruthukulangara and Lata Ragha. 2022. Abstractive text summarization for multimodal data. In 2022 International Conference on Computing Communication Security and Intelligent Systems (IC3SIS). IEEE 1\u20136.","DOI":"10.1109\/IC3SIS54991.2022.9885342"},{"key":"e_1_3_3_147_2","doi-asserted-by":"crossref","first-page":"365","DOI":"10.1145\/3201064.3202917","volume-title":"Proceedings of the 10th ACM Conference on Web Science","author":"Kashyap Abhinav Ramesh","year":"2018","unstructured":"Abhinav Ramesh Kashyap, Christian von der Weth, Zhiyong Cheng, and Mohan Kankanhalli. 2018. EPICURE-aspect-based multimodal review summarization. In Proceedings of the 10th ACM Conference on Web Science. 365\u2013369."},{"key":"e_1_3_3_148_2","first-page":"1","volume-title":"Recommender Systems Handbook","author":"Ricci Francesco","year":"2010","unstructured":"Francesco Ricci, Lior Rokach, and Bracha Shapira. 2010. Introduction to recommender systems handbook. In Recommender Systems Handbook. Springer, 1\u201335."},{"key":"e_1_3_3_149_2","doi-asserted-by":"publisher","DOI":"10.1145\/3651983"},{"key":"e_1_3_3_150_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638949"},{"key":"e_1_3_3_151_2","article-title":"How2: A large-scale dataset for multimodal language understanding","author":"Sanabria Ramon","year":"2018","unstructured":"Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Lo\u00efc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 (2018).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_152_2","first-page":"840","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Schluter Natalie","year":"2015","unstructured":"Natalie Schluter and Anders S\u00f8gaard. 2015. Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 840\u2013844."},{"key":"e_1_3_3_153_2","article-title":"Get to the point: Summarization with pointer-generator networks","author":"See Abigail","year":"2017","unstructured":"Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 (2017).","journal-title":"arXiv preprint"},{"issue":"1","key":"e_1_3_3_154_2","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s42979-022-01446-w","article-title":"Automatic text summarization methods: A comprehensive review","volume":"4","author":"Sharma Grishma","year":"2022","unstructured":"Grishma Sharma and Deepak Sharma. 2022. Automatic text summarization methods: A comprehensive review. SN Computer Science 4, 1 (2022), 33.","journal-title":"SN Computer Science"},{"key":"e_1_3_3_155_2","first-page":"1","volume-title":"2021 IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON)","author":"Sheik Reshma","year":"2021","unstructured":"Reshma Sheik and S. Jaya Nirmala. 2021. Deep learning techniques for legal text summarization. In 2021 IEEE 8th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). IEEE, 1\u20135."},{"key":"e_1_3_3_156_2","first-page":"273","volume-title":"China National Conference on Chinese Computational Linguistics","author":"Shi Xiaorui","year":"2023","unstructured":"Xiaorui Shi. 2023. MCLS: A large-scale multimodal cross-lingual summarization dataset. In China National Conference on Chinese Computational Linguistics. Springer, 273\u2013288."},{"key":"e_1_3_3_157_2","first-page":"424","volume-title":"Chinese Conference on Pattern Recognition and Computer Vision (PRCV)","author":"Shi Xiaorui","year":"2024","unstructured":"Xiaorui Shi. 2024. Towards making the most of knowledge across languages for multimodal cross-lingual summarization. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 424\u2013438."},{"key":"e_1_3_3_158_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680978"},{"key":"e_1_3_3_159_2","volume-title":"3rd International Conference on Learning Representations (ICLR 2015)","author":"Simonyan K.","year":"2015","unstructured":"K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society."},{"key":"e_1_3_3_160_2","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015).","journal-title":"3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings"},{"key":"e_1_3_3_161_2","doi-asserted-by":"crossref","first-page":"992","DOI":"10.1145\/3477495.3532076","volume-title":"Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Song Xuemeng","year":"2022","unstructured":"Xuemeng Song, Liqiang Jing, Dengtian Lin, Zhongzhou Zhao, Haiqing Chen, and Liqiang Nie. 2022. V2P: Vision-to-prompt based multi-modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 992\u20131001."},{"key":"e_1_3_3_162_2","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1109\/CSCWD61410.2024.10580245","volume-title":"2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)","author":"Song Yutao","year":"2024","unstructured":"Yutao Song, Nankai Lin, Lingbao Li, and Shengyi Jiang. 2024. A vision enhanced framework for indonesian multimodal abstractive text-image summarization. In 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 61\u201366."},{"key":"e_1_3_3_163_2","doi-asserted-by":"publisher","unstructured":"Tipu Sultan Mohammad Abu Tareq Rony Mohammad Shariful Islam Samah Alshathri and Walid El-Shafai. 2025. SumGPT: A multimodal framework for radiology report summarization to improve clinical performance. IEEE Access 13 (2025) 15929\u201315945. DOI:10.1109\/ACCESS.2025.3528335","DOI":"10.1109\/ACCESS.2025.3528335"},{"key":"e_1_3_3_164_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3052783"},{"key":"e_1_3_3_165_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_3_166_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_3_167_2","article-title":"Lxmert: Learning cross-modality encoder representations from transformers","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_168_2","first-page":"263","volume-title":"Proceedings of the 31st International Conference on Computational Linguistics: Industry Track","author":"Tan Zusheng","year":"2025","unstructured":"Zusheng Tan, Xinyi Zhong, Jing-Yu Ji, Wei Jiang, and Billy Chiu. 2025. Enhancing large language models for scientific multimodal summarization with multimodal output. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 263\u2013275."},{"key":"e_1_3_3_169_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2024.128270"},{"issue":"3","key":"e_1_3_3_170_2","doi-asserted-by":"crossref","first-page":"1469","DOI":"10.1109\/TCSVT.2023.3296196","article-title":"TLDW: Extreme multimodal summarization of news videos","volume":"34","author":"Tang Peggy","year":"2023","unstructured":"Peggy Tang, Kun Hu, Lei Zhang, Jiebo Luo, and Zhiyong Wang. 2023. TLDW: Extreme multimodal summarization of news videos. IEEE Transactions on Circuits and Systems for Video Technology 34, 3 (2023), 1469\u20131480.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_3_171_2","first-page":"5657","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Tang Xiangru","year":"2022","unstructured":"Xiangru Tang, Arjun Nair, Borui Wang, Bingyao Wang, Jai Desai, Aaron Wade, Haoran Li, Asli Celikyilmaz, Yashar Mehdad, and Dragomir Radev. 2022. CONFIT: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5657\u20135668."},{"key":"e_1_3_3_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/3115433"},{"key":"e_1_3_3_173_2","article-title":"Russian-language multimodal dataset for automatic summarization of scientific papers","author":"Tsanda Alena","year":"2024","unstructured":"Alena Tsanda and Elena Bruches. 2024. Russian-language multimodal dataset for automatic summarization of scientific papers. arXiv preprint arXiv:2405.07886 (2024).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_174_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477314.3507106"},{"key":"e_1_3_3_175_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS\u201917) Curran Associates Inc. Long Beach California USA 6000\u20136010."},{"key":"e_1_3_3_176_2","article-title":"Graph attention networks","author":"Veli\u010dkovi\u0107 Petar","year":"2017","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_177_2","article-title":"Large scale multi-lingual multi-modal summarization dataset","author":"Verma Yash","year":"2023","unstructured":"Yash Verma, Anubhav Jangra, Raghvendra Kumar, and Sriparna Saha. 2023. Large scale multi-lingual multi-modal summarization dataset. arXiv preprint arXiv:2302.06560 (2023).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_178_2","first-page":"9632","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Wan David","year":"2022","unstructured":"David Wan and Mohit Bansal. 2022. Evaluating and improving factuality in multimodal abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 9632\u20139648."},{"key":"e_1_3_3_179_2","first-page":"13933","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"35","author":"Wang Haonan","year":"2021","unstructured":"Haonan Wang, Yang Gao, Yu Bai, Mirella Lapata, and Heyan Huang. 2021. Exploring explainable selection to control abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13933\u201313941."},{"key":"e_1_3_3_180_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.109890"},{"key":"e_1_3_3_181_2","doi-asserted-by":"crossref","first-page":"381","DOI":"10.1007\/978-3-319-11746-1_27","volume-title":"Web Information Systems Engineering\u2013WISE 2014: 15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part II 15","author":"Wang Ting","year":"2014","unstructured":"Ting Wang and Changqing Bai. 2014. Understand the city better: Multimodal aspect-opinion summarization for travel. In Web Information Systems Engineering\u2013WISE 2014: 15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part II 15. Springer, 381\u2013394."},{"issue":"1","key":"e_1_3_3_182_2","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1007\/s42979-023-02343-6","article-title":"A comparative survey of text summarization techniques","volume":"5","author":"Watanangura Patcharapruek","year":"2023","unstructured":"Patcharapruek Watanangura, Sukit Vanichrudee, On Minteer, Theeranat Sringamdee, Nattapong Thanngam, and Thitirat Siriborvornratanakul. 2023. A comparative survey of text summarization techniques. SN Computer Science 5, 1 (2023), 47.","journal-title":"SN Computer Science"},{"key":"e_1_3_3_183_2","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1007\/978-3-031-78495-8_3","volume-title":"International Conference on Pattern Recognition","author":"Weng Yu","year":"2025","unstructured":"Yu Weng, Xuming Ye, Tianjiao Xing, Zheng Liu, Xuan Liu, et\u00a0al. 2025. Facet-aware multimodal summarization via cross-modal alignment. In International Conference on Pattern Recognition. Springer, 37\u201352."},{"key":"e_1_3_3_184_2","article-title":"Efficient streaming language models with attention sinks","author":"Xiao Guangxuan","year":"2023","unstructured":"Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_185_2","first-page":"19297","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"38","author":"Xiao Min","year":"2024","unstructured":"Min Xiao, Junnan Zhu, Feifei Zhai, Yu Zhou, and Chengqing Zong. 2024. DIUSum: Dynamic image utilization for multimodal summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19297\u201319305."},{"key":"e_1_3_3_186_2","first-page":"1","volume-title":"2024 7th International Conference on Machine Learning and Natural Language Processing (MLNLP)","author":"Xu Renxin","year":"2024","unstructured":"Renxin Xu, Yongqi Shao, Zihan Wang, Shijie Yang, Tao Fang, and Hong Huo. 2024. AliSum: Multimodal summarization with multimodal output boosted by multimodal alignment. In 2024 7th International Conference on Machine Learning and Natural Language Processing (MLNLP). IEEE, 1\u20139."},{"key":"e_1_3_3_187_2","doi-asserted-by":"publisher","DOI":"10.1145\/2461466.2461480"},{"key":"e_1_3_3_188_2","first-page":"3248","volume-title":"Findings of the Association for Computational Linguistics: NAACL 2024","author":"Yan Haolong","year":"2024","unstructured":"Haolong Yan, Binghao Tang, Boda Lin, Gang Zhao, and Si Li. 2024. Visual enhanced entity-level interaction network for multimodal summarization. In Findings of the Association for Computational Linguistics: NAACL 2024. 3248\u20133260."},{"key":"e_1_3_3_189_2","doi-asserted-by":"publisher","DOI":"10.1145\/2396761.2396799"},{"key":"e_1_3_3_190_2","doi-asserted-by":"crossref","first-page":"745","DOI":"10.1145\/2009916.2010016","volume-title":"Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Yan Rui","year":"2011","unstructured":"Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Xiaoming Li, and Yan Zhang. 2011. Evolutionary timeline summarization: A balanced optimization framework via iterative substitution. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 745\u2013754."},{"key":"e_1_3_3_191_2","first-page":"012070","volume-title":"Journal of Physics: Conference Series","volume":"1856","author":"Ye Xia","year":"2021","unstructured":"Xia Ye, Zengying Yue, and Ruiheng Liu. 2021. MBA: A multimodal bilinear attention model with residual connection for abstractive multimodal summarization. In Journal of Physics: Conference Series, Vol. 1856. IOP Publishing, 012070."},{"key":"e_1_3_3_192_2","first-page":"238","volume-title":"2021 2nd International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)","author":"Ye Xia","year":"2021","unstructured":"Xia Ye, Zengying Yue, Ruiheng Liu, and Qiduo Lu. 2021. MTMS: A fact-corrected summarization model based on multitask learning and multimodal fusion. In 2021 2nd International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE). IEEE, 238\u2013247."},{"key":"e_1_3_3_193_2","first-page":"28877","article-title":"Do transformers really perform badly for graph representation?","volume":"34","author":"Ying Chengxuan","year":"2021","unstructured":"Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems 34 (2021), 28877\u201328888.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_194_2","doi-asserted-by":"crossref","unstructured":"Jingshu Yuan Jing Yun Bofei Zheng Lei Jiao and Limin Liu. 2023. MCR: Multilayer cross-fusion with reconstructor for multimodal abstractive summarisation. IET Computer Vision 17 4 (2023) 389\u2013403.","DOI":"10.1049\/cvi2.12173"},{"key":"e_1_3_3_195_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657753"},{"key":"e_1_3_3_196_2","first-page":"11328","volume-title":"International Conference on Machine Learning","author":"Zhang Jingqing","year":"2020","unstructured":"Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328\u201311339."},{"key":"e_1_3_3_197_2","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1137\/1.9781611977653.ch42","volume-title":"Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)","author":"Zhang Litian","year":"2023","unstructured":"Litian Zhang, Xiaoming Zhang, Ziming Guo, and Zhipeng Liu. 2023. CISum: Learning cross-modality interaction to enhance multimodal semantic coverage for multimodal summarization. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). SIAM, 370\u2013378."},{"key":"e_1_3_3_198_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2024.103693"},{"key":"e_1_3_3_199_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21422"},{"key":"e_1_3_3_200_2","doi-asserted-by":"crossref","unstructured":"Mengli Zhang Gang Zhou Wanting Yu Ningbo Huang and Wenfen Liu. 2022. A comprehensive survey of abstractive text summarization based on deep learning. Computational Intelligence and Neuroscience 2022 1 (2022) 7132226.","DOI":"10.1155\/2022\/7132226"},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.1145\/3158369"},{"key":"e_1_3_3_202_2","article-title":"BERTScore: Evaluating text generation with bert","author":"Zhang Tianyi","year":"2019","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_203_2","doi-asserted-by":"crossref","first-page":"9851","DOI":"10.18653\/v1\/2024.findings-acl.587","volume-title":"Findings of the Association for Computational Linguistics ACL 2024","author":"Zhang Yanghai","year":"2024","unstructured":"Yanghai Zhang, Ye Liu, Shiwei Wu, Kai Zhang, Xukai Liu, Qi Liu, and Enhong Chen. 2024. Leveraging entity information for cross-modality correlation learning: The entity-guided multimodal summarization. In Findings of the Association for Computational Linguistics ACL 2024. 9851\u20139862."},{"key":"e_1_3_3_204_2","doi-asserted-by":"crossref","first-page":"3404","DOI":"10.18653\/v1\/2024.naacl-long.187","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Zhang Yusen","year":"2024","unstructured":"Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Richard Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, et\u00a0al. 2024. Fair abstractive summarization of diverse perspectives. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 3404\u20133426."},{"key":"e_1_3_3_205_2","article-title":"Fine-grained and explainable factuality evaluation for multimodal summarization","author":"Zhang Yue","year":"2024","unstructured":"Yue Zhang, Jingxuan Zuo, and Liqiang Jing. 2024. Fine-grained and explainable factuality evaluation for multimodal summarization. arXiv preprint arXiv:2402.11414 (2024).","journal-title":"arXiv preprint"},{"key":"e_1_3_3_206_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21431"},{"key":"e_1_3_3_207_2","doi-asserted-by":"crossref","first-page":"362","DOI":"10.1007\/978-3-031-46664-9_25","volume-title":"International Conference on Advanced Data Mining and Applications","author":"Zhang Zhicheng","year":"2023","unstructured":"Zhicheng Zhang, Yibo Sun, and Shiyan Su. 2023. Multimodal learning for automatic summarization: A survey. In International Conference on Advanced Data Mining and Applications. Springer, 362\u2013376."},{"key":"e_1_3_3_208_2","doi-asserted-by":"crossref","first-page":"12037","DOI":"10.18653\/v1\/2022.emnlp-main.825","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Zhao Nan","year":"2022","unstructured":"Nan Zhao, Haoran Li, Youzheng Wu, and Xiaodong He. 2022. JDDC 2.1: A multimodal chinese dialogue dataset with joint tasks of query rewriting, response generation, discourse parsing, and summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 12037\u201312051."},{"key":"e_1_3_3_209_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.112908"},{"key":"e_1_3_3_210_2","doi-asserted-by":"publisher","unstructured":"Junnan Zhu Haoran Li Tianshang Liu Yu Zhou Jiajun Zhang and Chengqing Zong. 2018. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics Brussels Belgium 4154\u20134164. DOI:10.18653\/v1\/D18-1448","DOI":"10.18653\/v1\/D18-1448"},{"key":"e_1_3_3_211_2","doi-asserted-by":"crossref","unstructured":"Junnan Zhu Lu Xiang Yu Zhou Jiajun Zhang and Chengqing Zong. 2021. Graph-based multimodal ranking models for multimodal summarization. Transactions on Asian and Low-Resource Language Information Processing 20 4 (2021) 1\u201321.","DOI":"10.1145\/3445794"},{"key":"e_1_3_3_212_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6525"},{"issue":"1","key":"e_1_3_3_213_2","first-page":"1","article-title":"COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization","volume":"2017","author":"Zlatintsi Athanasia","year":"2017","unstructured":"Athanasia Zlatintsi, Petros Koutras, Georgios Evangelopoulos, Nikolaos Malandrakis, Niki Efthymiou, Katerina Pastra, Alexandros Potamianos, and Petros Maragos. 2017. COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017, 1 (2017), 1\u201324.","journal-title":"EURASIP Journal on Image and Video Processing"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3763245","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T15:23:00Z","timestamp":1759159380000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3763245"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,29]]},"references-count":212,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3763245"],"URL":"https:\/\/doi.org\/10.1145\/3763245","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,29]]},"assertion":[{"value":"2024-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-08-14","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}