{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T22:05:54Z","timestamp":1780524354831,"version":"3.54.1"},"reference-count":198,"publisher":"Association for Computing Machinery (ACM)","issue":"13s","license":[{"start":{"date-parts":[[2023,7,13]],"date-time":"2023-07-13T00:00:00Z","timestamp":1689206400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this article, we present a comprehensive survey of the existing research in the area of MMS, covering various modalities such as text, image, audio, and video. Apart from highlighting the different evaluation metrics and datasets used for the MMS task, our work also discusses the current challenges and future directions in this field.<\/jats:p>","DOI":"10.1145\/3584700","type":"journal-article","created":{"date-parts":[[2023,2,21]],"date-time":"2023-02-21T11:22:45Z","timestamp":1676978565000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":54,"title":["A Survey on Multi-modal Summarization"],"prefix":"10.1145","volume":"55","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5571-6098","authenticated-orcid":false,"given":"Anubhav","family":"Jangra","sequence":"first","affiliation":[{"name":"Department of Computer Science, Indian Institute of Technology Patna, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1980-2735","authenticated-orcid":false,"given":"Sourajit","family":"Mukherjee","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Indian Institute of Technology Patna, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7235-0665","authenticated-orcid":false,"given":"Adam","family":"Jatowt","sequence":"additional","affiliation":[{"name":"Department of Informatics &amp; DiSC, University of Innsbruck, Austria"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5458-9381","authenticated-orcid":false,"given":"Sriparna","family":"Saha","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Indian Institute of Technology Patna, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1838-0091","authenticated-orcid":false,"given":"Mohammad","family":"Hasanuzzaman","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Cork Institute of Technology, Ireland"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,7,13]]},"reference":[{"issue":"2","key":"e_1_3_2_2_2","first-page":"105","article-title":"Multi-document summarization model based on integer linear programming","volume":"1","author":"Alguliev Rasim","year":"2010","unstructured":"Rasim Alguliev, Ramiz Aliguliyev, and Makrufa Hajirahimova. 2010. Multi-document summarization model based on integer linear programming. Intell. Contr. Autom. 1, 2 (2010), 105.","journal-title":"Intell. Contr. Autom."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10844-018-0521-8"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00061"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-010-0182-0"},{"key":"e_1_3_2_6_2","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arxiv:cs.CL\/1409.0473."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_2_8_2","article-title":"Multimodal emoji prediction","author":"Barbieri Francesco","year":"2018","unstructured":"Francesco Barbieri, Miguel Ballesteros, Francesco Ronzano, and Horacio Saggion. 2018. Multimodal emoji prediction. arXiv preprint arXiv:1803.02392.","journal-title":"arXiv preprint arXiv:1803.02392"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3355398"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-1086"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505652"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2014.2384912"},{"key":"e_1_3_2_13_2","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei David M.","year":"2003","unstructured":"David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, Jan. (2003), 993\u20131022.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1134"},{"key":"e_1_3_2_15_2","article-title":"Probing the need for visual context in multimodal machine translation","author":"Caglayan Ozan","year":"2019","unstructured":"Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and Lo\u00efc Barrault. 2019. Probing the need for visual context in multimodal machine translation. arXiv preprint arXiv:1903.08678.","journal-title":"arXiv preprint arXiv:1903.08678"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2010.5582561"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1438"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/SKG.2018.00033"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/SKG49510.2019.00029"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5721"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330725"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1063"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. arxiv:cs.CV\/1909.11740.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1264"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.4000\/books.aaccademia.4595"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793868"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.bionlp-1.33"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_31_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et\u00a0al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1329"},{"key":"e_1_3_2_33_2","article-title":"Findings of the second shared task on multimodal machine translation and multilingual image description","author":"Elliott Desmond","year":"2017","unstructured":"Desmond Elliott, Stella Frank, Lo\u00efc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv preprint arXiv:1710.07177 (2017).","journal-title":"arXiv preprint arXiv:1710.07177"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1523"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.04.001"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2003.1221239"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compstruc.2012.07.010"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2008.4712308"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2267205"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2009.4960393"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00373"},{"key":"e_1_3_2_42_2","unstructured":"Fangxiaoyu Feng Yinfei Yang Daniel Cer Naveen Arivazhagan and Wei Wang. 2020. Language-agnostic BERT Sentence Embedding. arxiv:cs.CL\/2007.01852."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2004.11.012"},{"key":"e_1_3_2_44_2","article-title":"Multi-modal summarization for video-containing documents","author":"Fu Xiyan","year":"2020","unstructured":"Xiyan Fu, Jun Wang, and Zhenglu Yang. 2020. Multi-modal summarization for video-containing documents. arXiv preprint arXiv:2009.08018.","journal-title":"arXiv preprint arXiv:2009.08018"},{"key":"e_1_3_2_45_2","first-page":"911","volume-title":"Proceedings of the International Conference on Computational Linguistics (COLING\u201912)","author":"Galanis Dimitrios","year":"2012","unstructured":"Dimitrios Galanis, Gerasimos Lampouras, and Ion Androutsopoulos. 2012. Extractive multi-document summarization with integer linear programming and support vector regression. In Proceedings of the International Conference on Computational Linguistics (COLING\u201912). 911\u2013926."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-016-9475-9"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.124"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1001\/jamaophthalmol.2019.2004"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCSP.2016.7754131"},{"issue":"3","key":"e_1_3_2_50_2","first-page":"258","article-title":"A survey of text summarization extractive techniques","volume":"2","author":"Gupta Vishal","year":"2010","unstructured":"Vishal Gupta and Gurpreet Singh Lehal. 2010. A survey of text summarization extractive techniques. J. Emerg. Technol. Web Intell. 2, 3 (2010), 258\u2013268.","journal-title":"J. Emerg. Technol. Web Intell."},{"key":"e_1_3_2_51_2","article-title":"Exploring explainable selection to control abstractive summarization","author":"Haonan Wang","year":"2020","unstructured":"Wang Haonan, Gao Yang, Bai Yu, Mirella Lapata, and Huang Heyan. 2020. Exploring explainable selection to control abstractive summarization. arXiv preprint arXiv:2004.11779.","journal-title":"arXiv preprint arXiv:2004.11779"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.5555\/2566972.2566993"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.450"},{"key":"e_1_3_2_57_2","first-page":"2528","volume-title":"Proceedings of the CVPR Workshops","author":"Hori Chiori","year":"2018","unstructured":"Chiori Hori, Takaaki Hori, Gordon Wichern, Jue Wang, Teng-yok Lee, Anoop Cherian, and Tim K. Marks. 2018. Multimodal attention for fusion of audio and spatiotemporal features for video description. In Proceedings of the CVPR Workshops. 2528\u20132531."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2016.12.002"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2360"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_11"},{"key":"e_1_3_2_61_2","article-title":"Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers","author":"Huang Zhicheng","year":"2020","unstructured":"Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.","journal-title":"arXiv preprint arXiv:2004.00849"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107567"},{"key":"e_1_3_2_63_2","first-page":"99","volume-title":"Proceedings of the Workshop on Multimodal User Authentication","author":"Indovina M.","year":"2003","unstructured":"M. Indovina, U. Uludag, R. Snelick, A. Mink, and A. Jain. 2003. Multimodal biometric authentication methods: A COTS approach. In Proceedings of the Workshop on Multimodal User Authentication. Citeseer, 99\u2013106."},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2006.10.019"},{"key":"e_1_3_2_65_2","article-title":"A survey on medical document summarization","author":"Jain Raghav","year":"2022","unstructured":"Raghav Jain, Anubhav Jangra, Sriparna Saha, and Adam Jatowt. 2022. A survey on medical document summarization. arXiv preprint arXiv:2212.01669 (2022).","journal-title":"arXiv preprint arXiv:2212.01669"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-99736-6_21"},{"key":"e_1_3_2_67_2","first-page":"191","volume-title":"Proceedings of the 17th International Conference on Natural Language Processing (ICON)","author":"Jangra Anubhav","year":"2020","unstructured":"Anubhav Jangra, Raghav Jain, Vaibhav Mavi, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Semantic extractor-paraphraser based abstractive summarization. In Proceedings of the 17th International Conference on Natural Language Processing (ICON). 191\u2013199."},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45442-5_24"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401232"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462877"},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-6893-7_54"},{"key":"e_1_3_2_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548299"},{"key":"e_1_3_2_73_2","first-page":"1889","volume-title":"Proceedings of the International Conference on Advances in Neural Information Processing Systems","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1889\u20131897."},{"key":"e_1_3_2_74_2","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1007\/978-981-15-5554-1_5","volume-title":"Evaluating Information Retrieval and Access Tasks","author":"Kato Tsuneaki","year":"2021","unstructured":"Tsuneaki Kato. 2021. Multi-modal summarization. In Evaluating Information Retrieval and Access Tasks. Springer, Singapore, 71\u201382."},{"key":"e_1_3_2_75_2","article-title":"The kinetics human action video dataset","author":"Kay Will","year":"2017","unstructured":"Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et\u00a0al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.","journal-title":"arXiv preprint arXiv:1705.06950"},{"key":"e_1_3_2_76_2","article-title":"Transformers in vision: A survey","author":"Khan Salman","year":"2021","unstructured":"Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169.","journal-title":"arXiv preprint arXiv:2101.01169"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475692"},{"key":"e_1_3_2_78_2","article-title":"MAST: Multimodal abstractive summarization with trimodal hierarchical attention","author":"Khullar Aman","year":"2020","unstructured":"Aman Khullar and Udit Arora. 2020. MAST: Multimodal abstractive summarization with trimodal hierarchical attention. arXiv preprint arXiv:2010.08021.","journal-title":"arXiv preprint arXiv:2010.08021"},{"key":"e_1_3_2_79_2","first-page":"361","article-title":"Multimodal residual learning for visual QA","volume":"29","author":"Kim Jin-Hwa","year":"2016","unstructured":"Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual QA. Adv. Neural Inf. Process. Syst. 29 (2016), 361\u2013369.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_2_80_2","article-title":"Hadamard product for low-rank bilinear pooling","author":"Kim Jin-Hwa","year":"2016","unstructured":"Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325.","journal-title":"arXiv preprint arXiv:1610.04325"},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-5018"},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1109\/i-PACT44901.2019.8960003"},{"key":"e_1_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0085060"},{"key":"e_1_3_2_84_2","article-title":"Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation","author":"Klein Benjamin","year":"2014","unstructured":"Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2014. Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399.","journal-title":"arXiv preprint arXiv:1411.7399"},{"key":"e_1_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.3389\/fict.2017.00011"},{"key":"e_1_3_2_86_2","unstructured":"Yaniv Leviathan and Yossi Matias. 2018. Google Duplex: An AI system for accomplishing real-world tasks over the phone."},{"key":"e_1_3_2_87_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6332"},{"key":"e_1_3_2_88_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/577"},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1114"},{"key":"e_1_3_2_90_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2848260"},{"key":"e_1_3_2_91_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3412879"},{"key":"e_1_3_2_92_2","article-title":"VisualBERT: A simple and performant baseline for vision and language","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.","journal-title":"arXiv preprint arXiv:1908.03557"},{"key":"e_1_3_2_93_2","article-title":"VMSMO: Learning to generate multimodal summary for video-based news articles","author":"Li Mingzhe","year":"2020","unstructured":"Mingzhe Li, Xiuying Chen, Shen Gao, Zhangming Chan, Dongyan Zhao, and Rui Yan. 2020. VMSMO: Learning to generate multimodal summary for video-based news articles. arXiv preprint arXiv:2010.05406.","journal-title":"arXiv preprint arXiv:2010.05406"},{"key":"e_1_3_2_94_2","volume-title":"Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)","author":"Libovick\u1ef3 Jindrich","year":"2018","unstructured":"Jindrich Libovick\u1ef3, Shruti Palaskar, Spandana Gella, and Florian Metze. 2018. Multimodal abstractive summarization for open-domain videos. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)."},{"key":"e_1_3_2_95_2","first-page":"74","volume-title":"Text Summarization Branches Out","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74\u201381. Retrieved from https:\/\/www.aclweb.org\/anthology\/W04-1013."},{"key":"e_1_3_2_96_2","first-page":"912","volume-title":"Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Lin Hui","year":"2010","unstructured":"Hui Lin and Jeff Bilmes. 2010. Multi-document summarization via budgeted maximization of submodular functions. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 912\u2013920."},{"key":"e_1_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00693"},{"key":"e_1_3_2_98_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-1004"},{"key":"e_1_3_2_99_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2015.2405553"},{"key":"e_1_3_2_100_2","article-title":"Learn to combine modalities in multimodal deep learning","author":"Liu Kuan","year":"2018","unstructured":"Kuan Liu, Yanen Li, Ning Xu, and Prem Natarajan. 2018. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730.","journal-title":"arXiv preprint arXiv:1805.11730"},{"key":"e_1_3_2_101_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.144"},{"key":"e_1_3_2_102_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.12.014"},{"key":"e_1_3_2_103_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.207"},{"key":"e_1_3_2_104_2","first-page":"13","volume-title":"Proceedings of the International Conference on Advances in Neural Information Processing Systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 13\u201323."},{"key":"e_1_3_2_105_2","doi-asserted-by":"publisher","DOI":"10.1147\/rd.22.0159"},{"key":"e_1_3_2_106_2","unstructured":"Congbo Ma Wei Emma Zhang Mingyu Guo Hu Wang and Quan Z. Sheng. 2020. Multi-document Summarization via Deep Learning Techniques: A Survey. arxiv:cs.CL\/2011.04843."},{"key":"e_1_3_2_107_2","article-title":"On faithfulness and factuality in abstractive summarization","author":"Maynez Joshua","year":"2020","unstructured":"Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661.","journal-title":"arXiv preprint arXiv:2005.00661"},{"key":"e_1_3_2_108_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.102123"},{"key":"e_1_3_2_109_2","doi-asserted-by":"publisher","DOI":"10.3115\/1219044.1219064"},{"key":"e_1_3_2_110_2","first-page":"404","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Mihalcea Rada","year":"2004","unstructured":"Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 404\u2013411."},{"key":"e_1_3_2_111_2","first-page":"3111","volume-title":"Proceedings of the International Conference on Advances in Neural Information Processing Systems","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3111\u20133119."},{"key":"e_1_3_2_112_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.advengsoft.2013.12.007"},{"key":"e_1_3_2_113_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-26190-4_12"},{"key":"e_1_3_2_114_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-48743-4_27"},{"key":"e_1_3_2_115_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2007.04.002"},{"key":"e_1_3_2_116_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1186"},{"key":"e_1_3_2_117_2","article-title":"Multimodal named entity recognition for short social media posts","author":"Moon Seungwhan","year":"2018","unstructured":"Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862.","journal-title":"arXiv preprint arXiv:1802.07862"},{"key":"e_1_3_2_118_2","doi-asserted-by":"publisher","DOI":"10.1145\/2070481.2070509"},{"key":"e_1_3_2_119_2","first-page":"387","volume-title":"Findings of the Association for Computational Linguistics (AACL-IJCNLP\u201922)","author":"Mukherjee Sourajit","year":"2022","unstructured":"Sourajit Mukherjee, Anubhav Jangra, Sriparna Saha, and Adam Jatowt. 2022. Topic-aware multimodal summarization. In Findings of the Association for Computational Linguistics (AACL-IJCNLP\u201922). 387\u2013398."},{"key":"e_1_3_2_120_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01588971"},{"key":"e_1_3_2_121_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-3223-4_3"},{"key":"e_1_3_2_122_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1222"},{"key":"e_1_3_2_123_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10462-012-9332-4"},{"key":"e_1_3_2_124_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2003.06.005"},{"key":"e_1_3_2_125_2","article-title":"Multimodal abstractive summarization for how2 videos","author":"Palaskar Shruti","year":"2019","unstructured":"Shruti Palaskar, Jindrich Libovick\u1ef3, Spandana Gella, and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901.","journal-title":"arXiv preprint arXiv:1906.07901"},{"key":"e_1_3_2_126_2","first-page":"4055","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Parmar Niki","year":"2018","unstructured":"Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proceedings of the International Conference on Machine Learning. PMLR, 4055\u20134064."},{"key":"e_1_3_2_127_2","article-title":"Support-set bottlenecks for video-text representation learning","author":"Patrick Mandela","year":"2020","unstructured":"Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, Joao Henriques, and Andrea Vedaldi. 2020. Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824.","journal-title":"arXiv preprint arXiv:2010.02824"},{"key":"e_1_3_2_128_2","doi-asserted-by":"publisher","DOI":"10.1080\/14786440109462720"},{"key":"e_1_3_2_129_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_130_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-4510"},{"key":"e_1_3_2_131_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2018.10.028"},{"key":"e_1_3_2_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2014.2369731"},{"key":"e_1_3_2_133_2","article-title":"Multimodal machine translation with reinforcement learning","author":"Qian Xin","year":"2018","unstructured":"Xin Qian, Ziyi Zhong, and Jieli Zhou. 2018. Multimodal machine translation with reinforcement learning. arXiv preprint arXiv:1805.02356.","journal-title":"arXiv preprint arXiv:1805.02356"},{"key":"e_1_3_2_134_2","doi-asserted-by":"publisher","DOI":"10.1007\/s40747-019-0115-2"},{"key":"e_1_3_2_135_2","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2017.2738401"},{"key":"e_1_3_2_136_2","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2984066"},{"key":"e_1_3_2_137_2","unstructured":"Aditya Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arxiv:cs.CV\/2102.12092."},{"key":"e_1_3_2_138_2","first-page":"139","volume-title":"Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk","author":"Rashtchian Cyrus","year":"2010","unstructured":"Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon\u2019s Mechanical Turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk. 139\u2013147."},{"key":"e_1_3_2_139_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2013.9"},{"key":"e_1_3_2_140_2","doi-asserted-by":"publisher","DOI":"10.1061\/(ASCE)1090-0241(2004)130:6(636)"},{"key":"e_1_3_2_141_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-28569-1_1"},{"key":"e_1_3_2_142_2","doi-asserted-by":"publisher","DOI":"10.1145\/2509916.2509925"},{"key":"e_1_3_2_143_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2021.3110819"},{"key":"e_1_3_2_144_2","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0223477"},{"key":"e_1_3_2_145_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2018.10.021"},{"key":"e_1_3_2_146_2","doi-asserted-by":"publisher","DOI":"10.5555\/77013"},{"key":"e_1_3_2_147_2","doi-asserted-by":"publisher","DOI":"10.1145\/3347318.3355524"},{"key":"e_1_3_2_148_2","article-title":"How2: A large-scale dataset for multimodal language understanding","author":"Sanabria Ramon","year":"2018","unstructured":"Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Lo\u00efc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347.","journal-title":"arXiv preprint arXiv:1811.00347"},{"key":"e_1_3_2_149_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413752"},{"issue":"13","key":"e_1_3_2_150_2","first-page":"30","article-title":"A survey on video summarization techniques","volume":"132","author":"Sebastian Tinumol","year":"2015","unstructured":"Tinumol Sebastian and Jiby J. Puthiyidam. 2015. A survey on video summarization techniques. Int. J. Comput. Appl 132, 13 (2015), 30\u201332.","journal-title":"Int. J. Comput. Appl"},{"key":"e_1_3_2_151_2","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1117\/12.600746","volume-title":"Internet Imaging VI","author":"Sebe Nicu","year":"2005","unstructured":"Nicu Sebe, Ira Cohen, Theo Gevers, and Thomas S. Huang. 2005. Multimodal approaches for emotion recognition: A survey. In Internet Imaging VI, Vol. 5670. International Society for Optics and Photonics, 56\u201367."},{"key":"e_1_3_2_152_2","article-title":"Get to the point: Summarization with pointer-generator networks","volume":"1704","author":"See Abigail","year":"2017","unstructured":"Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. CoRR abs\/1704.04368.","journal-title":"CoRR"},{"key":"e_1_3_2_153_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3512257"},{"key":"e_1_3_2_154_2","doi-asserted-by":"publisher","DOI":"10.1145\/2484028.2484045"},{"key":"e_1_3_2_155_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_156_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.418"},{"key":"e_1_3_2_157_2","doi-asserted-by":"publisher","DOI":"10.5555\/2380816.2380846"},{"key":"e_1_3_2_158_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2005.57"},{"key":"e_1_3_2_159_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2017.08.003"},{"issue":"12","key":"e_1_3_2_160_2","first-page":"1","article-title":"A machine learning ensemble classifier for early prediction of diabetic retinopathy","volume":"41","author":"Somasundaram S. K.","year":"2017","unstructured":"S. K. Somasundaram and P. Alli. 2017. A machine learning ensemble classifier for early prediction of diabetic retinopathy. J. Med. Syst. 41, 12 (2017), 1\u201312.","journal-title":"J. Med. Syst."},{"key":"e_1_3_2_161_2","unstructured":"Lucia Specia. 2018. Multi-modal context modelling for machine translation. (2018). https:\/\/rua.ua.es\/dspace\/handle\/10045\/76101."},{"key":"e_1_3_2_162_2","first-page":"114101","article-title":"Why pay more? A simple and efficient named entity recognition system for tweets","author":"Suman Chanchal","year":"2020","unstructured":"Chanchal Suman, Saichethan Miriyala Reddy, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Why pay more? A simple and efficient named entity recognition system for tweets. Exp. Syst. Applic. (2020), 114101.","journal-title":"Exp. Syst. Applic."},{"key":"e_1_3_2_163_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_164_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_3_2_165_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-20161-5_18"},{"key":"e_1_3_2_166_2","doi-asserted-by":"publisher","DOI":"10.17261\/Pressacademia.2017.591"},{"key":"e_1_3_2_167_2","article-title":"What makes a good summary? Reconsidering the focus of automatic summarization","author":"Hoeve Maartje ter","year":"2020","unstructured":"Maartje ter Hoeve, Julia Kiseleva, and Maarten de Rijke. 2020. What makes a good summary? Reconsidering the focus of automatic summarization. arXiv preprint arXiv:2012.07619.","journal-title":"arXiv preprint arXiv:2012.07619"},{"key":"e_1_3_2_168_2","doi-asserted-by":"publisher","DOI":"10.1145\/3115433"},{"key":"e_1_3_2_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2011.5711541"},{"key":"e_1_3_2_170_2","article-title":"Going deeper with image transformers","author":"Touvron Hugo","year":"2021","unstructured":"Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv\u00e9 J\u00e9gou. 2021. Going deeper with image transformers. arXiv preprint arXiv:2103.17239.","journal-title":"arXiv preprint arXiv:2103.17239"},{"key":"e_1_3_2_171_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.tourman.2020.104122"},{"key":"e_1_3_2_172_2","doi-asserted-by":"publisher","DOI":"10.1145\/1943403.1943412"},{"key":"e_1_3_2_173_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.192"},{"key":"e_1_3_2_174_2","first-page":"5998","volume-title":"Proceedings of the International Conference on Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_2_175_2","first-page":"6170","volume-title":"Proceedings of the 13th Language Resources and Evaluation Conference","author":"Verma Yash","year":"2022","unstructured":"Yash Verma, Anubhav Jangra, Sriparna Saha, Adam Jatowt, and Dwaipayan Roy. 2022. MAKED: Multi-lingual automatic keyword extraction dataset. In Proceedings of the 13th Language Resources and Evaluation Conference. 6170\u20136179."},{"key":"e_1_3_2_176_2","doi-asserted-by":"publisher","DOI":"10.1007\/s41809-019-00047-z"},{"key":"e_1_3_2_177_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.541"},{"key":"e_1_3_2_178_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11889"},{"key":"e_1_3_2_179_2","doi-asserted-by":"publisher","DOI":"10.1145\/2461466.2461480"},{"key":"e_1_3_2_180_2","article-title":"A deep multi-level attentive network for multimodal sentiment analysis","author":"Yadav Ashima","year":"2020","unstructured":"Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. A deep multi-level attentive network for multimodal sentiment analysis. arXiv preprint arXiv:2012.08256.","journal-title":"arXiv preprint arXiv:2012.08256"},{"key":"e_1_3_2_181_2","doi-asserted-by":"publisher","DOI":"10.1145\/2396761.2396799"},{"key":"e_1_3_2_182_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-017-1042-4"},{"key":"e_1_3_2_183_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_184_2","unstructured":"Jianfei Yu Jing Jiang Li Yang and Rui Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. Association for Computational Linguistics. https:\/\/aclanthology.org\/2020.acl-main.306\/."},{"key":"e_1_3_2_185_2","first-page":"1113","volume-title":"Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916)","author":"Yu Naitong","year":"2016","unstructured":"Naitong Yu, Minlie Huang, Yuanyuan Shi, and Xiaoyan Zhu. 2016. Product review summarization by exploiting phrase properties. In Proceedings of the 26th International Conference on Computational Linguistics (COLING\u201916). 1113\u20131124."},{"key":"e_1_3_2_186_2","article-title":"Tensor fusion network for multimodal sentiment analysis","author":"Zadeh Amir","year":"2017","unstructured":"Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.","journal-title":"arXiv preprint arXiv:1707.07250"},{"key":"e_1_3_2_187_2","article-title":"Is a picture worth a thousand words? A Deep Multi-modal Fusion Architecture for Product Classification in e-commerce","author":"Zahavy Tom","year":"2016","unstructured":"Tom Zahavy, Alessandro Magnani, Abhinandan Krishnan, and Shie Mannor. 2016. Is a picture worth a thousand words? A Deep Multi-modal Fusion Architecture for Product Classification in e-commerce. arXiv preprint arXiv:1611.09534.","journal-title":"arXiv preprint arXiv:1611.09534"},{"key":"e_1_3_2_188_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2007.12.039"},{"key":"e_1_3_2_189_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654903"},{"key":"e_1_3_2_190_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11962"},{"key":"e_1_3_2_191_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhang Tianyi","year":"2020","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=SkeHuCVFDr."},{"key":"e_1_3_2_192_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1053"},{"key":"e_1_3_2_193_2","article-title":"Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward","author":"Zhou Kaiyang","year":"2017","unstructured":"Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2017. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054.","journal-title":"arXiv preprint arXiv:1801.00054"},{"key":"e_1_3_2_194_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.7005"},{"key":"e_1_3_2_195_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1448"},{"key":"e_1_3_2_196_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6525"},{"key":"e_1_3_2_197_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6525"},{"key":"e_1_3_2_198_2","volume-title":"Multi-dimensional Summarization in Cyber-physical Society","author":"Zhuge Hai","year":"2016","unstructured":"Hai Zhuge. 2016. Multi-dimensional Summarization in Cyber-physical Society. Morgan Kaufmann."},{"issue":"1","key":"e_1_3_2_199_2","first-page":"1","article-title":"COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization","volume":"2017","author":"Zlatintsi Athanasia","year":"2017","unstructured":"Athanasia Zlatintsi, Petros Koutras, Georgios Evangelopoulos, Nikolaos Malandrakis, Niki Efthymiou, Katerina Pastra, Alexandros Potamianos, and Petros Maragos. 2017. COGNIMUSE: A multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP J. Image Vid. Process. 2017, 1 (2017), 1\u201324.","journal-title":"EURASIP J. Image Vid. Process."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3584700","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3584700","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:37Z","timestamp":1750182697000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3584700"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,13]]},"references-count":198,"journal-issue":{"issue":"13s","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3584700"],"URL":"https:\/\/doi.org\/10.1145\/3584700","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,13]]},"assertion":[{"value":"2021-01-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}