{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T05:06:33Z","timestamp":1765343193853,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":35,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100000780","name":"European Commission","doi-asserted-by":"publisher","award":["101070250"],"award-info":[{"award-number":["101070250"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Austrian Research Promotion Agency","award":["FO999902665"],"award-info":[{"award-number":["FO999902665"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3758224","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T07:37:21Z","timestamp":1761377841000},"page":"12822-12828","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["A Dataset and Metric for Textual Video Content Description"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-2710-9425","authenticated-orcid":false,"given":"Stefan J.","family":"Arzberger","sequence":"first","affiliation":[{"name":"DIGITAL, JOANNEUM RESEARCH, Graz, Styria, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2190-7785","authenticated-orcid":false,"given":"Paul","family":"Raith","sequence":"additional","affiliation":[{"name":"DIGITAL, JOANNEUM RESEARCH, Graz, Styria, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2442-4900","authenticated-orcid":false,"given":"Werner","family":"Bailer","sequence":"additional","affiliation":[{"name":"DIGITAL, JOANNEUM RESEARCH, Graz, Styria, Austria"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-2582-997X","authenticated-orcid":false,"given":"Marion","family":"Jaks","sequence":"additional","affiliation":[{"name":"Austrian Mediathek, Vienna, Austria"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Moloud Abdar Meenakshi Kollati Swaraja Kuraparthi Farhad Pourpanah Daniel McDuff Mohammad Ghavamzadeh Shuicheng Yan Abduallah Mohamed Abbas Khosravi Erik Cambria et al. 2024. A review of deep learning for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).","DOI":"10.1109\/TPAMI.2024.3522295"},{"key":"e_1_3_2_1_2_1","volume-title":"Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al.","author":"Agrawal Pravesh","year":"2024","unstructured":"Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al., 2024. Pixtral 12B. arXiv preprint arXiv:2410.07073 (2024)."},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. 65-72","author":"Banerjee Satanjeev","year":"2005","unstructured":"Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and\/or summarization. 65-72."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/2002472.2002497"},{"key":"e_1_3_2_1_6_1","first-page":"72842","article-title":"Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset","volume":"36","author":"Chen Sihan","year":"2023","unstructured":"Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. 2023. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems, Vol. 36 (2023), 72842-72866.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_7_1","unstructured":"Zesen Cheng Sicong Leng Hang Zhang Yifei Xin Xin Li Guanzheng Chen Yongxin Zhu Wenqi Zhang Ziyang Luo Deli Zhao et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)."},{"key":"e_1_3_2_1_8_1","unstructured":"Marta R Costa-Juss\u00e0 James Cross Onur \u00c7elebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 (2022)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.340"},{"key":"e_1_3_2_1_10_1","unstructured":"Aaron Grattafiori et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICISPC63824.2024.00013"},{"key":"e_1_3_2_1_12_1","first-page":"48955","article-title":"Miradata: A large-scale video dataset with long durations and structured captions","volume":"37","author":"Ju Xuan","year":"2024","unstructured":"Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. 2024. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems, Vol. 37 (2024), 48955-48970.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02095"},{"key":"e_1_3_2_1_15_1","volume-title":"Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74-81.","author":"Lin Chin-Yew","year":"2004","unstructured":"Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74-81."},{"key":"e_1_3_2_1_16_1","volume-title":"European conference on computer vision. Springer, 216-233","author":"Liu Yuan","year":"2024","unstructured":"Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al., 2024. Mmbench: Is your multi-modal model an all-around player?. In European conference on computer vision. Springer, 216-233."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00272"},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311-318","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311-318."},{"key":"e_1_3_2_1_19_1","volume-title":"International conference on machine learning. PmLR, 8748-8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748-8763."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1410"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24947-6_17"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2016.7532983"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"Yunlong Tang Jing Bi Siting Xu Luchuan Song Susan Liang Teng Wang Daoan Zhang Jie An Jingyang Lin Rongyi Zhu et al. 2025. Video understanding with large language models: A survey. IEEE Transactions on Circuits and Systems for Video Technology (2025).","DOI":"10.1109\/TCSVT.2025.3566695"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"e_1_3_2_1_26_1","volume-title":"Tarsier: Recipes for training and evaluating large video description models","author":"Wang Jiawei","year":"2024","unstructured":"Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. [n.d.]. Tarsier: Recipes for training and evaluating large video description models, 2024. URL https:\/\/arxiv.org\/abs\/2407.00634, Vol. 8 ([n.d.])."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00468"},{"key":"e_1_3_2_1_28_1","volume-title":"Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942","author":"Wang Yi","year":"2023","unstructured":"Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al., 2023. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)."},{"key":"e_1_3_2_1_29_1","volume-title":"European Conference on Computer Vision. Springer, 396-416","author":"Wang Yi","year":"2024","unstructured":"Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al., 2024. Internvideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision. Springer, 396-416."},{"key":"e_1_3_2_1_30_1","unstructured":"Zhiyu Wu Xiaokang Chen Zizheng Pan Xingchao Liu Wen Liu Damai Dai Huazuo Gao Yiyang Ma Chengyue Wu Bingxuan Wang et al. 2024. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)."},{"key":"e_1_3_2_1_31_1","volume-title":"International Conference on Machine Learning. PMLR, 38728-38748","author":"Xu Haiyang","year":"2023","unstructured":"Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al., 2023. mplug-2: A modularized multi-modal foundation model across text, image and video. In International Conference on Machine Learning. PMLR, 38728-38748."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_2_1_33_1","first-page":"57240","article-title":"Vript: A video is worth thousands of words","volume":"37","author":"Yang Dongjie","year":"2024","unstructured":"Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. 2024. Vript: A video is worth thousands of words. Advances in Neural Information Processing Systems, Vol. 37 (2024), 57240-57261.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_34_1","unstructured":"Pan Zhang Xiaoyi Dong Yuhang Zang Yuhang Cao Rui Qian Lin Chen Qipeng Guo Haodong Duan Bin Wang Linke Ouyang et al. 2024. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320 (2024)."},{"key":"e_1_3_2_1_35_1","volume-title":"BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations.","author":"Zhang Tianyi","year":"2020","unstructured":"Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations."}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3758224","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T05:03:23Z","timestamp":1765343003000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3758224"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":35,"alternative-id":["10.1145\/3746027.3758224","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3758224","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}