{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:08:19Z","timestamp":1757617699875,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,22]]},"DOI":"10.1145\/3705328.3759303","type":"proceedings-article","created":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T10:46:13Z","timestamp":1757155573000},"page":"1159-1163","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8466-3933","authenticated-orcid":false,"given":"Marco","family":"De Nadai","sequence":"first","affiliation":[{"name":"Spotify, Copenhagen, Denmark"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7194-4155","authenticated-orcid":false,"given":"Andreas","family":"Damianou","sequence":"additional","affiliation":[{"name":"Spotify, Cambridge, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3531-3096","authenticated-orcid":false,"given":"Mounia","family":"Lalmas","sequence":"additional","affiliation":[{"name":"Spotify, London, United Kingdom"}]}],"member":"320","published-online":{"date-parts":[[2025,9,7]]},"reference":[{"key":"e_1_3_3_2_2_2","unstructured":"Sami Abu-El-Haija Nisarg Kothari Joonseok Lee Paul Natsev George Toderici Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1609.08675 (2016)."},{"key":"e_1_3_3_2_3_2","unstructured":"Kirolos Ataallah Xiaoqian Shen Eslam Abdelrahman Essam Sleiman Deyao Zhu Jian Ding and Mohamed Elhoseiny. 2024. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2404.03413 (2024)."},{"key":"e_1_3_3_2_4_2","unstructured":"Yunfei Chu Jin Xu Xiaohuan Zhou Qian Yang Shiliang Zhang Zhijie Yan Chang Zhou and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2311.07919 (2023)."},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959190"},{"key":"e_1_3_3_2_6_2","unstructured":"Chaoyou Fu Haojia Lin Xiong Wang Yi-Fan Zhang Yunhang Shen Xiaoyu Liu Haoyu Cao Zuwei Long Heting Gao Ke Li et\u00a0al. 2025. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2501.01957 (2025)."},{"key":"e_1_3_3_2_7_2","unstructured":"Junchen Fu Xuri Ge Xin Xin Alexandros Karatzoglou Ioannis Arapakis Kaiwen Zheng Yongxin Ni and Joemon\u00a0M Jose. 2024. Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.02992 (2024)."},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3511808.3557065"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"crossref","unstructured":"Pan Gu Haiyang Hu and Guandong Xu. 2024. Modeling multi-behavior sequence via HyperGRU contrastive network for micro-video recommendation. Knowledge-Based Systems 295 (2024) 111841.","DOI":"10.1016\/j.knosys.2024.111841"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413653"},{"key":"e_1_3_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2018.00035"},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.121"},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467189"},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE60146.2024.00380"},{"key":"e_1_3_3_2_15_2","unstructured":"Haotian Liu Chunyuan Li Yuheng Li and Yong\u00a0Jae Lee. 2023. Improved Baselines with Visual Instruction Tuning."},{"key":"e_1_3_3_2_16_2","volume-title":"NeurIPS","author":"Liu Haotian","year":"2023","unstructured":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong\u00a0Jae Lee. 2023. Visual Instruction Tuning. In NeurIPS."},{"key":"e_1_3_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383313.3418479"},{"key":"e_1_3_3_2_18_2","unstructured":"Yongxin Ni Yu Cheng Xiangyan Liu Junchen Fu Youhua Li Xiangnan He Yongfeng Zhang and Fajie Yuan. 2023. A content-driven micro-video recommendation dataset at scale. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2309.15379 (2023)."},{"key":"e_1_3_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539156"},{"key":"e_1_3_3_2_20_2","unstructured":"Peixuan Qi. 2024. Movie Visual and Speech Analysis through Multi-Modal LLM for Recommendation Systems. IEEE Access (2024)."},{"key":"e_1_3_3_2_21_2","first-page":"28492","volume-title":"International conference on machine learning","author":"Radford Alec","year":"2023","unstructured":"Alec Radford, Jong\u00a0Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning. PMLR, 28492\u201328518."},{"key":"e_1_3_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3640457.3688190"},{"key":"e_1_3_3_2_23_2","unstructured":"Zhan Tong Yibing Song Jue Wang and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022) 10078\u201310093."},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Elahe Vahdani and Yingli Tian. 2022. Deep learning-based action detection in untrimmed videos: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 4 (2022) 4302\u20134320.","DOI":"10.1109\/TPAMI.2022.3193611"},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589334.3645600"},{"key":"e_1_3_3_2_26_2","unstructured":"Peng Wang Shuai Bai Sinan Tan Shijie Wang Zhihao Fan Jinze Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge et\u00a0al. 2024. Qwen2-vl: Enhancing vision-language model\u2019s perception of the world at any resolution. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2409.12191 (2024)."},{"key":"e_1_3_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351034"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01322"},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10095969"},{"key":"e_1_3_3_2_30_2","unstructured":"Shitao Xiao Zheng Liu Peitian Zhang and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arxiv:https:\/\/arXiv.org\/abs\/2309.07597\u00a0[cs.CL]"},{"key":"e_1_3_3_2_31_2","unstructured":"Jin Xu Zhifang Guo Jinzheng He Hangrui Hu Ting He Shuai Bai Keqin Chen Jialin Wang Yang Fan Kai Dang et\u00a0al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2503.20215 (2025)."},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i12.33426"},{"key":"e_1_3_3_2_33_2","first-page":"508","volume-title":"Joint European Conference on Machine Learning and Knowledge Discovery in Databases","author":"Yu Yisong","year":"2022","unstructured":"Yisong Yu, Beihong Jin, Jiageng Song, Beibei Li, Yiyuan Zheng, and Wei Zhuo. 2022. Improving micro-video recommendation by controlling position bias. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 508\u2013523."},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-26387-3_31"},{"key":"e_1_3_3_2_35_2","unstructured":"Guanghu Yuan Fajie Yuan Yudong Li Beibei Kong Shujie Li Lei Chen Min Yang Chenyun Yu Bo Hu Zang Li et\u00a0al. 2022. Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.10629 (2022)."},{"key":"e_1_3_3_2_36_2","unstructured":"Yanzhao Zhang Mingxin Li Dingkun Long Xin Zhang Huan Lin Baosong Yang Pengjun Xie An Yang Dayiheng Liu Junyang Lin et\u00a0al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.05176 (2025)."},{"key":"e_1_3_3_2_37_2","volume-title":"International Conference on Machine Learning (ICML)","author":"Zhao Long","year":"2024","unstructured":"Long Zhao, Nitesh\u00a0B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer\u00a0J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David\u00a0A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. 2024. VideoPrism: A Foundational Visual Encoder for Video Understanding. In International Conference on Machine Learning (ICML)."},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548428"},{"key":"e_1_3_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657929"},{"key":"e_1_3_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3696410.3714764"},{"key":"e_1_3_3_2_41_2","unstructured":"Tinghui Zhu Kai Zhang Muhao Chen and Yu Su. 2025. Is Extending Modality The Right Path Towards Omni-Modality? arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2506.01872 (2025)."}],"event":{"name":"RecSys '25: Nineteenth ACM Conference on Recommender Systems","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction","SIGAI ACM Special Interest Group on Artificial Intelligence","SIGIR ACM Special Interest Group on Information Retrieval","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"],"location":"Prague Czech Republic","acronym":"RecSys '25"},"container-title":["Proceedings of the Nineteenth ACM Conference on Recommender Systems"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3705328.3759303","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T11:41:58Z","timestamp":1757158918000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3705328.3759303"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,7]]},"references-count":40,"alternative-id":["10.1145\/3705328.3759303","10.1145\/3705328"],"URL":"https:\/\/doi.org\/10.1145\/3705328.3759303","relation":{},"subject":[],"published":{"date-parts":[[2025,9,7]]},"assertion":[{"value":"2025-09-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}