{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T02:24:01Z","timestamp":1777343041930,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":56,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,11,10]]},"DOI":"10.1145\/3746252.3761398","type":"proceedings-article","created":{"date-parts":[[2025,11,8]],"date-time":"2025-11-08T00:29:28Z","timestamp":1762561768000},"page":"2377-2387","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5206-3909","authenticated-orcid":false,"given":"Claudio","family":"Pomo","sequence":"first","affiliation":[{"name":"Politecnico di Bari, Bari, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6600-1938","authenticated-orcid":false,"given":"Matteo","family":"Attimonelli","sequence":"additional","affiliation":[{"name":"Politecnico di Bari, Bari, Italy and Sapienza University of Rome, Rome, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-5203-1229","authenticated-orcid":false,"given":"Danilo","family":"Danese","sequence":"additional","affiliation":[{"name":"Politecnico di Bari, Bari, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9255-3256","authenticated-orcid":false,"given":"Fedelucio","family":"Narducci","sequence":"additional","affiliation":[{"name":"Politecnico di Bari, Bari, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0939-5462","authenticated-orcid":false,"given":"Tommaso","family":"Di Noia","sequence":"additional","affiliation":[{"name":"Politecnico di Bari, Bari, Italy"}]}],"member":"320","published-online":{"date-parts":[[2025,11,10]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Marah I Abdin Sam Ade Jacobs Ammar Ahmad Awan Jyoti Aneja Ahmed Awadallah Hany Awadalla Nguyen Bach Amit Bahree Arash Bakhtiari Harkirat S. Behl Alon Benhaim Misha Bilenko Johan Bjorck S\u00e9bastien Bubeck Martin Cai Caio C\u00e9sar Teodoro Mendes Weizhu Chen Vishrav Chaudhary Parul Chopra Allie Del Giorno Gustavo de Rosa Matthew Dixon Ronen Eldan Dan Iter Amit Garg Abhishek Goswami Suriya Gunasekar Emman Haider Junheng Hao Russell J. Hewett Jamie Huynh Mojan Javaheripi Xin Jin Piero Kauffmann Nikos Karampatziakis Dongwoo Kim Mahoud Khademi Lev Kurilenko James R. Lee Yin Tat Lee Yuanzhi Li Chen Liang Weishung Liu Eric Lin Zeqi Lin Piyush Madan Arindam Mitra Hardik Modi Anh Nguyen Brandon Norick Barun Patra Daniel Perez-Becker Thomas Portet Reid Pryzant Heyang Qin Marko Radmilac Corby Rosset Sambudha Roy Olatunji Ruwase Olli Saarikivi Amin Saied Adil Salim Michael Santacroce Shital Shah Ning Shang Hiteshi Sharma Xia Song Masahiro Tanaka Xin Wang Rachel Ward Guanhua Wang Philipp Witte Michael Wyatt Can Xu Jiahang Xu Sonali Yadav Fan Yang Ziyi Yang Donghan Yu Chengruidong Zhang Cyril Zhang Jianwen Zhang Li Lyna Zhang Yi Zhang Yue Zhang Yunan Zhang and Xiren Zhou. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. CoRR Vol. abs\/2404.14219 (2024)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3463245"},{"key":"e_1_3_2_1_3_1","first-page":"2425","article-title":"VQA","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV. IEEE Computer Society, 2425-2433.","journal-title":"Visual Question Answering. In ICCV. IEEE Computer Society"},{"key":"e_1_3_2_1_4_1","volume-title":"Daniele Malitesta, Claudio Pomo, and Tommaso Di Noia.","author":"Attimonelli Matteo","year":"2024","unstructured":"Matteo Attimonelli, Danilo Danese, Angela Di Fazio, Daniele Malitesta, Claudio Pomo, and Tommaso Di Noia. 2024a. Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation. arXiv preprint arXiv:2409.15857 (2024)."},{"key":"e_1_3_2_1_5_1","first-page":"1075","article-title":"Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation. In WWW (Companion Volume)","author":"Attimonelli Matteo","year":"2024","unstructured":"Matteo Attimonelli, Danilo Danese, Daniele Malitesta, Claudio Pomo, Giuseppe Gassi, and Tommaso Di Noia. 2024b. Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation. In WWW (Companion Volume). ACM, 1075-1078.","journal-title":"ACM"},{"key":"e_1_3_2_1_6_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2021.3059508"},{"key":"e_1_3_2_1_8_1","first-page":"765","article-title":"Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network","author":"Chen Xu","year":"2019","unstructured":"Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In SIGIR. ACM, 765-774.","journal-title":"Towards Visually Explainable Recommendation. In SIGIR. ACM"},{"key":"e_1_3_2_1_9_1","volume-title":"Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi.","author":"Dai Wenliang","year":"2023","unstructured":"Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In NeurIPS."},{"key":"e_1_3_2_1_10_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). Association for Computational Linguistics, 4171-4186."},{"key":"e_1_3_2_1_11_1","first-page":"1107","article-title":"A Survey on In-context Learning","author":"Dong Qingxiu","year":"2024","unstructured":"Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, and Zhifang Sui. 2024. A Survey on In-context Learning. In EMNLP. Association for Computational Linguistics, 1107-1128.","journal-title":"EMNLP. Association for Computational Linguistics"},{"key":"e_1_3_2_1_12_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR. OpenReview.net."},{"key":"e_1_3_2_1_13_1","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan Anirudh Goyal Anthony Hartshorn Aobo Yang Archi Mitra Archie Sravankumar Artem Korenev Arthur Hinsvark Arun Rao Aston Zhang Aur\u00e9lien Rodriguez Austen Gregerson Ava Spataru Baptiste Rozi\u00e8re Bethany Biron Binh Tang Bobbie Chern Charlotte Caucheteux Chaya Nayak Chloe Bi Chris Marra Chris McConnell Christian Keller Christophe Touret Chunyang Wu Corinne Wong Cristian Canton Ferrer Cyrus Nikolaidis Damien Allonsius Daniel Song Danielle Pintz Danny Livshits David Esiobu Dhruv Choudhary Dhruv Mahajan Diego Garcia-Olano Diego Perino Dieuwke Hupkes Egor Lakomkin Ehab AlBadawy Elina Lobanova Emily Dinan Eric Michael Smith Filip Radenovic Frank Zhang Gabriel Synnaeve Gabrielle Lee Georgia Lewis Anderson Graeme Nail Gr\u00e9goire Mialon Guan Pang Guillem Cucurell Hailey Nguyen Hannah Korevaar Hu Xu Hugo Touvron Iliyan Zarov Imanol Arrieta Ibarra Isabel M. Kloumann Ishan Misra Ivan Evtimov Jade Copet Jaewon Lee Jan Geffert Jana Vranes Jason Park Jay Mahadeokar Jeet Shah Jelmer van der Linde Jennifer Billock Jenny Hong Jenya Lee Jeremy Fu Jianfeng Chi Jianyu Huang Jiawen Liu Jie Wang Jiecao Yu Joanna Bitton Joe Spisak Jongsoo Park Joseph Rocca Joshua Johnstun Joshua Saxe Junteng Jia Kalyan Vasuden Alwala Kartikeya Upasani Kate Plawiak Ke Li Kenneth Heafield Kevin Stone and et al. 2024. The Llama 3 Herd of Models. CoRR Vol. abs\/2407.21783 (2024)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-024-02309-x"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISBIM.2008.190"},{"key":"e_1_3_2_1_16_1","first-page":"770","article-title":"Deep Residual Learning for Image Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770-778.","journal-title":"CVPR. IEEE Computer Society"},{"key":"e_1_3_2_1_17_1","volume-title":"Efficient Multimodal Learning from Data-centric Perspective. CoRR","author":"He Muyang","year":"2024","unstructured":"Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. 2024. Efficient Multimodal Learning from Data-centric Perspective. CoRR, Vol. abs\/2402.11530 (2024)."},{"key":"e_1_3_2_1_18_1","volume-title":"McAuley","author":"He Ruining","year":"2016","unstructured":"Ruining He and Julian J. McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI. AAAI Press, 144-150."},{"key":"e_1_3_2_1_19_1","first-page":"639","article-title":"LightGCN","author":"He Xiangnan","year":"2020","unstructured":"Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In SIGIR. ACM, 639-648.","journal-title":"Simplifying and Powering Graph Convolution Network for Recommendation. In SIGIR. ACM"},{"key":"e_1_3_2_1_20_1","volume-title":"LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model. CoRR","author":"Hinck Musashi","year":"2024","unstructured":"Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, and Vasudev Lal. 2024. LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model. CoRR, Vol. abs\/2404.01331 (2024)."},{"key":"e_1_3_2_1_21_1","volume-title":"McAuley","author":"Hou Yupeng","year":"2024","unstructured":"Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian J. McAuley. 2024. Bridging Language and Items for Retrieval and Recommendation. CoRR, Vol. abs\/2403.03952 (2024)."},{"key":"e_1_3_2_1_22_1","first-page":"263","article-title":"Collaborative Filtering for Implicit Feedback Datasets","author":"Hu Yifan","year":"2008","unstructured":"Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263-272.","journal-title":"ICDM. IEEE Computer Society"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_3_2_1_24_1","volume-title":"What matters when building vision-language models? CoRR","author":"Lauren\u00e7on Hugo","year":"2024","unstructured":"Hugo Lauren\u00e7on, L\u00e9o Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models? CoRR, Vol. abs\/2405.02246 (2024)."},{"key":"e_1_3_2_1_25_1","unstructured":"Chankyu Lee Rajarshi Roy Mengyao Xu Jonathan Raiman Mohammad Shoeybi Bryan Catanzaro and Wei Ping. 2025. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. In ICLR. OpenReview.net."},{"key":"e_1_3_2_1_26_1","unstructured":"Chaofan Li Minghao Qin Shitao Xiao Jianlyu Chen Kun Luo Defu Lian Yingxia Shao and Zheng Liu. 2025. Making Text Embedders Few-Shot Learners. In ICLR. OpenReview.net."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3217449"},{"key":"e_1_3_2_1_28_1","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023b. Visual Instruction Tuning. In NeurIPS."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3617827"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3544106"},{"key":"e_1_3_2_1_31_1","first-page":"687","article-title":"EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation","author":"Liu Xiaohao","year":"2022","unstructured":"Xiaohao Liu, Zhulin Tao, Jiahong Shao, Lifang Yang, and Xianglin Huang. 2022. EliMRec: Eliminating Single-modal Bias in Multimedia Recommendation. In ACM Multimedia. ACM, 687-695.","journal-title":"ACM Multimedia. ACM"},{"key":"e_1_3_2_1_32_1","first-page":"2853","article-title":"Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In ACM Multimedia","author":"Liu Yong","year":"2021","unstructured":"Yong Liu, Susen Yang, Chenyi Lei, Guoxin Wang, Haihong Tang, Juyong Zhang, Aixin Sun, and Chunyan Miao. 2021. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In ACM Multimedia. ACM, 2853-2861.","journal-title":"ACM"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3115\/1118108.1118117"},{"key":"e_1_3_2_1_34_1","article-title":"Formalizing Multimedia Recommendation through Multimodal Deep","volume":"3","author":"Malitesta Daniele","year":"2025","unstructured":"Daniele Malitesta, Giandomenico Cornacchia, Claudio Pomo, Felice Antonio Merra, Tommaso Di Noia, and Eugenio Di Sciascio. 2025. Formalizing Multimedia Recommendation through Multimodal Deep Learning. Trans. Recomm. Syst., Vol. 3, 3 (2025), 37:1-37:33.","journal-title":"Learning. Trans. Recomm. Syst."},{"key":"e_1_3_2_1_35_1","first-page":"32","article-title":"A Deep Multimodal Approach for Cold-start Music Recommendation. In DLRS@RecSys","author":"Oramas Sergio","year":"2017","unstructured":"Sergio Oramas, Oriol Nieto, Mohamed Sordo, and Xavier Serra. 2017. A Deep Multimodal Approach for Cold-start Music Recommendation. In DLRS@RecSys. ACM, 32-37.","journal-title":"ACM"},{"key":"e_1_3_2_1_36_1","first-page":"3421","article-title":"Multimodal Meta-Learning for Cold-Start Sequential Recommendation","author":"Pan Xingyu","year":"2022","unstructured":"Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, Jinpeng Wang, He Hu, and Wayne Xin Zhao. 2022. Multimodal Meta-Learning for Cold-Start Sequential Recommendation. In CIKM. ACM, 3421-3430.","journal-title":"CIKM. ACM"},{"key":"e_1_3_2_1_37_1","volume-title":"Noah Constant, Colin Raffel, and Chris Callison-Burch.","author":"Patel Ajay","year":"2023","unstructured":"Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, and Chris Callison-Burch. 2023. Bidirectional Language Models Are Also Few-shot Learners. In ICLR. OpenReview.net."},{"key":"e_1_3_2_1_38_1","volume-title":"ICML (Proceedings of Machine Learning Research","volume":"8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML (Proceedings of Machine Learning Research, Vol. 139). PMLR, 8748-8763."},{"key":"e_1_3_2_1_39_1","first-page":"127","article-title":"Getting to know you: learning new user preferences in recommender systems","author":"Rashid Al Mamunur","year":"2002","unstructured":"Al Mamunur Rashid, Istv\u00e1n Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A. Konstan, and John Riedl. 2002. Getting to know you: learning new user preferences in recommender systems. In IUI. ACM, 127-134.","journal-title":"IUI. ACM"},{"key":"e_1_3_2_1_40_1","first-page":"3980","article-title":"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP\/IJCNLP (1)","author":"Reimers Nils","year":"2019","unstructured":"Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP\/IJCNLP (1). Association for Computational Linguistics, 3980-3990.","journal-title":"Association for Computational Linguistics"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/371920.372071"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/564376.564421"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3187556"},{"key":"e_1_3_2_1_44_1","volume-title":"LLaMA: Open and Efficient Foundation Language Models. CoRR","author":"Touvron Hugo","year":"2023","unstructured":"Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth\u00e9e Lacroix, Baptiste Rozi\u00e8re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur\u00e9lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR, Vol. abs\/2302.13971 (2023)."},{"key":"e_1_3_2_1_45_1","first-page":"1798","article-title":"Aligning Dual Disentangled User Representations from Ratings and Textual Content","author":"Tran Nhu-Thuat","year":"2022","unstructured":"Nhu-Thuat Tran and Hady W. Lauw. 2022. Aligning Dual Disentangled User Representations from Ratings and Textual Content. In KDD. ACM, 1798-1806.","journal-title":"KDD. ACM"},{"key":"e_1_3_2_1_46_1","first-page":"3156","article-title":"Show and tell: A neural image caption generator","author":"Vinyals Oriol","year":"2015","unstructured":"Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE Computer Society, 3156-3164.","journal-title":"CVPR. IEEE Computer Society"},{"key":"e_1_3_2_1_47_1","article-title":"GIT: A Generative Image-to-text Transformer for Vision and","volume":"2022","author":"Wang Jianfeng","year":"2022","unstructured":"Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. GIT: A Generative Image-to-text Transformer for Vision and Language. Trans. Mach. Learn. Res., Vol. 2022 (2022).","journal-title":"Language. Trans. Mach. Learn. Res."},{"key":"e_1_3_2_1_48_1","volume-title":"ACL (1)","author":"Wang Liang","year":"1897","unstructured":"Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024b. Improving Text Embeddings with Large Language Models. In ACL (1). Association for Computational Linguistics, 11897-11916."},{"key":"e_1_3_2_1_49_1","volume-title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. CoRR","author":"Wang Peng","year":"2024","unstructured":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024a. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. CoRR, Vol. abs\/2409.12191 (2024)."},{"key":"e_1_3_2_1_50_1","first-page":"6369","article-title":"Modal-aware Bias Constrained Contrastive Learning for Multimodal Recommendation. In ACM Multimedia","author":"Yang Wei","year":"2023","unstructured":"Wei Yang, Zhengru Fang, Tianle Zhang, Shiguang Wu, and Chi Lu. 2023. Modal-aware Bias Constrained Contrastive Learning for Multimodal Recommendation. In ACM Multimedia. ACM, 6369-6378.","journal-title":"ACM"},{"key":"e_1_3_2_1_51_1","volume-title":"MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. CoRR","author":"Yue Xiang","year":"2024","unstructured":"Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2024. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. CoRR, Vol. abs\/2409.02813 (2024)."},{"key":"e_1_3_2_1_52_1","first-page":"353","article-title":"Collaborative Knowledge Base Embedding for Recommender Systems","author":"Zhang Fuzheng","year":"2016","unstructured":"Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In KDD. ACM, 353-362.","journal-title":"KDD. ACM"},{"key":"e_1_3_2_1_53_1","first-page":"3872","article-title":"Mining Latent Structures for Multimedia Recommendation. In ACM Multimedia","author":"Zhang Jinghao","year":"2021","unstructured":"Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang. 2021. Mining Latent Structures for Multimedia Recommendation. In ACM Multimedia. ACM, 3872-3880.","journal-title":"ACM"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2022.3221949"},{"key":"e_1_3_2_1_55_1","first-page":"935","article-title":"A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation","author":"Zhou Xin","year":"2023","unstructured":"Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. In ACM Multimedia. ACM, 935-943.","journal-title":"ACM Multimedia. ACM"},{"key":"e_1_3_2_1_56_1","first-page":"845","article-title":"Bootstrap Latent Representations for Multi-modal Recommendation","author":"Zhou Xin","year":"2023","unstructured":"Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi-modal Recommendation. In WWW. ACM, 845-854.","journal-title":"WWW. ACM"}],"event":{"name":"CIKM '25: The 34th ACM International Conference on Information and Knowledge Management","location":"Seoul Republic of Korea","acronym":"CIKM '25","sponsor":["SIGIR ACM Special Interest Group on Information Retrieval","SIGWEB ACM Special Interest Group on Hypertext, Hypermedia, and Web"]},"container-title":["Proceedings of the 34th ACM International Conference on Information and Knowledge Management"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746252.3761398","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T01:52:17Z","timestamp":1765504337000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746252.3761398"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,10]]},"references-count":56,"alternative-id":["10.1145\/3746252.3761398","10.1145\/3746252"],"URL":"https:\/\/doi.org\/10.1145\/3746252.3761398","relation":{},"subject":[],"published":{"date-parts":[[2025,11,10]]},"assertion":[{"value":"2025-11-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}