{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T07:47:15Z","timestamp":1768895235216,"version":"3.49.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"9","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61866013 and 62401204"],"award-info":[{"award-number":["61866013 and 62401204"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University","award":["VRLAB2025C06"],"award-info":[{"award-number":["VRLAB2025C06"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>Multimodal recommendation systems improve the accuracy of recommendations by integrating information from different modalities to obtain potential representations of users and items. However, existing multimodal recommendation methods often use single user embedding to model users\u2019 interests in different modalities, neglecting multimodal information. Furthermore, the semantics expressed by the same items in different modalities may be inconsistent, leading to suboptimal recommendation performance. To alleviate the impact of these issues, we propose a new multimodal recommendation framework called Learnable Prompts and ID-guided Contrastive Learning (LPIC). Specifically, we introduce a continuously learnable prompt embedding method, incorporating multimodal features of items to model users\u2019 interests in specific modalities. Then, we propose an ID-guided contrastive learning component to enhance historical interaction features in textual, visual, and fused modalities, while aligning text, image, and fused modality to enhance semantic consistency between modalities. Finally, we conduct extensive experiments on three publicly available Amazon datasets to demonstrate the effectiveness of the LPIC framework.<\/jats:p>","DOI":"10.1145\/3735561","type":"journal-article","created":{"date-parts":[[2025,5,23]],"date-time":"2025-05-23T11:57:31Z","timestamp":1748001451000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["LPIC: Learnable Prompts and ID-guided Contrastive Learning for Multimodal Recommendation"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-9101-3699","authenticated-orcid":false,"given":"Xin","family":"Liu","sequence":"first","affiliation":[{"name":"Hunan Normal University, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3115-0672","authenticated-orcid":false,"given":"Qiya","family":"Song","sequence":"additional","affiliation":[{"name":"Hunan Normal University, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3172-3490","authenticated-orcid":false,"given":"Lin","family":"Xiao","sequence":"additional","affiliation":[{"name":"Hunan Normal University, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5969-223X","authenticated-orcid":false,"given":"Chun","family":"Wang","sequence":"additional","affiliation":[{"name":"Hunan Normal University, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7511-9418","authenticated-orcid":false,"given":"Xieping","family":"Gao","sequence":"additional","affiliation":[{"name":"Hunan Normal University, Changsha, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,11]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Tom B. Brown. 2020. Language models are few-shot learners. arXiv:2005.14165. Retrieved from https:\/\/arxiv.org\/abs\/2005.14165"},{"key":"e_1_3_2_3_2","first-page":"8283","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Chen Junyang","year":"2024","unstructured":"Junyang Chen, Guoxuan Zou, Pan Zhou, Wu Yirui, Zhenghan Chen, Houcheng Su, Huan Wang, and Zhiguo Gong. 2024. Sparse enhanced network: An adversarial generation method for robust augmentation in sequential recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 8283\u20138291."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539597.3570484"},{"key":"e_1_3_2_5_2","first-page":"1","volume-title":"ACM Transactions on Multimedia Computing, Communications and Applications","author":"De Divitiis Lavinia","year":"2023","unstructured":"Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, and Alberto Del Bimbo. 2023. Disentangling features for fashion recommendation. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1\u201321."},{"key":"e_1_3_2_6_2","unstructured":"Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_7_2","first-page":"1","volume-title":"ACM Transactions on Multimedia Computing, Communications and Applications","author":"Djenouri Youcef","year":"2023","unstructured":"Youcef Djenouri, Asma Belhadi, Gautam Srivastava, and Jerry Chun-Wei Lin. 2023. An efficient and accurate GPU-based deep learning model for multimedia recommendation. ACM Transactions on Multimedia Computing, Communications and Applications 20, 2 (2023), 1\u201318."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","first-page":"101989","DOI":"10.1016\/j.inffus.2023.101989","article-title":"Prompt-based and weak-modality enhanced multimodal recommendation","volume":"101","author":"Dong Xue","year":"2024","unstructured":"Xue Dong, Xuemeng Song, Minghui Tian, and Linmei Hu. 2024. Prompt-based and weak-modality enhanced multimodal recommendation. Information Fusion 101 (2024), 101989.","journal-title":"Information Fusion"},{"key":"e_1_3_2_9_2","first-page":"249","volume-title":"Proceedings of the 13th International Conference on Artificial Intelligence and Statistics","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 249\u2013256."},{"key":"e_1_3_2_10_2","unstructured":"Yuxian Gu Xu Han Zhiyuan Liu and Minlie Huang. 2021. PPT: Pre-trained prompt tuning for few-shot learning. arXiv:2109.04332. Retrieved from https:\/\/arxiv.org\/abs\/2109.04332"},{"key":"e_1_3_2_11_2","first-page":"8454","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Guo Zhiqiang","year":"2024","unstructured":"Zhiqiang Guo, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. 2024. LGMRec: Local and global graph learning for multimodal recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, 8454\u20138462."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_13_2","first-page":"144","volume-title":"Proceedings of the 30th AAAI Conference on Artificial Intelligence","author":"He R.","year":"2016","unstructured":"R. He and J. V. B. P. R. McAuley. 2016. Visual Bayesian personalized ranking from implicit feedback. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 144\u2013150."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401063"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052569"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","first-page":"110825","DOI":"10.1016\/j.knosys.2023.110825","article-title":"Enhanced contrastive learning with multi-aspect information for recommender systems","volume":"277","author":"Hu Linfeng","year":"2023","unstructured":"Linfeng Hu, Wei Zhou, Fengji Luo, Shuang Ni, and Junhao Wen. 2023. Enhanced contrastive learning with multi-aspect information for recommender systems. Knowledge-Based Systems 277 (2023), 110825.","journal-title":"Knowledge-Based Systems"},{"key":"e_1_3_2_17_2","unstructured":"Diederik P. Kingma. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3573010"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","unstructured":"Brian Lester Rami Al-Rfou and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv:2104.08691. Retrieved from https:\/\/arxiv.org\/abs\/2104.08691","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3687473"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","first-page":"112042","DOI":"10.1016\/j.knosys.2024.112042","article-title":"An attention mechanism and residual network based knowledge graph-enhanced recommender system","volume":"299","author":"Li Weisheng","year":"2024","unstructured":"Weisheng Li, Hao Zhong, Junming Zhou, Chao Chang, Ronghua Lin, and Yong Tang. 2024. An attention mechanism and residual network based knowledge graph-enhanced recommender system. Knowledge-Based Systems 299 (2024), 112042.","journal-title":"Knowledge-Based Systems"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3512104"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","first-page":"9343","DOI":"10.1109\/TMM.2023.3251108","article-title":"Multimodal graph contrastive learning for multimedia-based recommendation","volume":"25","author":"Liu Kang","year":"2023","unstructured":"Kang Liu, Feng Xue, Dan Guo, Peijie Sun, Shengsheng Qian, and Richang Hong. 2023. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Transactions on Multimedia 25 (2023), 9343\u20139355.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_24_2","first-page":"72","article-title":"Joint multi-grained popularity-aware graph convolution collaborative filtering for recommendation","author":"Liu Kang","year":"2022","unstructured":"Kang Liu, Feng Xue, Xiangnan He, Dan Guo, and Richang Hong. 2022. Joint multi-grained popularity-aware graph convolution collaborative filtering for recommendation. IEEE Transactions on Computational Social Systems 10, 1 (2022), 72\u201383.","journal-title":"IEEE Transactions on Computational Social Systems"},{"key":"e_1_3_2_25_2","first-page":"1","article-title":"Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing","author":"Liu Pengfei","year":"2023","unstructured":"Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 9 (2023), 1\u201335.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1018"},{"key":"e_1_3_2_27_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.127"},{"key":"e_1_3_2_29_2","unstructured":"Steffen Rendle Christoph Freudenthaler Zeno Gantner and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv:1205.2618. Retrieved from https:\/\/arxiv.org\/abs\/1205.2618"},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","unstructured":"Taylor Shin Yasaman Razeghi Robert L. Logan IV Eric Wallace and Sameer Singh. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv:2010.15980. Retrieved from https:\/\/arxiv.org\/abs\/2010.15980","DOI":"10.18653\/v1\/2020.emnlp-main.346"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2025.3567297"},{"issue":"12","key":"e_1_3_2_32_2","doi-asserted-by":"crossref","first-page":"10028","DOI":"10.1109\/TNNLS.2022.3163771","article-title":"Multimodal sparse transformer network for audio-visual speech recognition","volume":"34","author":"Song Qiya","year":"2023","unstructured":"Qiya Song, Bin Sun, and Shutao Li. 2023. Multimodal sparse transformer network for audio-visual speech recognition. IEEE Transactions on Neural Networks and Learning Systems 34, 12 (2023), 10028\u201310038.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_2_33_2","first-page":"5107","article-title":"Self-supervised learning for multimedia recommendation","author":"Tao Zhulin","year":"2022","unstructured":"Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia 25 (2022), 5107\u20135116.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_34_2","first-page":"1074","article-title":"DualGNN: Dual graph neural network for multimedia recommendation","author":"Wang Qifan","year":"2021","unstructured":"Qifan Wang, Yinwei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, and Liqiang Nie. 2021. DualGNN: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia 25 (2021), 1074\u20131084.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3077136.3080771"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3488560.3498527"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543507.3583206"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413556"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3351034"},{"key":"e_1_3_2_40_2","first-page":"5812","article-title":"Graph contrastive learning with augmentations","author":"You Yuning","year":"2020","unstructured":"Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. In Proceedings of the Advances in Neural Information Processing Systems, 5812\u20135823.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_41_2","first-page":"913","article-title":"XSimGCL: Towards extremely simple graph contrastive learning for recommendation","author":"Yu Junliang","year":"2023","unstructured":"Junliang Yu, Xin Xia, Tong Chen, Lizhen Cui, Nguyen Quoc Viet Hung, and Hongzhi Yin. 2023. XSimGCL: Towards extremely simple graph contrastive learning for recommendation. IEEE Transactions on Knowledge and Data Engineering 36, 2 (2023), 913\u2013926.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449844"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531937"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3613915"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475259"},{"key":"e_1_3_2_46_2","first-page":"9154","article-title":"Latent structure mining with contrastive modality fusion for multimedia recommendation","author":"Zhang Jinghao","year":"2022","unstructured":"Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Mengqi Zhang, Shu Wu, and Liang Wang. 2022. Latent structure mining with contrastive modality fusion for multimedia recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 9\u00a0(2022), 9154\u20139167.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_2_47_2","unstructured":"Hongyu Zhou Xin Zhou Zhiwei Zeng Lingzi Zhang and Zhiqi Shen.2023. A comprehensive survey on multimodal recommender systems: Taxonomy evaluation and future directions. arXiv:2302.04473. Retrieved from https:\/\/arxiv.org\/abs\/2302.04473"},{"key":"e_1_3_2_48_2","first-page":"1247","volume-title":"Proceedings of the 2023 IEEE 39th International Conference on Data Engineering","author":"Zhou Xin","year":"2023","unstructured":"Xin Zhou, Donghui Lin, Yong Liu, and Chunyan Miao. 2023. Layer-refined graph convolutional networks for recommendation. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering. IEEE, 1247\u20131259."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611943"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543507.3583251"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3735561","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T19:45:38Z","timestamp":1757619938000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3735561"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,11]]},"references-count":49,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3735561"],"URL":"https:\/\/doi.org\/10.1145\/3735561","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,11]]},"assertion":[{"value":"2025-01-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}