{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T16:29:18Z","timestamp":1766507358800,"version":"3.48.0"},"reference-count":92,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"BUPT Excellent Ph.D. Students Foundation","award":["CX20241015"],"award-info":[{"award-number":["CX20241015"]}]},{"DOI":"10.13039\/501100002766","name":"Beijing University of Posts and Telecommunications","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002766","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>Multimodal data can more comprehensively portray changes in user interests, and thus, multimodal sequential recommendation (MSRS) has gained widespread attention in recent years. However, the MSRS faces two key challenges: (1)\u2009how to effectively model long-range dependencies in user interaction sequence; and (2)\u2009how to efficiently fuse multimodal features. To address these challenges, this article proposes a novel multimodal sequential recommendation architecture based on pure convolutional neural network (CNN), named PCMSRec. PCMSRec contains two key innovations: first, by using the global receptive field of large kernel convolution, it models the long-range dependencies of multimodal user interaction sequence, breaking through the limitation that existing CNN-based methods can only capture local short-distance dependencies; second, by taking advantage of the high flexibility of the CNN architecture, it models the relationships among multimodal features of items through a carefully designed convolutional layer architecture and fusion strategy. Specifically, PCMSRec consists of two blocks: sequence-feature block and modal block. The sequence-feature block models long-range dependencies in user interaction sequence through large kernel convolutional layer and extracts item features by incorporating a bottleneck architecture. The modal block models the complex relationships between multimodal features using multiple convolutional layer. Experimental results on five public datasets show that PCMSRec outperforms existing methods.<\/jats:p>","DOI":"10.1145\/3777377","type":"journal-article","created":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T14:46:43Z","timestamp":1763736403000},"page":"1-35","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Rethinking Convolutional Neural Network in Multimodal Sequential Recommendation"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8278-3699","authenticated-orcid":false,"given":"Zhicheng","family":"Zhou","sequence":"first","affiliation":[{"name":"School of Computer Science (National Pilot Software Engineering School) and Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2180-5986","authenticated-orcid":false,"given":"Xiangwu","family":"Meng","sequence":"additional","affiliation":[{"name":"School of Computer Science (National Pilot Software Engineering School) and Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1805-8342","authenticated-orcid":false,"given":"Yujie","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science (National Pilot Software Engineering School) and Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,12,23]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3658596"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.195"},{"key":"e_1_3_1_4_2","unstructured":"Junyoung Chung Caglar Gulcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. Retrieved from https:\/\/arxiv.org\/abs\/1412.3555"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3637528.3671511"},{"key":"e_1_3_1_6_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_1_7_2","first-page":"11963","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ding Xiaohan","year":"2022","unstructured":"Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. 2022. Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 11963\u201311975."},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","unstructured":"Xiaohan Ding Yiyuan Zhang Yixiao Ge Sijie Zhao Lin Song Xiangyu Yue and Ying Shan. 2023. UniRepLKNet: A universal perception large-kernel ConvNet for audio video point cloud time-series and image recognition. arXiv:2311.15599. Retrieved from https:\/\/arxiv.org\/abs\/2311.15599","DOI":"10.1109\/CVPR52733.2024.00527"},{"key":"e_1_3_1_9_2","unstructured":"Xue Dong Xuemeng Song Na Zheng Sicheng Zhao and Guiguang Ding. 2025. Modality reliability guided multimodal recommendation. arXiv:2504.16524. Retrieved from https:\/\/arxiv.org\/abs\/2504.16524"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3539618.3591689"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3426723"},{"key":"e_1_3_1_12_2","unstructured":"Junchen Fu Xuri Ge Xin Xin Alexandros Karatzoglou Ioannis Arapakis Kaiwen Zheng Yongxin Ni and Joemon M. Jose. 2024. Efficient and effective adaptation of multimodal foundation models in sequential recommendation. arXiv:2411.02992. Retrieved from https:\/\/arxiv.org\/abs\/2411.02992"},{"key":"e_1_3_1_13_2","doi-asserted-by":"crossref","unstructured":"Jingtong Gao Xiangyu Zhao Muyang Li Minghao Zhao Runze Wu Ruocheng Guo Yiding Liu and Dawei Yin. 2023. SMLP4Rec: An efficient all-MLP architecture for sequential recommendations. ACM Transactions on Information Systems 42 3 (2023) 1\u201323.","DOI":"10.1145\/3637871"},{"key":"e_1_3_1_14_2","unstructured":"Huifeng Guo Ruiming Tang Yunming Ye Zhenguo Li and Xiuqiang He. 2017. DeepFM: A factorization-machine based neural network for CTR prediction. arXiv:1703.04247. Retrieved from https:\/\/arxiv.org\/abs\/1703.04247"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s41095-023-0364-2"},{"key":"e_1_3_1_16_2","first-page":"1","volume-title":"Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201925)","author":"Guo Xu","year":"2025","unstructured":"Xu Guo, Tong Zhang, Yufei Xue, Chenxu Wang, Fuyun Wang, and Zhen Cui. 2025. M 3 rec: Selective state space models with mixture-of-modality experts for multi-modal sequential recommendation. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP \u201925). IEEE, 1\u20135."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i8.28688"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2023.3250463"},{"key":"e_1_3_1_19_2","unstructured":"Bal\u00e1zs Hidasi Alexandros Karatzoglou Linas Baltrunas and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv:1511.06939. Retrieved from https:\/\/arxiv.org\/abs\/1511.06939"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959167"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539381"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3614775"},{"key":"e_1_3_1_23_2","first-page":"14023","volume-title":"International Conference on Machine Learning","author":"Huang Tianjin","year":"2023","unstructured":"Tianjin Huang, Lu Yin, Zhenyu Zhang, Li Shen, Meng Fang, Mykola Pechenizkiy, Zhangyang Wang, and Shiwei Liu. 2023. Are large kernels better teachers than transformers for convnets? In International Conference on Machine Learning. PMLR, 14023\u201314038."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612091"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3583780.3614773"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681498"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3670995"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2018.00035"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-10-0557-2_112"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3616855.3635817"},{"key":"e_1_3_1_31_2","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2009.263"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.121352"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3573010"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3696410.3714606"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543507.3583440"},{"key":"e_1_3_1_37_2","unstructured":"Muyang Li Xiangyu Zhao Chuan Lyu Minghao Zhao Runze Wu and Ruocheng Guo. 2022. MLP4Rec: A pure MLP architecture for sequential recommendations. arXiv:2204.11510. Retrieved from https:\/\/arxiv.org\/abs\/2204.11510"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627673.3679647"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3543507.3583378"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.2753\/MIS0742-1222230303"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612362"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i5.16549"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611886"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/3544106"},{"key":"e_1_3_1_45_2","unstructured":"Qidong Liu Jiaxi Hu Yutian Xiao Jingtong Gao and Xiangyu Zhao. 2023. Multimodal recommender systems: A survey. arXiv:2302.03883. Retrieved from https:\/\/arxiv.org\/abs\/2302.03883"},{"issue":"2","key":"e_1_3_1_46_2","first-page":"1","article-title":"Multimodal recommender systems: A survey","volume":"57","author":"Liu Qidong","year":"2024","unstructured":"Qidong Liu, Jiaxi Hu, Yutian Xiao, Xiangyu Zhao, Jingtong Gao, Wanyu Wang, Qing Li, and Jiliang Tang. 2024. Multimodal recommender systems: A survey. ACM Computing Surveys 57, 2 (2024), 1\u201317.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3706631"},{"key":"e_1_3_1_48_2","unstructured":"Shiwei Liu Tianlong Chen Xiaohan Chen Xuxi Chen Qiao Xiao Boqian Wu Tommi K\u00e4rkk\u00e4inen Mykola Pechenizkiy Decebal Mocanu and Zhangyang Wang. 2022. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv:2207.03620. Retrieved from https:\/\/arxiv.org\/abs\/2207.03620"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482406"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627673.3679626"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3638562"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2015.03.008"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589335.3651956"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1018"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3511808.3557101"},{"key":"e_1_3_1_56_2","article-title":"Pytorch: An imperative style, high-performance deep learning library","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Vol. 32.","journal-title":"Advances in Neural Information Processing Systems,"},{"key":"e_1_3_1_57_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.127"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3510413"},{"key":"e_1_3_1_60_2","article-title":"Convolutional LSTM network: A machine learning approach for precipitation nowcasting","volume":"28","author":"Shi Xingjian","year":"2015","unstructured":"Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-Chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28 (2015), 1.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_61_2","unstructured":"Farhad Mortezapour Shiri Thinagaran Perumal Norwati Mustapha and Raihani Mohamed. 2023. A comprehensive overview and comparative analysis on deep learning models: CNN RNN LSTM GRU. arXiv:2305.17473. Retrieved from https:\/\/arxiv.org\/abs\/2305.17473"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3357895"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3159652.3159656"},{"key":"e_1_3_1_64_2","first-page":"24261","article-title":"Mlp-mixer: An all-mlp architecture for vision","volume":"34","author":"Tolstikhin Ilya O.","year":"2021","unstructured":"Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. 2021. Mlp-mixer: An all-mlp architecture for vision. In Advances in Neural Information Processing Systems 34 (2021), 24261\u201324272.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.2478\/jaiscr-2024-0010"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2025.3551402"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-025-94256-y"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TGRS.2023.3335484"},{"key":"e_1_3_1_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611967"},{"key":"e_1_3_1_70_2","doi-asserted-by":"crossref","unstructured":"Shoujin Wang Liang Hu Yan Wang Longbing Cao Quan Z. Sheng and Mehmet Orgun. 2019. Sequential recommender systems: Challenges progress and prospects. arXiv:2001.04830. Retrieved from https:\/\/arxiv.org\/abs\/2001.04830","DOI":"10.24963\/ijcai.2019\/883"},{"key":"e_1_3_1_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589335.3648308"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3530257"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531963"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313408"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i12.33408"},{"key":"e_1_3_1_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627673.3680037"},{"key":"e_1_3_1_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3626772.3657839"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358113"},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM59182.2024.00113"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i12.33422"},{"key":"e_1_3_1_81_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i12.33426"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289600.3290975"},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/3616855.3635760"},{"key":"e_1_3_1_84_2","first-page":"5256","article-title":"Ninerec: A benchmark dataset suite for evaluating transferable recommendation","author":"Zhang Jiaqi","year":"2024","unstructured":"Jiaqi Zhang, Yu Cheng, Yongxin Ni, Yunzhu Pan, Zheng Yuan, Junchen Fu, Youhua Li, Jie Wang, and Fajie Yuan. 2024. Ninerec: A benchmark dataset suite for evaluating transferable recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (2024), 5256\u20135267.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1145\/3682075"},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1145\/3158369"},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.5555\/3367471.3367642"},{"key":"e_1_3_1_88_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649447"},{"key":"e_1_3_1_89_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33015941"},{"key":"e_1_3_1_90_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219823"},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.3390\/app132011378"},{"key":"e_1_3_1_92_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2024.3369875"},{"key":"e_1_3_1_93_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611943"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3777377","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,23]],"date-time":"2025-12-23T14:07:16Z","timestamp":1766498836000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3777377"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,23]]},"references-count":92,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3777377"],"URL":"https:\/\/doi.org\/10.1145\/3777377","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"type":"print","value":"1046-8188"},{"type":"electronic","value":"1558-2868"}],"subject":[],"published":{"date-parts":[[2025,12,23]]},"assertion":[{"value":"2024-12-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-11","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}