{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T23:44:17Z","timestamp":1772322257239,"version":"3.50.1"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"5","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation","doi-asserted-by":"crossref","award":["62172300, 62372326, and 62202336"],"award-info":[{"award-number":["62172300, 62372326, and 62202336"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2024YFC3811100"],"award-info":[{"award-number":["2024YFC3811100"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["2024-4-YB-03"],"award-info":[{"award-number":["2024-4-YB-03"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>\n            With the growing diversity of data formats on social media, such as text, images, and videos, there is a growing need to analyze sentiment from multiple modalities. Multimodal sentiment detection, which aims to identify users\u2019 sentiment by jointly modeling information from different modalities, has thus attracted increasing attention. However, most existing multimodal sentiment detection methods fuse multimodal information directly after the unimodal encoding and overlook the modality consistency of multimodal vector spaces before the fusion, which may damage the accuracy of multimodal sentiment detection. To address this issue, we propose a contrastive learning-based multimodal sentiment detection model termed EPMC which can map the representations of different modalities into a unified semantic space before fusion. EPMC operates in two stages, i.e., pre-training stage and fine-tuning stage. At the pre-training stage, we designed a cross-modal transformation module to map different modalities into a unified feature space. Meanwhile, to further capture the relationship between the cross-modal transformation vectors and the unimodal encoding vectors, we propose a multimodal consistency contrastive learning task that helps the model discern and amplify the cross-modal similarity between different modalities, thereby learning more discriminative features for sentiment detection. At the fine-tuning stage, EPMC is iteratively refined using the learned multimodal representation and guided by the cross-entropy loss. Extensive experiments conducted on three public multimodal datasets validate the effectiveness of EPMC model. The official implementation of EPMC is released at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/ADMIS-TONGJI\/EPMC\">https:\/\/github.com\/ADMIS-TONGJI\/EPMC<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3748658","type":"journal-article","created":{"date-parts":[[2025,9,15]],"date-time":"2025-09-15T13:49:05Z","timestamp":1757944145000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Ensuring Pre-Fusion Modality Consistency: A New Approach to Multimodal Sentiment Detection"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-1150-1569","authenticated-orcid":false,"given":"Yulou","family":"Shu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8768-6740","authenticated-orcid":false,"given":"Wengen","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9800-3271","authenticated-orcid":false,"given":"Yu-Ping","family":"Ruan","sequence":"additional","affiliation":[{"name":"Zhejiang Lab, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-5376-043X","authenticated-orcid":false,"given":"Wuchao","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9931-4733","authenticated-orcid":false,"given":"Yichao","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2313-7635","authenticated-orcid":false,"given":"Jihong","family":"Guan","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Tongji University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1949-2768","authenticated-orcid":false,"given":"Shuigeng","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Computer Science, Fudan University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,16]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.4018\/IJSSMET.2019040103"},{"key":"e_1_3_1_3_2","volume-title":"Sentiment Analysis and Opinion Mining","author":"Liu Bing","year":"2022","unstructured":"Bing Liu. 2022. Sentiment Analysis and Opinion Mining. Springer Nature."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1561\/1500000011"},{"key":"e_1_3_1_5_2","first-page":"79","volume-title":"Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP \u201902)","author":"Pang Bo","year":"2002","unstructured":"Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP \u201902), 79\u201386."},{"key":"e_1_3_1_6_2","article-title":"XLNet: Generalized autoregressive pretraining for language understanding","volume":"32","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654930"},{"issue":"5","key":"e_1_3_1_8_2","first-page":"1358","article-title":"WSCNet: Weakly supervised coupled networks for visual sentiment classification and detection","volume":"22","author":"She Dongyu","year":"2019","unstructured":"Dongyu She, Jufeng Yang, Ming-Ming Cheng, Yu-Kun Lai, Paul L. Rosin, and Liang Wang. 2019. WSCNet: Weakly supervised coupled networks for visual sentiment classification and detection. IEEE Transactions on Multimedia 22, 5 (2019), 1358\u20131371.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1002\/widm.1253"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.412"},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","unstructured":"Preslav Nakov Alan Ritter Sara Rosenthal Fabrizio Sebastiani and Veselin Stoyanov. 2019. Semeval-2016 task 4: Sentiment analysis in Twitter. arXiv:1912.01973. Retrieved from https:\/\/arxiv.org\/abs\/1912.01973","DOI":"10.18653\/v1\/S16-1001"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.3301305"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210093"},{"key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":", 785","DOI":"10.1007\/978-3-030-47436-2_59","volume-title":"Proceedings of the 24th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD \u201920)","author":"Jiang Tao","year":"2020","unstructured":"Tao Jiang, Jiahai Wang, Zhiyue Liu, and Yingbiao Ling. 2020. Fusion-extraction network for multimodal sentiment analysis. In Proceedings of the 24th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD \u201920). Springer, 785\u2013797."},{"key":"e_1_3_1_15_2","first-page":"2282","volume-title":"Findings of the Association for Computational Linguistics (NAACL \u201922)","author":"Li Zhen","year":"2022","unstructured":"Zhen Li, Bing Xu, Conghui Zhu, and Tiejun Zhao. July. 2022. CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection. In Findings of the Association for Computational Linguistics (NAACL \u201922). Association for Computational Linguistics, Seattle, 2282\u20132294."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.28"},{"key":"e_1_3_1_17_2","unstructured":"Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https:\/\/arxiv.org\/abs\/1301.3781"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-018-1236-4"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_1_20_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Technical Report. Retrieved from https:\/\/www.openai.com\/research\/language-unsupervised"},{"key":"e_1_3_1_21_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Long and Short Papers, Vol. 1. Association for Computational Linguistics, Minneapolis, MN, 4171\u20134186."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00532"},{"key":"e_1_3_1_23_2","first-page":"3595","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence","author":"Zhu Xinge","year":"2017","unstructured":"Xinge Zhu, Liang Li, Weigang Zhang, Tianrong Rao, Min Xu, Qingming Huang, and Dong Xu. 2017. Dependency exploitation: A unified CNN-RNN approach for visual emotion recognition. In Proceedings of the International Joint Conference on Artificial Intelligence, 3595\u20133601."},{"key":"e_1_3_1_24_2","first-page":", 381","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"29","author":"You Quanzeng","year":"2015","unstructured":"Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29, AAAI, 381\u2013388."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107676"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2021.3062200"},{"key":"e_1_3_1_27_2","first-page":"474","volume-title":"Proceedings of the 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA)","author":"Bhuiyan Hanif","year":"2017","unstructured":"Hanif Bhuiyan, Jinat Ara, Rajon Bardhan, and Md Rashedul Islam. 2017. Retrieving YouTube video by sentiment analysis on user comment. In Proceedings of the 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). IEEE, 474\u2013478."},{"key":"e_1_3_1_28_2","first-page":"1","volume-title":"Proceedings of the 3rd International Conference on Learning Representations (ICLR \u201915)","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR \u201915), 1\u201314."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380000"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413678"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3035277"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3160060"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISI.2017.8004895"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133142"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3178204"},{"key":"e_1_3_1_36_2","first-page":"5240","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics","volume":"1","author":"Wei Yiwei","year":"2023","unstructured":"Yiwei Wei, Shaozu Yuan, Ruosong Yang, Lei Shen, Zhangmeizhi Li, Longbiao Wang, and Meng Chen. 2023. Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Long Papers, Vol. 1, 5240\u20135252."},{"key":"e_1_3_1_37_2","first-page":"16051","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Li Dongyuan","unstructured":"Dongyuan Li, Yusong Wang, Kotaro Funakoshi, and Manabu Okumura. 2023. Joyful: Joint modality fusion and graph contrastive learning for multimodal emotion recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 16051\u201316069."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.122731"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.111848"},{"key":"e_1_3_1_40_2","first-page":", 2592","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing","volume":"1","author":"Li Wei","year":"2021","unstructured":"Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Long Papers, Vol. 1. Association for Computational Linguistics, 2592\u20132607."},{"key":"e_1_3_1_41_2","first-page":"5583","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. PLMR, 5583\u20135594."},{"key":"e_1_3_1_42_2","unstructured":"Jiahui Yu Zirui Wang Vijay Vasudevan Legg Yeung Mojtaba Seyedhosseini and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text foundation models. arXiv:2205.01917. Retrieved from https:\/\/arxiv.org\/abs\/2205.01917"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01838"},{"key":"e_1_3_1_45_2","first-page":"9694","article-title":"Align before fuse: Vision and language representation learning with momentum distillation","volume":"34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 (2021), 9694\u20139705.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_46_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_1_48_2","first-page":"2070","article-title":"Weakly correlated multimodal sentiment analysis: New dataset and topic-oriented model","author":"Liu Wuchao","year":"2024","unstructured":"Wuchao Liu, Wengen Li, Yu-Ping Ruan, Yulou Shu, Juntao Chen, Yina Li, Caili Yu, Yichao Zhang, Jihong Guan, and Shuigeng Zhou. 2024. Weakly correlated multimodal sentiment analysis: New dataset and topic-oriented model. IEEE Transactions on Affective Computing 15, 4 (2024), 2070\u20132082.","journal-title":"IEEE Transactions on Affective Computing"},{"key":"e_1_3_1_49_2","first-page":", 15","volume-title":"Proceedings of the 22nd International Conference on MultiMedia Modeling (MMM \u201916)","author":"Niu Teng","year":"2016","unstructured":"Teng Niu, Shiai Zhu, Lei Pang, and Abdulmotaleb El Saddik. 2016. Sentiment analysis on multi-view social data. In Proceedings of the 22nd International Conference on MultiMedia Modeling (MMM \u201916). Springer, 15\u201327."},{"key":"e_1_3_1_50_2","first-page":"1746","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Kim Yoon","unstructured":"Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746\u20131751."},{"key":"e_1_3_1_51_2","unstructured":"Zhenzhong Lan Mingda Chen Sebastian Goodman Kevin Gimpel Piyush Sharma and Radu Soricut. 2019. Albert: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942. Retrieved from https:\/\/arxiv.org\/abs\/1909.11942"},{"key":"e_1_3_1_52_2","doi-asserted-by":"crossref","unstructured":"Alexis Conneau Kartikay Khandelwal Naman Goyal Vishrav Chaudhary Guillaume Wenzek Francisco Guzm\u00e1n Edouard Grave Myle Ott Luke Zettlemoyer and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv:1911.02116. Retrieved from https:\/\/arxiv.org\/abs\/1911.02116","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_1_55_2","unstructured":"Georgios Chochlakis Tejas Srinivasan Jesse Thomason and Shrikanth Narayanan. 2022. Vault: Augmenting the vision-and-language transformer for sentiment classification on social media. arXiv:2208.09021. Retrieved from https:\/\/arxiv.org\/abs\/2208.09021"},{"issue":"11","key":"e_1_3_1_56_2","article-title":"Visualizing data using t-SNE","volume":"9","author":"Van der Maaten Laurens","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 11.","journal-title":"Journal of Machine Learning Research"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3748658","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T11:37:17Z","timestamp":1760614637000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3748658"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,16]]},"references-count":55,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3748658"],"URL":"https:\/\/doi.org\/10.1145\/3748658","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,16]]},"assertion":[{"value":"2024-09-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-30","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}