{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T22:57:46Z","timestamp":1781650666097,"version":"3.54.5"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"10","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62476005, 62076006"],"award-info":[{"award-number":["62476005, 62076006"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Opening Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK","award":["COGOS-2023HE02"],"award-info":[{"award-number":["COGOS-2023HE02"]}]},{"name":"University Synergy Innovation Program of Anhui Province","award":["GXXT-2021-008"],"award-info":[{"award-number":["GXXT-2021-008"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>User-generated multimodal data can provide powerful sentiment clues for sentiment analysis task. Existing works have aligned common sentiment features in different modalities through various multimodal fusion methods. However, these works have certain limitations: (1) Previous research works only align common sentiment features between image and text, without fully exploring interactions among these features, leading to suboptimal analysis results. (2) Redundant noise in image and text increases the risk of feature misalignment during cross-modal alignment. To address these issues, we propose a Multimodal Semantic Fusion Network (MSFN) to deeply explore the semantic relationship between image and text for Multimodal Sentiment Analysis (MSA). Specifically, we align image region and text word features related to sentiment by using a gated attention mechanism. Subsequently, we employ graph convolutional networks to model the interactions among these features to obtain explicit sentiment semantics. The proposed gated attention mechanism corrects potential feature misalignment during cross-modal alignment using a gating mechanism. Moreover, considering not all image\u2013text pairs have explicit corresponding sentiment features, we integrate implicit sentiment semantics to our model for enhancing reliability in analysis. Experimental results on benchmark datasets demonstrate the effectiveness of our proposed model compared to baselines.<\/jats:p>","DOI":"10.1145\/3744648","type":"journal-article","created":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T10:43:10Z","timestamp":1750156990000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["A Multimodal Semantic Fusion Network with Cross-Modal Alignment for Multimodal Sentiment Analysis"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0540-7593","authenticated-orcid":false,"given":"Shunxiang","family":"Zhang","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China and School of Computer, Huainan Normal University, Huainan, China and Artificial Intelligence Research Institute, Hefei Comprehensive National Science Center, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-1219-3200","authenticated-orcid":false,"given":"Jiajia","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China  and Artificial Intelligence Research Institute, Hefei Comprehensive National Science Center, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-9168-8786","authenticated-orcid":false,"given":"Yixuan","family":"Jiao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China  and Artificial Intelligence Research Institute, Hefei Comprehensive National Science Center, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-3344-857X","authenticated-orcid":false,"given":"Yulei","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, China  and Artificial Intelligence Research Institute, Hefei Comprehensive National Science Center, Hefei, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9706-1943","authenticated-orcid":false,"given":"Lei","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer, Huainan Normal University, Huainan, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1381-4364","authenticated-orcid":false,"given":"Kuanching","family":"Li","sequence":"additional","affiliation":[{"name":"Computer Science and Information Engineering, Providence University, Taichung, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,10,14]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2022.09.025"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cosrev.2020.100336"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107018"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3586075"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3388861"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2020.3035277"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3160060"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.108107"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-naacl.175"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2022.3171091"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2023.111346"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3517139"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.122731"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.aacl-main.32"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3210093"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2019.01.019"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.287"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-021-02936-9"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.28"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10489-023-05151-w"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3593583"},{"key":"e_1_3_1_23_2","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from https:\/\/arxiv.org\/abs\/1609.02907"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2011.11.107"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1007\/s12559-022-10043-1"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2022.04.004"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2012.07.059"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3234427"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jretconser.2022.103011"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asoc.2022.109251"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.07.022"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.26599\/BDMA.2020.9020024"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-023-00726-3"},{"key":"e_1_3_1_35_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. arXiv:1706.03762. Retrieved from https:\/\/arxiv.org\/abs\/1706.03762"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.146"},{"key":"e_1_3_1_38_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1873965"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654930"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2502282"},{"key":"e_1_3_1_42_2","unstructured":"Tao Chen Damian Borth Trevor Darrell and Shih-Fu Chang. 2014. DeepSentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586. Retrieved from https:\/\/arxiv.org\/abs\/1410.8586"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3094362"},{"key":"e_1_3_1_44_2","unstructured":"Quanzeng You Jiebo Luo Hailin Jin and Jianchao Yang. 2015. Robust image sentiment analysis using progressively trained and domain-transferred deep networks. arXiv:1509.06041. Retrieved from https:\/\/arxiv.org\/abs\/1509.06041"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3326335"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3359753"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.10.062"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2012.10.009"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISI.2017.8004895"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133142"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2023.110502"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2023.107335"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2022.103193"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2022.119240"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.197"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01267"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-27674-8_2"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1239"},{"key":"e_1_3_1_60_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01506"},{"key":"e_1_3_1_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01683"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2024.3367940"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3499027"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00586"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.349"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744648","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T21:23:45Z","timestamp":1760477025000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744648"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,14]]},"references-count":65,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3744648"],"URL":"https:\/\/doi.org\/10.1145\/3744648","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,14]]},"assertion":[{"value":"2024-06-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}