{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,16]],"date-time":"2026-02-16T18:01:32Z","timestamp":1771264892932,"version":"3.50.1"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62176084, and 62176083"],"award-info":[{"award-number":["62176084, and 62176083"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities of China","doi-asserted-by":"crossref","award":["PA2021GDSK0092 and PA2022GDSK0066"],"award-info":[{"award-number":["PA2021GDSK0092 and PA2022GDSK0066"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Provincial Natural Science Research Project","award":["KJ2020A0773"],"award-info":[{"award-number":["KJ2020A0773"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,4,30]]},"abstract":"<jats:p>Human Multimodal Sentiment Analysis (MSA) is an attractive research that studies sentiment expressed from multiple heterogeneous modalities. While transformer-based methods have achieved great success, designing an effective \u201cco-attention\u201d model to associate text modality with nonverbal modalities remains challenging. There are two main problems: 1) the dominant role of the text in modalities is underutilization, and 2) the interaction between modalities is not sufficiently explored. This paper proposes a deep modular Co-Attention Shifting Network (CoASN) for MSA. A Cross-modal Modulation Module based on Co-attention (CMMC) and an Advanced Modality-mixing Adaptation Gate (AMAG) are constructed. The CMMC consists of the Text-guided Co-Attention (TCA) and Interior Transformer Encoder (ITE) units to capture inter-modal features and intra-modal features. With text modality as the core, the CMMC module aims to guide and promote the expression of emotion in nonverbal modalities, and the nonverbal modalities increase the richness of the text-based multimodal sentiment information. In addition, the AMAG module is introduced to explore the dynamical correlations among all modalities. Particularly, this efficient module first captures the nonverbal shifted representations and then combines them to calculate the shifted word embedding representations for the final MSA tasks. Extensive experiments on two commonly used datasets, CMU-MOSI and CMU-MOSEI, demonstrate that our proposed method is superior to the state-of-the-art performance.<\/jats:p>","DOI":"10.1145\/3634706","type":"journal-article","created":{"date-parts":[[2023,11,27]],"date-time":"2023-11-27T16:02:17Z","timestamp":1701100937000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Deep Modular Co-Attention Shifting Network for Multimodal Sentiment Analysis"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0783-5487","authenticated-orcid":false,"given":"Piao","family":"Shi","sequence":"first","affiliation":[{"name":"The Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, National Smart Eldercare International Science and Technology Cooperation Base, School of Computer Science and Information Engineering, Hefei University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2122-0240","authenticated-orcid":false,"given":"Min","family":"Hu","sequence":"additional","affiliation":[{"name":"The Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, National Smart Eldercare International Science and Technology Cooperation Base, School of Computer Science and Information Engineering, Hefei University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1514-9265","authenticated-orcid":false,"given":"Xuefeng","family":"Shi","sequence":"additional","affiliation":[{"name":"The Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, National Smart Eldercare International Science and Technology Cooperation Base, School of Computer Science and Information Engineering, Hefei University of Technology, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4860-9184","authenticated-orcid":false,"given":"Fuji","family":"Ren","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, University of Electronic Science and Technology of China, China"}]}],"member":"320","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"TEASEL: A transformer-based speech-prefixed language model","author":"Arjmand Mehdi","year":"2021","unstructured":"Mehdi Arjmand, Mohammad Javad Dousti, and Hadi Moradi. 2021. TEASEL: A transformer-based speech-prefixed language model. arXiv preprint arXiv:2109.05522 (2021).","journal-title":"arXiv preprint arXiv:2109.05522"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2021.107134"},{"key":"e_1_3_2_4_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_5_2","first-page":"104225872211021","article-title":"Pandemic depression: COVID-19 and the mental health of the self-employed","author":"Caliendo Marco","year":"2022","unstructured":"Marco Caliendo, Daniel Graeber, Alexander S. Kritikos, and Johannes Seebauer. 2022. Pandemic depression: COVID-19 and the mental health of the self-employed. Entrepreneurship Theory and Practice (2022), 10422587221102106.","journal-title":"Entrepreneurship Theory and Practice"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.93"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3136755.3136801"},{"key":"e_1_3_2_8_2","article-title":"Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion","author":"Cheng Hongju","year":"2023","unstructured":"Hongju Cheng, Zizhen Yang, Xiaoqi Zhang, and Yang Yang. 2023. Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Transactions on Affective Computing (2023).","journal-title":"IEEE Transactions on Affective Computing"},{"key":"e_1_3_2_9_2","article-title":"Electra: Pre-training text encoders as discriminators rather than generators","author":"Clark Kevin","year":"2020","unstructured":"Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).","journal-title":"arXiv preprint arXiv:2003.10555"},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","first-page":"960","DOI":"10.1109\/ICASSP.2014.6853739","volume-title":"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Degottex Gilles","year":"2014","unstructured":"Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP-A collaborative voice analysis repository for speech technologies. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 960\u2013964."},{"key":"e_1_3_2_11_2","first-page":"1","volume-title":"2022 International Joint Conference on Neural Networks (IJCNN)","author":"Fang Lingyong","year":"2022","unstructured":"Lingyong Fang, Gongshen Liu, and Ru Zhang. 2022. Sense-aware BERT and multi-task fine-tuning for multimodal sentiment analysis. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1\u20138."},{"key":"e_1_3_2_12_2","article-title":"Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions","author":"Gandhi Ankita","year":"2022","unstructured":"Ankita Gandhi, Kinjal Adhvaryu, Soujanya Poria, Erik Cambria, and Amir Hussain. 2022. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion (2022).","journal-title":"Information Fusion"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548137"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3462244.3479919"},{"key":"e_1_3_2_15_2","article-title":"TextMI: Textualize multimodal information for integrating non-verbal cues in pre-trained language models","author":"Hasan Md. Kamrul","year":"2023","unstructured":"Md. Kamrul Hasan, Md. Saiful Islam, Sangwu Lee, Wasifur Rahman, Iftekhar Naim, Mohammed Ibrahim Khan, and Ehsan Hoque. 2023. TextMI: Textualize multimodal information for integrating non-verbal cues in pre-trained language models. arXiv preprint arXiv:2303.15430 (2023).","journal-title":"arXiv preprint arXiv:2303.15430"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413678"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2021.3139856"},{"key":"e_1_3_2_18_2","article-title":"Dynamic invariant-specific representation fusion network for multimodal sentiment analysis","volume":"2022","author":"He Jing","year":"2022","unstructured":"Jing He, Haonan Yanga, Changfan Zhang, Hongrun Chen, and Yifu Xua. 2022. Dynamic invariant-specific representation fusion network for multimodal sentiment analysis. Computational Intelligence and Neuroscience 2022 (2022).","journal-title":"Computational Intelligence and Neuroscience"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"110502","DOI":"10.1016\/j.knosys.2023.110502","article-title":"TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis","author":"Huang Changqin","year":"2023","unstructured":"Changqin Huang, Junling Zhang, Xuemei Wu, Yi Wang, Ming Li, and Xiaodi Huang. 2023. TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowledge-Based Systems (2023), 110502.","journal-title":"Knowledge-Based Systems"},{"issue":"1","key":"e_1_3_2_20_2","first-page":"876","article-title":"A survey of computational approaches and challenges in multimodal sentiment analysis","volume":"7","author":"Huddar Mahesh G.","year":"2019","unstructured":"Mahesh G. Huddar, Sanjeev S. Sannakki, and Vijay S. Rajpurohit. 2019. A survey of computational approaches and challenges in multimodal sentiment analysis. Int. J. Comput. Sci. Eng. 7, 1 (2019), 876\u2013883.","journal-title":"Int. J. Comput. Sci. Eng."},{"key":"e_1_3_2_21_2","unstructured":"iMotions. 2017. Facial expression analysis. (2017). https:\/\/imotions.com\/biosensor\/fea-facial-expression-analysis\/"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00300"},{"key":"e_1_3_2_23_2","first-page":"4171","volume-title":"Proceedings of NAACL-HLT","author":"Kenton Jacob Devlin Ming-Wei Chang","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. 4171\u20134186."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3529954"},{"key":"e_1_3_2_25_2","first-page":"869","volume-title":"IJCAI","author":"Liu Fei","year":"2019","unstructured":"Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2019. Densely connected attention flow for visual question answering. In IJCAI. 869\u2013875."},{"key":"e_1_3_2_26_2","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"e_1_3_2_27_2","article-title":"Efficient low-rank multimodal fusion with modality-specific factors","author":"Liu Zhun","year":"2018","unstructured":"Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018).","journal-title":"arXiv preprint arXiv:1806.00064"},{"key":"e_1_3_2_28_2","article-title":"Hierarchical question-image co-attention for visual question answering","volume":"29","author":"Lu Jiasen","year":"2016","unstructured":"Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29 (2016).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_29_2","article-title":"Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos","author":"Ma Lianyang","year":"2022","unstructured":"Lianyang Ma, Yu Yao, Tao Liang, and Tongliang Liu. 2022. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos. arXiv preprint arXiv:2206.07981 (2022).","journal-title":"arXiv preprint arXiv:2206.07981"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2022.3172360"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1037\/h0024648"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1037\/h0024532"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2070481.2070509"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00637"},{"key":"e_1_3_2_36_2","article-title":"How do vision transformers work?","author":"Park Namuk","year":"2022","unstructured":"Namuk Park and Songkuk Kim. 2022. How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022).","journal-title":"arXiv preprint arXiv:2202.06709"},{"key":"e_1_3_2_37_2","doi-asserted-by":"crossref","first-page":"1973","DOI":"10.21437\/Interspeech.2022-532","article-title":"Word-wise sparse attention for multimodal sentiment analysis","author":"Qian Fan","year":"2022","unstructured":"Fan Qian, Hongwei Song, and Jiqing Han. 2022. Word-wise sparse attention for multimodal sentiment analysis. Proc. Interspeech 2022 (2022), 1973\u20131977.","journal-title":"Proc. Interspeech 2022"},{"key":"e_1_3_2_38_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018). https:\/\/api.semanticscholar.org\/CorpusID:49313245"},{"issue":"8","key":"e_1_3_2_39_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.","journal-title":"OpenAI Blog"},{"key":"e_1_3_2_40_2","first-page":"2359","volume-title":"Proceedings of the Conference. Association for Computational Linguistics. Meeting","volume":"2020","author":"Rahman Wasifur","year":"2020","unstructured":"Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Vol. 2020. NIH Public Access, 2359."},{"issue":"6","key":"e_1_3_2_41_2","doi-asserted-by":"crossref","first-page":"810","DOI":"10.1109\/THMS.2016.2599495","article-title":"Automatic facial expression learning method based on humanoid robot XIN-REN","volume":"46","author":"Ren Fuji","year":"2016","unstructured":"Fuji Ren and Zhong Huang. 2016. Automatic facial expression learning method based on humanoid robot XIN-REN. IEEE Transactions on Human-Machine Systems 46, 6 (2016), 810\u2013821.","journal-title":"IEEE Transactions on Human-Machine Systems"},{"issue":"4","key":"e_1_3_2_42_2","first-page":"043056","article-title":"Uncertain and biased facial expression recognition based on depthwise separable convolutional neural network with embedded attention mechanism","volume":"31","author":"Shi Piao","year":"2022","unstructured":"Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, Hongbo Li, Zezhong Li, and Hui Lin. 2022. Uncertain and biased facial expression recognition based on depthwise separable convolutional neural network with embedded attention mechanism. Journal of Electronic Imaging 31, 4 (2022), 043056.","journal-title":"Journal of Electronic Imaging"},{"key":"e_1_3_2_43_2","article-title":"Learning modality-fused representation based on transformer for emotion analysis","author":"Shi Piao","year":"2022","unstructured":"Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, and Liangfeng Xu. 2022. Learning modality-fused representation based on transformer for emotion analysis. Journal of Electronic Imaging (2022).","journal-title":"Journal of Electronic Imaging"},{"issue":"8","key":"e_1_3_2_44_2","doi-asserted-by":"crossref","first-page":"1698","DOI":"10.3390\/sym14081698","article-title":"ELM-based active learning via asymmetric samplers: Constructing a multi-class text corpus for emotion classification","volume":"14","author":"Shi Xuefeng","year":"2022","unstructured":"Xuefeng Shi, Min Hu, Fuji Ren, Piao Shi, and Xiao Sun. 2022. ELM-based active learning via asymmetric samplers: Constructing a multi-class text corpus for emotion classification. Symmetry 14, 8 (2022), 1698.","journal-title":"Symmetry"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2017.08.003"},{"key":"e_1_3_2_46_2","first-page":"8992","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Sun Zhongkai","year":"2020","unstructured":"Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992\u20138999."},{"issue":"4","key":"e_1_3_2_47_2","doi-asserted-by":"crossref","first-page":"1966","DOI":"10.1109\/TCSVT.2022.3218018","article-title":"BAFN: Bi-direction attention based fusion network for multimodal sentiment analysis","volume":"33","author":"Tang Jiajia","year":"2022","unstructured":"Jiajia Tang, Dongjun Liu, Xuanyu Jin, Yong Peng, Qibin Zhao, Yu Ding, and Wanzeng Kong. 2022. BAFN: Bi-direction attention based fusion network for multimodal sentiment analysis. IEEE Transactions on Circuits and Systems for Video Technology 33, 4 (2022), 1966\u20131978.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.3390\/jimaging7080157"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1656"},{"key":"e_1_3_2_50_2","first-page":"1823","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing","volume":"2020","author":"Tsai Yao-Hung Hubert","year":"2020","unstructured":"Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2020. NIH Public Access, 1823."},{"issue":"11","key":"e_1_3_2_51_2","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Maaten Laurens van der","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579\u20132605.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_52_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"3","key":"e_1_3_2_53_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3572915","article-title":"AMSA: Adaptive multimodal learning for sentiment analysis","volume":"19","author":"Wang Jingyao","year":"2023","unstructured":"Jingyao Wang, Luntian Mou, Lei Ma, Tiejun Huang, and Wen Gao. 2023. AMSA: Adaptive multimodal learning for sentiment analysis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 3s (2023), 1\u201321.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33017216"},{"issue":"2","key":"e_1_3_2_55_2","first-page":"1","article-title":"A optimized BERT for multimodal sentiment analysis","volume":"19","author":"Wu Jun","year":"2023","unstructured":"Jun Wu, Tianliang Zhu, Jiahui Zhu, Tianyi Li, and Chunzhi Wang. 2023. A optimized BERT for multimodal sentiment analysis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1\u201312.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_56_2","article-title":"Multi-level attention map network for multimodal sentiment analysis","author":"Xue Xiaojun","year":"2022","unstructured":"Xiaojun Xue, Chunxia Zhang, Zhendong Niu, and Xindong Wu. 2022. Multi-level attention map network for multimodal sentiment analysis. IEEE Transactions on Knowledge and Data Engineering (2022).","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3517139"},{"key":"e_1_3_2_58_2","article-title":"Multimodal sentiment analysis with two-phase multi-task learning","author":"Yang Bo","year":"2022","unstructured":"Bo Yang, Lijun Wu, Jinhua Zhu, Bo Shao, Xiaola Lin, and Tie-Yan Liu. 2022. Multimodal sentiment analysis with two-phase multi-task learning. IEEE\/ACM Transactions on Audio, Speech, and Language Processing (2022).","journal-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413690"},{"key":"e_1_3_2_60_2","article-title":"XLNet: Generalized autoregressive pretraining for language understanding","volume":"32","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00644"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2817340"},{"key":"e_1_3_2_63_2","article-title":"Tensor fusion network for multimodal sentiment analysis","author":"Zadeh Amir","year":"2017","unstructured":"Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).","journal-title":"arXiv preprint arXiv:1707.07250"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12021"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/p18-1208"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2016.94"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETCI.2022.3224929"},{"key":"e_1_3_2_68_2","doi-asserted-by":"crossref","first-page":"4703","DOI":"10.1109\/ICASSP43922.2022.9746910","volume-title":"ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhao Jinming","year":"2022","unstructured":"Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, and Haizhou Li. 2022. MEmoBERT: Pre-training model with prompt-based learning for multimodal emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4703\u20134707."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.02.028"},{"key":"e_1_3_2_70_2","first-page":"7367","volume-title":"ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zou Heqing","year":"2022","unstructured":"Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng. 2022. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7367\u20137371."},{"key":"e_1_3_2_71_2","first-page":"1","volume-title":"2022 IEEE International Conference on Multimedia and Expo (ICME)","author":"Zou Wenwen","year":"2022","unstructured":"Wenwen Zou, Jundi Ding, and Chao Wang. 2022. Utilizing BERT intermediate layers for multimodal sentiment analysis. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1\u20136."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634706","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3634706","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:49Z","timestamp":1750178149000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3634706"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":70,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,4,30]]}},"alternative-id":["10.1145\/3634706"],"URL":"https:\/\/doi.org\/10.1145\/3634706","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"2023-04-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}