{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T05:33:53Z","timestamp":1775453633237,"version":"3.50.1"},"reference-count":49,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T00:00:00Z","timestamp":1775088000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Youth Fund of the National Natural Science Foundation of China","award":["12004275"],"award-info":[{"award-number":["12004275"]}]},{"DOI":"10.13039\/501100003398","name":"Shanxi Scholarship Council of China","doi-asserted-by":"crossref","award":["2024-060"],"award-info":[{"award-number":["2024-060"]}],"id":[{"id":"10.13039\/501100003398","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Natural Science Foundation of Shanxi Province, China","award":["202403021211098"],"award-info":[{"award-number":["202403021211098"]}]},{"name":"Startup Fund of Shanxi University of Electronic Science and Technology","award":["2025KJ016"],"award-info":[{"award-number":["2025KJ016"]}]},{"award":["2025KJ016"],"award-info":[{"award-number":["2025KJ016"]}],"id":[{"id":"https:\/\/ror.org\/0522dg826","id-type":"ROR","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"<jats:p>There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators\u2019 cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU\u2013Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations.<\/jats:p>","DOI":"10.3390\/mti10040038","type":"journal-article","created":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T14:09:54Z","timestamp":1775138994000},"page":"38","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6072-8237","authenticated-orcid":false,"given":"Shufei","family":"Duan","sequence":"first","affiliation":[{"name":"College of Computer Science and Technology, Shanxi University of Electronic Science and Technology, Linfen 041000, China"},{"name":"College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenjie","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liangqi","family":"Li","sequence":"additional","affiliation":[{"name":"College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3681-8659","authenticated-orcid":false,"given":"Ting","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Computing, Newcastle University, Newcastle NE1 7RU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fangyu","family":"Zhao","sequence":"additional","affiliation":[{"name":"College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fujiang","family":"Li","sequence":"additional","affiliation":[{"name":"College of Electronic Information Engineering, Taiyuan University of Technology, Jinzhong 030600, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Huizhi","family":"Liang","sequence":"additional","affiliation":[{"name":"School of Computing, Newcastle University, Newcastle NE1 7RU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2026,4,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1016\/j.inffus.2017.02.003","article-title":"A review of affective computing: From unimodal analysis to multimodal fusion","volume":"37","author":"Poria","year":"2017","journal-title":"Inf. Fusion"},{"key":"ref_2","first-page":"29","article-title":"Modeling artificial emotions in PAD emotional space and human-computer interaction experiments","volume":"51","author":"Wu","year":"2019","journal-title":"J. Harbin Inst. Technol."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"692","DOI":"10.1016\/j.future.2017.10.028","article-title":"Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena","volume":"96","author":"Chen","year":"2019","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_4","first-page":"12136","article-title":"Deep multimodal multilinear fusion with high-order polynomial pooling","volume":"32","author":"Hou","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Tsai, Y.-H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.-P., and Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics.","DOI":"10.18653\/v1\/P19-1656"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Hazarika, D., Zimmermann, R., and Poria, S. (2020). MISA: Modality-invariant and modality-specific representations for multimodal sentiment analysis. Proceedings of the 28th ACM International Conference on Multimedia, Association for Computing Machinery.","DOI":"10.1145\/3394171.3413678"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ji, H., Li, X., Li, M., Zhao, M., and Gao, C. (2025). Hybrid relational graphs with sentiment-laden semantic alignment for multimodal emotion recognition in conversation. Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI), International Joint Conferences on Artificial Intelligence Organization.","DOI":"10.24963\/ijcai.2025\/331"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"127181","DOI":"10.1016\/j.neucom.2023.127181","article-title":"Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis","volume":"572","author":"Wang","year":"2024","journal-title":"Neurocomputing"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1109\/TMM.2019.2925966","article-title":"Locally confined modality fusion network with a global perspective for multimodal human affective computing","volume":"22","author":"Mai","year":"2020","journal-title":"IEEE Trans. Multimed."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"108339","DOI":"10.1016\/j.engappai.2024.108339","article-title":"Using transformers for multimodal emotion recognition: Taxonomies and state of the art review","volume":"133","author":"Hazmoune","year":"2024","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Yang, K., Xu, H., and Gao, K. (2020, January 12\u201316). Cm-bert: Cross-modal bert for text-audio sentiment analysis. Proceedings of the 28th ACM International Conference on Multimedia (MM \u201820), Seattle, WA, USA.","DOI":"10.1145\/3394171.3413690"},{"key":"ref_12","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 12\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning (ICML), Virtual. Available online: https:\/\/proceedings.mlr.press\/v119\/chen20j.html."},{"key":"ref_13","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning (ICML), Virtual. Available online: https:\/\/proceedings.mlr.press\/v139\/radford21a.html."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Chen, S., Xie, S., and He, K. (2021, January 11\u201317). An empirical study of training self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), Virtual.","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"4795","DOI":"10.1109\/TMM.2025.3543029","article-title":"Facial expression recognition with heatmap neighbor contrastive learning","volume":"27","author":"Liu","year":"2025","journal-title":"IEEE Trans. Multimed."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2506913","DOI":"10.1109\/TIM.2025.3533618","article-title":"Contrastive learning of EEG representation of brain area for emotion recognition","volume":"74","author":"Dai","year":"2025","journal-title":"IEEE Trans. Instrum. Meas."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1716","DOI":"10.1109\/TAFFC.2025.3535542","article-title":"Multi-scale hyperbolic contrastive learning for cross-subject EEG emotion recognition","volume":"16","author":"Chang","year":"2025","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_18","first-page":"37","article-title":"Review on Speech Emotion Recognition","volume":"25","author":"Han","year":"2014","journal-title":"J. Softw."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B. (2005, January 4\u20138). A database of German emotional speech. Proceedings of the Interspeech, Lisbon, Portugal.","DOI":"10.21437\/Interspeech.2005-446"},{"key":"ref_20","unstructured":"Grimm, M., Kroschel, K., and Narayanan, S. (April, January 23). The Vera am Mittag German audio-visual emotional speech database. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hannover, Germany."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1109\/T-AFFC.2011.20","article-title":"The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent","volume":"3","author":"McKeown","year":"2012","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_22","unstructured":"Zadeh, A., Zellers, R., Pincus, E., and Morency, L.P. (2016). MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Ringeval, F., Sonderegger, A., Sauer, J., and Lalanne, D. (2013, January 22\u201326). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), Shanghai, China.","DOI":"10.1109\/FG.2013.6553805"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"606","DOI":"10.1109\/TAFFC.2023.3286351","article-title":"MGEED: A multimodal genuine emotion and expression detection database","volume":"15","author":"Wang","year":"2024","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Lee, S., Yildirim, S., Kazemzadeh, A., and Narayanan, S. (2005, January 4\u20138). An articulatory study of emotional speech production. Proceedings of the Interspeech, Lisbon, Portugal.","DOI":"10.21437\/Interspeech.2005-325"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1411","DOI":"10.1121\/1.4908284","article-title":"A kinematic study of critical and non-critical articulators in emotional speech production","volume":"137","author":"Kim","year":"2015","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/S0167-6393(02)00082-1","article-title":"The role of voice quality in communicating emotion, mood and attitude","volume":"40","author":"Gobl","year":"2003","journal-title":"Speech Commun."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1321","DOI":"10.1121\/1.1646401","article-title":"On the use of the derivative of electroglottographic signals for characterization of nonpathological phonation","volume":"115","author":"Henrich","year":"2004","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_30","unstructured":"Mehrabian, A., and Russell, J.A. (1974). An Approach to Environmental Psychology, The MIT Press."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Meenakshi, N., Yarra, C., Yamini, B.K., and Ghosh, P.K. (2014, January 14\u201318). Comparison of speech quality with and without sensors in electromagnetic articulograph AG 501 recording. Proceedings of the Interspeech, Singapore.","DOI":"10.21437\/Interspeech.2014-243"},{"key":"ref_32","first-page":"380","article-title":"Design of speech database combining discrete labels and dimensional space","volume":"37","author":"Chen","year":"2018","journal-title":"Tech. Acoust."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1002\/1097-4679(198001)36:1<215::AID-JCLP2270360127>3.0.CO;2-6","article-title":"Induction of mood and mood shift","volume":"36","author":"Brewer","year":"1980","journal-title":"J. Clin. Psychol."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1007\/BF02686918","article-title":"Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament","volume":"14","author":"Mehrabian","year":"1996","journal-title":"Curr. Psychol."},{"key":"ref_35","unstructured":"Fant, G. (1960). Acoustic Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations, Mouton."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Cai, Z., Qin, X., Cai, D., Li, M., Liu, X., and Zhong, H. (2018, January 26\u201329). The DKU-JNU-EMA Electromagnetic Articulography Database on Mandarin and Chinese Dialects with tandem feature based acoustic-to-articulatory inversion. Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan.","DOI":"10.1109\/ISCSLP.2018.8706629"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1793","DOI":"10.3923\/itj.2014.1793.1799","article-title":"The acoustics properties of the nasals and nasalization in Standard Chinese","volume":"13","author":"Li","year":"2014","journal-title":"Inf. Technol. J."},{"key":"ref_38","first-page":"448","article-title":"Speech emotion database oriented to emotional change detection","volume":"38","author":"Zhang","year":"2021","journal-title":"Comput. Simul."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"547","DOI":"10.1044\/1092-4388(2008\/07-0218)","article-title":"Accuracy assessment for AG500, electromagnetic articulograph","volume":"52","author":"Yunusova","year":"2009","journal-title":"J. Speech Lang. Hear. Res."},{"key":"ref_40","first-page":"51","article-title":"Establishment of an emotional speech database based on the fuzzy comprehensive evaluation method","volume":"39","author":"Song","year":"2016","journal-title":"Mod. Electron. Technol."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1016\/j.jvoice.2010.09.009","article-title":"Measures of vocal attack time for healthy young adults","volume":"26","author":"Roark","year":"2012","journal-title":"J. Voice"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1016\/0092-6566(77)90037-X","article-title":"Evidence for a three-factor theory of emotions","volume":"11","author":"Russell","year":"1977","journal-title":"J. Res. Pers."},{"key":"ref_43","first-page":"40","article-title":"Revision of the Chinese Facial Affective Picture System","volume":"25","author":"Gong","year":"2011","journal-title":"Chin. Ment. Health J."},{"key":"ref_44","first-page":"821","article-title":"Effect of neuroticism on depressive symptoms in officers and soldiers: The mediating effect of negative automatic thoughts and psychological stress response","volume":"43","author":"Ge","year":"2022","journal-title":"J. Nav. Med. Univ."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Schuller, B., Steidl, S., and Batliner, A. (2009, January 6\u201310). The Interspeech 2009 emotion challenge. Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), Brighton, UK.","DOI":"10.21437\/Interspeech.2009-103"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Wang, F., and Liu, H. (2021, January 20\u201325). Understanding the behaviour of contrastive loss. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00252"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Chen, L.W., and Rudnicky, A. (2023, January 4\u201310). Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095036"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Wang, N., and Yang, D. (2025). Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlled differential equations classifier. PLoS ONE, 20.","DOI":"10.1371\/journal.pone.0318297"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Ma, Z., Zheng, Z., Ye, J., Li, J., Gao, Z., Zhang, S., and Chen, X. (2024, January 11\u201316). emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.findings-acl.931"}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/10\/4\/38\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T04:14:54Z","timestamp":1775448894000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/10\/4\/38"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,4,2]]},"references-count":49,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2026,4]]}},"alternative-id":["mti10040038"],"URL":"https:\/\/doi.org\/10.3390\/mti10040038","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,2]]}}}