{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T15:51:37Z","timestamp":1774453897744,"version":"3.50.1"},"reference-count":151,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T00:00:00Z","timestamp":1762905600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Research University Higher School of Economics"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio\u2013text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings.<\/jats:p>","DOI":"10.3390\/bdcc9110285","type":"journal-article","created":{"date-parts":[[2025,11,13]],"date-time":"2025-11-13T09:59:09Z","timestamp":1763027949000},"page":"285","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Cross-Lingual Bimodal Emotion Recognition with LLM-Based Label Smoothing"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4135-6949","authenticated-orcid":false,"given":"Elena","family":"Ryumina","sequence":"first","affiliation":[{"name":"LEYA Lab for NLP, HSE University, 199106 St. Petersburg, Russia"},{"name":"Speech and Multimodal Interfaces Laboratory, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7479-2851","authenticated-orcid":false,"given":"Alexandr","family":"Axyonov","sequence":"additional","affiliation":[{"name":"LEYA Lab for NLP, HSE University, 199106 St. Petersburg, Russia"},{"name":"Speech and Multimodal Interfaces Laboratory, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9439-1813","authenticated-orcid":false,"given":"Timur","family":"Abdulkadirov","sequence":"additional","affiliation":[{"name":"LEYA Lab for NLP, HSE University, 199106 St. Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6207-8413","authenticated-orcid":false,"given":"Darya","family":"Koryakovskaya","sequence":"additional","affiliation":[{"name":"LEYA Lab for NLP, HSE University, 199106 St. Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7935-0569","authenticated-orcid":false,"given":"Dmitry","family":"Ryumin","sequence":"additional","affiliation":[{"name":"LEYA Lab for NLP, HSE University, 199106 St. Petersburg, Russia"},{"name":"Speech and Multimodal Interfaces Laboratory, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"715","DOI":"10.1109\/TCDS.2021.3071170","article-title":"Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition","volume":"14","author":"Liu","year":"2021","journal-title":"IEEE Trans. Cogn. Dev. Syst."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"102218","DOI":"10.1016\/j.inffus.2023.102218","article-title":"Multimodal Emotion Recognition with Deep Learning: Advancements, Challenges, and Future Directions","volume":"105","author":"Geetha","year":"2024","journal-title":"Inf. Fusion"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wu, Y., Mi, Q., and Gao, T. (2025). A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics, 10.","DOI":"10.3390\/biomimetics10070418"},{"key":"ref_4","unstructured":"Ai, W., Zhang, F., Shou, Y., Meng, T., Chen, H., and Li, K. (March, January 25). Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum. Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"64330","DOI":"10.1109\/ACCESS.2025.3559339","article-title":"BiMER: Design and Implementation of a Bimodal Emotion Recognition System Enhanced by Data Augmentation Techniques","volume":"13","author":"Dikbiyik","year":"2025","journal-title":"IEEE Access"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Khan, M., Tran, P.N., Pham, N.T., El Saddik, A., and Othmani, A. (2025). MemoCMT: Multimodal Emotion Recognition using Cross-Modal Transformer-based Feature Fusion. Sci. Rep., 15.","DOI":"10.1038\/s41598-025-89202-x"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1016\/j.inffus.2017.02.003","article-title":"A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion","volume":"37","author":"Poria","year":"2017","journal-title":"Inf. Fusion"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"122579","DOI":"10.1016\/j.eswa.2023.122579","article-title":"Deep CNN with Late Fusion for Real Time Multimodal Emotion Recognition","volume":"240","author":"Dixit","year":"2024","journal-title":"Expert Syst. Appl."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1007\/s13735-025-00362-y","article-title":"PAMoE-MSA: Polarity-Aware Mixture of Experts Network for Multimodal Sentiment Analysis","volume":"14","author":"Huang","year":"2025","journal-title":"Int. J. Multimed. Inf. Retr."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, Q., Gkoumas, D., Sordoni, A., Nie, J.Y., and Melucci, M. (2021, January 2\u20139). Quantum-Inspired Neural Network for Conversational Emotion Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada.","DOI":"10.1609\/aaai.v35i15.17567"},{"key":"ref_11","unstructured":"Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.S., Ji, D., and Li, F. (November, January 29). Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition. Proceedings of the ACM International Conference on Multimedia, Ottawa, ON, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., and Fidler, S. (2016, January 27\u201330). MovieQA: Understanding Stories in Movies through Question-Answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.501"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lin, Z., Madotto, A., Shin, J., Xu, P., and Fung, P. (2019, January 3\u20137). MoEL: Mixture of Empathetic Listeners. Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1012"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"17787","DOI":"10.1007\/s00521-024-10262-7","article-title":"A Review of Research on Micro-Expression Recognition Algorithms based on Deep Learning","volume":"36","author":"Zhang","year":"2024","journal-title":"Neural Comput. Appl."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Gao, Y., Shi, H., Chu, C., and Kawahara, T. (2024, January 1\u20135). Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-2385"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1007\/s13735-023-00315-3","article-title":"An Emotion-Driven, Transformer-based Network for Multimodal Fake News Detection","volume":"13","author":"Yadav","year":"2024","journal-title":"Int. J. Multimed. Inf. Retr."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Ryumina, E., Markitantov, M., Ryumin, D., Kaya, H., and Karpov, A. (2024, January 17\u201321). Zero-Shot Audio-Visual Compound Expression Recognition Method based on Emotion Probability Fusion. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00478"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"192","DOI":"10.1016\/j.patrec.2025.02.024","article-title":"Multi-Corpus Emotion Recognition Method based on Cross-Modal Gated Attention Fusion","volume":"190","author":"Ryumina","year":"2025","journal-title":"Pattern Recognit. Lett."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Ekman, P., Dalgleish, T., and Power, M. (1999). Handbook of Cognition and Emotion, Wiley Online Library.","DOI":"10.1002\/0470013494"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1161","DOI":"10.1037\/h0077714","article-title":"A Circumplex Model of Affect","volume":"39","author":"Russell","year":"1980","journal-title":"J. Personal. Soc. Psychol."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"107901","DOI":"10.1016\/j.neunet.2025.107901","article-title":"DialogueLLM: Context and Emotion Knowledge-tuned Large Language Models for Emotion Recognition in Conversations","volume":"192","author":"Zhang","year":"2025","journal-title":"Neural Netw."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"e13403","DOI":"10.1111\/exsy.13403","article-title":"Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods","volume":"42","author":"Khalane","year":"2025","journal-title":"Expert Syst."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"103268","DOI":"10.1016\/j.inffus.2025.103268","article-title":"RMER-DT: Robust Multimodal Emotion Recognition in Conversational Contexts based on Diffusion and Transformers","volume":"123","author":"Zhu","year":"2025","journal-title":"Inf. Fusion"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Liang, Y., Wang, Z., Liu, F., Liu, M., and Yao, Y. (2025, January 11\u201312). Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA.","DOI":"10.1109\/CVPRW67362.2025.00562"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"7659","DOI":"10.1109\/TCYB.2022.3195739","article-title":"AIA-Net: Adaptive Interactive Attention Network for Text\u2013Audio Emotion Recognition","volume":"53","author":"Zhang","year":"2023","journal-title":"IEEE Trans. Cybern."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1181","DOI":"10.1109\/LSP.2025.3550007","article-title":"Adaptive Alignment and Time Aggregation Network for Speech-Visual Emotion Recognition","volume":"32","author":"Wu","year":"2025","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Mote, P., Sisman, B., and Busso, C. (2024, January 1\u20135). Unsupervised Domain Adaptation for Speech Emotion Recognition using K-Nearest Neighbors Voice Conversion. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1248"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Lu, C., Zong, Y., Zhao, Y., Lian, H., Qi, T., Schuller, B., and Zheng, W. (2024, January 1\u20135). Hierarchical Distribution Adaptation for Unsupervised Cross-Corpus Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1948"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"102711","DOI":"10.1016\/j.inffus.2024.102711","article-title":"Multiplex Graph Aggregation and Feature Refinement for Unsupervised Incomplete Multimodal Emotion Recognition","volume":"114","author":"Deng","year":"2025","journal-title":"Inf. Fusion"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"107543","DOI":"10.1016\/j.neunet.2025.107543","article-title":"Group-Wise Relation Mining for Weakly-Supervised Fine-Grained Multimodal Emotion Recognition","volume":"190","author":"Jin","year":"2025","journal-title":"Neural Netw."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tran, M., Yin, Y., and Soleymani, M. (2025). SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision. IEEE Trans. Affect. Comput., 1\u201315.","DOI":"10.1109\/TAFFC.2025.3568024"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"10745","DOI":"10.1109\/TPAMI.2023.3263585","article-title":"Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap","volume":"45","author":"Wagner","year":"2023","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","unstructured":"Lubenets, I., Davidchuk, N., and Amentes, A. (2025, September 28). Aniemore: A Toolkit for Animation and Emotion Recognition. GitHub Repository. Available online: https:\/\/github.com\/aniemore\/Aniemore."},{"key":"ref_34","unstructured":"Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., and Mihalcea, R. (August, January 28). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Yi, Y., Zhou, Y., Wang, T., and Zhou, J. (2025). Advances in Video Emotion Recognition: Challenges and Trends. Sensors, 25.","DOI":"10.3390\/s25123615"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"505","DOI":"10.1109\/TAFFC.2018.2874986","article-title":"Survey on Emotional Body Gesture Recognition","volume":"12","author":"Noroozi","year":"2018","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1109\/TAFFC.2021.3053275","article-title":"A Survey of Textual Emotion Recognition and its Challenges","volume":"14","author":"Deng","year":"2021","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Garc\u00eda-Hern\u00e1ndez, R.A., Luna-Garc\u00eda, H., Celaya-Padilla, J.M., Garc\u00eda-Hern\u00e1ndez, A., Reveles-G\u00f3mez, L.C., Flores-Chaires, L.A., Delgado-Contreras, J.R., Rondon, D., and Villalba-Condori, K.O. (2024). A Systematic Literature Review of Modalities, Trends, and Limitations in Emotion Recognition, Affective Computing, and Sentiment Analysis. Appl. Sci., 14.","DOI":"10.3390\/app14167165"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases","volume":"44","author":"Kamel","year":"2011","journal-title":"Pattern Recognit."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"117327","DOI":"10.1109\/ACCESS.2019.2936124","article-title":"Speech Emotion Recognition using Deep Learning Techniques: A Review","volume":"7","author":"Khalil","year":"2019","journal-title":"IEEE Access"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"47795","DOI":"10.1109\/ACCESS.2021.3068045","article-title":"A Comprehensive Review of Speech Emotion Recognition Systems","volume":"9","author":"Wani","year":"2021","journal-title":"IEEE Access"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"106764","DOI":"10.1016\/j.neunet.2024.106764","article-title":"HiMul-LGG: A Hierarchical Decision Fusion-based Local-Global Graph Neural Network for Multimodal Emotion Recognition in Conversation","volume":"181","author":"Fu","year":"2025","journal-title":"Neural Netw."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Dutta, S., and Ganapathy, S. (2025, January 6\u201311). LLM Supervised Pre-training for Multimodal Emotion Recognition in Conversations. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India.","DOI":"10.1109\/ICASSP49660.2025.10889998"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Zhang, X., and Li, Y. (2023, January 20\u201324). A Dual Attention-based Modality-Collaborative Fusion Network for Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-523"},{"key":"ref_45","unstructured":"Wang, Y., Li, Y., and Cui, Z. (2023, January 10\u201316). Incomplete Multimodality-Diffused Emotion Recognition. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"102663","DOI":"10.1016\/j.inffus.2024.102663","article-title":"Triple Disentangled Representation Learning for Multimodal Affective Analysis","volume":"114","author":"Zhou","year":"2025","journal-title":"Inf. Fusion"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"26","DOI":"10.1109\/MCS.2025.3534477","article-title":"Unsupervised Representation Learning in Deep Reinforcement Learning: A Review","volume":"45","author":"Botteghi","year":"2025","journal-title":"IEEE Control Syst."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"129261","DOI":"10.1016\/j.neucom.2024.129261","article-title":"UA-FER: Uncertainty-Aware Representation Learning for Facial Expression Recognition","volume":"621","author":"Zhou","year":"2025","journal-title":"Neurocomputing"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"126649","DOI":"10.1016\/j.neucom.2023.126649","article-title":"A Multimodal Fusion Emotion Recognition Method based on Multitask Learning and Attention Mechanism","volume":"556","author":"Xie","year":"2023","journal-title":"Neurocomputing"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"103976","DOI":"10.1109\/ACCESS.2024.3430850","article-title":"A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges","volume":"12","author":"Kalateh","year":"2024","journal-title":"IEEE Access"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Bujnowski, P., Kuzma, B., Paziewski, B., Rutkowski, J., Marhula, J., Bordzicka, Z., and Andruszkiewicz, P. (2024, January 1\u20135). SAMSEMO: New Dataset for Multilingual and Multimodal Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-212"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Bagher Zadeh, A., Liang, P.P., Poria, S., Cambria, E., and Morency, L.P. (2018, January 15\u201320). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1208"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Fan, W., Xu, X., Xing, X., Chen, W., and Huang, D. (2021, January 6\u201311). LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414542"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Nojavanasghari, B., Baltru\u0161aitis, T., Hughes, C.E., and Morency, L.P. (2016, January 12\u201316). EmoReact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children. Proceedings of the ACM International Conference on Multimodal Interaction (ICMI), Tokyo, Japan.","DOI":"10.1145\/2993148.2993168"},{"key":"ref_56","unstructured":"Sokolov, A., Minkin, F., Savushkin, N., Karpov, N., Kutuzov, O., and Kondratenko, V. (2025, September 28). Dusha Dataset. GitHub Repository. Available online: https:\/\/github.com\/salute-developers\/golos\/tree\/master\/dusha#dusha-dataset."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"1162","DOI":"10.1016\/j.specom.2006.04.003","article-title":"Emotional Speech Recognition: Resources, Features, and Methods","volume":"48","author":"Ververidis","year":"2006","journal-title":"Speech Commun."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech Emotion Recognition using Deep Neural Network and Extreme Learning Machine. Proceedings of the Interspeech, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Papakostas, M., Spyrou, E., Giannakopoulos, T., Siantikos, G., Sgouropoulos, D., Mylonas, P., and Makedon, F. (2017). Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition. Computation, 5.","DOI":"10.3390\/computation5020026"},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"4908","DOI":"10.1109\/TNNLS.2024.3367940","article-title":"DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition","volume":"36","author":"Ai","year":"2025","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Yaddaden, Y. (2025). Efficient Dynamic Emotion Recognition from Facial Expressions using Statistical Spatio-Temporal Geometric Features. Big Data Cogn. Comput., 9.","DOI":"10.20944\/preprints202505.1095.v1"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Labib, F.H., Elagamy, M., and Saleh, S.N. (2025). EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification. Big Data Cogn. Comput., 9.","DOI":"10.3390\/bdcc9020048"},{"key":"ref_63","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA."},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm\u00e1n, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020, January 5\u201310). Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, USA.","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"ref_65","unstructured":"Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (May, January 26). ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia."},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Sturua, S., Mohr, I., Kalim Akram, M., G\u00fcnther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., and Wang, N. (2025, January 6\u201310). Jina Embeddings V3: Multilingual Text Encoder with Low-Rank Adaptations. Proceedings of the European Conference on Information Retrieval (ECIR), Lucca, Italy.","DOI":"10.1007\/978-3-031-88720-8_21"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019, January 15\u201319). Wav2vec: Unsupervised Pre-Training for Speech Recognition. Proceedings of the Interspeech, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1873"},{"key":"ref_68","unstructured":"Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020, January 6\u201312). Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual-only."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Li, F., Luo, J., and Xia, W. (2025, January 8\u201310). WavFusion: Towards Wav2vec 2.0 Multimodal Speech Emotion Recognition. Proceedings of the MultiMedia Modeling, Nara, Japan.","DOI":"10.1007\/978-981-96-2071-5_24"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Amiriparian, S., Packa\u0144, F., Gerczuk, M., and Schuller, B.W. (2024, January 1\u20135). ExHuBERT: Enhancing HuBERT through Block Extension and Fine-Tuning on 37 Emotion Datasets. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-280"},{"key":"ref_71","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. (2023, January 23\u201329). Robust Speech Recognition via Large-Scale Weak Supervision. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Goron, E., Asai, L., Rut, E., and Dinov, M. (2024, January 14\u201319). Improving Domain Generalization in Speech Emotion Recognition with Whisper. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10446997"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Fukuda, R., Kano, T., Ando, A., and Ogawa, A. (2025, January 6\u201311). Speech Emotion Recognition based on Large-Scale Automatic Speech Recognizer. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India.","DOI":"10.1109\/ICASSP49660.2025.10889314"},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Kim, K., and Cho, N. (2023, January 20\u201324). Focus-Attention-Enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-555"},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Leem, S.G., Fulford, D., Onnela, J.P., Gard, D., and Busso, C. (2023, January 20\u201324). Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1034"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Luo, J., Phan, H., and Reiss, J. (2023, January 20\u201324). Fine-tuned RoBERTa Model with a CNN-LSTM Network for Conversational Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-463"},{"key":"ref_77","unstructured":"Zhao, J., Wei, X., and Bo, L. (2025, September 28). R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning. GitHub Repository. Available online: https:\/\/github.com\/HumanMLLM\/R1-Omni."},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Li, Y., Wang, Y., and Cui, Z. (2023, January 18\u201322). Decoupled Multimodal Distilling for Emotion Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00641"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Shi, H., Liang, Z., and Yu, J. (2024, January 1\u20135). Emotional Cues Extraction and Fusion for Multi-Modal Emotion Prediction and Recognition in Conversation. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1688"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"111340","DOI":"10.1016\/j.patcog.2024.111340","article-title":"FrameERC: Framelet Transform based Multimodal Graph Neural Networks for Emotion Recognition in Conversation","volume":"161","author":"Li","year":"2025","journal-title":"Pattern Recognit."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Li, W., Zhou, H., Yu, J., Song, Z., and Yang, W. (2024, January 10\u201315). Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada.","DOI":"10.52202\/079017-1910"},{"key":"ref_82","unstructured":"Wang, J., Paliotta, D., May, A., Rush, A.M., and Dao, T. (2024, January 10\u201315). The Mamba in the Llama: Distilling and Accelerating Hybrid Models. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_83","unstructured":"Zou, J., Liao, B., Zhang, Q., Liu, W., and Wang, X. (2025, September 28). OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models. GitHub Repository. Available online: https:\/\/github.com\/hustvl\/OmniMamba."},{"key":"ref_84","unstructured":"Yang, S., Kautz, J., and Hatamizadeh, A. (2025, January 24\u201328). Gated Delta Networks: Improving Mamba2 with Delta Rule. Proceedings of the International Conference on Learning Representations (ICLR), Singapore."},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"Yang, Z., and Hirschberg, J. (2018, January 2\u20136). Predicting Arousal and Valence from Waveforms and Spectrograms using Deep Neural Networks. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-2397"},{"key":"ref_86","doi-asserted-by":"crossref","unstructured":"Delbrouck, J.B., Tits, N., Brousmiche, M., and Dupont, S. (2020, January 10). A Transformer-based Joint-Encoding for Emotion Recognition and Sentiment Analysis. Proceedings of the Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), Seattle, WA, USA.","DOI":"10.18653\/v1\/2020.challengehml-1.1"},{"key":"ref_87","unstructured":"Padi, S., Sadjadi, S.O., Manocha, D., and Sriram, R.D. (July, January 28). Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based Models. Proceedings of the Speaker and Language Recognition Workshop (Odyssey), Beijing, China."},{"key":"ref_88","unstructured":"Lei, S., Dong, G., Wang, X., Wang, K., and Wang, S. (2025, September 28). InstructERC: Reforming Emotion Recognition in Conversation with Multi-Task Retrieval-Augmented Large Language Models. GitHub Repository. Available online: https:\/\/github.com\/LIN-SHANG\/InstructERC."},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Wang, S., Gudnason, J., and Borth, D. (2023, January 20\u201324). Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1595"},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Lee, S.w. (2023, January 20\u201324). Diverse Feature Mapping and Fusion via Multitask Learning for Multilingual Speech Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1425"},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Gong, T., Belanich, J., Somandepalli, K., Nagrani, A., Eoff, B., and Jou, B. (2023, January 20\u201324). LanSER: Language-Model Supported Speech Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1832"},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Mai, J., Xing, X., Chen, W., and Xu, X. (2024, January 1\u20135). DropFormer: A Dynamic Noise-Dropping Transformer for Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-651"},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Zhao, Z., Gao, T., Wang, H., and Schuller, B. (2024, January 1\u20135). MFDR: Multiple-Stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1735"},{"key":"ref_94","doi-asserted-by":"crossref","unstructured":"Garc\u00eda, R., Mahu, R., Gr\u00e1geda, N., Luzanto, A., Bohmer, N., Busso, C., and Becerra Yoma, N. (2024, January 1\u20135). Speech Emotion Recognition with Deep Learning Beamforming on a Distant Human-Robot Interaction Scenario. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1273"},{"key":"ref_95","doi-asserted-by":"crossref","unstructured":"Sun, H., Zhang, F., Gao, Y., Zhang, S., Lian, Z., and Feng, J. (2024, January 1\u20135). MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-427"},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Wu, H., Chou, H.C., Chang, K.W., Goncalves, L., Du, J., Jang, J.S.R., Lee, C.C., and Lee, H.Y. (2024, January 2\u20135). Open-Emotion: A Reproducible EMO-Superb for Speech Emotion Recognition Systems. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Macao, China.","DOI":"10.1109\/SLT61566.2024.10832296"},{"key":"ref_97","doi-asserted-by":"crossref","unstructured":"Phukan, O.C., Kashyap, G.S., Buduru, A.B., and Sharma, R. (2024, January 1\u20135). Are Paralinguistic Representations All that is Needed for Speech Emotion Recognition?. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-2233"},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Ritter-Gutierrez, F., Huang, K.P., Wong, J.H.M., Ng, D., Lee, H.-y., Chen, N.F., and Chng, E.S. (2024, January 1\u20135). Dataset-Distillation Generative Model for Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1430"},{"key":"ref_99","doi-asserted-by":"crossref","unstructured":"Leem, S.G., Fulford, D., Onnela, J.P., Gard, D., and Busso, C. (2024, January 1\u20135). Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-1218"},{"key":"ref_100","doi-asserted-by":"crossref","unstructured":"Ma, L., Shen, L., Li, R., Zhang, H., Qian, K., Hu, B., Schuller, B.W., and Yamamoto, Y. (2024, January 1\u20135). E-ODN: An Emotion Open Deep Network for Generalised and Adaptive Speech Emotion Recognition. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-685"},{"key":"ref_101","doi-asserted-by":"crossref","unstructured":"Huang, Z., Mak, M.W., and Lee, K.A. (2024, January 1\u20135). MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-538"},{"key":"ref_102","doi-asserted-by":"crossref","unstructured":"Sun, H., Zhao, S., Li, S., Kong, X., Wang, X., Zhou, J., Kong, A., Chen, Y., Zeng, W., and Qin, Y. (2025, January 6\u201311). Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India.","DOI":"10.1109\/ICASSP49660.2025.10889485"},{"key":"ref_103","first-page":"1","article-title":"Multimodal Emotion-Cause pair Extraction with Holistic Interaction and Label Constraint","volume":"21","author":"Li","year":"2024","journal-title":"ACM Trans. Multimed. Comput. Commun. Appl."},{"key":"ref_104","doi-asserted-by":"crossref","unstructured":"Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saraf, Y., and Pino, J. (2022, January 18\u201322). XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale. Proceedings of the Interspeech, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-143"},{"key":"ref_105","doi-asserted-by":"crossref","first-page":"1505","DOI":"10.1109\/JSTSP.2022.3188113","article-title":"WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing","volume":"16","author":"Chen","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_106","first-page":"13","article-title":"M3GAT: A Multi-Modal, Multi-Task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition","volume":"42","author":"Zhang","year":"2023","journal-title":"ACM Trans. Inf. Syst."},{"key":"ref_107","doi-asserted-by":"crossref","first-page":"103184","DOI":"10.1016\/j.specom.2024.103184","article-title":"AMGCN: An Adaptive Multi-Graph Convolutional Network for Speech Emotion Recognition","volume":"168","author":"Lian","year":"2025","journal-title":"Speech Commun."},{"key":"ref_108","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Zhou, Y., Yang, Y., Liu, Y., Huang, J., Zhao, S., Su, R., Wang, L., and Yan, N. (2025, January 17\u201321). Emotion-Guided Graph Attention Networks for Speech-based Depression Detection under Emotion-Inducting Tasks. Proceedings of the Interspeech, Rotterdam, The Netherlands.","DOI":"10.21437\/Interspeech.2025-1597"},{"key":"ref_109","doi-asserted-by":"crossref","first-page":"2184","DOI":"10.1109\/LSP.2025.3570245","article-title":"Multi-Stage Confidence-Guided Diffusion and Emotional Bidirectional Mamba for Robust Speech Emotion Recognition","volume":"32","author":"Liu","year":"2025","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_110","doi-asserted-by":"crossref","unstructured":"Zhang, T., Chen, Z., and Du, J. (2025, January 6\u20139). Multimodal Mamba Model for Emotion Recognition in Conversations. Proceedings of the International Conference on Machine Learning and Computing, Nanjing, China.","DOI":"10.1007\/978-3-031-94898-5_20"},{"key":"ref_111","doi-asserted-by":"crossref","unstructured":"Chen, G., Liao, Y., Zhang, D., Yang, W., Mai, Z., and Xu, C. (2025). Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment. Electronics, 14.","DOI":"10.3390\/electronics14183638"},{"key":"ref_112","unstructured":"Gu, A., and Dao, T. (2024, January 7\u20139). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Proceedings of the Conference on Language Modeling (CoLM), University of Pennsylvania, Philadelphia, PA, USA."},{"key":"ref_113","unstructured":"Dao, T., and Gu, A. (2024, January 21\u201327). Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality. Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria."},{"key":"ref_114","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2022, January 25\u201329). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations (ICLR), Virtual."},{"key":"ref_115","unstructured":"Mai, S., Zeng, Y., and Hu, H. (May, January 28). Learning by Comparing: Boosting Multimodal Affective Computing through Ordinal Learning. Proceedings of the ACM on Web Conference, Sydney, NSW, Australia."},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Ma, F., Li, Y., Ni, S., Huang, S.L., and Zhang, L. (2022). Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN. Appl. Sci., 12.","DOI":"10.3390\/app12010527"},{"key":"ref_117","doi-asserted-by":"crossref","unstructured":"Tiwari, U., Soni, M., Chakraborty, R., Panda, A., and Kopparapu, S.K. (2020, January 4\u20138). Multi-Conditioning and Data Augmentation using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053581"},{"key":"ref_118","doi-asserted-by":"crossref","first-page":"2083","DOI":"10.1109\/TAI.2025.3537965","article-title":"A Multimodal-Driven Fusion Data Augmentation Framework for Emotion Recognition","volume":"6","author":"Li","year":"2025","journal-title":"IEEE Trans. Artif. Intell."},{"key":"ref_119","doi-asserted-by":"crossref","unstructured":"Bouchelligua, W., Al-Dayil, R., and Algaith, A. (2025). Effective Data Augmentation Techniques for Arabic Speech Emotion Recognition using Convolutional Neural Networks. Appl. Sci., 15.","DOI":"10.20944\/preprints202501.0126.v1"},{"key":"ref_120","doi-asserted-by":"crossref","first-page":"111647","DOI":"10.1109\/ACCESS.2025.3578143","article-title":"A Comprehensive Analysis of Data Augmentation Methods for Speech Emotion Recognition","volume":"13","author":"Avci","year":"2025","journal-title":"IEEE Access"},{"key":"ref_121","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., and Lopez-Paz, D. (May, January 30). Mixup: Beyond Empirical Risk Minimization. Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada."},{"key":"ref_122","doi-asserted-by":"crossref","unstructured":"Malik, M.I., Latif, S., Jurdak, R., and Schuller, B.W. (2023, January 20\u201324). A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1080"},{"key":"ref_123","doi-asserted-by":"crossref","unstructured":"Wang, Y., and Chen, L. (2025, January 3\u20137). Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-Scarce Classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Denver, Colorado.","DOI":"10.1109\/CVPR52734.2025.02380"},{"key":"ref_124","doi-asserted-by":"crossref","unstructured":"Su, X., Yang, B., Yi, X., and Cao, Y. (2025, January 17\u201321). DiffEmotionVC: A Dual-Granularity Disentangled Diffusion Framework for Any-to-Any Emotional Voice Conversion. Proceedings of the Interspeech, Rotterdam, The Netherlands.","DOI":"10.21437\/Interspeech.2025-1210"},{"key":"ref_125","doi-asserted-by":"crossref","unstructured":"Stanley, E., DeMattos, E., Klementiev, A., Ozimek, P., Clarke, G., Berger, M., and Palaz, D. (2023, January 20\u201324). Emotion Label Encoding using Word Embeddings for Speech Emotion Recognition. Proceedings of the Interspeech, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-1591"},{"key":"ref_126","doi-asserted-by":"crossref","unstructured":"Purohit, T., and Magimai-Doss, M. (2025, January 6\u201311). Emotion Information Recovery Potential of Wav2Vec2 Network Fine-tuned for Speech Recognition Task. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India.","DOI":"10.1109\/ICASSP49660.2025.10890800"},{"key":"ref_127","doi-asserted-by":"crossref","unstructured":"Wu, Y.T., Wu, J., Sethu, V., and Lee, C.C. (2024, January 1\u20135). Can Modelling Inter-Rater Ambiguity Lead to Noise-Robust Continuous Emotion Predictions?. Proceedings of the Interspeech, Kos, Greece.","DOI":"10.21437\/Interspeech.2024-482"},{"key":"ref_128","doi-asserted-by":"crossref","first-page":"125524","DOI":"10.1016\/j.eswa.2024.125524","article-title":"Leveraging Large Language Model ChatGPT for Enhanced Understanding of End-User Emotions in Social Media Feedbacks","volume":"261","author":"Khan","year":"2025","journal-title":"Expert Syst. Appl."},{"key":"ref_129","unstructured":"Muhammad, S.H., Ousidhoum, N., Abdulmumin, I., Yimam, S.M., Wahle, J.P., Lima Ruas, T., Beloucif, M., De Kock, C., Belay, T.D., and Ahmad, I.S. (August, January 31). SemEval-2025 Task 11: Bridging the Gap in Text-based Emotion Detection. Proceedings of the International Workshop on Semantic Evaluation (SemEval), Vienna, Austria."},{"key":"ref_130","doi-asserted-by":"crossref","unstructured":"Franceschini, R., Fini, E., Beyan, C., Conti, A., Arrigoni, F., and Ricci, E. (2022, January 21\u201325). Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss. Proceedings of the International Conference on Pattern Recognition (ICPR), Montr\u00e9al, QC, Canada.","DOI":"10.1109\/ICPR56361.2022.9956589"},{"key":"ref_131","doi-asserted-by":"crossref","first-page":"110261","DOI":"10.1016\/j.patcog.2024.110261","article-title":"EmoComicNet: A Multi-Task Model for Comic Emotion Recognition","volume":"150","author":"Dutta","year":"2024","journal-title":"Pattern Recognit."},{"key":"ref_132","doi-asserted-by":"crossref","unstructured":"Ryumin, D., Ivanko, D., and Ryumina, E. (2023). Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors, 23.","DOI":"10.3390\/s23042284"},{"key":"ref_133","doi-asserted-by":"crossref","first-page":"2163","DOI":"10.1109\/TASLPRO.2025.3574878","article-title":"Resampling Filter Design for Multirate Neural Audio Effect Processing","volume":"33","author":"Carson","year":"2025","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_134","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1109\/TAFFC.2017.2736999","article-title":"Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings","volume":"10","author":"Lotfian","year":"2019","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_135","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_136","first-page":"107547","article-title":"xLSTM: Extended Long Short-Term Memory","volume":"37","author":"Beck","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst. (NeurIPS)"},{"key":"ref_137","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is All You Need. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA."},{"key":"ref_138","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1162\/tacl_a_00448","article-title":"Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation","volume":"10","author":"Clark","year":"2022","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_139","unstructured":"M\u00fcller, R., Kornblith, S., and Hinton, G.E. (2019, January 8\u201314). When does Label Smoothing Help?. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada."},{"key":"ref_140","doi-asserted-by":"crossref","first-page":"5984","DOI":"10.1109\/TIP.2021.3089942","article-title":"Delving Seep Into Label Smoothing","volume":"30","author":"Zhang","year":"2021","journal-title":"IEEE Trans. Image Process."},{"key":"ref_141","doi-asserted-by":"crossref","unstructured":"Axyonov, A., Ryumin, D., Ivanko, D., Kashevnik, A., and Karpov, A. (2024, January 14\u201319). Audio-Visual Speech Recognition In-the-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-based Method. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10448048"},{"key":"ref_142","doi-asserted-by":"crossref","unstructured":"Zhu, J., Zhao, S., Jiang, J., Xu, Z., Tang, W., and Yao, H. (2025, January 6\u201311). Learning Class Prototypes for Visual Emotion Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India.","DOI":"10.1109\/ICASSP49660.2025.10889974"},{"key":"ref_143","unstructured":"Ryumina, E., Markitantov, M., Axyonov, A., Ryumin, D., Dolgushin, M., and Karpov, A. (2025, January 19\u201323). Zero-Shot Multimodal Compound Expression Recognition Approach using Off-the-Shelf Large Visual-Language Models. Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Honolulu, HI, USA."},{"key":"ref_144","doi-asserted-by":"crossref","unstructured":"Devvrit, F., Kudugunta, S., Kusupati, A., Dettmers, T., Chen, K., Dhillon, I., Tsvetkov, Y., Hajishirzi, H., Kakade, S., and Farhadi, A. (2025, January 10\u201315). MatFormer: Nested Transformer for Elastic Inference. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada.","DOI":"10.52202\/079017-4461"},{"key":"ref_145","first-page":"1","article-title":"Statistical Comparisons of Classifiers over Multiple Data Sets","volume":"7","year":"2006","journal-title":"J. Mach. Learn. Res."},{"key":"ref_146","doi-asserted-by":"crossref","unstructured":"Yun, T., Lim, H., Lee, J., and Song, M. (2024, January 16\u201321). TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation. Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Mexico City, Mexico.","DOI":"10.18653\/v1\/2024.naacl-long.5"},{"key":"ref_147","doi-asserted-by":"crossref","first-page":"012050","DOI":"10.1088\/1742-6596\/1740\/1\/012050","article-title":"HPC Resources of the Higher School of Economics","volume":"1740","author":"Kostenetskiy","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_148","doi-asserted-by":"crossref","unstructured":"Wu, Y., Zhang, S., and Li, P. (2025). Multi-Modal Emotion Recognition in Conversation based on Prompt Learning with Text-Audio Fusion Features. Sci. Rep., 15.","DOI":"10.1038\/s41598-025-89758-8"},{"key":"ref_149","first-page":"7","article-title":"Feature-Enhanced Neural Collaborative Reasoning for Explainable Recommendation","volume":"43","author":"Zhang","year":"2024","journal-title":"ACM Trans. Inf. Syst."},{"key":"ref_150","unstructured":"Tang, X., Li, Z., Sun, X., Xu, X., and Zhang, M.L. (May, January 26). ZzzMate: A Self-Conscious Emotion-Aware Chatbot for Sleep Intervention. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan."},{"key":"ref_151","doi-asserted-by":"crossref","unstructured":"Schedl, M., Lex, E., and Tkalcic, M. (2025, January 13\u201318). Psychological Aspects in Retrieval and Recommendation. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy.","DOI":"10.1145\/3726302.3731691"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/11\/285\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,14]],"date-time":"2025-11-14T05:35:29Z","timestamp":1763098529000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/9\/11\/285"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,12]]},"references-count":151,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["bdcc9110285"],"URL":"https:\/\/doi.org\/10.3390\/bdcc9110285","relation":{},"ISSN":["2504-2289"],"issn-type":[{"value":"2504-2289","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,12]]}}}