{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,29]],"date-time":"2026-03-29T15:52:39Z","timestamp":1774799559716,"version":"3.50.1"},"reference-count":47,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,5,22]],"date-time":"2023-05-22T00:00:00Z","timestamp":1684713600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>Speech emotion recognition is challenging due to the subjectivity and ambiguity of emotion. In recent years, multimodal methods for speech emotion recognition have achieved promising results. However, due to the heterogeneity of data from different modalities, effectively integrating different modal information remains a difficulty and breakthrough point of the research. Moreover, in view of the limitations of feature-level fusion and decision-level fusion methods, capturing fine-grained modal interactions has often been neglected in previous studies. We propose a method named multimodal transformer augmented fusion that uses a hybrid fusion strategy, combing feature-level fusion and model-level fusion methods, to perform fine-grained information interaction within and between modalities. A Model-fusion module composed of three Cross-Transformer Encoders is proposed to generate multimodal emotional representation for modal guidance and information fusion. Specifically, the multimodal features obtained by feature-level fusion and text features are used to enhance speech features. Our proposed method outperforms existing state-of-the-art approaches on the IEMOCAP and MELD dataset.<\/jats:p>","DOI":"10.3389\/fnbot.2023.1181598","type":"journal-article","created":{"date-parts":[[2023,5,22]],"date-time":"2023-05-22T04:35:32Z","timestamp":1684730132000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":45,"title":["Multimodal transformer augmented fusion for speech emotion recognition"],"prefix":"10.3389","volume":"17","author":[{"given":"Yuanyuan","family":"Wang","sequence":"first","affiliation":[]},{"given":"Yu","family":"Gu","sequence":"additional","affiliation":[]},{"given":"Yifei","family":"Yin","sequence":"additional","affiliation":[]},{"given":"Yingping","family":"Han","sequence":"additional","affiliation":[]},{"given":"He","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Shuang","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Chenyu","family":"Li","sequence":"additional","affiliation":[]},{"given":"Dou","family":"Quan","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,5,22]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1016\/j.specom.2022.03.002","article-title":"Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion","volume":"140","author":"Atmaja","year":"2022","journal-title":"Speech Commun."},{"key":"B2","doi-asserted-by":"publisher","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on speech emotion recognition: features, classification schemes, and databases","volume":"44","author":"Ayadi","year":"2011","journal-title":"Pattern Recogn."},{"key":"B3","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"B4","first-page":"374","article-title":"\u201cA multi-scale fusion framework for bimodal speech emotion recognition,\u201d","volume-title":"Interspeech","author":"Chen","year":"2020"},{"key":"B5","doi-asserted-by":"crossref","first-page":"571","DOI":"10.1145\/2964284.2967286","article-title":"\u201cMulti-modal conditional attention fusion for dimensional emotion prediction,\u201d","volume-title":"Proceedings of the 24th ACM International Conference on Multimedia","author":"Chen","year":"2016"},{"key":"B6","article-title":"Multimodal end-to-end sparse model for emotion recognition","author":"Dai","year":"2021","journal-title":"arXiv preprint arXiv:2103.09666"},{"key":"B7","first-page":"4148","article-title":"\u201cContextualized gnn based multimodal emotion recognition,\u201d","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Joshi","year":"2022"},{"key":"B8","doi-asserted-by":"publisher","first-page":"2980","DOI":"10.1109\/TMM.2018.2827782","article-title":"Building emotional machines: Recognizing image emotions through deep neural networks","volume":"20","author":"Kim","year":"2017","journal-title":"IEEE Trans. Multimedia"},{"key":"B9","article-title":"Adam: a method for stochastic optimization","author":"Kingma","year":"2014","journal-title":"Comput. Sci"},{"key":"B10","first-page":"4243","article-title":"\u201cMultimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,\u201d","volume-title":"Interspeech","author":"Krishna","year":"2020"},{"key":"B11","first-page":"1748","article-title":"\u201cTowards the explainability of multimodal speech emotion recognition,\u201d","volume-title":"InterSpeech","author":"Kumar","year":"2021"},{"key":"B12","article-title":"Interpretable multimodal emotion recognition using hybrid fusion of speech and image data","author":"Kumar","year":"2022","journal-title":"arXiv preprint arXiv:2208.11868"},{"key":"B13","doi-asserted-by":"publisher","first-page":"94557","DOI":"10.1109\/ACCESS.2021.3092735","article-title":"Multimodal emotion recognition fusion analysis adapting bert with heterogeneous feature unification","volume":"9","author":"Lee","year":"2021","journal-title":"IEEE Access"},{"key":"B14","doi-asserted-by":"crossref","DOI":"10.1109\/TASLP.2021.3049898","article-title":"\u201cCTNet: conversational transformer network for emotion recognition,\u201d","volume-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing","author":"Lian","year":"2021"},{"key":"B15","first-page":"394","article-title":"\u201cContext-dependent domain adversarial neural network for multimodal emotion recognition,\u201d","volume-title":"Interspeech","author":"Lian","year":"2020"},{"key":"B16","article-title":"\u201cEmotion in the speech of children with autism spectrum conditions: Prosody and everything else,\u201d","author":"Marchi","year":"2012","journal-title":"Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012)"},{"key":"B17","doi-asserted-by":"crossref","first-page":"18","DOI":"10.25080\/Majora-7b98e3ed-003","article-title":"\u201clibrosa: Audio and music signal analysis in Python,\u201d","author":"McFee","year":"2015","journal-title":"Proceedings of the 14th Python in Science Conference"},{"key":"B18","volume-title":"Silent Messages","author":"Mehrabian","year":"1971"},{"key":"B19","first-page":"1359","article-title":"\u201cM3er: multiplicative multimodal emotion recognition using facial, textual, and speech cues,\u201d","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Mittal","year":"2020"},{"key":"B20","first-page":"2211","article-title":"Multiple kernel learning algorithms","volume":"12","author":"Nen","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"B21","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1016\/j.inffus.2017.02.003","article-title":"A review of affective computing: from unimodal analysis to multimodal fusion","volume":"37","author":"Poria","year":"","journal-title":"Inform. Fusion"},{"key":"B22","first-page":"873","article-title":"\u201cContext-dependent sentiment analysis in user-generated videos,\u201d","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics","author":"Poria","year":""},{"key":"B23","first-page":"1033","article-title":"\u201cMulti-level multiple attentions for contextual multimodal sentiment analysis,\u201d","volume-title":"2017 IEEE International Conference on Data Mining (ICDM)","author":"Poria","year":""},{"key":"B24","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1016\/j.neucom.2015.01.095","article-title":"Fusing audio, visual and textual clues for sentiment analysis from multimodal content","volume":"174","author":"Poria","year":"2016","journal-title":"Neurocomputing"},{"key":"B25","article-title":"\u201cMELD: a multimodal multi-party dataset for emotion recognition in conversations,\u201d","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Poria","year":"2018"},{"key":"B26","article-title":"Multimodal speech emotion recognition and ambiguity resolution","author":"Sahu","year":"2019","journal-title":"arXiv preprint arXiv:1904.06022"},{"key":"B27","doi-asserted-by":"publisher","first-page":"90","DOI":"10.1145\/3129340","article-title":"Speech emotion recognition two decades in a nutshell, benchmarks, and ongoing trends","volume":"61","author":"Schuller","year":"2018","journal-title":"Commun. ACM"},{"key":"B28","first-page":"51","article-title":"\u201cFusion techniques for utterance-level emotion recognition combining speech and transcripts,\u201d","volume-title":"Interspeech","author":"Sebastian","year":"2019"},{"key":"B29","article-title":"\u201cMultimodal approaches for emotion recognition: a survey,\u201d","volume-title":"Proceedings of SPIE - The International Society for Optical Engineering","author":"Sebe","year":"2005"},{"key":"B30","first-page":"369","article-title":"\u201cWise: word-level interaction-based multimodal fusion for speech emotion recognition,\u201d","volume-title":"Interspeech","author":"Shen","year":"2020"},{"key":"B31","doi-asserted-by":"publisher","first-page":"505","DOI":"10.1016\/S0959-4388(00)00241-5","article-title":"Sensory modalities are not separate modalities: plasticity and interactions","volume":"11","author":"Shimojo","year":"2001","journal-title":"Curr. Opin. Neurobiol."},{"key":"B32","doi-asserted-by":"crossref","first-page":"4275","DOI":"10.1109\/ICASSP39728.2021.9414654","article-title":"\u201cMultimodal cross-and self-attention network for speech emotion recognition,\u201d","volume-title":"ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Sun","year":"2021"},{"key":"B33","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1561\/2200000013","article-title":"An introduction to conditional random fields","volume":"4","author":"Sutton","year":"2010","journal-title":"Found. Trends Mach. Learn."},{"key":"B34","doi-asserted-by":"crossref","first-page":"981","DOI":"10.1007\/11573548_125","article-title":"\u201cAffective computing: a review,\u201d","volume-title":"Affective Computing and Intelligent Interaction: First International Conference, ACII 2005","author":"Tao","year":"2005"},{"key":"B35","article-title":"Attention is all you need","author":"Vaswani","year":"2017","journal-title":"arXiv preprint arXiv:1706.03762"},{"key":"B36","doi-asserted-by":"publisher","first-page":"4897","DOI":"10.1007\/s11042-021-10553-4","article-title":"Speech emotion recognition based on multi-feature and multi-lingual fusion","volume":"81","author":"Wang","year":"2022","journal-title":"Multimedia Tools Appl."},{"key":"B37","first-page":"4518","article-title":"\u201cLearning mutual correlation in multimodal transformer for speech emotion recognition,\u201d","volume-title":"Interspeech","author":"Wang","year":"2021"},{"key":"B38","doi-asserted-by":"publisher","first-page":"47795","DOI":"10.1109\/ACCESS.2021.3068045","article-title":"A comprehensive review of speech emotion recognition systems","volume":"9","author":"Wani","year":"2021","journal-title":"IEEE Access"},{"key":"B39","doi-asserted-by":"publisher","DOI":"10.3389\/fnbot.2022.971446","article-title":"A novel silent speech recognition approach based on parallel inception convolutional neural network and mel frequency spectral coefficient","author":"Wu","year":"2022","journal-title":"Front. Neurorobot."},{"key":"B40","first-page":"554","article-title":"\u201cParallel-inception cnn approach for facial semg based silent speech recognition,\u201d","author":"Wu","year":"2021","journal-title":"2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)"},{"key":"B41","doi-asserted-by":"crossref","first-page":"6269","DOI":"10.1109\/ICASSP39728.2021.9414880","article-title":"\u201cEmotion recognition by fusing time synchronous and time asynchronous representations,\u201d","volume-title":"ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Wu","year":"2021"},{"key":"B42","article-title":"Learning alignment for multimodal emotion recognition from speech","author":"Xu","year":"2019","journal-title":"arXiv preprint arXiv:1909.05645."},{"key":"B43","doi-asserted-by":"crossref","first-page":"6499","DOI":"10.1109\/ICASSP40776.2020.9053039","article-title":"\u201cHGFM: a hierarchical grained and feature model for acoustic emotion recognition,\u201d","volume-title":"ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Xu","year":"2020"},{"key":"B44","article-title":"Attentive modality hopping mechanism for speech emotion recognition","author":"Yoon","year":"2019","journal-title":"arXiv preprint arXiv:1912.00846"},{"key":"B45","first-page":"2717","article-title":"\u201cMultimodal speech emotion recognition using cross attention with aligned audio and text,\u201d","volume-title":"Interspeech","author":"Yoonhyung","year":"2020"},{"key":"B46","first-page":"1103","article-title":"\u201cTensor fusion network for multimodal sentiment analysis,\u201d","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Zadeh","year":"2017"},{"key":"B47","doi-asserted-by":"crossref","DOI":"10.1609\/aaai.v32i1.11280","article-title":"\u201cInferring emotion from conversational voice data: a semi-supervised multi-path generative neural network approach,\u201d","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Zhou","year":"2018"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2023.1181598\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,22]],"date-time":"2023-05-22T04:35:59Z","timestamp":1684730159000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2023.1181598\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,22]]},"references-count":47,"alternative-id":["10.3389\/fnbot.2023.1181598"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2023.1181598","relation":{},"ISSN":["1662-5218"],"issn-type":[{"value":"1662-5218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,22]]},"article-number":"1181598"}}