{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,23]],"date-time":"2026-02-23T14:11:05Z","timestamp":1771855865395,"version":"3.50.1"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,8,18]],"date-time":"2022-08-18T00:00:00Z","timestamp":1660780800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,8,18]],"date-time":"2022-08-18T00:00:00Z","timestamp":1660780800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2020YFB1313600"],"award-info":[{"award-number":["2020YFB1313600"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2023,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>With regard to human\u2013machine interaction, accurate emotion recognition is a challenging problem. In this paper, efforts were taken to explore the possibility to complete the feature abstraction and fusion by the homogeneous network component, and propose a dual-modal emotion recognition framework that is composed of a parallel convolution (Pconv) module and attention-based bidirectional long short-term memory (BLSTM) module. The Pconv module employs parallel methods to extract multidimensional social features and provides more effective representation capacity. Attention-based BLSTM module is utilized to strengthen key information extraction and maintain the relevance between information. Experiments conducted on the CH-SIMS dataset indicate that the recognition accuracy reaches 74.70% on audio data and 77.13% on text, while the accuracy of the dual-modal fusion model reaches 90.02%. Through experiments it proves the feasibility to process heterogeneous information within homogeneous network component, and demonstrates that attention-based BLSTM module would achieve best coordination with the feature fusion realized by Pconv module. This can give great flexibility for the modality expansion and architecture design.<\/jats:p>","DOI":"10.1007\/s40747-022-00841-3","type":"journal-article","created":{"date-parts":[[2022,8,18]],"date-time":"2022-08-18T04:03:33Z","timestamp":1660795413000},"page":"951-963","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":47,"title":["A novel dual-modal emotion recognition algorithm with fusing hybrid features of audio signal and speech context"],"prefix":"10.1007","volume":"9","author":[{"given":"Yurui","family":"Xu","sequence":"first","affiliation":[]},{"given":"Hang","family":"Su","sequence":"additional","affiliation":[]},{"given":"Guijin","family":"Ma","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8307-9700","authenticated-orcid":false,"given":"Xiaorui","family":"Liu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,8,18]]},"reference":[{"key":"841_CR1","doi-asserted-by":"crossref","unstructured":"Nayak S, Nagesh B, Routray A et al (2021) A human\u2013computer interaction framework for emotion recognition through time-series thermal video sequences. Comput Electr Eng 93:107\u2013118","DOI":"10.1016\/j.compeleceng.2021.107280"},{"key":"841_CR2","doi-asserted-by":"publisher","first-page":"597","DOI":"10.1016\/j.procs.2020.07.086","volume":"175","author":"M Bouhlal","year":"2020","unstructured":"Bouhlal M, Aarika K, Ait Abdelouahid R et al (2020) Emotions recognition as innovative tool for improving students\u2019 performance and learning approaches. Procedia Comput Sci 175:597\u2013620","journal-title":"Procedia Comput Sci"},{"key":"841_CR3","doi-asserted-by":"crossref","unstructured":"Krause FC, Linardatos Ef, Fresco DM et al (2021) Facial emotion recognition in major depressive disorder: a meta-analytic review. J Affect Disord 293:320\u2013328","DOI":"10.1016\/j.jad.2021.06.053"},{"key":"841_CR4","doi-asserted-by":"crossref","unstructured":"Cui Y, Ma Y, Li W et al (2020) Multi-EmoNet: a novel multi-task neural network for driver emotion recognition. IFAC PapersOnLine 53:650\u2013655","DOI":"10.1016\/j.ifacol.2021.04.155"},{"issue":"2","key":"841_CR5","first-page":"308","volume":"11","author":"C Mumenthaler","year":"2020","unstructured":"Mumenthaler C, Sander D, Manstead ASR (2020) Emotion recognition in simulated social interactions. IEEE Trans Affect Comput 11(2):308\u2013312","journal-title":"IEEE Trans Affect Comput"},{"key":"841_CR6","doi-asserted-by":"crossref","unstructured":"Volpert-Esmond HI, Bartholow BD (2021) A functional coupling of brain and behavior during social categorization of faces. Personal Soc Psychol Bull 47:1580\u20131595","DOI":"10.1177\/0146167220976688"},{"issue":"30","key":"841_CR7","doi-asserted-by":"publisher","first-page":"eaay4073","DOI":"10.1126\/sciadv.aay4073","volume":"6","author":"L Liu","year":"2020","unstructured":"Liu L, Xu H, Wang J, Li J, Xu H (2020) Cell type-differential modulation of prefrontal cortical gabaergic interneurons on low gamma rhythm and social interaction. Sci Adv 6(30):eaay4073","journal-title":"Sci Adv"},{"key":"841_CR8","doi-asserted-by":"crossref","unstructured":"Baltruaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41:423\u2013443","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"841_CR9","doi-asserted-by":"crossref","unstructured":"Poria S, Hazarika D, Majumder N et al (2020) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEE Trans Affect Comput 14:1\u201329","DOI":"10.1109\/TAFFC.2020.3038167"},{"key":"841_CR10","doi-asserted-by":"publisher","DOI":"10.1016\/j.bspc.2020.101867","volume":"58","author":"R Sharma","year":"2020","unstructured":"Sharma R, Pachori RB, Sircar P (2020) Automated emotions recognition based on higher order statistics and deep learning algorithm. Biomed Signal Process Control 58:101867","journal-title":"Biomed Signal Process Control"},{"key":"841_CR11","doi-asserted-by":"crossref","unstructured":"Singh K, Malhotra J (2022) Two-layer LSTM network based prediction of epileptic seizures using EEG spectral features. Complex Intell Syst 8:2405\u20132418","DOI":"10.1007\/s40747-021-00627-z"},{"key":"841_CR12","doi-asserted-by":"publisher","DOI":"10.1016\/j.bspc.2020.101921","volume":"59","author":"R Sharma","year":"2020","unstructured":"Sharma R, Sircar P, Pachori RB (2020) Seizures classification based on higher order statistics and deep neural network. Biomed Signal Process Control 59:101921","journal-title":"Biomed Signal Process Control"},{"issue":"002","key":"841_CR13","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1007\/s42235-019-0018-3","volume":"16","author":"X Qi","year":"2019","unstructured":"Qi X, Wang W, Guo L et al (2019) Building\u00a0a\u00a0Plutchik\u2019s\u00a0wheel inspired affective model for social robots. J Bionic Eng 16(002):209\u2013221","journal-title":"J Bionic Eng"},{"key":"841_CR14","doi-asserted-by":"crossref","unstructured":"Hossain MS, Muhammad G (2018) Emotion recognition using deep learning approach from audio-visual emotional big data. Inf Fusion 49","DOI":"10.1016\/j.inffus.2018.09.008"},{"key":"841_CR15","unstructured":"Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15:2949\u20132980"},{"key":"841_CR16","doi-asserted-by":"crossref","unstructured":"Xu G, Li W, Liu J (2020) A social emotion classification approach using multi-model fusion. Future Gener Comput Syst 102:347\u2013356","DOI":"10.1016\/j.future.2019.07.007"},{"key":"841_CR17","doi-asserted-by":"crossref","unstructured":"Cai H, Qu Z, Li Z et al (2020) Feature-level fusion approaches based on multimodal EEG data for depression recognition. Inf Fusion 59:127\u2013138","DOI":"10.1016\/j.inffus.2020.01.008"},{"key":"841_CR18","doi-asserted-by":"crossref","unstructured":"Nguyen D, Nguyen K, Sridharan S et al (2018) Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput Vis Image Underst 174:33\u201342","DOI":"10.1016\/j.cviu.2018.06.005"},{"key":"841_CR19","doi-asserted-by":"crossref","unstructured":"Liu Y, Fu G (2021) Emotion recognition by deeply learned multi-channel textual and EEG features. Future Gener Comput Syst 119:1\u201313","DOI":"10.1016\/j.future.2021.01.010"},{"key":"841_CR20","unstructured":"Li J, Selvaraju RR, Gotmare AD et al (2021) Align before fuse: vision and language representation learning with momentum distillation. In: Paper Presented at the Proceedings of the 35th Conference on Neural Information Processing System, Sydney, pp 104\u2013121"},{"key":"841_CR21","doi-asserted-by":"crossref","unstructured":"Li W, Gao C, Niu G et al (2020) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. In: Paper Presented at the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Conference on Natural Language Processing, Thailand, pp 2592\u20132607","DOI":"10.18653\/v1\/2021.acl-long.202"},{"issue":"JUL.","key":"841_CR22","first-page":"217","volume":"62","author":"X Wang","year":"2018","unstructured":"Wang X, Peng M, Pan L, Hu M, Jin C, Ren F (2018) Two-level attention with two-stage multi-task learning for facial emotion recognition. J Vis Commun Image Represent 62(JUL.):217\u2013225","journal-title":"J Vis Commun Image Represent"},{"issue":"3","key":"841_CR23","doi-asserted-by":"publisher","DOI":"10.1016\/j.apacoust.2021.108046","volume":"179","author":"J Ancilin","year":"2021","unstructured":"Ancilin J, Milton A (2021) Improved speech emotion recognition with mel frequency magnitude coefficient. Appl Acoust 179(3):108046","journal-title":"Appl Acoust"},{"key":"841_CR24","doi-asserted-by":"crossref","unstructured":"Farhoudi Z, Setayeshi S (2020) Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition. Speech Commun 127:92\u2013103","DOI":"10.1016\/j.specom.2020.12.001"},{"key":"841_CR25","unstructured":"Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision and language tasks. In: Paper Presented at the Proceedings of 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp 13\u201323"},{"key":"841_CR26","unstructured":"Liunian LH, Yatskar M, Yin D et al (2019) Visualbert: a simple and performant baseline for vision and language. arXiv arXiv:1908.03557"},{"key":"841_CR27","doi-asserted-by":"crossref","unstructured":"Chen YC, Li L, Yu L et al (2020) Uniter: universal image-text representation learning. In: European conference on computer vision. Paper Presented at the Proceedings of European Conference on Computer Vision, Glasgow, pp 1303\u20131313","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"841_CR28","doi-asserted-by":"crossref","unstructured":"Wang Z, Zhou X, Wang W et al (2020) Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video. Int J Mach Learn Cybern 11:923\u2013934","DOI":"10.1007\/s13042-019-01056-8"},{"key":"841_CR29","doi-asserted-by":"crossref","unstructured":"Xu H, Zhang H, Han K et al (2019) Learning alignment for multimodal emotion recognition from speech. In: Proceedings of InterSpeech 2019, September 15-19, Graz, Austria, pp 3569\u20133573","DOI":"10.21437\/Interspeech.2019-3247"},{"key":"841_CR30","unstructured":"Narotam S, Nittin S, Abhinav D (2017) Continuous multimodal emotion recognition approach for AVEC 2017. arXiv arXiv:1709.05861"},{"issue":"1","key":"841_CR31","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/1856\/1\/012006","volume":"1856","author":"Z Meng","year":"2021","unstructured":"Meng Z (2021) Research on timbre classification based on BP neural network and MFCC. J Phys Conf Ser 1856(1):012006","journal-title":"J Phys Conf Ser"},{"issue":"2","key":"841_CR32","first-page":"1","volume":"39","author":"O Kolesnikova","year":"2020","unstructured":"Kolesnikova O, Gelbukh A (2020) A study of lexical function detection with word2vec and supervised machine learning. J Intell Fuzzy Syst 39(2):1\u20138","journal-title":"J Intell Fuzzy Syst"},{"key":"841_CR33","doi-asserted-by":"publisher","first-page":"2485","DOI":"10.1007\/s40747-021-00436-4","volume":"7","author":"J Shobana","year":"2021","unstructured":"Shobana J, Murali M (2021) An efficient sentiment analysis methodology based on long short-term memory networks. Complex Intell Syst 7:2485\u20132501","journal-title":"Complex Intell Syst"},{"key":"841_CR34","unstructured":"Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Comput Sci 23:1399\u20131409"},{"key":"841_CR35","doi-asserted-by":"crossref","unstructured":"Yu W, Xu H, Meng F et al (2020) CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Seattle, pp 3718\u20133727","DOI":"10.18653\/v1\/2020.acl-main.343"},{"key":"841_CR36","doi-asserted-by":"crossref","unstructured":"Singh P, Srivastava R, Rana K et al (2021) A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl Based Syst 229:107\u2013119","DOI":"10.1016\/j.knosys.2021.107316"},{"key":"841_CR37","doi-asserted-by":"crossref","unstructured":"Vashishtha S, Susan S (2020) Inferring sentiments from supervised classification of text and speech cues using fuzzy rules. Procedia Comput Sci 167:1370\u20131379","DOI":"10.1016\/j.procs.2020.03.348"},{"key":"841_CR38","doi-asserted-by":"crossref","unstructured":"Pepino L, Riera P, Ferrer L et al (2020) Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, pp 6484\u20136488","DOI":"10.1109\/ICASSP40776.2020.9054709"},{"key":"841_CR39","doi-asserted-by":"crossref","unstructured":"Priyasad D, Fernando T, Denman S et al (2020) Attention driven fusion for multi-modal emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, pp 3227\u20133231","DOI":"10.1109\/ICASSP40776.2020.9054441"},{"key":"841_CR40","doi-asserted-by":"crossref","unstructured":"Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: Proceedings of the 2021 IEEE automatic speech recognition and understanding workshop, Cartagena, pp 350\u2013357","DOI":"10.1109\/ASRU51503.2021.9688036"},{"key":"841_CR41","unstructured":"Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and ID convolutional neural network. In: Interspeech, Shanghai, China: ISCA, 2020, pp 4243\u20134247"},{"key":"841_CR42","doi-asserted-by":"publisher","first-page":"985","DOI":"10.1109\/TASLP.2021.3049898","volume":"29","author":"Z Lian","year":"2021","unstructured":"Lian Z, Liu B, Tao J (2021) CTNet: conversational transformer network for emotion recognition. IEE\/ACM Trans Audio Speech Lang Process 29:985\u20131000","journal-title":"IEE\/ACM Trans Audio Speech Lang Process"},{"key":"841_CR43","doi-asserted-by":"crossref","unstructured":"Padi S, Sadjadi SO, Manocha D et al (2022) Multimodal emotion recognition using transfer learning from speaker recognition and bert-based models. arXiv:2202.08974, pp 407\u2013414","DOI":"10.21437\/Odyssey.2022-57"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-022-00841-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-022-00841-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-022-00841-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,22]],"date-time":"2023-02-22T18:06:13Z","timestamp":1677089173000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-022-00841-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,18]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,2]]}},"alternative-id":["841"],"URL":"https:\/\/doi.org\/10.1007\/s40747-022-00841-3","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"value":"2199-4536","type":"print"},{"value":"2198-6053","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,18]]},"assertion":[{"value":"6 August 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 July 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 August 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}