{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T16:38:29Z","timestamp":1776184709930,"version":"3.50.1"},"reference-count":31,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2022,7,26]],"date-time":"2022-07-26T00:00:00Z","timestamp":1658793600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Henan Province Key Scientific Research Projects Plan of Colleges and Universities","award":["22A520004"],"award-info":[{"award-number":["22A520004"]}]},{"name":"Henan Province Key Scientific Research Projects Plan of Colleges and Universities","award":["22A510001"],"award-info":[{"award-number":["22A510001"]}]},{"name":"Henan Province Key Scientific Research Projects Plan of Colleges and Universities","award":["22A520004"],"award-info":[{"award-number":["22A520004"]}]},{"name":"Henan Province Key Scientific Research Projects Plan of Colleges and Universities","award":["22A510001"],"award-info":[{"award-number":["22A510001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>The quality of feature extraction plays a significant role in the performance of speech emotion recognition. In order to extract discriminative, affect-salient features from speech signals and then improve the performance of speech emotion recognition, in this paper, a multi-stream convolution-recurrent neural network based on attention mechanism (MSCRNN-A) is proposed. Firstly, a multi-stream sub-branches full convolution network (MSFCN) based on AlexNet is presented to limit the loss of emotional information. In MSFCN, sub-branches are added behind each pooling layer to retain the features of different resolutions, different features from which are fused by adding. Secondly, the MSFCN and Bi-LSTM network are combined to form a hybrid network to extract speech emotion features for the purpose of supplying the temporal structure information of emotional features. Finally, a feature fusion model based on a multi-head attention mechanism is developed to achieve the best fusion features. The proposed method uses an attention mechanism to calculate the contribution degree of different network features, and thereafter realizes the adaptive fusion of different network features by weighting different network features. Aiming to restrain the gradient divergence of the network, different network features and fusion features are connected through shortcut connection to obtain fusion features for recognition. The experimental results on three conventional SER corpora, CASIA, EMODB, and SAVEE, show that our proposed method significantly improves the network recognition performance, with a recognition rate superior to most of the existing state-of-the-art methods.<\/jats:p>","DOI":"10.3390\/e24081025","type":"journal-article","created":{"date-parts":[[2022,7,26]],"date-time":"2022-07-26T22:03:42Z","timestamp":1658873022000},"page":"1025","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition"],"prefix":"10.3390","volume":"24","author":[{"given":"Huawei","family":"Tao","sequence":"first","affiliation":[{"name":"College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China"}]},{"given":"Lei","family":"Geng","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China"}]},{"given":"Shuai","family":"Shan","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China"}]},{"given":"Jingchao","family":"Mai","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China"}]},{"given":"Hongliang","family":"Fu","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,7,26]]},"reference":[{"key":"ref_1","first-page":"2465","article-title":"Dimensional speech emotion recognition review","volume":"31","author":"Li","year":"2020","journal-title":"J. Softw."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.specom.2019.12.001","article-title":"Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers","volume":"116","year":"2020","journal-title":"Speech Commun."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on speech emotion recognition: Features, classification schemes, and databases","volume":"44","author":"Kamel","year":"2011","journal-title":"Pattern Recogn."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1109\/TPAMI.2008.52","article-title":"A survey of affect recognition methods: Audio, visual, and spontaneous expressions","volume":"31","author":"Zeng","year":"2008","journal-title":"IEEE Trans. Pattern Anal."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1109\/TSA.2003.814368","article-title":"Prosodic and accentual information for automatic speech recognition","volume":"11","author":"Milone","year":"2003","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_6","unstructured":"Zhang, S. (2008, January 24\u201328). Emotion recognition in Chinese natural speech by combining prosody and voice quality features. Proceedings of the 5th International Symposium on Neural Networks, Beijing, China."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1016\/j.bspc.2014.10.008","article-title":"Weighted spectral features based on local Hu moments for speech emotion recognition","volume":"18","author":"Sun","year":"2015","journal-title":"Biomed. Signal Process."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2203","DOI":"10.1109\/TMM.2014.2360798","article-title":"Learning salient features for speech emotion recognition using convolutional neural networks","volume":"16","author":"Mao","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1576","DOI":"10.1109\/TMM.2017.2766843","article-title":"Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching","volume":"20","author":"Zhang","year":"2017","journal-title":"IEEE Trans. Multimed."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_11","first-page":"31","article-title":"Semisupervised autoencoders for speech emotion recognition","volume":"26","author":"Deng","year":"2017","journal-title":"IEEE\/ACM Trans. Audio SPE"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Dissanayake, V., Zhang, H., Billinghurst, M., and Nanayakkara, S. (2020, January 25\u201329). Speech Emotion Recognition\u2019in the Wild\u2019Using an Autoencoder. Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1356"},{"key":"ref_13","first-page":"1675","article-title":"Speech emotion classification using attention-based LSTM","volume":"27","author":"Xie","year":"2019","journal-title":"IEEE\/ACM Trans. Audio SPE"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., and Tarokh, V. (2020, January 4\u20138). Speech emotion recognition with dual-sequence LSTM architecture. Proceedings of the 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054629"},{"key":"ref_15","first-page":"2697","article-title":"Semi-supervised speech emotion recognition with ladder networks","volume":"28","author":"Parthasarathy","year":"2020","journal-title":"IEEE\/ACM Trans. Audio SPE"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"5116","DOI":"10.1002\/int.22505","article-title":"Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network","volume":"36","author":"Kwon","year":"2021","journal-title":"Int. J. Intell. Syst."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Jiang, P., Xu, X., Tao, H., Zhao, L., and Zou, C. (2021). Convolutional-Recurrent Neural Networks with Multiple Attention Mechanisms for Speech Emotion Recognition. IEEE Trans Cogn. Dev. Syst.","DOI":"10.1109\/TCDS.2021.3123979"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1440","DOI":"10.1109\/LSP.2018.2860246","article-title":"3-D convolutional recurrent neural networks with attention model for speech emotion recognition","volume":"25","author":"Chen","year":"2018","journal-title":"IEEE Signal Proc. Lett."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"622","DOI":"10.1587\/transfun.2020EAL2051","article-title":"A novel hybrid network model based on attentional multi-feature fusion for deception detection","volume":"104","author":"Fang","year":"2021","journal-title":"IEICE Trans. Fund. Electron."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"90368","DOI":"10.1109\/ACCESS.2019.2927384","article-title":"Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition","volume":"7","author":"Jiang","year":"2019","journal-title":"IEEE Access."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.specom.2020.03.005","article-title":"Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN","volume":"120","author":"Yao","year":"2020","journal-title":"Speech Commun."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.specom.2020.12.009","article-title":"Learning deep multimodal affective features for spontaneous speech emotion recognition","volume":"127","author":"Zhang","year":"2021","journal-title":"Speech Commun."},{"key":"ref_23","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention Is All You Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1109\/TAFFC.2015.2392101","article-title":"Speech emotion recognition using Fourier parameters","volume":"6","author":"Wang","year":"2015","journal-title":"IEEE Trans Affect. Comput."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1109\/TNNLS.2020.3027600","article-title":"Improving speech emotion recognition with adversarial data augmentation network","volume":"33","author":"Yi","year":"2020","journal-title":"IEEE Trans Neural Netw. Learn. Syst."},{"key":"ref_26","unstructured":"Haq, S., Jackson, P.J.B., and Edge, J. (2009, January 10\u201313). Speaker-dependent audio-visual emotion recognition. Proceedings of the Auditory-Visual Speech Processing (AVSP) 2009, Norwich, UK."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., and Wendemuth, A. (December, January 13). Acoustic emotion recognition: A benchmark comparison of performances. Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Merano, Italy.","DOI":"10.1109\/ASRU.2009.5372886"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Schuller, B., Steidl, S., and Batliner, A. (2009, January 6). The INTERSPEECH 2009 emotion challenge. Proceedings of the 15th Annual Conference of the International Speech Communication Association, Brighton, UK.","DOI":"10.21437\/Interspeech.2009-103"},{"key":"ref_29","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20138). Imagenet classifification with deep convolutional neural networks. Proceedings of the Twenty-Sixth Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"587","DOI":"10.1049\/iet-spr.2016.0336","article-title":"Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features","volume":"11","author":"Jassim","year":"2017","journal-title":"IET Signal Process."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1945630","DOI":"10.1155\/2017\/1945630","article-title":"Random deep belief networks for recognizing emotions from speech signals","volume":"2017","author":"Wen","year":"2017","journal-title":"Comput. Intell. Neurosci."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/8\/1025\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:56:37Z","timestamp":1760140597000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/8\/1025"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,26]]},"references-count":31,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2022,8]]}},"alternative-id":["e24081025"],"URL":"https:\/\/doi.org\/10.3390\/e24081025","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7,26]]}}}