{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,13]],"date-time":"2026-06-13T13:55:34Z","timestamp":1781358934868,"version":"3.54.1"},"reference-count":87,"publisher":"MDPI AG","issue":"13","license":[{"start":{"date-parts":[[2023,7,7]],"date-time":"2023-07-07T00:00:00Z","timestamp":1688688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Second Century Fund (C2F), Chulalongkorn University","award":["801538"],"award-info":[{"award-number":["801538"]}]},{"name":"Second Century Fund (C2F), Chulalongkorn University","award":["RSPD2023R699"],"award-info":[{"award-number":["RSPD2023R699"]}]},{"name":"Universidad Carlos III de Madrid","award":["801538"],"award-info":[{"award-number":["801538"]}]},{"name":"Universidad Carlos III de Madrid","award":["RSPD2023R699"],"award-info":[{"award-number":["RSPD2023R699"]}]},{"name":"European Union\u2019s Horizon 2020","award":["801538"],"award-info":[{"award-number":["801538"]}]},{"name":"European Union\u2019s Horizon 2020","award":["RSPD2023R699"],"award-info":[{"award-number":["RSPD2023R699"]}]},{"name":"King Saud University","award":["801538"],"award-info":[{"award-number":["801538"]}]},{"name":"King Saud University","award":["RSPD2023R699"],"award-info":[{"award-number":["RSPD2023R699"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Speech emotion recognition (SER) is a challenging task in human\u2013computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance. Despite the promising results of recent studies, they generally do not leverage advanced fusion algorithms for the generation of effective representations of emotional features in speech utterances. To address this problem, we describe the fusion of spatial and temporal feature representations of speech emotion by parallelizing convolutional neural networks (CNNs) and a Transformer encoder for SER. We stack two parallel CNNs for spatial feature representation in parallel to a Transformer encoder for temporal feature representation, thereby simultaneously expanding the filter depth and reducing the feature map with an expressive hierarchical feature representation at a lower computational cost. We use the RAVDESS dataset to recognize eight different speech emotions. We augment and intensify the variations in the dataset to minimize model overfitting. Additive White Gaussian Noise (AWGN) is used to augment the RAVDESS dataset. With the spatial and sequential feature representations of CNNs and the Transformer, the SER model achieves 82.31% accuracy for eight emotions on a hold-out dataset. In addition, the SER system is evaluated with the IEMOCAP dataset and achieves 79.42% recognition accuracy for five emotions. Experimental results on the RAVDESS and IEMOCAP datasets show the success of the presented SER system and demonstrate an absolute performance improvement over the state-of-the-art (SOTA) models.<\/jats:p>","DOI":"10.3390\/s23136212","type":"journal-article","created":{"date-parts":[[2023,7,7]],"date-time":"2023-07-07T02:37:29Z","timestamp":1688697449000},"page":"6212","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":57,"title":["Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer"],"prefix":"10.3390","volume":"23","author":[{"given":"Rizwan","family":"Ullah","sequence":"first","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Muhammad","family":"Asif","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5513-6413","authenticated-orcid":false,"given":"Wahab Ali","family":"Shah","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Namal University, Mianwali 42250, Pakistan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fakhar","family":"Anjam","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5415-9872","authenticated-orcid":false,"given":"Ibrar","family":"Ullah","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Kohat Campus, University of Engineering and Technology Peshawar, Kohat 25000, Pakistan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6113-123X","authenticated-orcid":false,"given":"Tahir","family":"Khurshaid","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lunchakorn","family":"Wuttisittikulkij","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4310-7054","authenticated-orcid":false,"given":"Shashi","family":"Shah","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Syed Mansoor","family":"Ali","sequence":"additional","affiliation":[{"name":"Department of Physics and Astronomy, College of Science, King Saud University, P.O. Box 2455, Riyadh 11451, Saudi Arabia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mohammad","family":"Alibakhshikenari","sequence":"additional","affiliation":[{"name":"Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Legan\u00e9s, 28911 Madrid, Spain"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1016\/j.neucom.2018.05.005","article-title":"Speech emotion recognition based on an improved brain emotion learning model","volume":"309","author":"Liu","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1016\/S0167-6393(03)00099-2","article-title":"Speech emotion recognition using hidden Markov models","volume":"41","author":"Nwe","year":"2003","journal-title":"Speech Commun."},{"key":"ref_3","first-page":"294","article-title":"Emotion recognition from speech with gaussian mixture models via boosted gmm","volume":"3","author":"Patel","year":"2017","journal-title":"Int. J. Res. Sci. Eng."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1154","DOI":"10.1016\/j.dsp.2012.05.007","article-title":"Speech emotion recognition: Features and classification models","volume":"22","author":"Chen","year":"2012","journal-title":"Digit. Signal Process."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1007\/s10772-011-9125-1","article-title":"Emotion recognition from speech: A review","volume":"15","author":"Koolagudi","year":"2012","journal-title":"Int. J. Speech Technol."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.specom.2019.12.001","article-title":"Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers","volume":"116","year":"2020","journal-title":"Speech Commun."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1634","DOI":"10.1109\/TAFFC.2021.3114365","article-title":"Survey of deep representation learning for speech emotion recognition","volume":"14","author":"Latif","year":"2021","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1016\/j.neunet.2017.02.013","article-title":"Evaluating deep learning architectures for Speech Emotion Recognition","volume":"92","author":"Fayek","year":"2017","journal-title":"Neural Netw."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"106547","DOI":"10.1016\/j.knosys.2020.106547","article-title":"Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques","volume":"211","author":"Tuncer","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"107316","DOI":"10.1016\/j.knosys.2021.107316","article-title":"A multimodal hierarchical approach to speech emotion recognition from audio and text","volume":"229","author":"Singh","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_11","first-page":"33","article-title":"Voice analysis using PRAAT software and classification of user emotional state","volume":"5","author":"Magdin","year":"2019","journal-title":"Int. J. Interact. Multimed. Artif. Intell."},{"key":"ref_12","first-page":"112","article-title":"Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN","volume":"6","author":"Huddar","year":"2021","journal-title":"Int. J. Interact. Multimed. Artif. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"69","DOI":"10.1109\/TAFFC.2015.2392101","article-title":"Speech emotion recognition using Fourier parameters","volume":"6","author":"Wang","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"2203","DOI":"10.1109\/TMM.2014.2360798","article-title":"Learning salient features for speech emotion recognition using convolutional neural networks","volume":"16","author":"Mao","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"61672","DOI":"10.1109\/ACCESS.2020.2984368","article-title":"Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network","volume":"8","author":"Ho","year":"2020","journal-title":"IEEE Access"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"107914","DOI":"10.1016\/j.knosys.2021.107914","article-title":"Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition","volume":"238","author":"Saleem","year":"2022","journal-title":"Knowl.-Based Syst."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1016\/j.bspc.2018.08.035","article-title":"Speech emotion recognition using deep 1D & 2D CNN LSTM networks","volume":"47","author":"Zhao","year":"2019","journal-title":"Biomed. Signal Process. Control"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1675","DOI":"10.1109\/TASLP.2019.2925934","article-title":"Speech emotion classification using attention-based LSTM","volume":"27","author":"Xie","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., and Tarokh, V. (2020, January 4\u20138). Speech emotion recognition with dual-sequence LSTM architecture. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054629"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"106889","DOI":"10.1109\/ACCESS.2020.3000751","article-title":"Robust semisupervised generative adversarial networks for speech emotion recognition via distribution smoothness","volume":"8","author":"Zhao","year":"2020","journal-title":"IEEE Access"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1955","DOI":"10.1007\/s11760-022-02156-9","article-title":"Speech emotion recognition using data augmentation method by cycle-generative adversarial networks","volume":"16","author":"Shilandari","year":"2022","journal-title":"Signal Image Video Process."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1109\/TNNLS.2020.3027600","article-title":"Improving speech emotion recognition with adversarial data augmentation network","volume":"33","author":"Yi","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"749604","DOI":"10.1155\/2014\/749604","article-title":"A research of speech emotion recognition based on deep belief network and SVM","volume":"2014","author":"Huang","year":"2014","journal-title":"Math. Probl. Eng."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1787","DOI":"10.1007\/s12652-017-0644-8","article-title":"Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition","volume":"14","author":"Huang","year":"2019","journal-title":"J. Ambient. Intell. Humaniz. Comput."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1145\/3129340","article-title":"Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends","volume":"61","author":"Schuller","year":"2018","journal-title":"Commun. ACM"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"75798","DOI":"10.1109\/ACCESS.2019.2921390","article-title":"Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine","volume":"7","author":"Guo","year":"2019","journal-title":"IEEE Access"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the Interspeech, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Tiwari, U., Soni, M., Chakraborty, R., Panda, A., and Kopparapu, S.K. (2014, January 4\u20138). Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053581"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech emotion recognition from spectrograms with deep convolutional neural network. Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1016\/j.neucom.2021.06.036","article-title":"Affect-salient event sequence modelling for continuous speech emotion recognition","volume":"458","author":"Dong","year":"2021","journal-title":"Neurocomputing"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"104277","DOI":"10.1016\/j.engappai.2021.104277","article-title":"A novel dual attention-based BLSTM with hybrid features in speech emotion recognition","volume":"102","author":"Chen","year":"2021","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"108260","DOI":"10.1016\/j.apacoust.2021.108260","article-title":"Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition","volume":"182","author":"Atila","year":"2021","journal-title":"Appl. Acoust."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"452","DOI":"10.1080\/02699931.2013.837378","article-title":"Gender differences in emotion recognition: Impact of sensory modality and emotional category","volume":"28","author":"Lambrecht","year":"2014","journal-title":"Cogn. Emot."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Fu, C., Liu, C., Ishi, C.T., and Ishiguro, H. (2020). Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention. Sensors, 20.","DOI":"10.3390\/s20174894"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1007\/s10723-021-09564-0","article-title":"Speech expression multimodal emotion recognition based on deep belief network","volume":"19","author":"Liu","year":"2021","journal-title":"J. Grid Comput."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1016\/j.neunet.2021.03.013","article-title":"Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition","volume":"141","author":"Zhao","year":"2021","journal-title":"Neural Netw."},{"key":"ref_37","first-page":"205","article-title":"Analysis of emotional speech\u2014A review","volume":"1","author":"Gangamohan","year":"2016","journal-title":"Towar. Robot. Soc. Believable Behaving Syst."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1016\/S0167-6393(02)00082-1","article-title":"The role of voice quality in communicating emotion, mood and attitude","volume":"40","author":"Gobl","year":"2003","journal-title":"Speech Commun."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Vlasenko, B., Philippou-H\u00fcbner, D., Prylipko, D., B\u00f6ck, R., Siegert, I., and Wendemuth, A. (2011, January 11\u201315). Vowels formants analysis allows straightforward detection of high arousal emotions. Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, Barcelona, Spain.","DOI":"10.1109\/ICME.2011.6012003"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1109\/TSA.2004.838534","article-title":"Toward detecting emotions in spoken dialogs","volume":"13","author":"Lee","year":"2005","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Schuller, B., and Rigoll, G. (2006, January 17\u201321). Timing levels in segment-based speech emotion recognition. Proceedings of the INTERSPEECH 2006, Proceedings International Conference on Spoken Language Processing ICSLP, Pittsburgh, PA, USA.","DOI":"10.21437\/Interspeech.2006-502"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Lugger, M., and Yang, B. (2007, January 15\u201320). The relevance of voice quality features in speaker independent emotion recognition. Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP\u201907, Honolulu, HI, USA.","DOI":"10.1109\/ICASSP.2007.367152"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"012028","DOI":"10.1088\/1742-6596\/1591\/1\/012028","article-title":"Feature extraction methods: A review","volume":"1591","author":"Mutlag","year":"2005","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Cavalcante, R.C., Minku, L.L., and Oliveira, A.L. (2016, January 24\u201329). Fedd: Feature extraction for explicit concept drift detection in time series. Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada.","DOI":"10.1109\/IJCNN.2016.7727274"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1016\/j.cmpb.2014.06.013","article-title":"Feature extraction of the first difference of EMG time series for EMG pattern recognition","volume":"177","author":"Phinyomark","year":"2014","journal-title":"Comput. Methods Programs Biomed."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1515\/teme-2016-0072","article-title":"Automatic feature extraction and selection for classification of cyclical time series data","volume":"84","author":"Schneider","year":"2017","journal-title":"Tech. Mess."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Salau, A.O., and Jain, S. (2019, January 7\u20139). Feature extraction: A survey of the types, techniques, applications. Proceedings of the 2019 International Conference on Signal Processing and Communication (ICSC), Noida, India.","DOI":"10.1109\/ICSC45622.2019.8938371"},{"key":"ref_48","unstructured":"Salau, A.O., Olowoyo, T.D., and Akinola, S.O. (2020). Advances in Computational Intelligence Techniques, Springer."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zamil, A.A.A., Hasan, S., Baki, S.M.J., Adam, J.M., and Zaman, I. (2019, January 10\u201312). Emotion detection from speech signals using voting mechanism on classified frames. Proceedings of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh.","DOI":"10.1109\/ICREST.2019.8644168"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"104886","DOI":"10.1016\/j.knosys.2019.104886","article-title":"Bagged support vector machines for emotion recognition from speech","volume":"184","author":"Bhavan","year":"2019","journal-title":"Knowl.-Based Syst."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Huang, Z., Dong, M., Mao, Q., and Zhan, Y. (2014, January 3\u20137). Speech emotion recognition using CNN. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654984"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Latif, S., Rana, R., Younis, S., Qadir, J., and Epps, J. (2018). Transfer learning for improving speech emotion classification accuracy. arXiv.","DOI":"10.21437\/Interspeech.2018-1625"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Xie, B., Sidulova, M., and Park, C.H. (2021). Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors, 21.","DOI":"10.3390\/s21144913"},{"key":"ref_54","unstructured":"Ahmed, M., Islam, S., Islam, A.K.M., and Shatabda, S. (2021). An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. arXiv."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Yu, Y., and Kim, Y.J. (2020). Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database. Electronics, 9.","DOI":"10.3390\/electronics9050713"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"106190","DOI":"10.1016\/j.knosys.2020.106190","article-title":"Autoembedder: A semi-supervised DNN embedding system for clustering","volume":"204","author":"Ohi","year":"2020","journal-title":"Knowl.-Based Syst."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"79861","DOI":"10.1109\/ACCESS.2020.2990405","article-title":"Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM","volume":"8","author":"Sajjad","year":"2020","journal-title":"IEEE Access"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Bertero, D., and Fung, P. (2017, January 5\u20139). A first look into a convolutional neural network for speech emotion detection. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953131"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Mekruksavanich, S., Jitpattanakul, A., and Hnoohom, N. (2020, January 11\u201314). Negative emotion recognition using deep learning for Thai language. Proceedings of the 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Pattaya, Thailand.","DOI":"10.1109\/ECTIDAMTNCON48261.2020.9090768"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Anvarjon, T., and Kwon, S. (2020). Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors, 20.","DOI":"10.3390\/s20185212"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"1576","DOI":"10.1109\/TMM.2017.2766843","article-title":"Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching","volume":"20","author":"Zhang","year":"2017","journal-title":"IEEE Trans. Multimed."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B., and Zafeiriou, S. (2016, January 20\u201325). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472669"},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"2133","DOI":"10.3390\/math8122133","article-title":"CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network","volume":"8","author":"Kwon","year":"2020","journal-title":"Mathematics"},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"4097","DOI":"10.1007\/s11063-021-10581-z","article-title":"BLSTM and CNN Stacking Architecture for Speech Emotion Recognition","volume":"53","author":"Li","year":"2021","journal-title":"Neural Process. Lett."},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Zhu, L., Chen, L., Zhao, D., Zhou, J., and Zhang, W. (2017). Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors, 17.","DOI":"10.3390\/s17071694"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"183","DOI":"10.3390\/s20010183","article-title":"A CNN-assisted enhanced audio signal processing for speech emotion recognition","volume":"20","author":"Kwon","year":"2019","journal-title":"Sensors"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Lieskovsk\u00e1, E., Jakubec, M., Jarina, R., and Chmul\u00edk, M. (2021). A review on speech emotion recognition using deep learning and attention mechanism. Electronics, 10.","DOI":"10.3390\/electronics10101163"},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"107101","DOI":"10.1016\/j.asoc.2021.107101","article-title":"Att-Net: Enhanced emotion recognition system using lightweight self-attention module","volume":"102","author":"Kwon","year":"2021","journal-title":"Appl. Soft Comput."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Chen, S., Zhang, M., Yang, X., Zhao, Z., Zou, T., and Sun, X. (2021). The impact of attention mechanisms on speech emotion recognition. Sensors, 21.","DOI":"10.3390\/s21227530"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhao, T., and Kawahara, T. (2019, January 15\u201319). Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2594"},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., and Vepa, J. (2018, January 2\u20136). Speech Emotion Recognition Using Spectrogram Phoneme Embedding. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1811"},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Sarma, M., Ghahremani, P., Povey, D., Goel, N.K., Sarma, K.K., and Dehak, N. (2018, January 2\u20136). Emotion Identification from Raw Speech Signals Using DNNs. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1353"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Mirsamadi, S., Barsoum, E., and Zhang, C. (2017, January 5\u20139). Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952552"},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"101894","DOI":"10.1016\/j.bspc.2020.101894","article-title":"Speech emotion recognition with deep convolutional neural networks","volume":"59","author":"Issa","year":"2020","journal-title":"Biomed. Signal Process. Control"},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"889","DOI":"10.1007\/s10489-020-01839-5","article-title":"A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning","volume":"51","author":"Carta","year":"2021","journal-title":"Appl. Intell."},{"key":"ref_76","doi-asserted-by":"crossref","first-page":"108078","DOI":"10.1016\/j.cie.2022.108078","article-title":"Multi-head attention fusion networks for multi-modal speech emotion recognition","volume":"168","author":"Zhang","year":"2022","journal-title":"Comput. Ind. Eng."},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1186\/s40537-022-00619-x","article-title":"Detection of fake news and hate speech for Ethiopian languages: A systematic review of the approaches","volume":"9","author":"Demilie","year":"2022","journal-title":"J. Big Data"},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Bautista, J.L., Lee, Y.K., and Shin, H.S. (2022). Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation. Electronics, 11.","DOI":"10.3390\/electronics11233935"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Abeje, B.T., Salau, A.O., Ebabu, H.A., and Ayalew, A.M. (2022, January 23\u201325). Comparative Analysis of Deep Learning Models for Aspect Level Amharic News Sentiment Analysis. Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand.","DOI":"10.1109\/DASA54658.2022.9765172"},{"key":"ref_80","doi-asserted-by":"crossref","first-page":"125538","DOI":"10.1109\/ACCESS.2022.3225684","article-title":"Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features","volume":"10","author":"Kakuba","year":"2022","journal-title":"IEEE Access"},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Tao, H., Geng, L., Shan, S., Mai, J., and Fu, H. (2022). Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition. Entropy, 24.","DOI":"10.3390\/e24081025"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"114177","DOI":"10.1016\/j.eswa.2020.114177","article-title":"MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach","volume":"167","author":"Kwon","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_83","first-page":"1","article-title":"Attention is all you need","volume":"17","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_84","doi-asserted-by":"crossref","unstructured":"Livingstone, S.R., and Russo, F.A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0196391"},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_86","doi-asserted-by":"crossref","first-page":"3705","DOI":"10.1007\/s11042-017-5539-3","article-title":"Spectrogram based multi-task audio classification","volume":"78","author":"Zeng","year":"2019","journal-title":"Multimed. Tools Appl."},{"key":"ref_87","doi-asserted-by":"crossref","first-page":"119797","DOI":"10.1016\/j.eswa.2023.119797","article-title":"E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition","volume":"222","author":"Almadhor","year":"2023","journal-title":"Expert Syst. Appl."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/13\/6212\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:07:50Z","timestamp":1760126870000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/13\/6212"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,7]]},"references-count":87,"journal-issue":{"issue":"13","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["s23136212"],"URL":"https:\/\/doi.org\/10.3390\/s23136212","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,7]]}}}