{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T17:56:51Z","timestamp":1774288611233,"version":"3.50.1"},"reference-count":48,"publisher":"MDPI AG","issue":"14","license":[{"start":{"date-parts":[[2023,7,24]],"date-time":"2023-07-24T00:00:00Z","timestamp":1690156800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Korea Agency for Technology and Standards in 2022","award":["K_G012002236201"],"award-info":[{"award-number":["K_G012002236201"]}]},{"name":"Korea Agency for Technology and Standards in 2022","award":["K_G012002234001"],"award-info":[{"award-number":["K_G012002234001"]}]},{"name":"Korea Agency for Technology and Standards in 2022","award":["G22202202102201"],"award-info":[{"award-number":["G22202202102201"]}]},{"name":"Ministry of Oceans and Fisheries","award":["K_G012002236201"],"award-info":[{"award-number":["K_G012002236201"]}]},{"name":"Ministry of Oceans and Fisheries","award":["K_G012002234001"],"award-info":[{"award-number":["K_G012002234001"]}]},{"name":"Ministry of Oceans and Fisheries","award":["G22202202102201"],"award-info":[{"award-number":["G22202202102201"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Understanding and identifying emotional cues in human speech is a crucial aspect of human\u2013computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to architect an innovative framework for speech emotion recognition predicated on spectrograms and semantic feature transcribers, aiming to bolster performance precision by acknowledging the conspicuous inadequacies in extant methodologies and rectifying them. To procure invaluable attributes for speech detection, this investigation leveraged two divergent strategies. Primarily, a wholly convolutional neural network model was engaged to transcribe speech spectrograms. Subsequently, a cutting-edge Mel-frequency cepstral coefficient feature abstraction approach was adopted and integrated with Speech2Vec for semantic feature encoding. These dual forms of attributes underwent individual processing before they were channeled into a long short-term memory network and a comprehensive connected layer for supplementary representation. By doing so, we aimed to bolster the sophistication and efficacy of our speech emotion detection model, thereby enhancing its potential to accurately recognize and interpret emotion from human speech. The proposed mechanism underwent a rigorous evaluation process employing two distinct databases: RAVDESS and EMO-DB. The outcome displayed a predominant performance when juxtaposed with established models, registering an impressive accuracy of 94.8% on the RAVDESS dataset and a commendable 94.0% on the EMO-DB dataset. This superior performance underscores the efficacy of our innovative system in the realm of speech emotion recognition, as it outperforms current frameworks in accuracy metrics.<\/jats:p>","DOI":"10.3390\/s23146640","type":"journal-article","created":{"date-parts":[[2023,7,25]],"date-time":"2023-07-25T01:32:10Z","timestamp":1690248730000},"page":"6640","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":22,"title":["Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders"],"prefix":"10.3390","volume":"23","author":[{"given":"Ilkhomjon","family":"Pulatov","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea"}]},{"given":"Rashid","family":"Oteniyazov","sequence":"additional","affiliation":[{"name":"Department of Telecommunication Engineering, Nukus Branch of Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Nukus 230100, Uzbekistan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3594-0137","authenticated-orcid":false,"given":"Fazliddin","family":"Makhmudov","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0184-7599","authenticated-orcid":false,"given":"Young-Im","family":"Cho","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,24]]},"reference":[{"key":"ref_1","first-page":"112897","article-title":"Speech Emotion Recognition Based on SVM with Local Temporal-Spectral Features","volume":"9","author":"He","year":"2021","journal-title":"IEEE Access"},{"key":"ref_2","first-page":"4453","article-title":"Comparative study of SVM and KNN classifiers on speech emotion recognition based on prosody features","volume":"11","author":"Dhouha","year":"2020","journal-title":"J. Ambient Intell. Humaniz. Comput."},{"key":"ref_3","first-page":"5625","article-title":"Multi-modal Speech Emotion Recognition using SVM Classifier with Semi-Supervised Learning","volume":"12","author":"Shalini","year":"2021","journal-title":"J. Ambient Intell. Humaniz. Comput."},{"key":"ref_4","unstructured":"Schuller, B., Rigoll, G., and Lang, M. (2005, January 4\u20138). Hidden Markov model-based speech emotion recognition. Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal."},{"key":"ref_5","first-page":"1665","article-title":"Speech Emotion Recognition Based on HMM and Spiking Neural Network","volume":"31","author":"Liu","year":"2020","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1162","DOI":"10.1016\/j.specom.2006.04.003","article-title":"Emotional speech recognition: Resources, features, and methods","volume":"48","author":"Ververidis","year":"2006","journal-title":"Speech Commun."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1109\/TAFFC.2015.2457417","article-title":"The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing","volume":"7","author":"Eyben","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_8","first-page":"2734","article-title":"Speech Emotion Recognition using Gaussian Mixture Model with Deep Learning Techniques","volume":"10","author":"Reddy","year":"2021","journal-title":"Int. J. Innov. Technol. Explor. Eng."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Li, J., Zhang, X., Huang, L., Li, F., Duan, S., and Sun, Y. (2022). Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network. Appl. Sci., 12.","DOI":"10.3390\/app12199518"},{"key":"ref_10","first-page":"1214","article-title":"Speech Emotion Recognition Using Convolutional Neural Networks and Spectral Features","volume":"10","author":"Kim","year":"2020","journal-title":"Appl. Sci."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Aggarwal, A., Srivastava, A., Agarwal, A., Chahal, N., Singh, D., Alnuaim, A.A., Alhadlaq, A., and Lee, H.-N. (2022). Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning. Sensors, 22.","DOI":"10.3390\/s22062378"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Makhmudov, F., Kutlimuratov, A., Akhmedov, F., Abdallah, M.S., and Cho, Y.-I. (2022). Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics, 11.","DOI":"10.3390\/electronics11234047"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Atmaja, B.T., and Sasou, A. (2022). Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations. Sensors, 22.","DOI":"10.3390\/s22176369"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"466","DOI":"10.1007\/s00034-020-01486-8","article-title":"DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features","volume":"40","author":"Fahad","year":"2021","journal-title":"Circuits Syst. Signal Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Mamieva, D., Abdusalomov, A.B., Kutlimuratov, A., Muminov, B., and Whangbo, T.K. (2023). Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. Sensors, 23.","DOI":"10.3390\/s23125475"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Gong, Y., Chung, Y., and Glass, J.R. (2021). AST: Audio Spectrogram Transformer. arXiv.","DOI":"10.21437\/Interspeech.2021-698"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Toyoshima, I., Okada, Y., Ishimaru, M., Uchiyama, R., and Tada, M. (2023). Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS. Sensors, 23.","DOI":"10.3390\/s23031743"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Jiang, W., Wang, Z., Jin, J.S., Han, X., and Li, C. (2019). Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network. Sensors, 19.","DOI":"10.3390\/s19122730"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Kutlimuratov, A., Abdusalomov, A., and Whangbo, T.K. (2020). Evolving Hierarchical and Tag Information via the Deeply Enhanced Weighted Non-Negative Matrix Factorization of Rating Predictions. Symmetry, 12.","DOI":"10.3390\/sym12111930"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Guo, Y., Xiong, X., Liu, Y., Xu, L., and Li, Q. (2022). A novel speech emotion recognition method based on feature construction and ensemble learning. PLoS ONE, 17.","DOI":"10.1371\/journal.pone.0267132"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1016\/j.procs.2015.10.020","article-title":"Emotion detection using MFCC and cepstrum features","volume":"70","author":"Lalitha","year":"2015","journal-title":"Procedia Comput. Sci."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1067729","DOI":"10.3389\/fnbot.2022.1067729","article-title":"Dance emotion recognition based on linear predictive Meir frequency cepstrum coefficient and bidirectional long short-term memory from robot environment","volume":"16","author":"Shen","year":"2022","journal-title":"Front. Neurorobot."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"15563","DOI":"10.1007\/s11042-020-10329-2","article-title":"Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients","volume":"80","author":"Pawar","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1007\/s10772-020-09672-4","article-title":"Feature extraction algorithms to improve the speech emotion recognition rate","volume":"23","author":"Anusha","year":"2020","journal-title":"Int. J. Speech Technol."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.apacoust.2018.08.003","article-title":"Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition","volume":"142","author":"Ozseven","year":"2018","journal-title":"Appl. Acoust."},{"key":"ref_26","unstructured":"Peng, S., Chen, K., Tian, T., and Chen, J. (2022). An autoencoder-based feature level fusion for speech emotion recognition. Digit. Commun. Netw."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1675","DOI":"10.1109\/TASLP.2019.2925934","article-title":"Speech Emotion Classification Using Attention-Based LSTM","volume":"27","author":"Xie","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Tzirakis, P., Nguyen, A., Zafeiriou, S., and Schuller, B.W. (2021). Speech Emotion Recognition using Semantic Information. arXiv.","DOI":"10.1109\/ICASSP39728.2021.9414866"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"125538","DOI":"10.1109\/ACCESS.2022.3225684","article-title":"Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features","volume":"10","author":"Kakuba","year":"2022","journal-title":"IEEE Access"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yoon, S., Byun, S., and Jung, K. (2018, January 18\u201321). Multimodal Speech Emotion Recognition Using Audio and Text. Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece.","DOI":"10.1109\/SLT.2018.8639583"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y., and Li, X. (2019, January 15\u201319). Learning Alignment for Multimodal Emotion Recognition from Speech. Proceedings of the INTERSPEECH 2019: 20th Annual Conference of the International Speech Communication Association, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-3247"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Huang, L., and Shen, X. (2022). Research on Speech Emotion Recognition Based on the Fractional Fourier Transform. Electronics, 11.","DOI":"10.3390\/electronics11203393"},{"key":"ref_33","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20136). Imagenet classification with deep convolutional neural networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Satt, A., Rozenberg, S., and Hoory, R. (2017, January 20\u201324). Efficient emotion recognition from speech using deep learning on spectrograms. Proceedings of the INTERSPEECH 2017: 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-200"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Mocanu, B., Tapu, R., and Zaharia, T. (2021). Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition. Sensors, 21.","DOI":"10.3390\/s21124233"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Livingstone, S.R., and Russo, F.A. (2018). The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0196391"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Burkhardt, F., Paeschke, A., Rolfes, A., Sendlmeier, W.F., and Weiss, B. (2005, January 4\u20138). A database of German emotional speech. Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal.","DOI":"10.21437\/Interspeech.2005-446"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Dai, W., Cahyawijaya, S., Liu, Z., and Fung, P. (2021, January 6\u201311). Multimodal end-to-end sparse model for emotion recognition. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online.","DOI":"10.18653\/v1\/2021.naacl-main.417"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1109\/MIS.2018.2882362","article-title":"Multimodal sentiment analysis: Addressing key issues and setting up the baselines","volume":"33","author":"Poria","year":"2018","journal-title":"IEEE Intell. Syst."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Al-onazi, B.B., Nauman, M.A., Jahangir, R., Malik, M.M., Alkhammash, E.H., and Elshewey, A.M. (2022). Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion. Appl. Sci., 12.","DOI":"10.3390\/app12189188"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Bhangale, K., and Kothandaraman, M. (2023). Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network. Electronics, 12.","DOI":"10.3390\/electronics12040839"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"221640","DOI":"10.1109\/ACCESS.2020.3043201","article-title":"A novel approach for classification of speech emotions based on deep and acoustic features","volume":"8","author":"Bilal","year":"2020","journal-title":"IEEE Access"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"104886","DOI":"10.1016\/j.knosys.2019.104886","article-title":"Bagged Support Vector Machines for Emotion Recognition from Speech","volume":"184","author":"Bhavan","year":"2019","journal-title":"Knowl. Based Syst."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Markl, N. (2022, January 21\u201324). Language variation and algorithmic bias: Understanding algorithmic bias in British English automatic speech recognition. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT \u203222), Seoul, Republic of Korea.","DOI":"10.1145\/3531146.3533117"},{"key":"ref_45","unstructured":"Meyer, J., Rauchenstein, L., Eisenberg, J.D., and Howell, N. (2020, January 11\u201316). Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications. Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Padilla, J.J., Kavak, H., Lynch, C.J., Gore, R.J., and Diallo, S.Y. (2018). Temporal and spatiotemporal investigation of tourist attraction visit sentiment on Twitter. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0198857"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Safarov, F., Kutlimuratov, A., Abdusalomov, A.B., Nasimov, R., and Cho, Y.-I. (2023). Deep Learning Recommendations of E-Education Based on Clustering and Sequence. Electronics, 12.","DOI":"10.3390\/electronics12040809"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Ilyosov, A., Kutlimuratov, A., and Whangbo, T.-K. (2021). Deep-Sequence\u2013Aware Candidate Generation for e-Learning System. Processes, 9.","DOI":"10.3390\/pr9081454"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/14\/6640\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:18:03Z","timestamp":1760127483000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/14\/6640"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,24]]},"references-count":48,"journal-issue":{"issue":"14","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["s23146640"],"URL":"https:\/\/doi.org\/10.3390\/s23146640","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,24]]}}}