{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T23:10:54Z","timestamp":1775862654439,"version":"3.50.1"},"reference-count":62,"publisher":"MDPI AG","issue":"18","license":[{"start":{"date-parts":[[2020,9,12]],"date-time":"2020-09-12T00:00:00Z","timestamp":1599868800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2020R1F1A1060659"],"award-info":[{"award-number":["NRF-2020R1F1A1060659"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his\/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.<\/jats:p>","DOI":"10.3390\/s20185212","type":"journal-article","created":{"date-parts":[[2020,9,13]],"date-time":"2020-09-13T21:11:32Z","timestamp":1600031492000},"page":"5212","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":156,"title":["Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7419-494X","authenticated-orcid":false,"given":"Tursunov","family":"Anvarjon","sequence":"first","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8020-3590","authenticated-orcid":false,"family":"Mustaqeem","sequence":"additional","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5451-8815","authenticated-orcid":false,"given":"Soonil","family":"Kwon","sequence":"additional","affiliation":[{"name":"Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2020,9,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"385","DOI":"10.1109\/TAFFC.2015.2432810","article-title":"Recognizing emotions induced by affective sounds through heart rate variability","volume":"6","author":"Nardelli","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_2","first-page":"183","article-title":"A CNN-Assisted enhanced audio signal processing for speech emotion recognition","volume":"20","author":"Kwon","year":"2020","journal-title":"Sensors"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1007\/s10772-018-9491-z","article-title":"Databases, features and classifiers for speech emotion recognition: A review","volume":"21","author":"Swain","year":"2018","journal-title":"Int. J. Speech Technol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"5571","DOI":"10.1007\/s11042-017-5292-7","article-title":"Deep features-based speech emotion recognition for smart affective services","volume":"78","author":"Badshah","year":"2019","journal-title":"Multimed. Tools Appl."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Pandey, S.K., Shekhawat, H., and Prasanna, S. (2019, January 16\u201318). Deep learning techniques for speech emotion recognition: A review. Proceedings of the 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic.","DOI":"10.1109\/RADIOELEK.2019.8733432"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"117327","DOI":"10.1109\/ACCESS.2019.2936124","article-title":"Speech emotion recognition using deep learning techniques: A review","volume":"7","author":"Khalil","year":"2019","journal-title":"IEEE Access"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on speech emotion recognition: Features, classification schemes, and databases","volume":"44","author":"Kamel","year":"2011","journal-title":"Pattern Recognit."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., Chetouani, M., Weninger, F., Eyben, F., and Marchi, E. (2013, January 25\u201329). The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. Proceedings of the INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France.","DOI":"10.21437\/Interspeech.2013-56"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1016\/j.apacoust.2018.11.028","article-title":"A novel feature selection method for speech emotion recognition","volume":"146","year":"2019","journal-title":"Appl. Acoust."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1016\/j.dsp.2017.10.016","article-title":"Prominence features: Effective emotional features for speech emotion recognition","volume":"72","author":"Jing","year":"2018","journal-title":"Digit. Signal Process."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhu, L., Chen, L., Zhao, D., Zhou, J., and Zhang, W. (2017). Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors, 17.","DOI":"10.3390\/s17071694"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1862","DOI":"10.1109\/TPAMI.2019.2899857","article-title":"Exploiting unlabeled data in cnns by self-supervised learning to rank","volume":"41","author":"Liu","year":"2019","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"79861","DOI":"10.1109\/ACCESS.2020.2990405","article-title":"Clustering based speech emotion recognition by incorporating learned features and deep BiLSTM","volume":"8","author":"Mustaqeem","year":"2020","journal-title":"IEEE Access"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"67718","DOI":"10.1109\/ACCESS.2019.2916828","article-title":"Insights into LSTM fully convolutional networks for time series classification","volume":"7","author":"Karim","year":"2019","journal-title":"IEEE Access"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1016\/j.patcog.2018.12.026","article-title":"Time series feature learning with labeled and unlabeled data","volume":"89","author":"Wang","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Naqvi, R.A., Arsalan, M., Rehman, A., Rehman, A.U., Loh, W.K., and Paul, A. (2020). Deep learning-based drivers emotion classification system in time series data for remote applications. Remote Sens., 12.","DOI":"10.3390\/rs12030587"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"10767","DOI":"10.1109\/ACCESS.2019.2891838","article-title":"Effective combination of DenseNet and BiLSTM for keyword spotting","volume":"7","author":"Zeng","year":"2019","journal-title":"IEEE Access"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Tao, F., and Liu, G. (2018, January 15\u201320). Advanced LSTM: A study about better time dependency modeling in emotion recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461750"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3864","DOI":"10.1109\/TII.2018.2885700","article-title":"Learning shapelet patterns from network-based time series","volume":"15","author":"Wang","year":"2018","journal-title":"IEEE Trans. Ind. Inform."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Huang, Z., Dong, M., Mao, Q., and Zhan, Y. (2014, January 3\u20137). Speech emotion recognition using CNN. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654984"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1576","DOI":"10.1109\/TMM.2017.2766843","article-title":"Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching","volume":"20","author":"Zhang","year":"2017","journal-title":"IEEE Trans. Multimed."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Ren, Z., Cummins, N., Pandit, V., Han, J., Qian, K., and Schuller, B. (2018, January 23\u201326). Learning image-based representations for heart sound classification. Proceedings of the 2018 International Conference on Digital Health, Lyon, France.","DOI":"10.1145\/3194658.3194671"},{"key":"ref_23","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012, January 3\u20138). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems 2012, Lake Tahoe, NV, USA."},{"key":"ref_24","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Cummins, N., Amiriparian, S., Hagerer, G., Batliner, A., Steidl, S., and Schuller, B.W. (2017, January 23\u201327). An image-based deep spectrum feature representation for the recognition of emotional speech. Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123371"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Mirsamadi, S., Barsoum, E., and Zhang, C. (2017, January 5\u20139). Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952552"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Huang, C.-W., and Narayanan, S.S. (2016, January 8\u201312). Attention assisted discovery of sub-utterance structure in speech emotion recognition. Proceedings of the INTERSPEECH, San Francisco, CA, USA.","DOI":"10.21437\/Interspeech.2016-448"},{"key":"ref_28","unstructured":"LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., and Jackel, L.D. (1990). Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems, Morgan Kaufmann Publishers Inc."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1109\/72.554195","article-title":"Face recognition: A convolutional neural-network approach","volume":"8","author":"Lawrence","year":"1997","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_30","unstructured":"Zhang, X., Zhao, J., and LeCun, Y. (2015, January 7\u201312). Character-level convolutional networks for text classification. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018, January 15\u201320). The Microsoft 2017 conversational speech recognition system. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461870"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1016\/j.ins.2017.02.036","article-title":"Design of image cipher using block-based scrambling and image filtering","volume":"396","author":"Hua","year":"2017","journal-title":"Inf. Sci."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Li, T., Shi, J., Li, X., Wu, J., and Pan, F. (2019). Image encryption based on pixel-level diffusion with dynamic filtering and DNA-level permutation with 3D Latin cubes. Entropy, 21.","DOI":"10.3390\/e21030319"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Latif, S., Rana, R.K., Khalifa, S., Jurdak, R., and Epps, J. (2019). Direct modelling of speech emotion from raw speech. arXiv.","DOI":"10.21437\/Interspeech.2019-3252"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"2203","DOI":"10.1109\/TMM.2014.2360798","article-title":"Learning salient features for speech emotion recognition using convolutional neural networks","volume":"16","author":"Mao","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Bao, F., Neumann, M., and Vu, N.T. (2019). CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. Proc. Interspeech, 35\u201337.","DOI":"10.21437\/Interspeech.2019-2293"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1016\/j.bspc.2018.08.035","article-title":"Speech emotion recognition using deep 1D & 2D CNN LSTM networks","volume":"47","author":"Zhao","year":"2019","journal-title":"Biomed. Signal Process. Control"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"7053","DOI":"10.1007\/s00500-016-2247-2","article-title":"SVM or deep learning? A comparative study on remote sensing image classification","volume":"21","author":"Liu","year":"2017","journal-title":"Soft Comput."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wu, X., Liu, S., Cao, Y., Li, X., Yu, J., Dai, D., Ma, X., Hu, S., Wu, Z., and Liu, X. (2019, January 12\u201317). Speech emotion recognition using capsule networks. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683163"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Liu, C.-L., Yin, F., Wang, D.-H., and Wang, Q.-F. (2011, January 18\u201321). CASIA online and offline Chinese handwriting databases. Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China.","DOI":"10.1109\/ICDAR.2011.17"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Tursunov, A., Kwon, S., and Pang, H.-S. (2019). Discriminating Emotions in the valence dimension from speech using timbre features. Appl. Sci., 9.","DOI":"10.3390\/app9122470"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional neural networks for speech recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE ACM Trans. Audio Speech Lang. Process."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Wu, D., Sharma, N., and Blumenstein, M. (2017, January 14\u201319). Recent advances in video-based human action recognition using deep learning: A review. Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, Alaska.","DOI":"10.1109\/IJCNN.2017.7966210"},{"key":"ref_44","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B. (2005, January 4\u20138). A database of German emotional speech. Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal.","DOI":"10.21437\/Interspeech.2005-446"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"S\u00f6nmez, Y.\u00dc., and Varol, A. (2020, January 1\u20132). In-Depth analysis of speech production, auditory system, emotion theories and emotion recognition. Proceedings of the 2020 8th International Symposium on Digital Forensics and Security (ISDFS), Beirut, Lebanon.","DOI":"10.1109\/ISDFS49300.2020.9116231"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Shu, L., Xie, J., Yang, M., Li, Z., Li, Z., Liao, D., Xu, X., and Yang, X. (2018). A review of emotion recognition using physiological signals. Sensors, 18.","DOI":"10.3390\/s18072074"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"250","DOI":"10.1016\/j.ins.2016.01.033","article-title":"An improved method to construct basic probability assignment based on the confusion matrix for classification problem","volume":"340","author":"Deng","year":"2016","journal-title":"Inf. Sci."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"772","DOI":"10.1016\/j.ins.2019.06.064","article-title":"Three-way confusion matrix for classification: A measure driven view","volume":"507","author":"Xu","year":"2020","journal-title":"Inf. Sci."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Chicco, D., and Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom., 21.","DOI":"10.1186\/s12864-019-6413-7"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1440","DOI":"10.1109\/LSP.2018.2860246","article-title":"3-D convolutional recurrent neural networks with attention model for speech emotion recognition","volume":"25","author":"Chen","year":"2018","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"125868","DOI":"10.1109\/ACCESS.2019.2938007","article-title":"Speech emotion recognition from 3D log-mel spectrograms with deep learning network","volume":"7","author":"Meng","year":"2019","journal-title":"IEEE Access"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1016\/j.neunet.2017.02.013","article-title":"Evaluating deep learning architectures for Speech Emotion Recognition","volume":"92","author":"Fayek","year":"2017","journal-title":"Neural Netw."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"75798","DOI":"10.1109\/ACCESS.2019.2921390","article-title":"Exploration of complementary features for speech emotion recognition based on Kernel extreme learning machine","volume":"7","author":"Guo","year":"2019","journal-title":"IEEE Access"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Zheng, W., Yu, J., and Zou, Y. (2015, January 21\u201324). An experimental study of speech emotion recognition based on deep convolutional neural networks. Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) IEEE, Xi\u2019an, China.","DOI":"10.1109\/ACII.2015.7344669"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the Fifteenth Annual Conference of The International Speech Communication Association, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"97515","DOI":"10.1109\/ACCESS.2019.2928625","article-title":"Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition","volume":"7","author":"Zhao","year":"2019","journal-title":"IEEE Access"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Luo, D., Zou, Y., and Huang, D. (2018, January 2\u20136). Investigation on joint representation learning for robust feature extraction in speech emotion recognition. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1832"},{"key":"ref_60","first-page":"8","article-title":"Memento: An emotion-driven lifelogging system with wearables","volume":"15","author":"Jiang","year":"2019","journal-title":"ACM Trans. Sens. Netw. (TOSN)"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"101894","DOI":"10.1016\/j.bspc.2020.101894","article-title":"Speech emotion recognition with deep convolutional neural networks","volume":"59","author":"Issa","year":"2020","journal-title":"Biomed. Signal Process. Control"},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"90368","DOI":"10.1109\/ACCESS.2019.2927384","article-title":"Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition","volume":"7","author":"Jiang","year":"2019","journal-title":"IEEE Access"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/18\/5212\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:09:29Z","timestamp":1760177369000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/20\/18\/5212"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,12]]},"references-count":62,"journal-issue":{"issue":"18","published-online":{"date-parts":[[2020,9]]}},"alternative-id":["s20185212"],"URL":"https:\/\/doi.org\/10.3390\/s20185212","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,12]]}}}