{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T18:31:21Z","timestamp":1770748281689,"version":"3.49.0"},"reference-count":44,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2019,5,8]],"date-time":"2019-05-08T00:00:00Z","timestamp":1557273600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"BAP-C project, Eastern Mediterranean University","award":["BAP-C-02-18-0001"],"award-info":[{"award-number":["BAP-C-02-18-0001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Detecting human intentions and emotions helps improve human\u2013robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE\u201905 databases. The results are superior to the state-of-the-art methods reported in the literature.<\/jats:p>","DOI":"10.3390\/e21050479","type":"journal-article","created":{"date-parts":[[2019,5,13]],"date-time":"2019-05-13T11:00:57Z","timestamp":1557745257000},"page":"479","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":144,"title":["3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms"],"prefix":"10.3390","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3120-5370","authenticated-orcid":false,"given":"Noushin","family":"Hajarolasvadi","sequence":"first","affiliation":[{"name":"Department of Electrical and Electronics Engineering, Eastern Mediterranean University, 99628 Gazimagusa, North Cyprus, via Mersin 10, Turkey"}]},{"given":"Hasan","family":"Demirel","sequence":"additional","affiliation":[{"name":"Department of Electrical and Electronics Engineering, Eastern Mediterranean University, 99628 Gazimagusa, North Cyprus, via Mersin 10, Turkey"}]}],"member":"1968","published-online":{"date-parts":[[2019,5,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1007\/s10470-017-1006-3","article-title":"Real-time ensemble based face recognition system for NAO humanoids using local binary pattern","volume":"92","author":"Bolotnikova","year":"2017","journal-title":"Analog Integr. Circuits Signal Process."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"26391","DOI":"10.1109\/ACCESS.2018.2831927","article-title":"Dominant and Complementary Emotion Recognition From Still Images of Faces","volume":"6","author":"Guo","year":"2018","journal-title":"IEEE Access"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1109\/T-AFFC.2011.37","article-title":"Multimodal emotion recognition in response to videos","volume":"3","author":"Soleymani","year":"2012","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s12193-009-0025-5","article-title":"Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis","volume":"3","author":"Kessous","year":"2010","journal-title":"J. Multimodal User Interfaces"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1007\/s10772-017-9396-2","article-title":"Vocal-based emotion recognition using random forests and decision tree","volume":"20","author":"Noroozi","year":"2017","journal-title":"Int. J. Speech Technol."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"2203","DOI":"10.1109\/TMM.2014.2360798","article-title":"Learning salient features for speech emotion recognition using convolutional neural networks","volume":"16","author":"Mao","year":"2014","journal-title":"IEEE Trans. Multimedia"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"22081","DOI":"10.1109\/ACCESS.2017.2761539","article-title":"3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition","volume":"5","author":"Torfi","year":"2017","journal-title":"IEEE Access"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Avots, E., Sapi\u0144ski, T., Bachmann, M., and Kami\u0144ska, D. (2018). Audiovisual emotion recognition in wild. Mach. Vis. Appl., 1\u201311.","DOI":"10.1007\/s00138-018-0960-9"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Kim, Y., Lee, H., and Provost, E.M. (2013, January 26\u201331). Deep learning for robust feature generation in audiovisual emotion recognition. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638346"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Jaitly, N., and Hinton, G. (2011, January 22\u201327). Learning a better representation of speech soundwaves using restricted boltzmann machines. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5947700"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Palaz, D., and Collobert, R. (2015, January 11\u201315). Analysis of cnn-based speech recognition system using raw speech as input. Proceedings of the INTERSPEECH 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-3"},{"key":"ref_13","unstructured":"Schl\u00fcter, J., and Grill, T. (2015, January 26\u201330). Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech emotion recognition from spectrograms with deep convolutional neural network. Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, South Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional neural networks for speech recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1109\/LSP.2010.2100380","article-title":"Spectrogram image feature for sound event classification in mismatched conditions","volume":"18","author":"Dennis","year":"2011","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"936","DOI":"10.1109\/TMM.2008.927665","article-title":"Recognizing human emotional state from audiovisual signals","volume":"10","author":"Wang","year":"2008","journal-title":"IEEE Trans. Multimedia"},{"key":"ref_18","unstructured":"Jackson, P., and Haq, S. (2014). Surrey Audio-Visual Expressed Emotion (SAVEE) Database, University of Surrey."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Martin, O., Kotsia, I., Macq, B., and Pitas, I. (2006, January 3\u20137). The eNTERFACE\u201905 audio-visual emotion database. Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW\u201906), Atlanta, GA, USA.","DOI":"10.1109\/ICDEW.2006.145"},{"key":"ref_20","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"4883","DOI":"10.1007\/s11042-016-4041-7","article-title":"Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture","volume":"77","author":"Ahmad","year":"2018","journal-title":"Multimedia Tools Appl."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"5571","DOI":"10.1007\/s11042-017-5292-7","article-title":"Deep features-based speech emotion recognition for smart affective services","volume":"79","author":"Badshah","year":"2019","journal-title":"Multimedia Tools Appl."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., and Vepa, J. (2018, January 2\u20136). Speech Emotion Recognition Using Spectrogram & Phoneme Embedding. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1811"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Du, J., Wang, Z.R., and Zhang, J. (2018). Attention Based Fully Convolutional Network for Speech Emotion Recognition. arXiv.","DOI":"10.23919\/APSIPA.2018.8659587"},{"key":"ref_26","unstructured":"Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (NIPS 2012), Curran Associates, Inc."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., and Weiss, B. (2005, January 4\u20138). A database of German emotional speech. Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal.","DOI":"10.21437\/Interspeech.2005-446"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Satt, A., Rozenberg, S., and Hoory, R. (2017, January 20\u201324). Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Proceedings of the INTERSPEECH 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-200"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1109\/TAFFC.2017.2713783","article-title":"Audio-visual emotion recognition in video clips","volume":"10","author":"Noroozi","year":"2017","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Paliwal, K.K., Lyons, J.G., and W\u00f3jcicki, K.K. (2010, January 13\u201315). Preference for 20\u201340 ms window duration in speech analysis. Proceedings of the 2010 4th International Conference on Signal Processing and Communication Systems (ICSPCS), Gold Coast, QLD, Australia.","DOI":"10.1109\/ICSPCS.2010.5709770"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2014). 1175 \u201cLearning spatiotemporal features with 3d convolutional networks\u201d. arXiv.","DOI":"10.1109\/ICCV.2015.510"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in PCM","volume":"28","author":"Lloyd","year":"1982","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1045","DOI":"10.1109\/TIT.2014.2375327","article-title":"Randomized dimensionality reduction for k-means clustering","volume":"61","author":"Boutsidis","year":"2015","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_34","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Kim, J., Truong, K.P., Englebienne, G., and Evers, V. (2017, January 23\u201326). Learning spectro-temporal features with 3D CNNs for speech emotion recognition. Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA.","DOI":"10.1109\/ACII.2017.8273628"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Jiang, D., Cui, Y., Zhang, X., Fan, P., Ganzalez, I., and Sahli, H. (2011). Audio visual emotion recognition based on triple-stream dynamic bayesian network models. Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Springer.","DOI":"10.1007\/978-3-642-24600-5_64"},{"key":"ref_37","unstructured":"Kingma, D.P., and Ba, J.L. (2015, January 7\u20139). Adam: Amethod for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations, San Diego, CA, USA."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Smith, L.N. (2017, January 24\u201331). Cyclical learning rates for training neural networks. Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA.","DOI":"10.1109\/WACV.2017.58"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet Large Scale Visual Recognition Challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1121\/1.1915715","article-title":"The relation of pitch to intensity","volume":"6","author":"Stevens","year":"1935","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T., and Pikrakis, A. (2014). Introduction to Audio Analysis: A MATLAB\u00ae Approach, Academic Press.","DOI":"10.1016\/B978-0-08-099388-1.00001-7"},{"key":"ref_42","unstructured":"Vidyamurthy, G. (2004). Pairs Trading: Quantitative Methods and Analysis, John Wiley & Sons."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1109\/TASSP.1980.1163420","article-title":"Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences","volume":"28","author":"Davis","year":"1980","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Kopparapu, S.K., and Laxminarayana, M. (2010, January 10\u201313). Choice of Mel filter bank in computing MFCC of a resampled speech. Proceedings of the 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), Kuala Lumpur, Malaysia.","DOI":"10.1109\/ISSPA.2010.5605491"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/5\/479\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T12:50:09Z","timestamp":1760187009000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/21\/5\/479"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,5,8]]},"references-count":44,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2019,5]]}},"alternative-id":["e21050479"],"URL":"https:\/\/doi.org\/10.3390\/e21050479","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,5,8]]}}}