{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,20]],"date-time":"2026-05-20T00:14:37Z","timestamp":1779236077178,"version":"3.51.4"},"reference-count":45,"publisher":"MDPI AG","issue":"12","license":[{"start":{"date-parts":[[2021,6,20]],"date-time":"2021-06-20T00:00:00Z","timestamp":1624147200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100006595","name":"Unitatea Executiva pentru Finantarea Invatamantului Superior, a Cercetarii, Dezvoltarii si Inovarii","doi-asserted-by":"publisher","award":["PN-III-P1-1.1-TE-2019-0420"],"award-info":[{"award-number":["PN-III-P1-1.1-TE-2019-0420"]}],"id":[{"id":"10.13039\/501100006595","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human\u2013machine\/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.<\/jats:p>","DOI":"10.3390\/s21124233","type":"journal-article","created":{"date-parts":[[2021,6,20]],"date-time":"2021-06-20T21:50:15Z","timestamp":1624225815000},"page":"4233","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":21,"title":["Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition"],"prefix":"10.3390","volume":"21","author":[{"given":"Bogdan","family":"Mocanu","sequence":"first","affiliation":[{"name":"Department of Telecommunications, Faculty of ETTI, University \u201cPolitehnica\u201d of Bucharest, 060042 Bucharest, Romania"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ruxandra","family":"Tapu","sequence":"additional","affiliation":[{"name":"Institut Polytechnique de Paris, T\u00e9l\u00e9com SudParis, Advanced Research and TEchniques for Multidimensional Imaging Systems Department, 9 rue Charles Fourier, 91000 \u00c9vry, France"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Titus","family":"Zaharia","sequence":"additional","affiliation":[{"name":"Institut Polytechnique de Paris, T\u00e9l\u00e9com SudParis, Advanced Research and TEchniques for Multidimensional Imaging Systems Department, 9 rue Charles Fourier, 91000 \u00c9vry, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,6,20]]},"reference":[{"key":"ref_1","unstructured":"Venkataramanan, K., and Rajamohan, H.R. (2019). Emotion Recognition from Speech. arXiv, Available online: https:\/\/arxiv.org\/abs\/1912.10458."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1016\/j.patcog.2010.09.020","article-title":"Survey on speech emotion recognition: Features, classification schemes, and databases","volume":"44","author":"Kamel","year":"2011","journal-title":"Pattern Recognit."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"124","DOI":"10.1037\/h0030377","article-title":"Constants across cultures in the face and emotion","volume":"17","author":"Ekman","year":"1971","journal-title":"J. Pers. Soc. Psychol."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1037\/0033-2909.115.2.268","article-title":"Strong evidence for universals in facial expressions: A reply to Russell\u2019s mistaken critique","volume":"115","author":"Ekman","year":"1994","journal-title":"Psychol. Bull."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Vogt, T., Andre, E., and Wagner, J. (2008). Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realization. Affect and Emotion in Human-Computer Interaction, Springer Science and Business Media LLC. [1st ed.].","DOI":"10.1007\/978-3-540-85099-1_7"},{"key":"ref_6","unstructured":"Zhong, Y., Arandjelovic, R., and Zisserman, A. (2018). GhostVLAD for Set-Based Face Recognition. Lecture Notes in Computer Science Proceedings of the Asian Conference on Computer Vision, ACCV, Springer Science and Business Media LLC."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 27\u201330). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Huang, J., Tao, J., Liu, B., and Lian, Z. (2020, January 25\u201329). Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition. Proceedings of the Interspeech, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1391"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1437","DOI":"10.1109\/TPAMI.2017.2711011","article-title":"NetVLAD: CNN Architecture for Weakly Supervised Place Recognition","volume":"40","author":"Arandjelovic","year":"2018","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"626","DOI":"10.3758\/BF03192732","article-title":"Emotional category data on images from the international affective picture system","volume":"37","author":"Mikels","year":"2005","journal-title":"Behav. Res. Methods"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Jiang, W., Wang, Z., Jin, J.S., Han, X., and Li, C. (2019). Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network. Sensors, 19.","DOI":"10.3390\/s19122730"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Erden, M., and Arslan, L. (2011, January 27\u201331). Automatic Detection of Anger in Human-Human Call Center Dialogs. Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy.","DOI":"10.21437\/Interspeech.2011-21"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Lugger, M., and Yang, B. (2007, January 4). The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing\u2014ICASSP 07, Honolulu, HI, USA.","DOI":"10.1109\/ICASSP.2007.367152"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1109\/TSA.2004.838534","article-title":"Toward detecting emotions in spoken dialogs","volume":"13","author":"Lee","year":"2005","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Jeon, J.H., Xia, R., and Liu, Y. (2011, January 22\u201327). Sentence level emotion recognition based on decisions from subsentence segments. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5947464"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"768","DOI":"10.1016\/j.specom.2010.08.013","article-title":"Automatic speech emotion recognition using modulation spectral features","volume":"53","author":"Wu","year":"2011","journal-title":"Speech Commun."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Schuller, B., and Rigoll, G. (2006, January 17\u201321). Timing levels in segment-based speech emotion recognition. Proceedings of the Interspeech 2006, Pittsburgh, PA, USA.","DOI":"10.21437\/Interspeech.2006-502"},{"key":"ref_18","unstructured":"Espinosa, H.P., Garc\u00eda, C.A.R., and Pineda, L.V. (2010, January 14\u201319). Features selection for primitives estimation on emotional speech. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"162","DOI":"10.1109\/T-AFFC.2011.14","article-title":"Interdependencies among Voice Source Parameters in Emotional Speech","volume":"2","author":"Sundberg","year":"2011","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Sun., R., Moore, E., and Torres, J.F. (2009, January 19\u201324). Investigating glottal parameters for differentiating emotional categories with similar prosodics. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan.","DOI":"10.1109\/ICASSP.2009.4960632"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Esposito, A., and Jain, L. (2016). Analysis of Emotional Speech\u2014A Review. Toward Robotic Socially Believable Behaving Systems\u2014Volume I, Springer.","DOI":"10.1007\/978-3-319-31053-4"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Albanie, S., Nagrani, A., Vedaldi, A., and Zisserman, A. (2018, January 22\u201326). Emotion Recognition in Speech using Cross-Modal Transfer in the Wild. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea.","DOI":"10.1145\/3240508.3240578"},{"key":"ref_23","first-page":"19508194","article-title":"Metric Learning Based Multimodal Audio-visual Emotion Recognition","volume":"27","author":"Ghaleb","year":"2019","journal-title":"IEEE MultiMedia"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Yeh, L.Y., and Tai-Shih, C. (2010, January 26\u201330). Spectro-temporal modulations for robust speech emotion recognition. Proceedings of the Interspeech 2010, Makuhari, Chiba, Japan.","DOI":"10.21437\/Interspeech.2010-286"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"104886","DOI":"10.1016\/j.knosys.2019.104886","article-title":"Bagged support vector machines for emotion recognition from speech","volume":"184","author":"Bhavan","year":"2019","journal-title":"Knowl. Based Syst."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"101894","DOI":"10.1016\/j.bspc.2020.101894","article-title":"Speech emotion recognition with deep convolutional neural networks","volume":"59","author":"Issa","year":"2020","journal-title":"Biomed. Signal Process. Control."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Amer, M.R., Siddiquie, B., Richey, C., Divakaran, A., and Amer, M.R. (2014, January 14). Emotion detection in speech using deep networks. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6854297"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2203","DOI":"10.1109\/TMM.2014.2360798","article-title":"Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks","volume":"16","author":"Mao","year":"2014","journal-title":"IEEE Trans. Multimed."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1238","DOI":"10.21437\/Interspeech.2017-619","article-title":"Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space","volume":"2017","author":"Ma","year":"2017","journal-title":"Interspeech"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Lian, Z., Li, Y., Tao, J., and Huang, J. (2018, January 26). Speech Emotion Recognition via Contrastive Loss under Siamese Networks. Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, Seoul, Korea.","DOI":"10.1145\/3267935.3267946"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Farooq, M., Hussain, F., Baloch, N.K., Raja, F.R., Yu, H., and Zikria, Y.B. (2020). Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network. Sensors, 20.","DOI":"10.3390\/s20216008"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Kumar, P., Jain, S., Raman, B., Roy, P.P., and Iwamura, M. (2021, January 5). End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition. Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy.","DOI":"10.1109\/ICPR48806.2021.9413144"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Mirsamadi, S., Barsoum, E., and Zhang, C. (2017, January 19). Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952552"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Chao, L., Tao, J., Yang, M., Li, Y., and Wen, Z. (2016, January 20\u201325). Long short term memory recurrent neural network based encoding method for emotion recognition in video. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472178"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Tzinis, E., and Potamianos, A. (2017, January 23\u201326). Segment-based speech emotion recognition using recurrent neural networks. Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA.","DOI":"10.1109\/ACII.2017.8273599"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Huang, J., Li, Y., Tao, J., and Lian, Z. (2018, January 2\u20136). Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1432"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"117327","DOI":"10.1109\/ACCESS.2019.2936124","article-title":"Speech Emotion Recognition Using Deep Learning Techniques: A Review","volume":"7","author":"Khalil","year":"2019","journal-title":"IEEE Access"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Yin, R., Bredin, H., and Barras, C. (2017, January 20\u201324). Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-65"},{"key":"ref_40","unstructured":"Lim, J.S., and Oppenheim, A.V. (1987). Short-Time Fourier Transform. Advanced Topics in Signal Processing, Prentice-Hall."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Livingstone, S.R., and Russo, F.A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0196391"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"377","DOI":"10.1109\/TAFFC.2014.2336244","article-title":"CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset","volume":"5","author":"Cao","year":"2014","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Yadav, S., and Shukla, S. (2016, January 27\u201328). Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification. Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India.","DOI":"10.1109\/IACC.2016.25"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Chung, J.S., Nagrani, A., and Zisserman, A. (2018, January 2\u20136). VoxCeleb2: Deep Speaker Recognition. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"51975","DOI":"10.1109\/ACCESS.2018.2870334","article-title":"DEEP-SEE FACE: A Mobile Face Recognition System Dedicated to Visually Impaired People","volume":"6","author":"Mocanu","year":"2018","journal-title":"IEEE Access"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/12\/4233\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:19:48Z","timestamp":1760163588000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/21\/12\/4233"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,20]]},"references-count":45,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2021,6]]}},"alternative-id":["s21124233"],"URL":"https:\/\/doi.org\/10.3390\/s21124233","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,20]]}}}