{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:39:42Z","timestamp":1760233182686,"version":"build-2065373602"},"reference-count":46,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2022,12,24]],"date-time":"2022-12-24T00:00:00Z","timestamp":1671840000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Bio &amp; Medical Technology Development Program of the National Research Foundation (NRF) &amp; funded by the Korean government (MSIT)","award":["NRF-2019M3E5D1A02067961","NRF-2018R1D1A3B05049058","NRF-2020R1A4A1019191"],"award-info":[{"award-number":["NRF-2019M3E5D1A02067961","NRF-2018R1D1A3B05049058","NRF-2020R1A4A1019191"]}]},{"name":"Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education","award":["NRF-2019M3E5D1A02067961","NRF-2018R1D1A3B05049058","NRF-2020R1A4A1019191"],"award-info":[{"award-number":["NRF-2019M3E5D1A02067961","NRF-2018R1D1A3B05049058","NRF-2020R1A4A1019191"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Speech emotion recognition (SER) is one of the most exciting topics many researchers have recently been involved in. Although much research has been conducted recently on this topic, emotion recognition via non-verbal speech (known as the vocal burst) is still sparse. The vocal burst is concise and has meaningless content, which is harder to deal with than verbal speech. Therefore, in this paper, we proposed a self-relation attention and temporal awareness (SRA-TA) module to tackle this problem with vocal bursts, which could capture the dependency in a long-term period and focus on the salient parts of the audio signal as well. Our proposed method contains three main stages. Firstly, the latent features are extracted using a self-supervised learning model from the raw audio signal and its Mel-spectrogram. After the SRA-TA module is utilized to capture the valuable information from latent features, all features are concatenated and fed into ten individual fully-connected layers to predict the scores of 10 emotions. Our proposed method achieves a mean concordance correlation coefficient (CCC) of 0.7295 on the test set, which achieves the first ranking of the high-dimensional emotion task in the 2022 ACII Affective Vocal Burst Workshop &amp; Challenge.<\/jats:p>","DOI":"10.3390\/s23010200","type":"journal-article","created":{"date-parts":[[2022,12,27]],"date-time":"2022-12-27T03:03:31Z","timestamp":1672110211000},"page":"200","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Self-Relation Attention and Temporal Awareness for Emotion Recognition via Vocal Burst"],"prefix":"10.3390","volume":"23","author":[{"given":"Dang-Linh","family":"Trinh","sequence":"first","affiliation":[{"name":"Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea"}]},{"given":"Minh-Cong","family":"Vo","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea"}]},{"given":"Soo-Hyung","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3024-5060","authenticated-orcid":false,"given":"Hyung-Jeong","family":"Yang","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8756-1382","authenticated-orcid":false,"given":"Guee-Sang","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2022,12,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Darwin, C., and Prodger, P. (1998). The Expression of the Emotions in Man and Animals, Oxford University Press.","DOI":"10.1093\/oso\/9780195112719.002.0002"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.specom.2019.12.001","article-title":"Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers","volume":"116","year":"2020","journal-title":"Speech Commun."},{"key":"ref_3","first-page":"838","article-title":"The voice conveys specific emotions: Evidence from vocal burst displays","volume":"9","author":"Keltner","year":"2009","journal-title":"Emot. Am. Psychol. Assoc."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1016\/S0167-6393(02)00078-X","article-title":"Experimental study of affect bursts","volume":"40","year":"2003","journal-title":"Speech Commun."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1016\/S0892-1997(05)80231-0","article-title":"Expression of emotion in voice and music","volume":"9","author":"Scherer","year":"1995","journal-title":"J. Voice"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Baird, A., Tzirakis, P., Brooks, J.A., Gregory, C.B., Schuller, B., Batliner, A., and Cowen, A. (2022). The ACII 2022 Affective Vocal Bursts Workshop & Competition: Understanding a critically understudied modality of emotional expression. arXiv.","DOI":"10.1109\/ACIIW57231.2022.10086002"},{"key":"ref_7","unstructured":"Cowen, A., Bard, A., Tzirakis, P., Opara, M., Kim, L., Brooks, J., and Metrick, J. (2022, February 28). The Hume Vocal Burst Competition Dataset (H-VB) | Raw Data. Available online: https:\/\/zenodo.org\/record\/6308780#.Y6ParhVByUk."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., J\u00e9gou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 11\u201317). Emerging Properties in Self-Supervised Vision Transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, BC, Canada.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Misra, I., and van der Maaten, L. (2020, January 14\u201319). Self-Supervised Learning of Pretext-Invariant Representations. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00674"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., and Makedon, F. (2020). A Survey on Contrastive Self-Supervised Learning. Technologies, 9.","DOI":"10.3390\/technologies9010002"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Triantafyllopoulos, A., Liu, S., and Schuller, B.W. (2021, January 5\u20139). Deep speaker conditioning for speech emotion recognition. Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China.","DOI":"10.1109\/ICME51207.2021.9428217"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019, January 16\u201320). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00132"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, S., Mallol-Ragolta, A., Parada-Cabeleiro, E., Qian, K., Jing, X., Kathan, A., Hu, B., and Schuller, B.W. (2022). Audio self-supervised learning: A survey. arXiv.","DOI":"10.1016\/j.patter.2022.100616"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_15","first-page":"16","article-title":"Modeling the temporal evolution of acoustic parameters for speech emotion recognition","volume":"3","author":"Ntalampiras","year":"2011","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1007\/s10772-018-9495-8","article-title":"Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition","volume":"21","author":"Koolagudi","year":"2018","journal-title":"Int. J. Speech Technol."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1109\/TAFFC.2015.2457417","article-title":"The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing","volume":"7","author":"Eyben","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_18","first-page":"292","article-title":"On the acoustics of emotion in audio: What speech, music, and sound have in common","volume":"4","author":"Weninger","year":"2013","journal-title":"Front. Psychol. Front. Media SA"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Han, K., Yu, D., and Tashev, I. (2014, January 14\u201318). Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the Interspeech 2014, Singapore.","DOI":"10.21437\/Interspeech.2014-57"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., and Schuller, B. (2011, January 22\u201327). Deep neural networks for acoustic emotion recognition: Raising the benchmarks. Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic.","DOI":"10.1109\/ICASSP.2011.5947651"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Papakostas, M., Spyrou, E., Giannakopoulos, T., Siantikos, G., Sgouropoulos, D., Mylonas, P., and Makedon, F. (2017). Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation, 5.","DOI":"10.3390\/computation5020026"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Wang, Z.-Q., and Tashev, I. (2017, January 5\u20139). Learning utterance-level representations for speech emotion and age\/gender recognition using deep neural networks. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7953138"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Badshah, A.M., Ahmad, J., Rahim, N., and Baik, S.W. (2017, January 13\u201315). Speech emotion recognition from spectrograms with deep convolutional neural network. Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea.","DOI":"10.1109\/PlatCon.2017.7883728"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Hajarolasvadi, N., and Demirel, H. (2019). 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy, 21.","DOI":"10.3390\/e21050479"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Li, P., Song, Y., McLoughlin, I.V., Guo, W., and Dai, L.-R. (2018, January 2\u20136). An attention pooling based representation learning method for speech emotion recognition. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1242"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Hsiao, P.-W., and Chen, C.-P. (2018, January 15\u201320). Effective attention mechanism in dynamic models for speech emotion recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461431"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Lee, J., and Tashev, I. (2015, January 6\u201310). High-level feature representation using recurrent neural network for speech emotion recognition. Proceedings of the Interspeech 2015, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-336"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Mirsamadi, S., Barsoum, E., and Zhang, C. (2017, January 5\u20139). Automatic speech emotion recognition using recurrent neural networks with local attention. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952552"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Zheng, C., Wang, C., and Jia, N. (2019). An ensemble model for multi-level speech emotion recognition. Appl. Sci., 10.","DOI":"10.3390\/app10010205"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhao, T., and Kawahara, T. (2019, January 15\u201319). Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2594"},{"key":"ref_32","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_33","unstructured":"Prasad, L.V.S.V., Seth, A., Ghosh, S., and Umesh, S. (2022). Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition. arXiv."},{"key":"ref_34","unstructured":"Xin, D., Takamichi, S., and Saruwatari, H. (2022). Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Kahn, J., Rivi\u00e8re, M., Zheng, W., Kharitonov, E., Xu, Q., Mazar\u00e9, P.-E., Karadayi, J., Liptchinsky, V., Collobert, R., and Fuegen, C. (2020, January 4\u20138). Libri-light: A benchmark for asr with limited or no supervision. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9052942"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Weyand, T., Araujo, A., Cao, B., and Sim, J. (2020, January 14\u201319). Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00265"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Meng, D., Peng, X., Wang, K., and Qiao, Y. (2019, January 22\u201325). Frame attention networks for facial expression recognition in videos. Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan.","DOI":"10.1109\/ICIP.2019.8803603"},{"key":"ref_38","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_39","unstructured":"(2022, November 07). ACII A-VB2022\u2014Hume AI|ML. Available online: https:\/\/www.competitions.hume.ai\/avb2022."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Lawrence, I., and Lin, K. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 255\u2013268.","DOI":"10.2307\/2532051"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"6037","DOI":"10.1007\/s10462-022-10148-x","article-title":"Attention, please! A survey of neural attention models in deep learning","volume":"55","author":"Colombini","year":"2022","journal-title":"Artif. Intell. Rev."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Eyben, F., and Schuller, B.W. (2022). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. arXiv.","DOI":"10.1109\/TPAMI.2023.3263585"},{"key":"ref_43","unstructured":"Atmaja, B.T., and Sasou, A. (2022). Predicting Affective Vocal Bursts with Finetuned wav2vec 2.0. arXiv."},{"key":"ref_44","unstructured":"Nguyen, D.-K., Pant, S., Ho, N.-H., Lee, G.-S., Kim, S.-H., and Yang, H.-J. (2022). Fine-tuning Wav2vec for Vocal-burst Emotion Recognition. arXiv."},{"key":"ref_45","unstructured":"Hallmen, T., Mertes, S., Schiller, D., and Andr\u00e9, E. (2022). An Efficient Multitask Learning Architecture for Affective Vocal Burst Analysis. arXiv."},{"key":"ref_46","unstructured":"Karas, V., Triantafyllopoulos, A., Song, M., and Schuller, B.W. (2022). Elisabeth, Self-Supervised Attention Networks and Uncertainty Loss Weighting for Multi-Task Emotion Recognition on Vocal Bursts. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/1\/200\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:50:28Z","timestamp":1760147428000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/1\/200"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,24]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["s23010200"],"URL":"https:\/\/doi.org\/10.3390\/s23010200","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,12,24]]}}}