{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,11]],"date-time":"2026-06-11T06:52:43Z","timestamp":1781160763954,"version":"3.54.1"},"reference-count":28,"publisher":"MDPI AG","issue":"17","license":[{"start":{"date-parts":[[2022,8,24]],"date-time":"2022-08-24T00:00:00Z","timestamp":1661299200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001863","name":"New Energy and Industrial Technology Development Organization (NEDO)","doi-asserted-by":"publisher","award":["JPNP20006"],"award-info":[{"award-number":["JPNP20006"]}],"id":[{"id":"10.13039\/501100001863","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>The study of understanding sentiment and emotion in speech is a challenging task in human multimodal language. However, in certain cases, such as telephone calls, only audio data can be obtained. In this study, we independently evaluated sentiment analysis and emotion recognition from speech using recent self-supervised learning models\u2014specifically, universal speech representations with speaker-aware pre-training models. Three different sizes of universal models were evaluated for three sentiment tasks and an emotion task. The evaluation revealed that the best results were obtained with two classes of sentiment analysis, based on both weighted and unweighted accuracy scores (81% and 73%). This binary classification with unimodal acoustic analysis also performed competitively compared to previous methods which used multimodal fusion. The models failed to make accurate predictionsin an emotion recognition task and in sentiment analysis tasks with higher numbers of classes. The unbalanced property of the datasets may also have contributed to the performance degradations observed in the six-class emotion, three-class sentiment, and seven-class sentiment tasks.<\/jats:p>","DOI":"10.3390\/s22176369","type":"journal-article","created":{"date-parts":[[2022,8,24]],"date-time":"2022-08-24T23:48:58Z","timestamp":1661384938000},"page":"6369","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":41,"title":["Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1560-2824","authenticated-orcid":false,"given":"Bagus Tris","family":"Atmaja","sequence":"first","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1700-0325","authenticated-orcid":false,"given":"Akira","family":"Sasou","sequence":"additional","affiliation":[{"name":"National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2022,8,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Fujisaki, H. (2003, January 9\u201311). Prosody, Information, and Modeling with Emphasis on Tonal Features of Speech. Proceedings of the Workshop on Spoken Language Processing, Mumbai, India.","DOI":"10.21437\/SpeechProsody.2004-1"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Ghriss, A., Yang, B., Rozgic, V., Shriberg, E., and Wang, C. (2022, January 23\u201327). Sentiment-Aware Automatic Speech Recognition Pre-Training for Enhanced Speech Emotion Recognition. Proceedings of the ICASSP 2022\u20132022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9747637"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"012004","DOI":"10.1088\/1742-6596\/1896\/1\/012004","article-title":"Evaluation of error- and correlation-based loss functions for multitask learning dimensional speech emotion recognition","volume":"1896","author":"Atmaja","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_4","first-page":"22","article-title":"Sentiment analysis and emotion recognition: Evolving the paradigm of communication within data classification","volume":"6","author":"Gross","year":"2020","journal-title":"Appl. Mark. Anal."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"P\u00e9rez-Rosas, V., and Mihalcea, R. (2013, January 25\u201329). Sentiment analysis of online spoken reviews. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Lyon, France.","DOI":"10.21437\/Interspeech.2013-243"},{"key":"ref_6","unstructured":"Abercrombie, G., and Batista-Navarro, R. (2018, January 7\u201312). \u2018Aye\u2019 or \u2018No\u2019? Speech-level sentiment analysis of hansard UK parliamentary debate transcripts. Proceedings of the LREC 2018, Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., and Schuller, B.W. (2022). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. arXiv.","DOI":"10.1109\/TPAMI.2023.3263585"},{"key":"ref_8","first-page":"80","article-title":"Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network","volume":"2328","author":"Luo","year":"2019","journal-title":"CEUR Workshop Proc."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Georgiou, E., Paraskevopoulos, G., and Potamianos, A. (September, January 30). M3: MultiModal Masking Applied to Sentiment Analysis. Proceedings of the Interspeech 2021, Brno, Czechia.","DOI":"10.21437\/Interspeech.2021-1739"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.P. (2017). Tensor Fusion Network for Multimodal Sentiment Analysis. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics.","DOI":"10.18653\/v1\/D17-1115"},{"key":"ref_11","unstructured":"Zadeh, A., Liang, P.P., Vanbriesen, J., Poria, S., Tong, E., Cambria, E., Chen, M., and Morency, L.P. (2018, January 15\u201320). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1016\/j.specom.2022.03.002","article-title":"Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion","volume":"140","author":"Atmaja","year":"2022","journal-title":"Speech Commun."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., Wu, J., Qian, Y., Wei, F., and Li, J. (2022, January 23\u201327). Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training. Proceedings of the ICASSP 2022\u20132022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9747077"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Bertero, D., Siddique, F.B., Wu, C.S., Wan, Y., Ho, R., Chan, Y., and Fung, P. (2016, January 1\u20135). Real-Time Speech Emotion and Sentiment Recognition for Interactive Dialogue Systems. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA.","DOI":"10.18653\/v1\/D16-1110"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Poria, S., Chaturvedi, I., Cambria, E., and Hussain, A. (2016, January 12\u201315). Convolutional MKL based multimodal emotion recognition and sentiment analysis. Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain.","DOI":"10.1109\/ICDM.2016.0055"},{"key":"ref_16","unstructured":"Liang, P.P., and Salakhutdinov, R. (2018, January 20). Computational Modeling of Human Multimodal Language: The MOSEI Dataset and Interpretable Dynamic Fusion. Proceedings of the First Workshop and Grand Challenge on Computational Modeling of Human Multimodal Language, Melbourne, Australia."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yang, S.w., Chi, P.H., Chuang, Y.S., Lai, C.I.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., and Lin, G.T. (September, January 30). SUPERB: Speech Processing Universal PERformance Benchmark. Proceedings of the Interspeech 2021, Brno, Czechia.","DOI":"10.21437\/Interspeech.2021-1775"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"2476","DOI":"10.3389\/fpsyg.2019.02476","article-title":"Does Neutral Affect Exist? How Challenging Three Beliefs About Neutral Affect Can Advance Affective Research","volume":"10","author":"Gasper","year":"2019","journal-title":"Front. Psychol."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"260","DOI":"10.1111\/j.1745-6916.2007.00044.x","article-title":"Basic Emotions, Natural Kinds, Emotion Schemas, and a New Paradigm","volume":"2","author":"Izard","year":"2007","journal-title":"Perspect. Psychol. Sci."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Delbrouck, J.B., Tits, N., and Dupont, S. (2020, January 20). Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition. Proceedings of the First International Workshop on Natural Language Processing Beyond Text, Online.","DOI":"10.18653\/v1\/2020.nlpbt-1.1"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"6558","DOI":"10.18653\/v1\/P19-1656","article-title":"Multimodal transformer for unaligned multimodal language sequences","volume":"2019","author":"Tsai","year":"2019","journal-title":"Proc. Conf. Assoc. Comput. Linguist. Meet."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Sheikh, I., Dumpala, S.H., Chakraborty, R., and Kopparapu, S.K. (2018). Sentiment Analysis using Imperfect Views from Spoken Language and Acoustic Modalities. Proceedings of Grand Challenge and Workshop on Human Multimodal Language, Association for Computational Linguistics.","DOI":"10.18653\/v1\/W18-3305"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1853","DOI":"10.1109\/TASLP.2022.3178225","article-title":"Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model","volume":"30","author":"Sitaula","year":"2022","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_25","first-page":"7216","article-title":"Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors","volume":"33","author":"Wang","year":"2019","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_26","first-page":"6892","article-title":"Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities","volume":"33","author":"Pham","year":"2019","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_27","first-page":"1823","article-title":"Multimodal routing: Improving local and global interpretability of multimodal language analysis","volume":"2020","author":"Tsai","year":"2020","journal-title":"Conf. Empir. Methods Nat. Lang. Process. Proc. Conf."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"72381","DOI":"10.1109\/ACCESS.2022.3189481","article-title":"Speech Emotion and Naturalness Recognitions With Multitask and Single-Task Learnings","volume":"10","author":"Atmaja","year":"2022","journal-title":"IEEE Access"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6369\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:14:37Z","timestamp":1760141677000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/17\/6369"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,24]]},"references-count":28,"journal-issue":{"issue":"17","published-online":{"date-parts":[[2022,9]]}},"alternative-id":["s22176369"],"URL":"https:\/\/doi.org\/10.3390\/s22176369","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,24]]}}}