{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T19:27:06Z","timestamp":1769628426661,"version":"3.49.0"},"reference-count":41,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2022,7,12]],"date-time":"2022-07-12T00:00:00Z","timestamp":1657584000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"Ministry of Education of the Republic of Korea and the National Research Foundation of Korea","doi-asserted-by":"publisher","award":["NRF-2021S1A3A2A01087325"],"award-info":[{"award-number":["NRF-2021S1A3A2A01087325"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"Ministry of Education of the Republic of Korea and the National Research Foundation of Korea","doi-asserted-by":"publisher","award":["2020-0-01389"],"award-info":[{"award-number":["2020-0-01389"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010418","name":"Institute of Information &amp; Communications Technology Planning &amp; Evaluation (IITP)","doi-asserted-by":"publisher","award":["NRF-2021S1A3A2A01087325"],"award-info":[{"award-number":["NRF-2021S1A3A2A01087325"]}],"id":[{"id":"10.13039\/501100010418","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010418","name":"Institute of Information &amp; Communications Technology Planning &amp; Evaluation (IITP)","doi-asserted-by":"publisher","award":["2020-0-01389"],"award-info":[{"award-number":["2020-0-01389"]}],"id":[{"id":"10.13039\/501100010418","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Artificial Intelligence Convergence Innovation Human Resources Development (Inha University)","award":["NRF-2021S1A3A2A01087325"],"award-info":[{"award-number":["NRF-2021S1A3A2A01087325"]}]},{"name":"Artificial Intelligence Convergence Innovation Human Resources Development (Inha University)","award":["2020-0-01389"],"award-info":[{"award-number":["2020-0-01389"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Along with automatic speech recognition, many researchers have been actively studying speech emotion recognition, since emotion information is as crucial as the textual information for effective interactions. Emotion can be divided into categorical emotion and dimensional emotion. Although categorical emotion is widely used, dimensional emotion, typically represented as arousal and valence, can provide more detailed information on the emotional states. Therefore, in this paper, we propose a Conformer-based model for arousal and valence recognition. Our model uses Conformer as an encoder, a fully connected layer as a decoder, and statistical pooling layers as a connector. In addition, we adopted multi-task learning and multi-feature combination, which showed a remarkable performance for speech emotion recognition and time-series analysis, respectively. The proposed model achieves a state-of-the-art recognition accuracy of 70.0 \u00b1 1.5% for arousal in terms of unweighted accuracy on the IEMOCAP dataset.<\/jats:p>","DOI":"10.3390\/sym14071428","type":"journal-article","created":{"date-parts":[[2022,7,12]],"date-time":"2022-07-12T23:02:01Z","timestamp":1657666921000},"page":"1428","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0998-5083","authenticated-orcid":false,"given":"Jiyoung","family":"Seo","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5417-5699","authenticated-orcid":false,"given":"Bowon","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Inha University, Incheon 22212, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2022,7,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020, January 25\u201329). Conformer: Convolution-augmented transformer for speech recognition. Proceedings of the INTERSPEECH, ISCA, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., Synnaeve, G., and Auli, M. (2021, January 6\u201311). Self-training and pre-training are complementary for speech recognition. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414641"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Burkhardt, F., Ajmera, J., Englert, R., Stegmann, J., and Burleson, W. (2006, January 17\u201321). Detecting anger in automated voice portal dialogs. Proceedings of the INTERSPEECH, ISCA, Pittsburgh, PA, USA.","DOI":"10.21437\/Interspeech.2006-157"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Huang, Z., Epps, J., and Joachim, D. (2019, January 12\u201317). Speech landmark bigrams for depression detection from naturalistic smartphone speech. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682916"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s10579-008-9076-6","article-title":"IEMOCAP: Interactive emotional dyadic motion capture database","volume":"42","author":"Busso","year":"2008","journal-title":"Lang. Resour. Eval."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Singh, P., Saha, G., and Sahidullah, M. (2021). Deep scattering network for speech emotion recognition. arXiv.","DOI":"10.23919\/EUSIPCO54536.2021.9615958"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"395","DOI":"10.1109\/TAFFC.2015.2407898","article-title":"UMEME: University of Michigan emotional McGurk effect data set","volume":"6","author":"Provost","year":"2015","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1103","DOI":"10.21437\/Interspeech.2017-1494","article-title":"Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning","volume":"Volume 2017","author":"Parthasarathy","year":"2017","journal-title":"Interspeech"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Chen, J.M., Chang, P.C., and Liang, K.W. (2019, January 9\u201311). Speech Emotion Recognition Based on Joint Self-Assessment Manikins and Emotion Labels. Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), IEEE, San Diego, CA, USA.","DOI":"10.1109\/ISM46123.2019.00073"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Atmaja, B.T., and Akagi, M. (2020, January 5\u20137). Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information. Proceedings of the 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), IEEE, Yangon, Myanmar.","DOI":"10.1109\/O-COCOSDA50338.2020.9295032"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1109\/T-AFFC.2011.40","article-title":"Context-sensitive learning for enhanced audiovisual emotion classification","volume":"3","author":"Metallinou","year":"2012","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Wu, B., and Schuller, B. (2019, January 12\u201317). Attention-augmented end-to-end multi-task learning for emotion prediction from speech. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682896"},{"key":"ref_13","first-page":"992","article-title":"Multi-task semi-supervised adversarial autoencoding for speech emotion recognition","volume":"11","author":"Latif","year":"2020","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"985","DOI":"10.1109\/TASLP.2021.3049898","article-title":"CTNet: Conversational transformer network for emotion recognition","volume":"29","author":"Lian","year":"2021","journal-title":"IEEE\/ACM Trans. Audio, Speech, Lang. Process."},{"key":"ref_15","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_16","unstructured":"Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.C., Pang, R., Le, Q.V., and Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv."},{"key":"ref_17","unstructured":"Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. (2021). SpeechStew: Simply mix all available speech recognition data to train one large neural network. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Shor, J., Jansen, A., Han, W., Park, D., and Zhang, Y. (2021). Universal Paralinguistic Speech Representations Using Self-Supervised Conformers. arXiv.","DOI":"10.1109\/ICASSP43922.2022.9747197"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1109\/TAFFC.2015.2512598","article-title":"A multi-task learning framework for emotion recognition using 2D continuous space","volume":"8","author":"Xia","year":"2017","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kim, J.G., and Lee, B. (2019). Appliance classification by power signal analysis based on multi-feature combination multi-layer LSTM. Energies, 12.","DOI":"10.3390\/en12142804"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Wang, X., Wang, M., Qi, W., Su, W., Wang, X., and Zhou, H. (2021, January 6\u201311). A Novel end-to-end Speech Emotion Recognition Network with Stacked Transformer Layers. Proceedings of the ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414314"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, Y., Zhao, T., and Kawahara, T. (2019). Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. Interspeech, ISCA.","DOI":"10.21437\/Interspeech.2019-2594"},{"key":"ref_23","unstructured":"Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv."},{"key":"ref_24","unstructured":"Rana, R., Latif, S., Khalifa, S., Jurdak, R., and Epps, J. (2019). Multi-task semisupervised adversarial autoencoding for speech emotion. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Tits, N., Haddad, K.E., and Dutoit, T. (2018). Asr-based features for emotion recognition: A transfer learning approach. arXiv.","DOI":"10.18653\/v1\/W18-3307"},{"key":"ref_26","unstructured":"Wu, J., Dang, T., Sethu, V., and Ambikairajah, E. (2021). A Novel Markovian Framework for Integrating Absolute and Relative Ordinal Emotion Information. arXiv."},{"key":"ref_27","unstructured":"Ramachandran, P., Zoph, B., and Le, Q.V. (2017). Searching for activation functions. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Kim, Y., Lee, H., and Provost, E.M. (2013, January 26\u201331). Deep learning for robust feature generation in audiovisual emotion recognition. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638346"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"223","DOI":"10.21437\/Interspeech.2014-57","article-title":"Speech emotion recognition using deep neural network and extreme learning machine","volume":"Volume 2014","author":"Han","year":"2014","journal-title":"Interspeech 2014"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"125868","DOI":"10.1109\/ACCESS.2019.2938007","article-title":"Speech emotion recognition from 3D log-mel spectrograms with deep learning network","volume":"7","author":"Meng","year":"2019","journal-title":"IEEE Access"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, J., Xue, M., Culhane, R., Diao, E., Ding, J., and Tarokh, V. (2020, January 4\u20138). Speech emotion recognition with dual-sequence LSTM architecture. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054629"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"466","DOI":"10.1007\/s00034-020-01486-8","article-title":"DNN-HMM-based speaker-adaptive emotion recognition using MFCC and epoch-based features","volume":"40","author":"Fahad","year":"2021","journal-title":"Circuits Syst. Signal Process."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1558","DOI":"10.1109\/PROC.1977.10770","article-title":"A unified approach to short-time Fourier analysis and synthesis","volume":"65","author":"Allen","year":"1977","journal-title":"Proc. IEEE"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15\u201320). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"ref_35","unstructured":"Logan, B. (2000, January 23\u201325). Mel frequency cepstral coefficients for music modeling. Proceedings of the 1st International Symposium on Music Information Retrieval (ISMIR), Plymouth, MA, USA."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/j.neucom.2022.04.028","article-title":"A systematic literature review of speech emotion recognition approaches","volume":"492","author":"Singh","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15\u201320). X-vectors: Robust dnn embeddings for speaker recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461375"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Lozano-Diez, A., Plchot, O., Matejka, P., and Gonzalez-Rodriguez, J. (2018, January 15\u201320). DNN based embeddings for language recognition. Proceedings of the In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462403"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N., and Yamagishi, J. (2020, January 4\u20138). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9054535"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., and Nieto, O. (2015, January 6\u201312). librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"ref_41","first-page":"2613","article-title":"SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition","volume":"2019","author":"Park","year":"2019","journal-title":"Interspeech"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/7\/1428\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:48:36Z","timestamp":1760140116000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/7\/1428"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,7,12]]},"references-count":41,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2022,7]]}},"alternative-id":["sym14071428"],"URL":"https:\/\/doi.org\/10.3390\/sym14071428","relation":{},"ISSN":["2073-8994"],"issn-type":[{"value":"2073-8994","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,7,12]]}}}