{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T14:05:04Z","timestamp":1772201104986,"version":"3.50.1"},"reference-count":29,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T00:00:00Z","timestamp":1742860800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003725","name":"Ministry of Education","doi-asserted-by":"publisher","award":["2022R1A6A1A03052954"],"award-info":[{"award-number":["2022R1A6A1A03052954"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"Ministry of Education","doi-asserted-by":"publisher","award":["RS-2019-II191906"],"award-info":[{"award-number":["RS-2019-II191906"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003725","name":"Ministry of Education","doi-asserted-by":"publisher","award":["20214810100010"],"award-info":[{"award-number":["20214810100010"]}],"id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Korea government (MSIT)","award":["2022R1A6A1A03052954"],"award-info":[{"award-number":["2022R1A6A1A03052954"]}]},{"name":"Korea government (MSIT)","award":["RS-2019-II191906"],"award-info":[{"award-number":["RS-2019-II191906"]}]},{"name":"Korea government (MSIT)","award":["20214810100010"],"award-info":[{"award-number":["20214810100010"]}]},{"DOI":"10.13039\/501100007053","name":"Ministry of Trade, Industry &amp; Energy (MOTIE) of the Republic of Korea","doi-asserted-by":"publisher","award":["2022R1A6A1A03052954"],"award-info":[{"award-number":["2022R1A6A1A03052954"]}],"id":[{"id":"10.13039\/501100007053","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100007053","name":"Ministry of Trade, Industry &amp; Energy (MOTIE) of the Republic of Korea","doi-asserted-by":"publisher","award":["RS-2019-II191906"],"award-info":[{"award-number":["RS-2019-II191906"]}],"id":[{"id":"10.13039\/501100007053","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100007053","name":"Ministry of Trade, Industry &amp; Energy (MOTIE) of the Republic of Korea","doi-asserted-by":"publisher","award":["20214810100010"],"award-info":[{"award-number":["20214810100010"]}],"id":[{"id":"10.13039\/501100007053","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human\u2013computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion expression. Existing methods use external emotion predictors to compensate for this but introduce computational overhead and fail to fully integrate paralinguistic features with linguistic context. Moreover, the lack of high-quality emotional speech datasets limits models\u2019 ability to learn expressive emotional cues. To address these challenges, we propose EmoSDS, a unified SDS framework that integrates speech and emotion recognition by leveraging self-supervised learning (SSL) features. Our three-stage training pipeline enables the LLM to learn both discrete linguistic content and continuous paralinguistic features, improving emotional expressiveness and response naturalness. Additionally, we construct EmoSC, a dataset combining GPT-generated dialogues with emotional voice conversion data, ensuring greater emotional diversity and a balanced sample distribution across emotion categories. The experimental results show that EmoSDS outperforms existing models in emotional alignment and response generation, achieving a minimum 2.9% increase in text generation metrics, enhancing the LLM\u2019s ability to interpret emotional and textual cues for more expressive and contextually appropriate responses.<\/jats:p>","DOI":"10.3390\/fi17040143","type":"journal-article","created":{"date-parts":[[2025,3,25]],"date-time":"2025-03-25T05:35:03Z","timestamp":1742880903000},"page":"143","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["EmoSDS: Unified Emotionally Adaptive Spoken Dialogue System Using Self-Supervised Speech Representations"],"prefix":"10.3390","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-0873-375X","authenticated-orcid":false,"given":"Jaehwan","family":"Lee","sequence":"first","affiliation":[{"name":"Graduate School of Artificial Intelligence, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7986-1708","authenticated-orcid":false,"given":"Youngjun","family":"Sim","sequence":"additional","affiliation":[{"name":"Graduate School of Artificial Intelligence, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0554-2701","authenticated-orcid":false,"given":"Jinyou","family":"Kim","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7208-1709","authenticated-orcid":false,"given":"Young-Joo","family":"Suh","sequence":"additional","affiliation":[{"name":"Graduate School of Artificial Intelligence, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of Korea"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,25]]},"reference":[{"key":"ref_1","unstructured":"Sim, Y., Yoon, J., and Suh, Y.J. (2024). SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., and Qiu, X. (2023). Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.1055"},{"key":"ref_3","unstructured":"Rubenstein, P.K., Asawaroengchai, C., Nguyen, D.D., Bapna, A., Borsos, Z., Quitry, F.d.C., Chen, P., Badawy, D.E., Han, W., and Kharitonov, E. (2023). Audiopalm: A large language model that can speak and listen. arXiv."},{"key":"ref_4","unstructured":"Zhao, W., Zhao, Y., Lu, X., Wang, S., Tong, Y., and Qin, B. (2023). Is ChatGPT equipped with emotional dialogue capabilities?. arXiv."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Liu, R., Wei, J., Jia, C., and Vosoughi, S. (2021). Modulating language models with emotions. arXiv.","DOI":"10.18653\/v1\/2021.findings-acl.379"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Varshney, D., Ekbal, A., and Bhattacharyya, P. (2021, January 19\u201323). Modelling context emotions using multi-task learning for emotion controlled dialog generation. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online.","DOI":"10.18653\/v1\/2021.eacl-main.255"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Lin, G.T., Shivakumar, P.G., Gandhe, A., Yang, C.H.H., Gu, Y., Ghosh, S., Stolcke, A., Lee, H.y., and Bulyko, I. (2024, January 14\u201319). Paralinguistics-enhanced large language modeling of spoken dialogue. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10446933"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Lin, G.T., Chiang, C.H., and Lee, H.y. (2024). Advancing large language models to capture varied speaking styles and respond properly in spoken conversations. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.358"},{"key":"ref_9","unstructured":"Li, Y.A., Jiang, X., Darefsky, J., Zhu, G., and Mesgarani, N. (2024). Style-talker: Finetuning audio language model and style-based text-to-speech model for fast spoken dialogue generation. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Lee, K., Park, K., and Kim, D. (2023, January 4\u201310). Dailytalk: Spoken dialogue dataset for conversational text-to-speech. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095751"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.specom.2021.11.006","article-title":"Emotional voice conversion: Theory, databases and ESD","volume":"137","author":"Zhou","year":"2022","journal-title":"Speech Commun."},{"key":"ref_12","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). Gpt-4 technical report. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wang, M., Han, W., Shafran, I., Wu, Z., Chiu, C.C., Cao, Y., Chen, N., Zhang, Y., Soltau, H., and Rubenstein, P.K. (2023, January 16\u201320). Slm: Bridge the thin gap between speech and text foundation models. Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan.","DOI":"10.1109\/ASRU57964.2023.10389703"},{"key":"ref_14","unstructured":"Nachmani, E., Levkovitch, A., Hirsch, R., Salazar, J., Asawaroengchai, C., Mariooryad, S., Rivlin, E., Skerry-Ryan, R., and Ramanovich, M.T. (2023). Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM. arXiv."},{"key":"ref_15","unstructured":"Lin, G.T., Shivakumar, P.G., Gourav, A., Gu, Y., Gandhe, A., Lee, H.Y., and Bulyko, I. (2024). Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Xue, H., Liang, Y., Mu, B., Zhang, S., Chen, M., Chen, Q., and Xie, L. (2024, January 7\u201310). E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models. Proceedings of the 2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP), Beijing, China.","DOI":"10.1109\/ISCSLP63861.2024.10800447"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Hubert: Self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1505","DOI":"10.1109\/JSTSP.2022.3188113","article-title":"Wavlm: Large-scale self-supervised pre-training for full stack speech processing","volume":"16","author":"Chen","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Pasad, A., Chou, J.C., and Livescu, K. (2021, January 13\u201317). Layer-wise analysis of a self-supervised speech representation model. Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia.","DOI":"10.1109\/ASRU51503.2021.9688093"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Baas, M., van Niekerk, B., and Kamper, H. (2023). Voice conversion with just nearest neighbors. arXiv.","DOI":"10.21437\/Interspeech.2023-419"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1211","DOI":"10.1109\/JSTSP.2022.3206084","article-title":"Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge","volume":"16","author":"Dunbar","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, J., Tu, W., and Xiao, L. (2023, January 4\u201310). Freevc: Towards high-quality text-free one-shot voice conversion. Proceedings of the ICASSP 2023\u20142023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10095191"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Polyak, A., Adi, Y., Copet, J., Kharitonov, E., Lakhotia, K., Hsu, W.N., Mohamed, A., and Dupoux, E. (2021). Speech resynthesis from discrete disentangled self-supervised representations. arXiv.","DOI":"10.21437\/Interspeech.2021-475"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Li, J., Guo, Y., Chen, X., and Yu, K. (2024, January 14\u201319). SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention. Proceedings of the ICASSP 2024\u20142024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10446160"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. (2020). Unsupervised cross-lingual representation learning for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2021-329"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An asr corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002, January 7\u201312). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_28","unstructured":"Lin, C.Y. (2004, January 25\u201326). ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of the Text Summarization Branches Out, Barcelona, Spain."},{"key":"ref_29","unstructured":"Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., and Artzi, Y. (2020, January 26\u201330). BERTScore: Evaluating Text Generation with BERT. Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/143\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:59:40Z","timestamp":1760029180000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/143"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,25]]},"references-count":29,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["fi17040143"],"URL":"https:\/\/doi.org\/10.3390\/fi17040143","relation":{},"ISSN":["1999-5903"],"issn-type":[{"value":"1999-5903","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,25]]}}}