{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T18:46:29Z","timestamp":1781635589764,"version":"3.54.5"},"reference-count":178,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,2,12]],"date-time":"2024-02-12T00:00:00Z","timestamp":1707696000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,12]],"date-time":"2024-02-12T00:00:00Z","timestamp":1707696000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section\u00a08, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.<\/jats:p>","DOI":"10.1186\/s13636-024-00329-7","type":"journal-article","created":{"date-parts":[[2024,2,12]],"date-time":"2024-02-12T14:03:29Z","timestamp":1707746609000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":47,"title":["Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources"],"prefix":"10.1186","volume":"2024","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7585-4970","authenticated-orcid":false,"given":"Huda","family":"Barakat","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Oytun","family":"Turk","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Cenk","family":"Demiroglu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,2,12]]},"reference":[{"key":"329_CR1","unstructured":"Wikipedia. Speech Synthesis - Wikiversity \u2014 en.wikiversity.org. https:\/\/en.wikiversity.org\/wiki\/Speech_Synthesis. Accessed 09 Jun 2023"},{"key":"329_CR2","doi-asserted-by":"publisher","unstructured":"H. Ze, A. Senior, M. Schuster, in 2013 ieee international conference on acoustics, speech and signal processing. Statistical parametric speech synthesis using deep neural networks (IEEE, 2013), pp. 7962\u20137966.  https:\/\/doi.org\/10.1109\/icassp.2013.6639215","DOI":"10.1109\/icassp.2013.6639215"},{"key":"329_CR3","doi-asserted-by":"publisher","unstructured":"Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R.J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., in Proc. Interspeech 2017. Tacotron: Towards end-to-end speech synthesis (2017), pp. 4006\u20134010. https:\/\/doi.org\/10.21437\/Interspeech.2017-1452","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"329_CR4","doi-asserted-by":"publisher","unstructured":"J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions (IEEE, 2018), pp. 4779\u20134783. https:\/\/doi.org\/10.1109\/icassp.2018.8461368","DOI":"10.1109\/icassp.2018.8461368"},{"key":"329_CR5","unstructured":"Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.Y. Liu, Fastspeech: Fast, robust and controllable text to speech. Adv. Neural Inf. Process. Syst. 32, 3171\u20133180 (2019)"},{"key":"329_CR6","unstructured":"Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.Y. Liu, Fastspeech 2: Fast and high-quality end-to-end text to speech. (2020).\u00a0arXiv\u00a0preprint\u00a0arXiv:2006.04558"},{"issue":"10","key":"329_CR7","doi-asserted-by":"publisher","first-page":"15171","DOI":"10.1007\/s11042-022-13943-4","volume":"82","author":"Y Kumar","year":"2023","unstructured":"Y. Kumar, A. Koul, C. Singh, A deep learning approaches in text-to-speech system: a systematic review and recent research perspective. Multimed. Tools Appl. 82(10), 15171\u201315197 (2023)","journal-title":"Multimed. Tools Appl."},{"key":"329_CR8","doi-asserted-by":"crossref","unstructured":"F. Khanam, F.A. Munmun, N.A. Ritu, A.K. Saha, M. Firoz, Text to speech synthesis: A systematic review, deep learning based architecture and future research direction. J. Adv. Inform. Technol. 13(5), 398\u2013412 (2022)","DOI":"10.12720\/jait.13.5.398-412"},{"key":"329_CR9","doi-asserted-by":"publisher","unstructured":"Z. Mu, X. Yang, Y. Dong, Review of end-to-end speech synthesis technology based on deep learning. (2021). https:\/\/doi.org\/10.48550\/arXiv.2104.09995","DOI":"10.48550\/arXiv.2104.09995"},{"issue":"19","key":"329_CR10","doi-asserted-by":"publisher","first-page":"4050","DOI":"10.3390\/app9194050","volume":"9","author":"Y Ning","year":"2019","unstructured":"Y. Ning, S. He, Z. Wu, C. Xing, L.J. Zhang, A review of deep learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019)","journal-title":"Appl. Sci."},{"issue":"3","key":"329_CR11","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1109\/MSP.2014.2359987","volume":"32","author":"ZH Ling","year":"2015","unstructured":"Z.H. Ling, S.Y. Kang, H. Zen, A. Senior, M. Schuster, X.J. Qian, H.M. Meng, L. Deng, Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Process. Mag. 32(3), 35\u201352 (2015)","journal-title":"IEEE Signal Process. Mag."},{"key":"329_CR12","doi-asserted-by":"publisher","unstructured":"O. Nazir, A. Malik, in 2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC). Deep learning end to end speech synthesis: a review (IEEE, 2021), pp. 66\u201371. https:\/\/doi.org\/10.1109\/icsccc51823.2021.9478125","DOI":"10.1109\/icsccc51823.2021.9478125"},{"key":"329_CR13","unstructured":"X. Tan, T. Qin, F. Soong, T.Y. Liu. A survey on neural speech synthesis (2021). arXiv preprint arXiv:2106.15561"},{"key":"329_CR14","unstructured":"N. Kaur, P. Singh, Conventional and contemporary approaches used in text to speech synthesis: A review. Artif. Intell. Rev. 2022, 1\u201344 (2022)"},{"key":"329_CR15","doi-asserted-by":"crossref","unstructured":"A. Triantafyllopoulos, B.W. Schuller, G. \u0130ymen, M. Sezgin, X. He, Z. Yang, P. Tzirakis, S. Liu, S. Mertes, E. Andr\u00e9 et al., An overview of affective speech synthesis and conversion in the deep learning era. Proc. IEEE (2023),  vol. 111, no. 10, pp. 1355\u20131381","DOI":"10.1109\/JPROC.2023.3250266"},{"key":"329_CR16","unstructured":"Scopus. Scopus \u2014 scopus.com. https:\/\/www.scopus.com\/. Accessed 7 Jan 2023"},{"key":"329_CR17","doi-asserted-by":"publisher","unstructured":"S. Lei, Y. Zhou, L. Chen, Z. Wu, S. Kang, H. Meng, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Context-aware coherent speaking style prediction with hierarchical transformers for audiobook speech synthesis (IEEE, 2023), pp. 1\u20135.  https:\/\/doi.org\/10.1109\/icassp49357.2023.10095866","DOI":"10.1109\/icassp49357.2023.10095866"},{"key":"329_CR18","unstructured":"K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, J. Bian, Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2304.09116"},{"key":"329_CR19","doi-asserted-by":"publisher","unstructured":"S. Jo, Y. Lee, Y. Shin, Y. Hwang, T. Kim, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Cross-speaker emotion transfer by manipulating speech style latents (IEEE, 2023), pp. 1\u20135. https:\/\/doi.org\/10.1109\/icassp49357.2023.10095619","DOI":"10.1109\/icassp49357.2023.10095619"},{"key":"329_CR20","doi-asserted-by":"publisher","unstructured":"T.H. Teh, V. Hu, D.S.R. Mohan, Z. Hodari, C.G. Wallis, T.G. Ibarrondo, A. Torresquintero, J. Leoni, M. Gales, S. King, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ensemble prosody prediction for expressive speech synthesis (IEEE, 2023), pp. 1\u20135. https:\/\/doi.org\/10.1109\/icassp49357.2023.10096962","DOI":"10.1109\/icassp49357.2023.10096962"},{"key":"329_CR21","unstructured":"D. Yang, S. Liu, R. Huang, G. Lei, C. Weng, H. Meng, D. Yu, Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2301.13662"},{"key":"329_CR22","unstructured":"C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al., Neural codec language models are zero-shot text to speech synthesizers. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2301.02111"},{"issue":"4","key":"329_CR23","doi-asserted-by":"publisher","first-page":"2225","DOI":"10.3390\/app13042225","volume":"13","author":"W Zhao","year":"2023","unstructured":"W. Zhao, Z. Yang, An emotion speech synthesis method based on vits. Appl. Sci. 13(4), 2225 (2023)","journal-title":"Appl. Sci."},{"key":"329_CR24","unstructured":"H.S. Oh, S.H. Lee, S.W. Lee, Diffprosody: Diffusion-based latent prosody generation for expressive speech synthesis with prosody conditional adversarial training. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2307.16549"},{"key":"329_CR25","unstructured":"M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, et al., Voicebox: Text-guided multilingual universal speech generation at scale. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2306.15687"},{"key":"329_CR26","doi-asserted-by":"publisher","unstructured":"P. Wu, Z. Ling, L. Liu, Y. Jiang, H. Wu, L. Dai, in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). End-to-end emotional speech synthesis using style tokens and semi-supervised training (IEEE, 2019), pp. 623\u2013627.  https:\/\/doi.org\/10.1109\/apsipaasc47483.2019.9023186","DOI":"10.1109\/apsipaasc47483.2019.9023186"},{"key":"329_CR27","doi-asserted-by":"publisher","unstructured":"X. Zhu, S. Yang, G. Yang, L. Xie, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Controlling emotion strength with relative attribute for end-to-end speech synthesis (IEEE, 2019), pp. 192\u2013199. https:\/\/doi.org\/10.1109\/asru46091.2019.9003829","DOI":"10.1109\/asru46091.2019.9003829"},{"key":"329_CR28","doi-asserted-by":"publisher","first-page":"151","DOI":"10.1016\/j.cogsys.2019.09.009","volume":"59","author":"X Zhu","year":"2020","unstructured":"X. Zhu, L. Xue, Building a controllable expressive speech synthesis system with multiple emotion strengths. Cogn. Syst. Res. 59, 151\u2013159 (2020)","journal-title":"Cogn. Syst. Res."},{"key":"329_CR29","doi-asserted-by":"publisher","unstructured":"G. Xu, W. Song, Z. Zhang, C. Zhang, X. He, B. Zhou, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving prosody modelling with cross-utterance bert embeddings for end-to-end speech synthesis (IEEE, 2021), pp. 6079\u20136083. https:\/\/doi.org\/10.1109\/icassp39728.2021.9414102","DOI":"10.1109\/icassp39728.2021.9414102"},{"key":"329_CR30","doi-asserted-by":"publisher","unstructured":"A. Sun, J. Wang, N. Cheng, H. Peng, Z. Zeng, L. Kong, J. Xiao, in 2021 IEEE Spoken Language Technology Workshop (SLT). Graphpb: Graphical representations of prosody boundary in speech synthesis (IEEE, 2021), pp. 438\u2013445. https:\/\/doi.org\/10.1109\/slt48900.2021.9383530","DOI":"10.1109\/slt48900.2021.9383530"},{"key":"329_CR31","doi-asserted-by":"publisher","unstructured":"Y. Lei, S. Yang, L. Xie, in 2021 IEEE Spoken Language Technology Workshop (SLT). Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis (IEEE, 2021), pp. 423\u2013430. https:\/\/doi.org\/10.1109\/slt48900.2021.9383524","DOI":"10.1109\/slt48900.2021.9383524"},{"key":"329_CR32","doi-asserted-by":"publisher","unstructured":"T. Li, S. Yang, L. Xue, L. Xie, in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). Controllable emotion transfer for end-to-end speech synthesis (IEEE, 2021), pp. 1\u20135. https:\/\/doi.org\/10.1109\/iscslp49672.2021.9362069","DOI":"10.1109\/iscslp49672.2021.9362069"},{"key":"329_CR33","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1109\/TASLP.2022.3145293","volume":"30","author":"Y Lei","year":"2022","unstructured":"Y. Lei, S. Yang, X. Wang, L. Xie, Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis. IEEE\/ACM Trans. Audio Speech Lang. Process. 30, 853\u2013864 (2022)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR34","doi-asserted-by":"publisher","first-page":"1448","DOI":"10.1109\/TASLP.2022.3164181","volume":"30","author":"T Li","year":"2022","unstructured":"T. Li, X. Wang, Q. Xie, Z. Wang, L. Xie, Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE\/ACM Trans. Audio Speech Lang. Process. 30, 1448\u20131460 (2022)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR35","doi-asserted-by":"publisher","unstructured":"N.Q. Wu, Z.C. Liu, Z.H. Ling, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Discourse-level prosody modeling with a variational autoencoder for non-autoregressive expressive speech synthesis (IEEE, 2022), pp. 7592\u20137596. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746238","DOI":"10.1109\/icassp43922.2022.9746238"},{"key":"329_CR36","doi-asserted-by":"publisher","unstructured":"K. He, C. Sun, R. Zhu, L. Zhao, in 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP). Multi-speaker emotional speech synthesis with limited datasets: Two-stage non-parallel training strategy (IEEE, 2022), pp. 545\u2013548. https:\/\/doi.org\/10.1109\/icsp54964.2022.9778768","DOI":"10.1109\/icsp54964.2022.9778768"},{"key":"329_CR37","doi-asserted-by":"publisher","first-page":"2854","DOI":"10.1109\/TASLP.2022.3202126","volume":"30","author":"L Xue","year":"2022","unstructured":"L. Xue, F.K. Soong, S. Zhang, L. Xie, Paratts: Learning linguistic and prosodic cross-sentence information in paragraph-based tts. IEEE\/ACM Trans. Audio Speech Lang. Process. 30, 2854\u20132864 (2022)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR38","doi-asserted-by":"publisher","first-page":"1948","DOI":"10.1109\/LSP.2022.3203888","volume":"29","author":"Y Lei","year":"2022","unstructured":"Y. Lei, S. Yang, X. Zhu, L. Xie, D. Su, Cross-speaker emotion transfer through information perturbation in emotional speech synthesis. IEEE Signal Process. Lett. 29, 1948\u20131952 (2022)","journal-title":"IEEE Signal Process. Lett."},{"key":"329_CR39","doi-asserted-by":"crossref","unstructured":"T. Li, X. Wang, Q. Xie, Z. Wang, M. Jiang, L. Xie, Cross-speaker emotion transfer based on prosody compensation for end-to-end speech synthesis. IEEE\/ACM Trans. Audio Speech Lang. Process. 30, 1448\u20131460 (2022). arXiv preprint arXiv:2207.01198","DOI":"10.1109\/TASLP.2022.3164181"},{"key":"329_CR40","doi-asserted-by":"crossref","unstructured":"Y. Wu, X. Wang, S. Zhang, L. He, R. Song, J.Y. Nie, Self-supervised context-aware style representation for expressive speech synthesis. Proc. Annu. Conf. Int. Speech Commun. Assoc. pp. 5503\u20135507 (2022). arXiv preprint arXiv:2206.12559","DOI":"10.21437\/Interspeech.2022-686"},{"key":"329_CR41","doi-asserted-by":"crossref","unstructured":"R. Li, Z. Wu, Y. Huang, J. Jia, H. Meng, L. Cai, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Emphatic speech generation with conditioned input layer and bidirectional lstms for expressive speech synthesis (IEEE, 2018), pp. 5129\u20135133","DOI":"10.1109\/ICASSP.2018.8461748"},{"key":"329_CR42","doi-asserted-by":"publisher","unstructured":"X. Wu, L. Sun, S. Kang, S. Liu, Z. Wu, X. Liu, H. Meng, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Feature based adaptation for speaking style synthesis (IEEE, 2018), pp. 5304\u20135308.  https:\/\/doi.org\/10.1109\/icassp.2018.8462178","DOI":"10.1109\/icassp.2018.8462178"},{"key":"329_CR43","doi-asserted-by":"publisher","unstructured":"L. Xue, X. Zhu, X. An, L. Xie, in Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data (ASMMC-MMAC). A comparison of expressive speech synthesis approaches based on neural network (ACM, 2018), pp. 15\u201320. https:\/\/doi.org\/10.1145\/3267935.3267947","DOI":"10.1145\/3267935.3267947"},{"key":"329_CR44","doi-asserted-by":"crossref","unstructured":"Z. Zeng, J. Wang, N. Cheng, J. Xiao, in Proc. Interspeech 2020. Prosody learning mechanism for speech synthesis system without text length limit, vol. 2020 (2020), pp. 4422\u20134426. arXiv preprint arXiv:2008.05656","DOI":"10.21437\/Interspeech.2020-2053"},{"key":"329_CR45","doi-asserted-by":"crossref","unstructured":"F. Yang, S. Yang, Q. Wu, Y. Wang, L. Xie, in Proc. Interspeech 2020. Exploiting deep sentential context for expressive end-to-end speech synthesis., vol. 2020 (2020), pp. 3436\u20133440. arXiv preprint arXiv:2008.00613","DOI":"10.21437\/Interspeech.2020-2423"},{"key":"329_CR46","doi-asserted-by":"publisher","first-page":"1582","DOI":"10.1109\/TASLP.2021.3074757","volume":"29","author":"YJ Zhang","year":"2021","unstructured":"Y.J. Zhang, Z.H. Ling, Extracting and predicting word-level style variations for speech synthesis. IEEE\/ACM Trans. Audio Speech Lang. Process. 29, 1582\u20131593 (2021)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR47","doi-asserted-by":"publisher","unstructured":"C. Lu, X. Wen, R. Liu, X. Chen, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-speaker emotional speech synthesis with fine-grained prosody modeling (IEEE, 2021), pp. 5729\u20135733. https:\/\/doi.org\/10.1109\/icassp39728.2021.9413398","DOI":"10.1109\/icassp39728.2021.9413398"},{"key":"329_CR48","doi-asserted-by":"publisher","unstructured":"C. Gong, L. Wang, Z. Ling, S. Guo, J. Zhang, J. Dang, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving naturalness and controllability of sequence-to-sequence speech synthesis by learning local prosody representations (IEEE, 2021), pp. 5724\u20135728. https:\/\/doi.org\/10.1109\/icassp39728.2021.9414720","DOI":"10.1109\/icassp39728.2021.9414720"},{"key":"329_CR49","doi-asserted-by":"crossref","unstructured":"X. Li, C. Song, J. Li, Z. Wu, J. Jia, H. Meng, Towards multi-scale style control for expressive speech synthesis. (2021).\u00a0arXiv\u00a0preprint\u00a0arXiv:2104.03521","DOI":"10.21437\/Interspeech.2021-947"},{"key":"329_CR50","doi-asserted-by":"publisher","unstructured":"S. Lei, Y. Zhou, L. Chen, Z. Wu, S. Kang, H. Meng, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Towards expressive speaking style modelling with hierarchical context information for mandarin speech synthesis (IEEE, 2022), pp. 7922\u20137926.  https:\/\/doi.org\/10.1109\/icassp43922.2022.9747438","DOI":"10.1109\/icassp43922.2022.9747438"},{"key":"329_CR51","doi-asserted-by":"publisher","unstructured":"F. Yang, J. Luan, Y. Wang, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving emotional speech synthesis by using sus-constrained vae and text encoder aggregation (IEEE, 2022), pp. 8302\u20138306. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746994","DOI":"10.1109\/icassp43922.2022.9746994"},{"key":"329_CR52","doi-asserted-by":"publisher","unstructured":"R. Li, D. Pu, M. Huang, B. Huang, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unet-tts: Improving unseen speaker and style transfer in one-shot voice cloning (IEEE, 2022), pp. 8327\u20138331. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746049","DOI":"10.1109\/icassp43922.2022.9746049"},{"key":"329_CR53","doi-asserted-by":"publisher","unstructured":"Y. Wang, Y. Xie, K. Zhao, H. Wang, Q. Zhang, in 2022 IEEE International Conference on Multimedia and Expo (ICME). Unsupervised quantized prosody representation for controllable speech synthesis (IEEE, 2022), pp. 1\u20136. https:\/\/doi.org\/10.1109\/icme52920.2022.9859946","DOI":"10.1109\/icme52920.2022.9859946"},{"key":"329_CR54","doi-asserted-by":"crossref","unstructured":"Y. Zhou, C. Song, J. Li, Z. Wu, Y. Bian, D. Su, H. Meng, in Proc. Interspeech 2022. Enhancing word-level semantic representation via dependency structure for expressive text-to-speech synthesis, vol. 2022 (2022), pp. 5518\u20135522. arXiv preprint arXiv:2104.06835","DOI":"10.21437\/Interspeech.2022-10061"},{"key":"329_CR55","doi-asserted-by":"publisher","unstructured":"Y. Lee, T. Kim, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Robust and fine-grained prosody control of end-to-end speech synthesis (IEEE, 2019), pp. 5911\u20135915. https:\/\/doi.org\/10.1109\/icassp.2019.8683501","DOI":"10.1109\/icassp.2019.8683501"},{"key":"329_CR56","doi-asserted-by":"publisher","unstructured":"H. Choi, S. Park, J. Park, M. Hahn, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multi-speaker emotional acoustic modeling for cnn-based speech synthesis (IEEE, 2019), pp. 6950\u20136954. https:\/\/doi.org\/10.1109\/icassp.2019.8683682","DOI":"10.1109\/icassp.2019.8683682"},{"issue":"9","key":"329_CR57","doi-asserted-by":"publisher","first-page":"1383","DOI":"10.1109\/LSP.2019.2931673","volume":"26","author":"O Kwon","year":"2019","unstructured":"O. Kwon, I. Jang, C. Ahn, H.G. Kang, An effective style token weight control technique for end-to-end emotional speech synthesis. IEEE Signal Process. Lett. 26(9), 1383\u20131387 (2019)","journal-title":"IEEE Signal Process. Lett."},{"key":"329_CR58","doi-asserted-by":"publisher","unstructured":"S.Y. Um, S. Oh, K. Byun, I. Jang, C. Ahn, H.G. Kang, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Emotional speech synthesis with rich and granularized control (IEEE, 2020), pp. 7254\u20137258. https:\/\/doi.org\/10.1109\/icassp40776.2020.9053732","DOI":"10.1109\/icassp40776.2020.9053732"},{"key":"329_CR59","doi-asserted-by":"crossref","unstructured":"M. Kim, S.J. Cheon, B.J. Choi, J.J. Kim, N.S. Kim, in Proc. ISCA Interspeech 2021. Expressive text-to-speech using style tag, vol. 2021 (2021), pp. 4663\u20134667. arXiv preprint arXiv:2104.00436","DOI":"10.21437\/Interspeech.2021-465"},{"key":"329_CR60","doi-asserted-by":"publisher","first-page":"25455","DOI":"10.1109\/ACCESS.2022.3156093","volume":"10","author":"S Moon","year":"2022","unstructured":"S. Moon, S. Kim, Y.H. Choi, Mist-tacotron: end-to-end emotional speech synthesis using mel-spectrogram image style transfer. IEEE Access 10, 25455\u201325463 (2022)","journal-title":"IEEE Access"},{"key":"329_CR61","doi-asserted-by":"publisher","unstructured":"C.B. Im, S.H. Lee, S.B. Kim, S.W. Lee, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Emoq-tts: Emotion intensity quantization for fine-grained controllable emotional text-to-speech (IEEE, 2022), pp. 6317\u20136321. https:\/\/doi.org\/10.1109\/icassp43922.2022.9747098","DOI":"10.1109\/icassp43922.2022.9747098"},{"key":"329_CR62","doi-asserted-by":"publisher","unstructured":"Y. Shin, Y. Lee, S. Jo, Y. Hwang, T. Kim, in Proc. Interspeech 2022. Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS (2022), pp. 2313\u20132317. https:\/\/doi.org\/10.21437\/Interspeech.2022-10131","DOI":"10.21437\/Interspeech.2022-10131"},{"key":"329_CR63","doi-asserted-by":"publisher","unstructured":"C. Kim, S.Y. Um, H. Yoon, H.G. Kang, in Proc. Interspeech 2022. Fluenttts: Text-dependent fine-grained style control for multi-style tts, vol. 2022 (2022), pp. 4561\u20134565. https:\/\/doi.org\/10.21437\/Interspeech.2022-988","DOI":"10.21437\/Interspeech.2022-988"},{"key":"329_CR64","doi-asserted-by":"publisher","unstructured":"H.W. Yoon, O. Kwon, H. Lee, R. Yamamoto, E. Song, J.M. Kim, M.J. Hwang, in Proc. Interspeech 2022. Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems (2022), pp. 4596\u20134600. https:\/\/doi.org\/10.21437\/Interspeech.2022-11133","DOI":"10.21437\/Interspeech.2022-11133"},{"key":"329_CR65","doi-asserted-by":"publisher","unstructured":"K. Inoue, S. Hara, M. Abe, N. Hojo, Y. Ijima, in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). An investigation to transplant emotional expressions in dnn-based tts synthesis (IEEE, 2017), pp. 1253\u20131258. https:\/\/doi.org\/10.1109\/apsipa.2017.8282231","DOI":"10.1109\/apsipa.2017.8282231"},{"key":"329_CR66","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1016\/j.specom.2018.03.002","volume":"99","author":"J Lorenzo-Trueba","year":"2018","unstructured":"J. Lorenzo-Trueba, G.E. Henter, S. Takaki, J. Yamagishi, Y. Morino, Y. Ochiai, Investigating different representations for modeling and controlling multiple emotions in dnn-based speech synthesis. Speech Commun. 99, 135\u2013143 (2018)","journal-title":"Speech Commun."},{"key":"329_CR67","doi-asserted-by":"publisher","unstructured":"T. Koriyama, T. Kobayashi, in Proc. Interspeech 2019. Semi-supervised prosody modeling using deep gaussian process latent variable model. (2019), pp. 4450\u20134454. https:\/\/doi.org\/10.21437\/Interspeech.2019-2497","DOI":"10.21437\/Interspeech.2019-2497"},{"key":"329_CR68","doi-asserted-by":"crossref","unstructured":"Y. Hono, K. Tsuboi, K. Sawada, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, in Proc. ISCA Interspeech  2020. Hierarchical multi-grained generative model for expressive speech synthesis, vol. 2020 (2020), pp. 3441\u20133445. arXiv preprint arXiv:2009.08474","DOI":"10.21437\/Interspeech.2020-2477"},{"key":"329_CR69","doi-asserted-by":"publisher","first-page":"35","DOI":"10.1016\/j.specom.2020.11.004","volume":"126","author":"K Inoue","year":"2021","unstructured":"K. Inoue, S. Hara, M. Abe, N. Hojo, Y. Ijima, Model architectures to extrapolate emotional expressions in dnn-based text-to-speech. Speech Commun. 126, 35\u201343 (2021)","journal-title":"Speech Commun."},{"key":"329_CR70","doi-asserted-by":"publisher","unstructured":"W. Nakata, T. Koriyama, S. Takamichi, Y. Saito, Y. Ijima, R. Masumura, H. Saruwatari, in Proc. Interspeech 2022. Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis (2022), pp. 4551\u20134555. https:\/\/doi.org\/10.21437\/Interspeech.2022-638","DOI":"10.21437\/Interspeech.2022-638"},{"key":"329_CR71","doi-asserted-by":"crossref","unstructured":"D.S.R. Mohan, V. Hu, T.H. Teh, A. Torresquintero, C.G. Wallis, M. Staib, L. Foglianti, J. Gao, S. King, in Interspeech 2021. Ctrl-p: Temporal control of prosodic variation for speech synthesis, vol. 2021 (2021), pp. 3875\u20133879. arXiv preprint arXiv:2106.08352","DOI":"10.21437\/Interspeech.2021-1583"},{"key":"329_CR72","doi-asserted-by":"crossref","unstructured":"G. Pamisetty, K. Sri Rama Murty, Prosody-tts: An end-to-end speech synthesis system with prosody control. Circ. Syst. Signal Process. 42(1), 361\u2013384 (2023)","DOI":"10.1007\/s00034-022-02126-z"},{"key":"329_CR73","doi-asserted-by":"publisher","unstructured":"L. Zhao, J. Yang, Q. Qin, in 2020 3rd International Conference on Algorithms (ACAI '20), Computing and Artificial Intelligence. Enhancing prosodic features by adopting pre-trained language model in bahasa indonesia speech synthesis (ACM, 2020), pp. 1\u20136.  https:\/\/doi.org\/10.48550\/arXiv.2102.00184","DOI":"10.48550\/arXiv.2102.00184"},{"key":"329_CR74","unstructured":"R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, R.A. Saurous, in international conference on machine learning. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron (PMLR, 2018), pp. 4693\u20134702. https:\/\/proceedings.mlr.press\/v80\/skerry-ryan18a.html"},{"key":"329_CR75","unstructured":"Y. Wang, D. Stanton, Y. Zhang, R.S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, R.A. Saurous, in International Conference on Machine Learning. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis (PMLR, 2018), pp. 5180\u20135189. https:\/\/proceedings.mlr.press\/v80\/wang18h.html"},{"key":"329_CR76","doi-asserted-by":"crossref","unstructured":"K. Akuzawa, Y. Iwasawa, Y. Matsuo, Expressive speech synthesis via modeling expressions with variational autoencoder. (2018).\u00a0arXiv\u00a0preprint\u00a0arXiv:1804.02135","DOI":"10.21437\/Interspeech.2018-1113"},{"key":"329_CR77","doi-asserted-by":"publisher","unstructured":"Y.J. Zhang, S. Pan, L. He, Z.H. Ling, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Learning latent representations for style control and transfer in end-to-end speech synthesis (IEEE, 2019), pp. 6945\u20136949. https:\/\/doi.org\/10.1109\/icassp.2019.8683623","DOI":"10.1109\/icassp.2019.8683623"},{"key":"329_CR78","doi-asserted-by":"publisher","unstructured":"S. Suzi\u00e9, T. Nosek, M. Se\u010dujski, D. Pekar, V. Deli\u00e9, in 2019 27th Telecommunications Forum (TELFOR). Dnn based expressive text-to-speech with limited training data (IEEE, 2019), pp. 1\u20136. https:\/\/doi.org\/10.1109\/telfor48224.2019.8971351","DOI":"10.1109\/telfor48224.2019.8971351"},{"key":"329_CR79","doi-asserted-by":"publisher","unstructured":"T. Cornille, F. Wang, J. Bekker, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Interactive multi-level prosody control for expressive speech synthesis (IEEE, 2022), pp. 8312\u20138316. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746654","DOI":"10.1109\/icassp43922.2022.9746654"},{"key":"329_CR80","doi-asserted-by":"publisher","unstructured":"S. Suzic, T.V. Delic, S. Ostrogonac, S. Duric, D.J. Pekar, Style-code method for multi-style parametric text-to-speech synthesis. SPIIRAS Proc. 5(60), 216 (2018). https:\/\/doi.org\/10.15622\/sp.60.8","DOI":"10.15622\/sp.60.8"},{"key":"329_CR81","doi-asserted-by":"publisher","unstructured":"J. Parker, Y. Stylianou, R. Cipolla, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Adaptation of an expressive single speaker deep neural network speech synthesis system (IEEE, 2018), pp. 5309\u20135313. https:\/\/doi.org\/10.1109\/icassp.2018.8461888","DOI":"10.1109\/icassp.2018.8461888"},{"issue":"6","key":"329_CR82","first-page":"171","volume":"16","author":"S Suzi\u0107","year":"2019","unstructured":"S. Suzi\u0107, T. Deli\u0107, D. Pekar, V. Deli\u0107, M. Se\u010dujski, Style transplantation in neural network based speech synthesis. Acta Polytech. Hungarica 16(6), 171\u2013189 (2019)","journal-title":"Acta Polytech. Hungarica"},{"key":"329_CR83","doi-asserted-by":"crossref","unstructured":"N. Prateek, M. \u0141ajszczak, R. Barra-Chicote, T. Drugman, J. Lorenzo-Trueba, T. Merritt, S. Ronanki, T. Wood, in NAACL HLT 2019. In other news: A bi-style text-to-speech model for synthesizing newscaster voice with limited data. (2019). arXiv preprint arXiv:1904.02790","DOI":"10.18653\/v1\/N19-2026"},{"issue":"4","key":"329_CR84","doi-asserted-by":"publisher","first-page":"434","DOI":"10.3897\/jucs.2020.023","volume":"26","author":"M Secujski","year":"2020","unstructured":"M. Secujski, D. Pekar, S. Suzic, A. Smirnov, T.V. Nosek, Speaker\/style-dependent neural network speech synthesis based on speaker\/style embedding. J. Univers. Comput. Sci. 26(4), 434\u2013453 (2020)","journal-title":"J. Univers. Comput. Sci."},{"key":"329_CR85","doi-asserted-by":"crossref","unstructured":"Y. Gao, W. Zheng, Z. Yang, T. Kohler, C. Fuegen, Q. He, in Proc. Interspeech 2020.  Interactive text-to-speech system via joint style analysis, vol. 2020 (2020), pp. 4447\u20134451. arXiv preprint arXiv:2002.06758","DOI":"10.21437\/Interspeech.2020-3069"},{"key":"329_CR86","doi-asserted-by":"crossref","unstructured":"S. Pan, L. He, in Proc. Annu. Conf. INTERSPEECH 2021. Cross-speaker style transfer with prosody bottleneck in neural speech synthesis, vol. 2021 (2021), pp. 4678\u20134682. arXiv preprint arXiv:2107.12562","DOI":"10.21437\/Interspeech.2021-979"},{"key":"329_CR87","doi-asserted-by":"publisher","unstructured":"J. He, C. Gong, L. Wang, D. Jin, X. Wang, J. Xu, J. Dang, in Proc. Interspeech 2022. Improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised training (2022), pp. 5538\u20135542. https:\/\/doi.org\/10.21437\/Interspeech.2022-11336","DOI":"10.21437\/Interspeech.2022-11336"},{"key":"329_CR88","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877\u20131901 (2020)","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"329_CR89","doi-asserted-by":"publisher","unstructured":"D. Parikh, K. Grauman, in 2011 International Conference on Computer Vision. Relative attributes (IEEE, 2011), pp. 503\u2013510. https:\/\/doi.org\/10.1109\/iccv.2011.6126281","DOI":"10.1109\/iccv.2011.6126281"},{"issue":"1","key":"329_CR90","first-page":"2096","volume":"17","author":"Y Ganin","year":"2016","unstructured":"Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096\u20132030 (2016)","journal-title":"J. Mach. Learn. Res."},{"key":"329_CR91","doi-asserted-by":"publisher","first-page":"1806","DOI":"10.1109\/TASLP.2021.3076369","volume":"29","author":"R Liu","year":"2021","unstructured":"R. Liu, B. Sisman, G. Gao, H. Li, Expressive tts training with frame and style reconstruction loss. IEEE\/ACM Trans. Audio Speech Lang. Process. 29, 1806\u20131818 (2021)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR92","doi-asserted-by":"crossref","unstructured":"R. Liu, B. Sisman, H. Li, in Proc. Annu. Conf. Int. Speech Commun. Assoc. 2021. Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability. (2021). pp. 4648-4652. arXiv preprint arXiv:2104.01408","DOI":"10.21437\/Interspeech.2021-1236"},{"key":"329_CR93","doi-asserted-by":"crossref","unstructured":"X. Dai, C. Gong, L. Wang, K. Zhang, Information sieve: Content leakage reduction in end-to-end prosody for expressive speech synthesis. (2021).\u00a0arXiv\u00a0preprint\u00a0arXiv:2108.01831","DOI":"10.21437\/Interspeech.2021-1011"},{"key":"329_CR94","doi-asserted-by":"publisher","unstructured":"D. Stanton, Y. Wang, R. Skerry-Ryan, in 2018 IEEE Spoken Language Technology Workshop (SLT). Predicting expressive speaking style from text in end-to-end speech synthesis (IEEE, 2018), pp. 595\u2013602. https:\/\/doi.org\/10.1109\/slt.2018.8639682","DOI":"10.1109\/slt.2018.8639682"},{"key":"329_CR95","doi-asserted-by":"crossref","unstructured":"C. Du, K. Yu, in Proc. ISCA Interspeech 2021. Rich prosody diversity modelling with phone-level mixture density network, vol. 2021 (2021), pp. 3136\u20133140. arXiv preprint arXiv:2102.00851","DOI":"10.21437\/Interspeech.2021-802"},{"key":"329_CR96","doi-asserted-by":"publisher","unstructured":"Z. Lyu, J. Zhu, in 2022 12th International Conference on Information Science and Technology (ICIST). Enriching style transfer in multi-scale control based personalized end-to-end speech synthesis (IEEE, 2022), pp. 114\u2013119. https:\/\/doi.org\/10.1109\/icist55546.2022.9926908","DOI":"10.1109\/icist55546.2022.9926908"},{"key":"329_CR97","doi-asserted-by":"crossref","unstructured":"K. Lee, K. Park, D. Kim, in Proc. Interspeech 2021. Styler: Style factor modeling with rapidity and robustness via speech decomposition for expressive and controllable neural text to speech, vol. 2021 (2021), pp. 4643\u20134647. arXiv preprint arXiv:2103.09474","DOI":"10.21437\/Interspeech.2021-838"},{"key":"329_CR98","doi-asserted-by":"publisher","unstructured":"S.H. Lee, H.W. Yoon, H.R. Noh, J.H. Kim, S.W. Lee, in Proceedings of the AAAI Conference on Artificial Intelligence. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis, AAAI, vol. 35 (2021), pp. 13198\u201313206. https:\/\/doi.org\/10.1609\/aaai.v35i14.17559","DOI":"10.1609\/aaai.v35i14.17559"},{"key":"329_CR99","unstructured":"X. Luo, S. Takamichi, T. Koriyama, Y. Saito, H. Saruwatari, in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Emotion-controllable speech synthesis using emotion soft labels and fine-grained prosody factors (IEEE, 2021), pp. 794\u2013799"},{"key":"329_CR100","doi-asserted-by":"publisher","unstructured":"C. Gong, L. Wang, Z. Ling, J. Zhang, J. Dang, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Using multiple reference audios and style embedding constraints for speech synthesis (IEEE, 2022), pp. 7912\u20137916. https:\/\/doi.org\/10.1109\/icassp43922.2022.9747801","DOI":"10.1109\/icassp43922.2022.9747801"},{"key":"329_CR101","doi-asserted-by":"publisher","unstructured":"S. Liang, C. Miao, M. Chen, J. Ma, S. Wang, J. Xiao, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unsupervised learning for multi-style speech synthesis with limited data (IEEE, 2021), pp. 6583\u20136587. https:\/\/doi.org\/10.1109\/icassp39728.2021.9414220","DOI":"10.1109\/icassp39728.2021.9414220"},{"key":"329_CR102","doi-asserted-by":"publisher","unstructured":"K. Zhang, C. Gong, W. Lu, L. Wang, J. Wei, D. Liu, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Joint and adversarial training with asr for expressive speech synthesis (IEEE, 2022), pp. 6322\u20136326. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746442","DOI":"10.1109\/icassp43922.2022.9746442"},{"key":"329_CR103","doi-asserted-by":"crossref","unstructured":"T. Raitio, R. Rasipuram, D. Castellani, in  Interspeech 2020. Controllable neural text-to-speech synthesis using intuitive prosodic features, vol. 2020 (2020), pp. 4432\u20134436. arXiv preprint arXiv:2009.06775","DOI":"10.21437\/Interspeech.2020-2861"},{"key":"329_CR104","doi-asserted-by":"publisher","unstructured":"D.R. Liu, C.Y. Yang, S.L. Wu, H.Y. Lee, in 2018 IEEE Spoken Language Technology Workshop (SLT). Improving unsupervised style transfer in end-to-end speech synthesis with end-to-end speech recognition (IEEE, 2018), pp. 640\u2013647. https:\/\/doi.org\/10.1109\/slt.2018.8639672","DOI":"10.1109\/slt.2018.8639672"},{"key":"329_CR105","doi-asserted-by":"publisher","unstructured":"X. Cai, D. Dai, Z. Wu, X. Li, J. Li, H. Meng, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition (IEEE, 2021), pp. 5734\u20135738. https:\/\/doi.org\/10.1109\/icassp39728.2021.9413907","DOI":"10.1109\/icassp39728.2021.9413907"},{"key":"329_CR106","doi-asserted-by":"publisher","unstructured":"R. Chung, B. Mak, in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). On-the-fly data augmentation for text-to-speech style transfer (IEEE, 2021), pp. 634\u2013641. https:\/\/doi.org\/10.1109\/asru51503.2021.9688074","DOI":"10.1109\/asru51503.2021.9688074"},{"key":"329_CR107","doi-asserted-by":"publisher","first-page":"223","DOI":"10.1016\/j.neunet.2021.03.005","volume":"140","author":"L Xue","year":"2021","unstructured":"L. Xue, S. Pan, L. He, L. Xie, F.K. Soong, Cycle consistent network for end-to-end style transfer tts training. Neural Netw. 140, 223\u2013236 (2021)","journal-title":"Neural Netw."},{"issue":"15","key":"329_CR108","doi-asserted-by":"publisher","first-page":"5325","DOI":"10.3390\/app10155325","volume":"10","author":"SJ Cheon","year":"2020","unstructured":"S.J. Cheon, J.Y. Lee, B.J. Choi, H. Lee, N.S. Kim, Gated recurrent attention for multi-style speech synthesis. Appl. Sci. 10(15), 5325 (2020)","journal-title":"Appl. Sci."},{"key":"329_CR109","unstructured":"T. Kenter, V. Wan, C.A. Chan, R. Clark, J. Vit, in International Conference on Machine Learning. Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network (PMLR, 2019), pp. 3331\u20133340. https:\/\/proceedings.mlr.press\/v97\/kenter19a.html"},{"key":"329_CR110","doi-asserted-by":"crossref","unstructured":"T. Kenter, M.K. Sharma, R. Clark, in Proc. Interspeech 2020. Improving prosody of rnn-based english text-to-speech synthesis by incorporating a bert model, vol 2020 (2020), pp. 4412\u20134416","DOI":"10.21437\/Interspeech.2020-1430"},{"key":"329_CR111","doi-asserted-by":"crossref","unstructured":"D. Tan, T. Lee, in Proc. Annu. Conf. Int. Speech Commun. Assoc. 2020. Fine-grained style modeling, transfer and prediction in text-to-speech synthesis via phone-level content-style disentanglement, vol. 2020 (2020), pp. 4683\u20134687. arXiv preprint arXiv:2011.03943","DOI":"10.21437\/Interspeech.2021-1129"},{"key":"329_CR112","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1016\/j.specom.2022.11.006","volume":"146","author":"N Ellinas","year":"2023","unstructured":"N. Ellinas, M. Christidou, A. Vioni, J.S. Sung, A. Chalamandaris, P. Tsiakoulis, P. Mastorocostas, Controllable speech synthesis by learning discrete phoneme-level prosodic representations. Speech Commun. 146, 22\u201331 (2023)","journal-title":"Speech Commun."},{"key":"329_CR113","doi-asserted-by":"publisher","unstructured":"A. Vioni, M. Christidou, N. Ellinas, G. Vamvoukakis, P. Kakoulidis, T. Kim, J.S. Sung, H. Park, A. Chalamandaris, P. Tsiakoulis, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prosodic clustering for phoneme-level prosody control in end-to-end speech synthesis (IEEE, 2021), pp. 5719\u20135723. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413604","DOI":"10.1109\/ICASSP39728.2021.9413604"},{"key":"329_CR114","doi-asserted-by":"publisher","unstructured":"R. Valle, J. Li, R. Prenger, B. Catanzaro, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (IEEE, 2020), pp. 6189\u20136193. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9054556","DOI":"10.1109\/ICASSP40776.2020.9054556"},{"key":"329_CR115","doi-asserted-by":"publisher","unstructured":"G. Huybrechts, T. Merritt, G. Comini, B. Perz, R. Shah, J. Lorenzo-Trueba, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Low-resource expressive text-to-speech using data augmentation (IEEE, 2021), pp. 6593\u20136597. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9413466","DOI":"10.1109\/ICASSP39728.2021.9413466"},{"key":"329_CR116","doi-asserted-by":"publisher","unstructured":"Y. Guo, C. Du, K. Yu, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unsupervised word-level prosody tagging for controllable speech synthesis (IEEE, 2022), pp. 7597\u20137601. https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9746323","DOI":"10.1109\/ICASSP43922.2022.9746323"},{"key":"329_CR117","doi-asserted-by":"publisher","unstructured":"D. Paul, S. Mukherjee, Y. Pantazis, Y. Stylianou, in Interspeech 2021. A universal multi-speaker multi-style text-to-speech via disentangled representation learning based on r\u00e9nyi divergence minimization (2021), pp. 3625\u20133629. https:\/\/doi.org\/10.21437\/Interspeech.2021-660","DOI":"10.21437\/Interspeech.2021-660"},{"key":"329_CR118","doi-asserted-by":"crossref","unstructured":"J. Za\u00efdi, H. Seut\u00e9, B. van Niekerk, M.A. Carbonneau, in Proc. Interspeech 2022. Daft-exprt: Cross-speaker prosody transfer on any text for expressive speech synthesis, vol. 2022 (2021), pp. 4591\u20134595. arXiv preprint arXiv:2108.02271","DOI":"10.21437\/Interspeech.2022-10761"},{"key":"329_CR119","doi-asserted-by":"publisher","unstructured":"V. Aggarwal, M. Cotescu, N. Prateek, J. Lorenzo-Trueba, R. Barra-Chicote, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Using vaes and normalizing flows for one-shot text-to-speech synthesis of expressive speech (IEEE, 2020), pp. 6179\u20136183. https:\/\/doi.org\/10.1109\/icassp40776.2020.9053678","DOI":"10.1109\/icassp40776.2020.9053678"},{"key":"329_CR120","doi-asserted-by":"publisher","unstructured":"L.W. Chen, A. Rudnicky, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Fine-grained style control in transformer-based text-to-speech synthesis (IEEE, 2022), pp. 7907\u20137911. https:\/\/doi.org\/10.1109\/icassp43922.2022.9747747","DOI":"10.1109\/icassp43922.2022.9747747"},{"key":"329_CR121","doi-asserted-by":"publisher","unstructured":"X. Wu, Y. Cao, M. Wang, S. Liu, S. Kang, Z. Wu, X. Liu, D. Su, D. Yu, H. Meng, in Interspeech 2018. Rapid style adaptation using residual error embedding for expressive speech synthesis. (2018), pp. 3072\u20133076. https:\/\/doi.org\/10.21437\/Interspeech.2018-1991","DOI":"10.21437\/Interspeech.2018-1991"},{"key":"329_CR122","doi-asserted-by":"publisher","unstructured":"G. Zhang, Y. Qin, T. Lee, in Interspeech 2020 Learning syllable-level discrete prosodic representation for expressive speech generation (2020), pp. 3426\u20133430. https:\/\/doi.org\/10.21437\/Interspeech.2020-2228","DOI":"10.21437\/Interspeech.2020-2228"},{"key":"329_CR123","doi-asserted-by":"publisher","unstructured":"G. Sun, Y. Zhang, R.J. Weiss, Y. Cao, H. Zen, Y. Wu, in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (IEEE, 2020), pp. 6264\u20136268. https:\/\/doi.org\/10.1109\/icassp40776.2020.9053520","DOI":"10.1109\/icassp40776.2020.9053520"},{"key":"329_CR124","doi-asserted-by":"crossref","unstructured":"A. Suni, S. Kakouros, M. Vainio, J. \u0160imko, in 10th International Conference on Speech Prosody 2020. Prosodic prominence and boundaries in sequence-to-sequence speech synthesis. (2020). pp. 940\u2013944. arXiv preprint arXiv:2006.15967","DOI":"10.21437\/SpeechProsody.2020-192"},{"key":"329_CR125","doi-asserted-by":"publisher","unstructured":"Y. Zou, S. Liu, X. Yin, H. Lin, C. Wang, H. Zhang, Z. Ma, in Proc. Interspeech 2021. Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation (2021), pp. 3146\u20133150. https:\/\/doi.org\/10.21437\/Interspeech.2021-883","DOI":"10.21437\/Interspeech.2021-883"},{"key":"329_CR126","doi-asserted-by":"crossref","unstructured":"I. Vall\u00e9s-P\u00e9rez, J. Roth, G. Beringer, R. Barra-Chicote, J. Droppo, in Interspeech 2021. Improving multi-speaker tts prosody variance with a residual encoder and normalizing flows, vol. 2021 (2021), pp. 3131\u20133135. arXiv preprint arXiv:2106.05762","DOI":"10.21437\/Interspeech.2021-562"},{"key":"329_CR127","doi-asserted-by":"publisher","unstructured":"Z. Hodari, A. Moinet, S. Karlapati, J. Lorenzo-Trueba, T. Merritt, A. Joly, A. Abbas, P. Karanasou, T. Drugman, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Camp: a two-stage approach to modelling prosody in context (IEEE, 2021), pp. 6578\u20136582. https:\/\/doi.org\/10.1109\/icassp39728.2021.9414413","DOI":"10.1109\/icassp39728.2021.9414413"},{"key":"329_CR128","doi-asserted-by":"publisher","unstructured":"T. Raitio, J. Li, S. Seshadri, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hierarchical prosody modeling and control in non-autoregressive parallel neural tts (IEEE, 2022), pp. 7587\u20137591. https:\/\/doi.org\/10.1109\/icassp43922.2022.9746253","DOI":"10.1109\/icassp43922.2022.9746253"},{"key":"329_CR129","doi-asserted-by":"publisher","unstructured":"S. Karlapati, A. Abbas, Z. Hodari, A. Moinet, A. Joly, P. Karanasou, T. Drugman, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prosodic representation learning and contextual sampling for neural text-to-speech (IEEE, 2021), pp. 6573\u20136577. https:\/\/doi.org\/10.1109\/icassp39728.2021.9413696","DOI":"10.1109\/icassp39728.2021.9413696"},{"key":"329_CR130","doi-asserted-by":"crossref","unstructured":"S. Karlapati, A. Moinet, A. Joly, V. Klimkov, D. S\u00e1ez-Trigueros, T. Drugman, in Proc. Interspeech 2020. Copycat: Many-to-many fine-grained prosody transfer for neural text-to-speech, vol. 2020  (2020), pp. 4387\u20134391. arXiv preprint arXiv:2004.14617","DOI":"10.21437\/Interspeech.2020-1251"},{"key":"329_CR131","doi-asserted-by":"crossref","unstructured":"S. Tyagi, M. Nicolis, J. Rohnke, T. Drugman, J. Lorenzo-Trueba, in Proc. Annu. Conf. Int. Speech Commun. Assoc. 2019. Dynamic prosody generation for speech synthesis using linguistics-driven acoustic embedding selection. (2019). pp. 4407\u20134411. arXiv preprint arXiv:1912.00955","DOI":"10.21437\/Interspeech.2020-1411"},{"key":"329_CR132","doi-asserted-by":"crossref","unstructured":"Y. Yan, X. Tan, B. Li, G. Zhang, T. Qin, S. Zhao, Y. Shen, W.Q. Zhang, T.Y. Liu, in INTERSPEECH 2021. Adaspeech 3: Adaptive text to speech for spontaneous style, vol. 2021 (2021), pp. 1\u20135. arXiv preprint arXiv:2107.02530","DOI":"10.21437\/Interspeech.2021-584"},{"key":"329_CR133","doi-asserted-by":"publisher","unstructured":"X. An, Y. Wang, S. Yang, Z. Ma, L. Xie, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Learning hierarchical representations for expressive speaking style in end-to-end speech synthesis (IEEE, 2019), pp. 184\u2013191. https:\/\/doi.org\/10.1109\/asru46091.2019.9003859","DOI":"10.1109\/asru46091.2019.9003859"},{"key":"329_CR134","doi-asserted-by":"publisher","unstructured":"Y. Feng, P. Duan, Y. Zi, Y. Chen, S. Xiong, in 2022 IEEE International Conference on Multimedia and Expo (ICME). Fusing acoustic and text emotional features for expressive speech synthesis (IEEE, 2022), pp. 01\u201306. https:\/\/doi.org\/10.1109\/icme52920.2022.9859769","DOI":"10.1109\/icme52920.2022.9859769"},{"key":"329_CR135","doi-asserted-by":"publisher","unstructured":"I. Jauk, J. Lorenzo Trueba, J. Yamagishi, A. Bonafonte C\u00e1vez, in Interspeech 2018: 2-6 September 2018, Hyderabad. Expressive speech synthesis using sentiment embeddings (International Speech Communication Association (ISCA), 2018), pp. 3062\u20133066. https:\/\/doi.org\/10.21437\/interspeech.2018-2467","DOI":"10.21437\/interspeech.2018-2467"},{"key":"329_CR136","doi-asserted-by":"publisher","unstructured":"J. Li, Y. Meng, C. Li, Z. Wu, H. Meng, C. Weng, D. Su, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling (IEEE, 2022), pp. 7917\u20137921. https:\/\/doi.org\/10.1109\/icassp43922.2022.9747837","DOI":"10.1109\/icassp43922.2022.9747837"},{"key":"329_CR137","doi-asserted-by":"publisher","unstructured":"T.Y. Hu, A. Shrivastava, O. Tuzel, C. Dhir, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Unsupervised style and content separation by minimizing mutual information for speech synthesis (IEEE, 2020), pp. 3267\u20133271. https:\/\/doi.org\/10.1109\/icassp40776.2020.9054591","DOI":"10.1109\/icassp40776.2020.9054591"},{"key":"329_CR138","doi-asserted-by":"crossref","unstructured":"M. Morrison, Z. Jin, J. Salamon, N.J. Bryan, G.J. Mysore, in Proc. Interspeech 2020. Controllable neural prosody synthesis, vol. 2020 (2020), 4437\u20134441. arXiv preprint arXiv:2008.03388","DOI":"10.21437\/Interspeech.2020-2918"},{"key":"329_CR139","doi-asserted-by":"publisher","unstructured":"F. Eyben, F. Weninger, F. Gross, B. Schuller, in Proceedings of the 21st ACM international conference on Multimedia. Recent developments in opensmile, the munich open-source multimedia feature extractor (ACM, 2013), pp. 835\u2013838. https:\/\/doi.org\/10.1145\/2502081.2502224","DOI":"10.1145\/2502081.2502224"},{"issue":"7","key":"329_CR140","doi-asserted-by":"publisher","first-page":"1877","DOI":"10.1587\/transinf.2015EDP7457","volume":"99","author":"M Morise","year":"2016","unstructured":"M. Morise, F. Yokomori, K. Ozawa, World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877\u20131884 (2016)","journal-title":"IEICE Trans. Inf. Syst."},{"key":"329_CR141","doi-asserted-by":"publisher","unstructured":"E. Perez, F. Strub, H. De Vries, V. Dumoulin, A. Courville, in Proceedings of the AAAI Conference on Artificial Intelligence. Film: Visual reasoning with a general conditioning layer, AAAI, vol. 32 (2018). https:\/\/doi.org\/10.1609\/aaai.v32i1.11671","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"329_CR142","unstructured":"A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process Syst. 33, 12449\u201312460 (2020)"},{"key":"329_CR143","unstructured":"J. Kim, J. Kong, J. Son, in International Conference on Machine Learning. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech (PMLR, 2021), pp. 5530\u20135540. https:\/\/proceedings.mlr.press\/v139\/kim21f.html"},{"key":"329_CR144","doi-asserted-by":"publisher","unstructured":"L.A. Gatys, A.S. Ecker, M. Bethge, in Proceedings of the IEEE conference on computer vision and pattern recognition. Image style transfer using convolutional neural networks (IEEE, 2016), pp. 2414\u20132423. https:\/\/doi.org\/10.1109\/CVPR.2016.265","DOI":"10.1109\/CVPR.2016.265"},{"key":"329_CR145","unstructured":"K. Simonyan, A. Zisserman, in ICLR 2015. Very deep convolutional networks for large-scale image recognition. (2015). arXiv preprint arXiv:1409.1556"},{"key":"329_CR146","unstructured":"D.P. Kingma, M. Welling, in ICLR 2014. Auto-encoding variational bayes. (2014). arXiv preprint arXiv:1312.6114"},{"key":"329_CR147","unstructured":"Y. Taigman, L. Wolf, A. Polyak, E. Nachmani, in ICLR 2018. Voiceloop: Voice fitting and synthesis via a phonological loop. (2018). arXiv preprint arXiv:1707.06588"},{"key":"329_CR148","unstructured":"J. Devlin, M.W. Chang, K. Lee, K. Toutanova, in Proceedings of NAACL 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. (2019). pp. 4171\u20134186. arXiv preprint arXiv:1810.04805"},{"key":"329_CR149","unstructured":"A. Van Den Oord, O. Vinyals et al., Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 30, 6306\u20136315 (2017)"},{"key":"329_CR150","unstructured":"Z. Xiao, K. Kreis, A. Vahdat, in International Conference on Learning Representations 2022. Tackling the generative learning trilemma with denoising diffusion gans. (2022). arXiv preprint arXiv:2112.07804"},{"key":"329_CR151","unstructured":"A. D\u00e9fossez, J. Copet, G. Synnaeve, Y. Adi, High fidelity neural audio compression. (2022). arXiv preprint arXiv:2210.13438"},{"key":"329_CR152","doi-asserted-by":"publisher","unstructured":"J. Hu, L. Shen, G. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition. Squeeze-and-excitation networks (IEEE, 2018), pp. 7132\u20137141.  https:\/\/doi.org\/10.1109\/cvpr.2018.00745","DOI":"10.1109\/cvpr.2018.00745"},{"key":"329_CR153","unstructured":"K. Qian, Y. Zhang, S. Chang, X. Yang, M. Hasegawa-Johnson, in International Conference on Machine Learning. Autovc: Zero-shot voice style transfer with only autoencoder loss (PMLR, 2019), pp. 5210\u20135219. https:\/\/proceedings.mlr.press\/v97\/qian19c.html"},{"key":"329_CR154","unstructured":"A.A. Alemi, I. Fischer, J.V. Dillon, K. Murphy, in Proc. Int. Conf. Learn. Representations 2017. Deep variational information bottleneck. (2017). arXiv preprint arXiv:1612.00410"},{"key":"329_CR155","unstructured":"S. Ioffe, C. Szegedy, in International conference on machine learning. Batch normalization: Accelerating deep network training by reducing internal covariate shift (PMLR, 2015), pp. 448\u2013456. https:\/\/proceedings.mlr.press\/v37\/ioffe15.html"},{"key":"329_CR156","doi-asserted-by":"publisher","unstructured":"D. Ulyanov, A. Vedaldi, V. Lempitsky, in Proceedings of the IEEE conference on computer vision and pattern recognition. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis (IEEE, 2017), pp. 6924\u20136932. https:\/\/doi.org\/10.1109\/cvpr.2017.437","DOI":"10.1109\/cvpr.2017.437"},{"key":"329_CR157","unstructured":"M.I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, D. Hjelm, in International conference on machine learning. Mutual information neural estimation (PMLR, 2018), pp. 531\u2013540. https:\/\/proceedings.mlr.press\/v80\/belghazi18a.html"},{"key":"329_CR158","unstructured":"P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, L. Carin, in International conference on machine learning. Club: A contrastive log-ratio upper bound of mutual information (PMLR, 2020), pp. 1779\u20131788. https:\/\/proceedings.mlr.press\/v119\/cheng20b.html"},{"key":"329_CR159","doi-asserted-by":"publisher","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","volume":"29","author":"WN Hsu","year":"2021","unstructured":"W.N. Hsu, B. Bolte, Y.H.H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM Trans. Audio Speech Lang. Process. 29, 3451\u20133460 (2021)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"329_CR160","unstructured":"Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32, 5753\u20135763 (2019)"},{"key":"329_CR161","unstructured":"C.M. Bishop, Technical report, Aston University 1994. Mixture density networks (1994)"},{"key":"329_CR162","unstructured":"Y. Shen, Z. Lin, C.W. Huang, A. Courville, in Proceedings of ICLR 2018. Neural language modeling by jointly learning syntax and lexicon. (2018). arXiv preprint arXiv:1711.02013"},{"key":"329_CR163","doi-asserted-by":"publisher","first-page":"114135","DOI":"10.1016\/j.psychres.2021.114135","volume":"304","author":"J Sarzynska-Wawer","year":"2021","unstructured":"J. Sarzynska-Wawer, A. Wawer, A. Pawlak, J. Szymanowska, I. Stefaniak, M. Jarkiewicz, L. Okruszek, Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 304, 114135 (2021)","journal-title":"Psychiatry Res."},{"key":"329_CR164","unstructured":"K. Clark, M.T. Luong, Q.V. Le, C.D. Manning, in ICLR 2020. Electra: Pre-training text encoders as discriminators rather than generators. (2020). arXiv preprint arXiv:2003.10555"},{"key":"329_CR165","unstructured":"C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, Z. Zhu, Deep speaker: an end-to-end neural speaker embedding system. (2017).\u00a0arXiv\u00a0preprint\u00a0arXiv:1705.02304"},{"key":"329_CR166","doi-asserted-by":"publisher","unstructured":"M. Azab, N. Kojima, J. Deng, R. Mihalcea, in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Representing movie characters in dialogues. Association for Computational Linguistics, Hong Kong, China. (2019), pp. 99\u2013109. https:\/\/doi.org\/10.18653\/v1\/K19-1010","DOI":"10.18653\/v1\/K19-1010"},{"key":"329_CR167","unstructured":"Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach. (2019).\u00a0arXiv\u00a0preprint\u00a0arXiv:1907.11692"},{"key":"329_CR168","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1016\/j.csl.2016.11.001","volume":"45","author":"A Suni","year":"2017","unstructured":"A. Suni, J. \u0160imko, D. Aalto, M. Vainio, Hierarchical representation and estimation of prosody using continuous wavelet transform. Comput. Speech Lang. 45, 123\u2013136 (2017)","journal-title":"Comput. Speech Lang."},{"key":"329_CR169","unstructured":"J.M. Tomczak, M. Welling, in NIPS Workshop: Bayesian Deep Learning 2016. Improving variational auto-encoders using householder flow. (2016). arXiv preprint arXiv:1611.09630"},{"key":"329_CR170","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.specom.2021.11.006","volume":"137","author":"K Zhou","year":"2022","unstructured":"K. Zhou, B. Sisman, R. Liu, H. Li, Emotional voice conversion: Theory, databases and esd. Speech Commun. 137, 1\u201318 (2022)","journal-title":"Speech Commun."},{"key":"329_CR171","unstructured":"cstr. The blizzard challenge. https:\/\/www.cstr.ed.ac.uk\/projects\/blizzard\/. Accessed 15 Sept 2023"},{"key":"329_CR172","unstructured":"K. Ito, L. Johnson, The lj speech dataset. (2017). https:\/\/keithito.com\/LJ-Speech-Dataset\/. Accessed 15 Sept 2023"},{"key":"329_CR173","unstructured":"cstr. Voice cloning toolkit. https:\/\/datashare.ed.ac.uk\/handle\/10283\/3443. Accessed 15 Sept 2023"},{"key":"329_CR174","doi-asserted-by":"crossref","unstructured":"H. Zen, R. Clark, R.J. Weiss, V. Dang, Y. Jia, Y. Wu, Y. Zhang, Z. Chen, in Interspeech. Libritts: A corpus derived from librispeech for text-to-speech (2019). https:\/\/arxiv.org\/abs\/1904.02882. Accessed 15 Sept 2023","DOI":"10.21437\/Interspeech.2019-2441"},{"key":"329_CR175","unstructured":"Wikipedia. Emotion classification - Wikipedia \u2014 en.wikipedia.org. https:\/\/en.wikipedia.org\/wiki\/Emotion_classification. Accessed 30 May 2023"},{"issue":"2","key":"329_CR176","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1037\/0278-7393.18.2.379","volume":"18","author":"MM Bradley","year":"1992","unstructured":"M.M. Bradley, M.K. Greenwald, M.C. Petry, P.J. Lang, Remembering pictures: pleasure and arousal in memory. J. Exp. Psychol. Learn. Mem. Cogn. 18(2), 379 (1992)","journal-title":"J. Exp. Psychol. Learn. Mem. Cogn."},{"issue":"6","key":"329_CR177","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.1037\/h0077714","volume":"39","author":"JA Russell","year":"1980","unstructured":"J.A. Russell, A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161 (1980)","journal-title":"J. Pers. Soc. Psychol."},{"key":"329_CR178","unstructured":"P. Ekman, E. Revealed, Emotions revealed: Recognizing faces and feelings to improve communication and emotional life. (Holt Paperback, 2003), vol. 128, no. 8, pp. 140\u2013140"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00329-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-024-00329-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00329-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,12]],"date-time":"2024-02-12T14:08:11Z","timestamp":1707746891000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-024-00329-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,12]]},"references-count":178,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["329"],"URL":"https:\/\/doi.org\/10.1186\/s13636-024-00329-7","relation":{},"ISSN":["1687-4722"],"issn-type":[{"value":"1687-4722","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,12]]},"assertion":[{"value":"9 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"11"}}