{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T05:39:56Z","timestamp":1774589996767,"version":"3.50.1"},"reference-count":26,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2021,11,25]],"date-time":"2021-11-25T00:00:00Z","timestamp":1637798400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002661","name":"Fonds De La Recherche Scientifique - FNRS","doi-asserted-by":"publisher","award":["5207920F"],"award-info":[{"award-number":["5207920F"]}],"id":[{"id":"10.13039\/501100002661","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.<\/jats:p>","DOI":"10.3390\/informatics8040084","type":"journal-article","created":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T05:02:36Z","timestamp":1638334956000},"page":"84","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1971-4412","authenticated-orcid":false,"given":"No\u00e9","family":"Tits","sequence":"first","affiliation":[{"name":"Flowchase SRL, 1348 Ottignies-Louvain-la-Neuve, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1465-6273","authenticated-orcid":false,"given":"Kevin","family":"El Haddad","sequence":"additional","affiliation":[{"name":"TCTS Lab, University of Mons, Place du Parc 20, 7000 Mons, Belgium"}]},{"given":"Thierry","family":"Dutoit","sequence":"additional","affiliation":[{"name":"TCTS Lab, University of Mons, Place du Parc 20, 7000 Mons, Belgium"}]}],"member":"1968","published-online":{"date-parts":[[2021,11,25]]},"reference":[{"key":"ref_1","unstructured":"Burkhardt, F., and Campbell, N. (2014). Emotional speech synthesis. The Oxford Handbook of Affective Computing, Oxford University Press."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Tits, N. (2019, January 3\u20136). A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech\u2014A Deep Learning approach. Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Cambridge, UK.","DOI":"10.1109\/ACIIW.2019.8925241"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"100055","DOI":"10.1016\/j.simpa.2021.100055","article-title":"ICE-Talk 2: Interface for Controllable Expressive TTS with perceptual assessment tool","volume":"8","author":"Tits","year":"2021","journal-title":"Softw. Impacts"},{"key":"ref_4","unstructured":"Ito, K. (2021, August 29). The LJ Speech Dataset. Available online: https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"ref_5","unstructured":"Tits, N., El Haddad, K., and Dutoit, T. (2019). The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach. Human-Computer Interaction, IntechOpen."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Watts, O., Henter, G.E., Merritt, T., Wu, Z., and King, S. (2016, January 20\u201325). From HMMs to DNNs: Where do the improvements come from?. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472730"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1039","DOI":"10.1016\/j.specom.2009.04.004","article-title":"Statistical parametric speech synthesis","volume":"51","author":"Zen","year":"2009","journal-title":"Speech Commun."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zen, H., Senior, A., and Schuster, M. (2013, January 26\u201331). Statistical parametric speech synthesis using deep neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639215"},{"key":"ref_9","unstructured":"van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., and Bengio, S. (2017, January 20\u201324). Tacotron: Towards End-to-End Speech Synthesis. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"ref_11","unstructured":"Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R., Clark, R., and Saurous, R.A. (2018, January 10\u201315). Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Klimkov, V., Ronanki, S., Rohnke, J., and Drugman, T. (2019, January 15\u201319). Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-To-Speech. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2571"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Karlapati, S., Moinet, A., Joly, A., Klimkov, V., S\u00e1ez-Trigueros, D., and Drugman, T. (2020, January 25\u201329). CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech. Proceedings of the Interspeech 2020, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1251"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Akuzawa, K., Iwasawa, Y., and Matsuo, Y. (2018, January 2\u20136). Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1113"},{"key":"ref_15","unstructured":"Taigman, Y., Wolf, L., Polyak, A., and Nachmani, E. (2017). Voiceloop: Voice fitting and synthesis via a phonological loop. arXiv."},{"key":"ref_16","unstructured":"Hsu, W.N., Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Wang, Y., Cao, Y., Jia, Y., Chen, Z., and Shen, J. (2018). Hierarchical Generative Modeling for Controllable Speech Synthesis. arXiv."},{"key":"ref_17","unstructured":"Henter, G.E., Lorenzo-Trueba, J., Wang, X., and Yamagishi, J. (2018). Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, Y., Stanton, D., Zhang, Y., Ryan, R.S., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F., and Saurous, R.A. (2018, January 10\u201315). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the International Conference on Machine Learning 2018, Stockholm, Sweden.","DOI":"10.1109\/SLT.2018.8639682"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Shechtman, S., and Sorin, A. (2019). Sequence to sequence neural speech synthesis with prosody modification capabilities. arXiv.","DOI":"10.21437\/SSW.2019-49"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Raitio, T., Rasipuram, R., and Castellani, D. (2020). Controllable neural text-to-speech synthesis using intuitive prosodic features. arXiv.","DOI":"10.21437\/Interspeech.2020-2861"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Tits, N., El Haddad, K., and Dutoit, T. (2020). Neural Speech Synthesis with Style Intensity Interpolation: A Perceptual Analysis. Companion of the 2020 ACM\/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery.","DOI":"10.1145\/3371382.3378297"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tachibana, H., Uenoyama, K., and Aihara, S. (2018, January 15\u201320). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461829"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Tits, N., Wang, F., Haddad, K.E., Pagel, V., and Dutoit, T. (2019, January 15\u201319). Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis. Proceedings of the Interspeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-1426"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"190","DOI":"10.1109\/TAFFC.2015.2457417","article-title":"The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing","volume":"7","author":"Eyben","year":"2016","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_25","unstructured":"Kubichek, R. (1993, January 19\u201321). Mel-cepstral distance measure for objective speech quality assessment. Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1016\/j.specom.2007.09.003","article-title":"A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments","volume":"50","author":"Nakatani","year":"2008","journal-title":"Speech Commun."}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/8\/4\/84\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T07:35:29Z","timestamp":1760168129000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/8\/4\/84"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,25]]},"references-count":26,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["informatics8040084"],"URL":"https:\/\/doi.org\/10.3390\/informatics8040084","relation":{},"ISSN":["2227-9709"],"issn-type":[{"value":"2227-9709","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,25]]}}}