{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T00:34:15Z","timestamp":1770338055396,"version":"3.49.0"},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.<\/jats:p>","DOI":"10.24963\/ijcai.2023\/575","type":"proceedings-article","created":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T04:31:30Z","timestamp":1691728290000},"page":"5179-5187","source":"Crossref","is-referenced-by-count":14,"title":["Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining"],"prefix":"10.24963","author":[{"given":"Takaaki","family":"Saeki","sequence":"first","affiliation":[{"name":"The University of Tokyo"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Soumi","family":"Maiti","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xinjian","family":"Li","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shinji","family":"Watanabe","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shinnosuke","family":"Takamichi","sequence":"additional","affiliation":[{"name":"The University of Tokyo"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hiroshi","family":"Saruwatari","sequence":"additional","affiliation":[{"name":"The University of Tokyo"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"10584","event":{"name":"Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}","theme":"Artificial Intelligence","location":"Macau, SAR China","acronym":"IJCAI-2023","number":"32","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"start":{"date-parts":[[2023,8,19]]},"end":{"date-parts":[[2023,8,25]]}},"container-title":["Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T04:51:51Z","timestamp":1691729511000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2023\/575"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2023\/575","relation":{},"subject":[],"published":{"date-parts":[[2023,8]]}}}