{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T00:30:29Z","timestamp":1759797029749,"version":"build-2065373602"},"publisher-location":"New York, NY, USA","reference-count":15,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,9,28]]},"DOI":"10.1145\/3746058.3759013","type":"proceedings-article","created":{"date-parts":[[2025,9,27]],"date-time":"2025-09-27T14:33:09Z","timestamp":1758983589000},"page":"1-4","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Text-Driven Automated Multi-Character Voice Generation for Narrative Content"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-9591-0670","authenticated-orcid":false,"given":"Qirui","family":"Sun","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5748-2904","authenticated-orcid":false,"given":"Yunyi","family":"Ni","sequence":"additional","affiliation":[{"name":"Nanyang Technological University, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2162-1749","authenticated-orcid":false,"given":"Yuan","family":"Chai","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0560-4228","authenticated-orcid":false,"given":"Haipeng","family":"Mi","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,9,27]]},"reference":[{"key":"e_1_3_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3544549.3585892"},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501856"},{"key":"e_1_3_3_2_4_2","unstructured":"Edresson Casanova Julian Weber Christopher Shulby Arnaldo\u00a0Candido Junior Eren G\u00f6lge and Moacir\u00a0Antonelli Ponti. 2023. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. arXiv:https:\/\/arXiv.org\/abs\/2112.02418\u00a0[cs.SD] https:\/\/arxiv.org\/abs\/2112.02418"},{"key":"e_1_3_3_2_5_2","unstructured":"Zhihao Du Qian Chen Shiliang Zhang Kai Hu Heng Lu Yexin Yang Hangrui Hu Siqi Zheng Yue Gu Ziyang Ma Zhifu Gao and Zhijie Yan. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. arXiv:https:\/\/arXiv.org\/abs\/2407.05407\u00a0[cs.SD] https:\/\/arxiv.org\/abs\/2407.05407"},{"key":"e_1_3_3_2_6_2","unstructured":"Jerry\u00a0Bryan Fuller Tim Barnett Kim Hester Clint Relyea and Len Frey. 2007. An exploratory examination of voice behavior from an impression management perspective. Journal of Managerial Issues (2007) 134\u2013151."},{"key":"e_1_3_3_2_7_2","volume-title":"Audiobooks Market Size, Share & Trends Analysis Report By Genre (Fiction & Non-Fiction), By Preferred Device, By Distribution Channel, By Target Audience (Kids Mode, Adult), By Region, And Segment Forecasts, 2024 - 2030","author":"Research Grand View","year":"2024","unstructured":"Grand View Research. 2024. Audiobooks Market Size, Share & Trends Analysis Report By Genre (Fiction & Non-Fiction), By Preferred Device, By Distribution Channel, By Target Audience (Kids Mode, Adult), By Region, And Segment Forecasts, 2024 - 2030. Technical Report. Grand View Research. https:\/\/www.grandviewresearch.com\/industry-analysis\/audiobooks-market"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"publisher","unstructured":"Wei-Ning Hsu Benjamin Bolte Yao-Hung\u00a0Hubert Tsai Kushal Lakhotia Ruslan Salakhutdinov and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE\/ACM Trans. Audio Speech and Lang. Proc. 29 (Oct. 2021) 3451\u20133460. 10.1109\/TASLP.2021.3122291","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3491102.3501848"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Erika Sakai Takayuki Itoh and Akinori Ito. 2017. A Study on Voice Actor Recommendation for Game Characters Based on Acoustic Feature Estimation and Document Co-occurrence. 2017 Nicograph International (NicoInt) (2017) 15\u201318. https:\/\/api.semanticscholar.org\/CorpusID:21086827","DOI":"10.1109\/NICOInt.2017.31"},{"key":"e_1_3_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-84800-306-47"},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"publisher","unstructured":"Aatto Sonninen and Pertti Hurme. 1992. On the terminology of voice research. Journal of Voice 6 2 (1992) 188\u2013193. 10.1016\/S0892-1997(05)80132-8","DOI":"10.1016\/S0892-1997(05)80132-8"},{"volume-title":"With Audiobooks Launching in the U.S. Today, Spotify is the Home for All the Audio You Love","year":"2022","key":"e_1_3_3_2_13_2","unstructured":"Spotify. 2022. With Audiobooks Launching in the U.S. Today, Spotify is the Home for All the Audio You Love. https:\/\/newsroom.spotify.com\/2022-09-20\/with-audiobooks-launching-in-the-u-s-today-spotify-is-the-home-for-all-the-audio-you-love\/"},{"key":"e_1_3_3_2_14_2","unstructured":"Ilya Sutskever Oriol Vinyals and Quoc\u00a0V. Le. 2014. Sequence to Sequence Learning with Neural Networks. arXiv:https:\/\/arXiv.org\/abs\/1409.3215\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/1409.3215"},{"key":"e_1_3_3_2_15_2","unstructured":"Chengyi Wang Sanyuan Chen Yu Wu Ziqiang Zhang Long Zhou Shujie Liu Zhuo Chen Yanqing Liu Huaming Wang Jinyu Li Lei He Sheng Zhao and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv:https:\/\/arXiv.org\/abs\/2301.02111\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/2301.02111"},{"key":"e_1_3_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Yuxuan Wang RJ Skerry-Ryan Daisy Stanton Yonghui Wu Ron\u00a0J. Weiss Navdeep Jaitly Zongheng Yang Ying Xiao Zhifeng Chen Samy Bengio Quoc Le Yannis Agiomyrgiannakis Rob Clark and Rif\u00a0A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. arXiv:https:\/\/arXiv.org\/abs\/1703.10135\u00a0[cs.CL] https:\/\/arxiv.org\/abs\/1703.10135","DOI":"10.21437\/Interspeech.2017-1452"}],"event":{"name":"UIST '25: The 38th Annual ACM Symposium on User Interface Software and Technology","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction","SIGGRAPH ACM Special Interest Group on Computer Graphics and Interactive Techniques"],"location":"Busan Republic of Korea","acronym":"UIST Adjunct '25"},"container-title":["Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746058.3759013","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,6]],"date-time":"2025-10-06T10:06:46Z","timestamp":1759745206000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746058.3759013"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,27]]},"references-count":15,"alternative-id":["10.1145\/3746058.3759013","10.1145\/3746058"],"URL":"https:\/\/doi.org\/10.1145\/3746058.3759013","relation":{},"subject":[],"published":{"date-parts":[[2025,9,27]]},"assertion":[{"value":"2025-09-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}