{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T14:24:06Z","timestamp":1762957446426,"version":"3.37.3"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T00:00:00Z","timestamp":1716854400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T00:00:00Z","timestamp":1716854400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["Grants 61871262, 61901251, and 62071284"],"award-info":[{"award-number":["Grants 61871262, 61901251, and 62071284"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003399","name":"Science and Technology Commission of Shanghai Municipality","doi-asserted-by":"publisher","award":["Grants 21ZR1422400, 20JC1416400 and 20511106603"],"award-info":[{"award-number":["Grants 21ZR1422400, 20JC1416400 and 20511106603"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100015792","name":"Government of Pudong New Area","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100015792","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100015956","name":"Special Project for Research and Development in Key areas of Guangdong Province","doi-asserted-by":"publisher","award":["Grant 2020B0101130012"],"award-info":[{"award-number":["Grant 2020B0101130012"]}],"id":[{"id":"10.13039\/501100015956","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100011478","name":"Foshan Science and Technology Bureau","doi-asserted-by":"publisher","award":["Grant FS0AA-KJ919- 4402-0060"],"award-info":[{"award-number":["Grant FS0AA-KJ919- 4402-0060"]}],"id":[{"id":"10.13039\/501100011478","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method\u2019s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/great-research.github.io\/tsct-tts-demo\/\">https:\/\/great-research.github.io\/tsct-tts-demo\/<\/jats:ext-link>, providing practical insights into its application and effectiveness.<\/jats:p>","DOI":"10.1186\/s13636-024-00351-9","type":"journal-article","created":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T06:02:03Z","timestamp":1716876123000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis"],"prefix":"10.1186","volume":"2024","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9629-6111","authenticated-orcid":false,"given":"Zhiyong","family":"Chen","sequence":"first","affiliation":[]},{"given":"Zhiqi","family":"Ai","sequence":"additional","affiliation":[]},{"given":"Youxuan","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Xinnuo","family":"Li","sequence":"additional","affiliation":[]},{"given":"Shugong","family":"Xu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,28]]},"reference":[{"key":"351_CR1","doi-asserted-by":"crossref","unstructured":"Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong, H. Bu, X. Xu, in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The multi-speaker multi-style voice cloning challenge 2021 (IEEE, 2021), p. 8613\u20138617","DOI":"10.1109\/ICASSP39728.2021.9414001"},{"key":"351_CR2","unstructured":"X. Tan, T. Qin, F. Soong, T.Y. Liu, A survey on neural speech synthesis. (2021).\u00a0arXiv\u00a0preprint\u00a0arXiv:2106.15561"},{"key":"351_CR3","unstructured":"S. Arik, J. Chen, K. Peng, W. Ping, Y. Zhou, Neural voice cloning with a few samples. Adv. Neural Inf. Process. Syst. 31 (2018)"},{"key":"351_CR4","doi-asserted-by":"publisher","first-page":"2502","DOI":"10.1109\/LSP.2022.3226655","volume":"29","author":"BJ Choi","year":"2022","unstructured":"B.J. Choi, M. Jeong, J.Y. Lee, N.S. Kim, Snac: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech. IEEE Signal Process. Lett. 29, 2502\u20132506 (2022)","journal-title":"IEEE Signal Process. Lett."},{"key":"351_CR5","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1109\/LSP.2021.3125259","volume":"29","author":"SJ Cheon","year":"2022","unstructured":"S.J. Cheon, B.J. Choi, M. Kim, H. Lee, N.S. Kim, A controllable multi-lingual multi-speaker multi-style text-to-speech synthesis with multivariate information minimization. IEEE Signal Process. Lett. 29, 55\u201359 (2022)","journal-title":"IEEE Signal Process. Lett."},{"key":"351_CR6","unstructured":"Z. Qin, W. Zhao, X. Yu, X. Sun, OpenVoice: versatile instant voice cloning. (2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2312.01479"},{"key":"351_CR7","unstructured":"W. Ping, K. Peng, A. Gibiansky, S.\u00d6. Arik, A. Kannan, S. Narang, J. Raiman, J. Miller, in ICLR. Deep Voice 3: scaling text-to-speech with convolutional sequence learning (ICLR, 2018)"},{"key":"351_CR8","unstructured":"Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu, et al., Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31 (2018)"},{"key":"351_CR9","doi-asserted-by":"crossref","unstructured":"J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions (IEEE, 2018), p. 4779\u20134783","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"351_CR10","unstructured":"Y. Wang, D. Stanton, Y. Zhang, R.S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, R.A. Saurous, in International Conference on Machine Learning. Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis (PMLR, 2018), p. 5180\u20135189"},{"key":"351_CR11","doi-asserted-by":"crossref","unstructured":"E. Cooper, C.I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, J. Yamagishi, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings (IEEE, 2020), p. 6184\u20136188","DOI":"10.1109\/ICASSP40776.2020.9054535"},{"key":"351_CR12","doi-asserted-by":"crossref","unstructured":"Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, T.Y. Liu, in Proc. Interspeech 2022. AdaSpeech 4: adaptive text to speech in zero-shot scenarios (ISCA, 2022), p. 2568\u20132572","DOI":"10.21437\/Interspeech.2022-901"},{"key":"351_CR13","unstructured":"M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, T.Y. Liu, et al., in International Conference on Learning Representations. AdaSpeech: adaptive text to speech for custom voice (ICLR, 2020)"},{"key":"351_CR14","unstructured":"D. Min, D.B. Lee, E. Yang, S.J. Hwang, in International Conference on Machine Learning. Meta-StyleSpeech: multi-speaker adaptive text-to-speech generation (PMLR, 2021), p. 7748\u20137759"},{"key":"351_CR15","unstructured":"J. Kim, J. Kong, J. Son, in International Conference on Machine Learning. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech (PMLR, 2021), p. 5530\u20135540"},{"key":"351_CR16","doi-asserted-by":"crossref","unstructured":"J. Kong, J. Park, B. Kim, J. Kim, D. Kong, S. Kim, VITS2: improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design.\u00a0(2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2307.16430","DOI":"10.21437\/Interspeech.2023-534"},{"key":"351_CR17","unstructured":"E. Casanova, J. Weber, C.D. Shulby, A.C. Junior, E. G\u00f6lge, M.A. Ponti, in International Conference on Machine Learning. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone (PMLR, 2022), p. 2709\u20132720"},{"key":"351_CR18","doi-asserted-by":"crossref","unstructured":"G. Liu, Y. Zhang, Y. Lei, Y. Chen, R. Wang, Z. Li, L. Xie, PromptStyle: controllable style transfer for text-to-speech with natural language descriptions.\u00a0(2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2305.19522","DOI":"10.21437\/Interspeech.2023-1779"},{"key":"351_CR19","doi-asserted-by":"crossref","unstructured":"Z. Guo, Y. Leng, Y. Wu, S. Zhao, X. Tan, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). PromptTTS: controllable text-to-speech with text descriptions (IEEE, 2023), pp. 1\u20135","DOI":"10.1109\/ICASSP49357.2023.10096285"},{"issue":"4","key":"351_CR20","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1109\/PROC.1976.10154","volume":"64","author":"CH Coker","year":"1976","unstructured":"C.H. Coker, A model of articulatory dynamics and control. Proc. IEEE 64(4), 452\u2013460 (1976)","journal-title":"Proc. IEEE"},{"issue":"3","key":"351_CR21","doi-asserted-by":"publisher","first-page":"971","DOI":"10.1121\/1.383940","volume":"67","author":"DH Klatt","year":"1980","unstructured":"D.H. Klatt, Software for a cascade\/parallel formant synthesizer. J. Acoust. Soc. Am. 67(3), 971\u2013995 (1980)","journal-title":"J. Acoust. Soc. Am."},{"key":"351_CR22","doi-asserted-by":"crossref","unstructured":"J. Olive, in ICASSP\u201977. IEEE International Conference on Acoustics, Speech, and Signal Processing. Rule synthesis of speech from dyadic units, vol. 2 (IEEE, 1977), p. 568\u2013570","DOI":"10.1109\/ICASSP.1977.1170350"},{"issue":"5","key":"351_CR23","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.1109\/JPROC.2013.2251852","volume":"101","author":"K Tokuda","year":"2013","unstructured":"K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, K. Oura, Speech synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234\u20131252 (2013)","journal-title":"Proc. IEEE"},{"key":"351_CR24","unstructured":"A. Van Den Oord, et al., Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. (2016)"},{"key":"351_CR25","unstructured":"Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.Y. Liu, in International Conference on Learning Representations. FastSpeech 2: fast and high-quality end-to-end text to speech (ICLR, 2020)"},{"key":"351_CR26","unstructured":"Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.Y. Liu, FastSpeech: fast, robust and controllable text to speech. Adv. Neural Inf. Process. Syst. 32 (2019)"},{"key":"351_CR27","unstructured":"J. Betker, Better speech synthesis through scaling.\u00a0(2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2305.07243"},{"key":"351_CR28","doi-asserted-by":"crossref","unstructured":"J. Xue, Y. Deng, Y. Han, Y. Li, J. Sun, J. Liang, in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). ECAPA-TDNN for multi-speaker text-to-speech synthesis (IEEE, 2022), p. 230\u2013234","DOI":"10.1109\/ISCSLP57327.2022.10037956"},{"key":"351_CR29","unstructured":"C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al., Neural codec language models are zero-shot text to speech synthesizers.\u00a0(2023).\u00a0arXiv\u00a0preprint\u00a0arXiv:2301.02111"},{"key":"351_CR30","unstructured":"J. Wang, Z. Du, Q. Chen, Y. Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Ma, et al., LauraGPT: listen, attend, understand, and regenerate audio with GPT.\u00a0(2023).\u00a0arXiv\u00a0e-prints\u00a0pp.\u00a0arXiv\u20132310"},{"key":"351_CR31","unstructured":"Y.N. Dauphin, A. Fan, M. Auli, D. Grangier, in International conference on machine learning. Language modeling with gated convolutional networks (PMLR, 2017), p. 933\u2013941"},{"key":"351_CR32","unstructured":"J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization.\u00a0(2016).\u00a0arXiv\u00a0preprint\u00a0arXiv:1607.06450"},{"key":"351_CR33","doi-asserted-by":"crossref","unstructured":"H. Zen, V. Dang, R. Clark, Y. Zhang, R.J. Weiss, Y. Jia, Z. Chen, Y. Wu, in Proc. Interspeech 2019. LibriTTS: a corpus derived from LibriSpeech for text-to-speech (ISCA, 2019), p. 1526\u20131530","DOI":"10.21437\/Interspeech.2019-2441"},{"key":"351_CR34","doi-asserted-by":"crossref","unstructured":"Y. Shi, H. Bu, X. Xu, S. Zhang, M. Li, AiShell-3: a multi-speaker Mandarin TTS corpus and the baselines.\u00a0(2020).\u00a0arXiv\u00a0preprint\u00a0arXiv:2010.11567","DOI":"10.21437\/Interspeech.2021-755"},{"key":"351_CR35","first-page":"17022","volume":"33","author":"J Kong","year":"2020","unstructured":"J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 33, 17022\u201317033 (2020)","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"351_CR36","doi-asserted-by":"crossref","unstructured":"R. Kubichek, in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. Mel-cepstral distance measure for objective speech quality assessment, vol. 1 (IEEE, 1993), p. 125\u2013128","DOI":"10.1109\/PACRIM.1993.407206"},{"issue":"5","key":"351_CR37","doi-asserted-by":"publisher","first-page":"561","DOI":"10.3233\/IDA-2007-11508","volume":"11","author":"S Salvador","year":"2007","unstructured":"S. Salvador, P. Chan, Toward accurate dynamic time warping in linear time and space. Intell. Data Anal. 11(5), 561\u2013580 (2007)","journal-title":"Intell. Data Anal."},{"issue":"1","key":"351_CR38","first-page":"1929","volume":"15","author":"N Srivastava","year":"2014","unstructured":"N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929\u20131958 (2014)","journal-title":"J. Mach. Learn. Res."},{"key":"351_CR39","unstructured":"Microsoft. VALLE-X online demo (2023). https:\/\/huggingface.co\/spaces\/Plachta\/VALL-E-X. Accessed 20 May 2024"},{"key":"351_CR40","unstructured":"CoquiTTS. Coqui TTS online demo (2023). https:\/\/github.com\/coqui-ai\/TTS. Accessed 20 May 2024"},{"key":"351_CR41","unstructured":"OpenVoice. OpenVoice online demo (2024). https:\/\/huggingface.co\/spaces\/myshell-ai\/OpenVoice. Accessed 20 May 2024"},{"key":"351_CR42","unstructured":"LauraTTS. LauraTTS online demo (2024). https:\/\/modelscope.cn\/models\/iic\/speech_synthesizer-laura-en-libritts-16k-codec_nq2-pytorch\/summary. Accessed 20 May 2024"},{"key":"351_CR43","unstructured":"CoquiTTS. Coqui TTS datasets (2023). https:\/\/docs.coqui.ai\/en\/dev\/tts_datasets.html. Accessed 20 May 2024"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00351-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-024-00351-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-024-00351-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,20]],"date-time":"2024-11-20T04:58:55Z","timestamp":1732078735000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-024-00351-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,28]]},"references-count":43,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["351"],"URL":"https:\/\/doi.org\/10.1186\/s13636-024-00351-9","relation":{},"ISSN":["1687-4722"],"issn-type":[{"type":"electronic","value":"1687-4722"}],"subject":[],"published":{"date-parts":[[2024,5,28]]},"assertion":[{"value":"17 November 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 May 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 May 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"28"}}