{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,9]],"date-time":"2026-07-09T15:42:33Z","timestamp":1783611753866,"version":"3.55.0"},"reference-count":30,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2024,2,27]],"date-time":"2024-02-27T00:00:00Z","timestamp":1708992000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Neurorobot."],"abstract":"<jats:p>Deep learning has significantly advanced text-to-speech (TTS) systems. These neural network-based systems have enhanced speech synthesis quality and are increasingly vital in applications like human-computer interaction. However, conventional TTS models still face challenges, as the synthesized speeches often lack naturalness and expressiveness. Additionally, the slow inference speed, reflecting low efficiency, contributes to the reduced voice quality. This paper introduces SynthRhythm-TTS (SR-TTS), an optimized Transformer-based structure designed to enhance synthesized speech. SR-TTS not only improves phonological quality and naturalness but also accelerates the speech generation process, thereby increasing inference efficiency. SR-TTS contains an encoder, a rhythm coordinator, and a decoder. In particular, a pre-duration predictor within the cadence coordinator and a self-attention-based feature predictor work together to enhance the naturalness and articulatory accuracy of speech. In addition, the introduction of causal convolution enhances the consistency of the time series. The cross-linguistic capability of SR-TTS is validated by training it on both English and Chinese corpora. Human evaluation shows that SR-TTS outperforms existing techniques in terms of speech quality and naturalness of expression. This technology is particularly suitable for applications that require high-quality natural speech, such as intelligent assistants, speech synthesized podcasts, and human-computer interaction.<\/jats:p>","DOI":"10.3389\/fnbot.2024.1322312","type":"journal-article","created":{"date-parts":[[2024,2,27]],"date-time":"2024-02-27T04:15:10Z","timestamp":1709007310000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["SR-TTS: a rhyme-based end-to-end speech synthesis system"],"prefix":"10.3389","volume":"18","author":[{"given":"Yihao","family":"Yao","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tao","family":"Liang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rui","family":"Feng","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Keke","family":"Shi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Junxiao","family":"Yu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Wei","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jianqing","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2024,2,27]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"58","DOI":"10.47672\/ejt.1473","article-title":"A comprehensive survey of deep learning techniques natural language processing","volume":"7","author":"Bharadiya","year":"2023","journal-title":"Eur. J. Technol"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2006.03575","author":"Donahue","year":"2006","journal-title":"End-to-end adversarial text-to-speech. arxiv preprint arxiv:2006.03575"},{"key":"B3","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1461","author":"Elias","year":"2021","journal-title":"Parallel tacotron 2: a non-autoregressive neural tts model with differentiable duration modeling. arxiv preprint arxiv: 2103.14574"},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-443","author":"Fan","year":"2014","journal-title":"Tts synthesis with bidirectional lstm based recurrent neural networks"},{"key":"B5","doi-asserted-by":"publisher","first-page":"6840","DOI":"10.48550\/arXiv.2006.11239","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.1201\/9781315272702","author":"Holmes","year":"2002","journal-title":"Speech Synthesis and Recognition"},{"key":"B7","author":"Ito","year":"2017","journal-title":"The Lj speech dataset"},{"key":"B8","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-469","author":"Jeong","year":"2021","journal-title":"Diff-tts: a denoising diffusion model for text-to-speech. arxiv preprint arxiv: 2104.01409"},{"key":"B9","first-page":"3331","author":"Kenter","year":"2019","journal-title":"Chive: Varying Prosody in Speech Synthesis With a Linguistically Driven Dynamic Hierarchical Conditional Variational Network"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2009.09761","article-title":"Diffwave: a versatile diffusion model for audio synthesis","author":"Kong","year":"2009","journal-title":"Arxiv Preprint Arxiv: 2009.09761."},{"key":"B11","doi-asserted-by":"publisher","first-page":"15171","DOI":"10.1007\/s11042-022-13943-4","article-title":"A deep learning approaches in text-to-speech system: a systematic review and recent research perspective","volume":"82","author":"Kumar","year":"2023","journal-title":"Multimed. Tools Appl"},{"key":"B12","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1016\/S0167-6393(01)00045-0","article-title":"A segmental speech coder based on a concatenative tts","volume":"38","author":"Lee","year":"2002","journal-title":"Speech Commun"},{"key":"B13","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016706","author":"Li","year":"2019","journal-title":"Neural speech synthesis with transformer network"},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2110.12612","author":"Liu","year":"2021","journal-title":"Delightfultts: the microsoft speech synthesis system for blizzard challenge 2021. arxiv preprint arxiv: 2110.12612"},{"key":"B15","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2011.05533","article-title":"Spoken language interaction with robots: research issues and recommendations, report from the nsf future directions workshop","author":"Marge","year":"2020","journal-title":"Arxiv Preprint Arxiv: 2011.05533."},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1992.226117","article-title":"A rule-based text-to-speech system for portuguese","author":"Oliviera","year":"1992","journal-title":"IEEE Comput. Soc."},{"key":"B17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/2000000001","article-title":"Introduction to digital speech processing","volume":"1","author":"Rabiner","year":"2007","journal-title":"Found Trends Signal Process"},{"key":"B18","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2006.04558","author":"Ren","year":"2006","journal-title":"Fastspeech 2: fast and high-quality end-to-end text to speech. arxiv preprint arxiv: 2006.04558."},{"key":"B19","article-title":"Fastspeech: fast, robust and controllable text to speech","author":"Ren","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B20","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2074","article-title":"Self-attention with relative position representations","author":"Shaw","year":"2018","journal-title":"Arxiv Preprint Arxiv: 1803.02155."},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-755","author":"Shi","year":"2020","journal-title":"Aishell-3: a multi-speaker mandarin tts corpus and the baselines. arxiv preprint arxiv:"},{"key":"B22","doi-asserted-by":"publisher","first-page":"346","DOI":"10.1109\/SLT.2018.8639599","article-title":"Mos naturalness and the quest for human-like speech","volume":"2018","author":"Shirali-Shahreza","year":"2018","journal-title":"IEEE."},{"key":"B23","author":"Suni","year":"2013","journal-title":"Wavelets for intonation modeling in hmm speech synthesis."},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2106.15561","author":"Tan","year":"2021","journal-title":"A survey on neural speech synthesis. arxiv preprint arxiv: 2106.15561"},{"key":"B25","volume-title":"A prosody model to tts systems","author":"Teixeira","year":"2004"},{"key":"B26","doi-asserted-by":"publisher","first-page":"37","DOI":"10.25073\/2588-1086\/vnucsce.358","article-title":"VLSP 2021-TTS challenge: vietnamese spontaneous speech synthesis","volume":"38","author":"Trang","year":"2022","journal-title":"VNU J. Sci. Comput. Sci. Commun. Eng"},{"key":"B27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.48550\/arXiv.1706.03762","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst"},{"key":"B28","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1452","author":"Wang","year":"2017","journal-title":"Tacotron: towards end-to-end speech synthesis. arxiv preprint arxiv: 1703.10135"},{"key":"B29","doi-asserted-by":"publisher","first-page":"41","DOI":"10.3390\/e25010041","article-title":"DIA-TTS: deep-inherited attention-based text-to-speech synthesizer","volume":"25","author":"Yu","year":"2022","journal-title":"Entropy"},{"key":"B30","author":"Zen","year":"2015","journal-title":"Acoustic modeling in statistical parametric speech synthesis-from hmm to lstm-rnn"}],"container-title":["Frontiers in Neurorobotics"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1322312\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,27]],"date-time":"2024-02-27T04:15:20Z","timestamp":1709007320000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fnbot.2024.1322312\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,27]]},"references-count":30,"alternative-id":["10.3389\/fnbot.2024.1322312"],"URL":"https:\/\/doi.org\/10.3389\/fnbot.2024.1322312","relation":{},"ISSN":["1662-5218"],"issn-type":[{"value":"1662-5218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,27]]},"article-number":"1322312"}}