{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,12]],"date-time":"2025-09-12T17:44:51Z","timestamp":1757699091996,"version":"3.41.0"},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T00:00:00Z","timestamp":1723766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Natural Science Foundation of Xinjiang Uygur Autonomous Region of China","award":["2022D01C59"],"award-info":[{"award-number":["2022D01C59"]}]},{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2018YFC0823402"],"award-info":[{"award-number":["2018YFC0823402"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>End-to-end speech synthesis methodologies have exhibited considerable advancements for languages with abundant corpus resources. Nevertheless, such achievements are yet to be realized for languages constrained by limited corpora. This manuscript delineates a novel strategy that leverages contextual encoding information to augment the naturalness of the speech synthesized through FastSpeech2, particularly under resource-scarce conditions. Initially, we harness the cross-linguistic model XLM-RoBERTa to extract contextual features, which serve as an auxiliary input to the mel-spectrum decoder of FastSpeech2. Subsequently, we refine the mel-spectrum prediction module to mitigate the overfitting dilemma encountered by FastSpeech2 amidst scant training datasets. To this end, Conformer blocks, rather than traditional Transformer blocks, are employed within both the encoder and decoder to concentrate intensively on varying levels and granularities of feature information. Additionally, we introduce a token-average mechanism to equalize pitch and energy attributes at the frame level. The empirical outcomes indicate that our pre-training with the LJ Speech dataset, followed by fine-tuning using a modest 10-minute paired Uyghur corpus, yields satisfactory synthesized Uyghur speech. Relative to the baseline framework, our proposed technique halves the character error rate and enhances the mean opinion score by over 0.6. Similar results were observed in Mandarin Chinese experimental evaluations.<\/jats:p>","DOI":"10.1145\/3675397","type":"journal-article","created":{"date-parts":[[2024,6,28]],"date-time":"2024-06-28T11:05:45Z","timestamp":1719572745000},"page":"1-11","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Optimizing Uyghur Speech Synthesis by Combining Pretrained Cross-Lingual Model"],"prefix":"10.1145","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-8154-4882","authenticated-orcid":false,"given":"Kexin","family":"Lu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Xinjiang University, Urumqi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0973-4401","authenticated-orcid":false,"given":"Zhihua","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xinjiang University, Urumqi, China and Key Laboratory of Signal Detection and Processing in Xinjiang, Xinjiang University, Wulumuqi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4518-6257","authenticated-orcid":false,"given":"Mingming","family":"Yin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xinjiang University, Urumqi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-3069-9769","authenticated-orcid":false,"given":"Ke","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xinjiang University, Urumqi, China"}]}],"member":"320","published-online":{"date-parts":[[2024,8,16]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Jose Sotelo Soroush Mehri Kundan Kumar Joao Felipe Santos Kyle Kastner Aaron Courville and Yoshua Bengio. 2017. Char2Wav: End-to-end speech synthesis. (April 2017)."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","unstructured":"Yuxuan Wang R. J. Skerry-Ryan Daisy Stanton Yonghui Wu Ron J. Weiss Navdeep Jaitly Zongheng Yang Ying Xiao Zhifeng Chen Samy Bengio Quoc Le Yannis Agiomyrgiannakis Rob Clark and Rif A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. (April 2017). DOI:10.48550\/arXiv.1703.10135arxiv:1703.10135","DOI":"10.48550\/arXiv.1703.10135"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_5_2","volume-title":"Advances in Neural Information Processing Systems","author":"Ren Yi","year":"2019","unstructured":"Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc."},{"key":"e_1_3_2_6_2","volume-title":"International Conference on Learning Representations","author":"Ren Yi","year":"2022","unstructured":"Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2022. FastSpeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","unstructured":"Jin Xu Xu Tan Yi Ren Tao Qin Jian Li Sheng Zhao and Tie-Yan Liu. 2020. LRSpeech: Extremely low-resource speech synthesis and recognition. (Aug. 2020). DOI:10.48550\/arXiv.2008.03687","DOI":"10.48550\/arXiv.2008.03687"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","unstructured":"Xu Tan Tao Qin Frank Soong and Tie-Yan Liu. 2021. A Survey on Neural Speech Synthesis. (July 2021). DOI:10.48550\/arXiv.2106.15561arxiv:cs eess\/2106.15561","DOI":"10.48550\/arXiv.2106.15561"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","unstructured":"Tao Tu Yuan-Jui Chen Cheng-chieh Yeh and Hung-yi Lee. 2019. End-to-End Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning. (April 2019). DOI:10.48550\/arXiv.1904.06508arxiv:cs eess\/1904.06508","DOI":"10.48550\/arXiv.1904.06508"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683862"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-1333"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2762432"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-3177"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","unstructured":"Guillaume Lample and Alexis Conneau. 2019. Cross-Lingual Language Model Pretraining. (Jan. 2019). DOI:10.48550\/arXiv.1901.07291arxiv:1901.07291","DOI":"10.48550\/arXiv.1901.07291"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","unstructured":"Isabel Papadimitriou Ethan A. Chi Richard Futrell and Kyle Mahowald. 2021. Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT. (Jan. 2021). DOI:10.48550\/arXiv.2101.11043arxiv:2101.11043","DOI":"10.48550\/arXiv.2101.11043"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","unstructured":"Alexis Conneau Kartikay Khandelwal Naman Goyal Vishrav Chaudhary Guillaume Wenzek Francisco Guzm\u00e1n Edouard Grave Myle Ott Luke Zettlemoyer and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. (April 2020). DOI:10.48550\/arXiv.1911.02116arxiv:1911.02116","DOI":"10.48550\/arXiv.1911.02116"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","unstructured":"Anmol Gulati James Qin Chung-Cheng Chiu Niki Parmar Yu Zhang Jiahui Yu Wei Han Shibo Wang Zhengdong Zhang Yonghui Wu and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. (May 2020). DOI:10.48550\/arXiv.2005.08100arxiv:cs eess\/2005.08100","DOI":"10.48550\/arXiv.2005.08100"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.21105\/joss.03958"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","unstructured":"Taku Kudo and John Richardson. 2018. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. (Aug. 2018). DOI:10.48550\/arXiv.1808.06226arxiv:1808.06226","DOI":"10.48550\/arXiv.1808.06226"},{"key":"e_1_3_2_20_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414858"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413889"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.21437\/Blizzard.2021-2"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i11.26479"},{"key":"e_1_3_2_25_2","unstructured":"Keith Ito. 2017. The LJ Speech Dataset. (2017)."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACRIM.1993.407206"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3675397","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3675397","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:04:23Z","timestamp":1750291463000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3675397"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,16]]},"references-count":25,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3675397"],"URL":"https:\/\/doi.org\/10.1145\/3675397","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2024,8,16]]},"assertion":[{"value":"2023-10-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}