{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T22:18:46Z","timestamp":1757629126860,"version":"3.44.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>Mongolian speech synthesis is a technology that converts Mongolian text into Mongolian speech. In order to improve the emotional expressiveness of synthesized speech, this article first proposed a lightweight Mongolian phoneme pre-training model WFST-MnG2P based on weighted finite state transition machine. Secondly, as a representative low-resource language, Mongolian currently has no open source emotional speech corpus. For this reason, a Mongolian emotional speech corpus containing seven discrete emotions was constructed, totaling about 2.25 hours. Finally, since the non-autoregressive acoustic model can reduce word skipping, word missing, repeated pronunciation, and so on, and speed up the speech synthesis speed, this article proposes a Mongolian emotional speech synthesis model based on conditional generative adversarial network and improved FastSpeech2. Experimental results show that the average MOS score of emotional speech on the self-built Mongolian emotional speech corpus is 3.69, and the model can synthesize Mongolian emotional speech with rich multi-dimensional emotions and more robustness.<\/jats:p>","DOI":"10.1145\/3749102","type":"journal-article","created":{"date-parts":[[2025,7,17]],"date-time":"2025-07-17T11:21:08Z","timestamp":1752751268000},"page":"1-16","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Mongolian Emotional Speech Synthesis Based on CGAN and Improved FastSpeech2"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3079-764X","authenticated-orcid":false,"given":"Ren","family":"Qingdaoerji","sequence":"first","affiliation":[{"name":"Inner Mongolia University of Technology","place":["Hohhot, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6654-4150","authenticated-orcid":false,"given":"Yang","family":"Yang","sequence":"additional","affiliation":[{"name":"Inner Mongolia University of Technology","place":["Hohhot, China"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-2974-9898","authenticated-orcid":false,"given":"Wang","family":"Lele","sequence":"additional","affiliation":[{"name":"Inner Mongolia University of Technology","place":["Hohhot, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,10]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Sercan Arik Gregory Diamos Andrew Gibiansky John Miller Kainan Peng Wei Ping Jonathan Raiman and Yanqi Zhou. 2017. Deep voice 2: Multi-speaker neural text-to-speech. arXiv:1705.08947. Retrieved from https:\/\/arxiv.org\/abs\/1705.08947"},{"key":"e_1_3_1_3_2","first-page":"195","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ar\u0131k Sercan \u00d6","year":"2017","unstructured":"Sercan \u00d6 Ar\u0131k, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et\u00a0al. 2017. Deep voice: Real-time neural text-to-speech. In Proceedings of the International Conference on Machine Learning. PMLR, 195\u2013204."},{"issue":"6","key":"e_1_3_1_4_2","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1007\/s11222-023-10296-2","article-title":"Consistency factor for the MCD estimator at the student-t distribution","volume":"33","author":"Barabesi Lucio","year":"2023","unstructured":"Lucio Barabesi, Andrea Cerioli, Luis Angel Garc\u00eda-Escudero, and Agust\u00edn Mayo-Iscar. 2023. Consistency factor for the MCD estimator at the student-t distribution. Statistics and Computing 33, 6 (2023), 132.","journal-title":"Statistics and Computing"},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","first-page":"399","DOI":"10.18653\/v1\/P16-1038","volume-title":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Deri Aliya","year":"2016","unstructured":"Aliya Deri and Kevin Knight. 2016. Grapheme-to-phoneme models for (almost) any language. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 399\u2013408."},{"key":"e_1_3_1_6_2","first-page":"184","volume-title":"Proceedings of the 2022 International Conference on Asian Language Processing (IALP)","author":"Hu Yifan","year":"2022","unstructured":"Yifan Hu, Pengkai Yin, Rui Liu, Feilong Bao, and Guanglai Gao. 2022. MnTTS: An open-source mongolian text-to-speech synthesis dataset and accompanied baseline. In Proceedings of the 2022 International Conference on Asian Language Processing (IALP). IEEE, 184\u2013189."},{"key":"e_1_3_1_7_2","article-title":"Transfer learning from speaker verification to multispeaker text-to-speech synthesis","volume":"31","author":"Jia Ye","year":"2018","unstructured":"Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et\u00a0al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in Neural Information Processing Systems 31 (2018).","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"4","key":"e_1_3_1_8_2","first-page":"678","article-title":"End-to-end emotional speech synthesis method based on conditional variational autoencoder","volume":"39","author":"Jianming Zhang","year":"2023","unstructured":"Zhang Jianming, Peng Jintao, Jia Hongjie, and Mao Qirong. 2023. End-to-end emotional speech synthesis method based on conditional variational autoencoder. Signal Processing 39, 4 (2023), 678\u2013687.","journal-title":"Signal Processing"},{"issue":"2","key":"e_1_3_1_9_2","first-page":"177","article-title":"Comparison of MOS and PC evaluation methods in the evaluation of chinese speech synthesis systems","volume":"27","author":"Jieping Xu","year":"2006","unstructured":"Xu Jieping, Yan Li, and He Lin. 2006. Comparison of MOS and PC evaluation methods in the evaluation of chinese speech synthesis systems. Microcomputer Applications 27, 2 (2006), 177\u2013180.","journal-title":"Microcomputer Applications"},{"key":"e_1_3_1_10_2","first-page":"5530","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kim Jaehyeon","year":"2021","unstructured":"Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning. PMLR, 5530\u20135540."},{"key":"e_1_3_1_11_2","first-page":"13198","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Lee Sang-Hoon","year":"2021","unstructured":"Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh, Ji-Hoon Kim, and Seong-Whan Lee. 2021. Multi-spectrogan: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence. 13198\u201313206."},{"key":"e_1_3_1_12_2","unstructured":"Younggun Lee Azam Rabiee and Soo-Young Lee. 2017. Emotional end-to-end neural speech synthesizer. arXiv:1711.05447. Retrieved from https:\/\/arxiv.org\/abs\/1711.05447"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3145293"},{"key":"e_1_3_1_14_2","first-page":"483","volume-title":"Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)","author":"Li Jingdong","year":"2018","unstructured":"Jingdong Li, Hui Zhang, Rui Liu, Xueliang Zhang, and Feilong Bao. 2018. End-to-end mongolian text-to-speech system. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 483\u2013487."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016706"},{"key":"e_1_3_1_16_2","volume-title":"Research on Mongolian Speech Synthesis Based on Deep Learning","author":"Liu Rui","year":"2020","unstructured":"Rui Liu. 2020. Research on Mongolian Speech Synthesis Based on Deep Learning. Ph. D. Dissertation. Inner Mongolia University."},{"key":"e_1_3_1_17_2","first-page":"99","volume-title":"Proceedings of the 14th National Conference on Man-Machine Speech Communication: NCMMSC 2017, Lianyungang, China, October 11\u201313, 2017, Revised Selected Papers 14","author":"Liu Rui","year":"2018","unstructured":"Rui Liu, Feilong Bao, Guanglai Gao, and Yonghe Wang. 2018. Mongolian text-to-speech system based on deep neural network. In Proceedings of the 14th National Conference on Man-Machine Speech Communication: NCMMSC 2017, Lianyungang, China, October 11\u201313, 2017, Revised Selected Papers 14. Springer, 99\u2013108."},{"key":"e_1_3_1_18_2","unstructured":"Zhinan Liu. 2019. Research on end-to-end Mongolian speech synthesis method. (2019)."},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953089"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1386"},{"key":"e_1_3_1_21_2","unstructured":"Soroush Mehri Kundan Kumar Ishaan Gulrajani Rithesh Kumar Shubham Jain Jose Sotelo Aaron Courville and Yoshua Bengio. 2016. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv:1612.07837. Retrieved from https:\/\/arxiv.org\/abs\/1612.07837"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1017\/S1351324915000315"},{"key":"e_1_3_1_23_2","unstructured":"Aaron van den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alex Graves Nal Kalchbrenner Andrew Senior and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499. Retrieved from https:\/\/arxiv.org\/abs\/1609.03499"},{"key":"e_1_3_1_24_2","first-page":"1094","volume-title":"Proceedings of the ICLR","author":"Ping Wei","year":"2018","unstructured":"Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2018. Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of the ICLR. 1094\u20131099."},{"issue":"7","key":"e_1_3_1_25_2","first-page":"86","article-title":"MonTTS: A fully non-autoregressive real-time, high-fidelity mongolian speech synthesis model","volume":"36","author":"Rui Liu","year":"2022","unstructured":"Liu Rui, Kang Shiyin, and Gao Guanglai. 2022. MonTTS: A fully non-autoregressive real-time, high-fidelity mongolian speech synthesis model. Journal of Chinese Information Processing 36, 7 (2022), 86\u201397.","journal-title":"Journal of Chinese Information Processing"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_1_27_2","first-page":"4693","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Skerry-Ryan R. J.","year":"2018","unstructured":"R. J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A. Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proceedings of the International Conference on Machine Learning. PMLR, 4693\u20134702."},{"key":"e_1_3_1_28_2","unstructured":"Jose Sotelo Soroush Mehri Kundan Kumar Joao Felipe Santos Kyle Kastner Aaron Courville and Yoshua Bengio. 2017. Char2wav: End-to-end speech synthesis. (2017)."},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00530-014-0446-1"},{"key":"e_1_3_1_30_2","first-page":"129","volume-title":"Proceedings of the 1st Chinese Conference on Affective Computing and Intelligent Interaction, Beijing","author":"Tao J.","year":"2003","unstructured":"J. Tao and X. Xu. 2003. Emotion oriented speech synthesis system. In Proceedings of the 1st Chinese Conference on Affective Computing and Intelligent Interaction, Beijing. 129\u2013132."},{"key":"e_1_3_1_31_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Yuxuan Wang R. J. Skerry-Ryan Daisy Stanton Yonghui Wu Ron J. Weiss Navdeep Jaitly Zongheng Yang Ying Xiao Zhifeng Chen Samy Bengio et\u00a0al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv:1703.10135. Retrieved from https:\/\/arxiv.org\/abs\/1703.10135","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"e_1_3_1_33_2","first-page":"5180","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Wang Yuxuan","year":"2018","unstructured":"Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J.Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 5180\u20135189."},{"key":"e_1_3_1_34_2","unstructured":"Tan Xu Chen Jiawei Liu Haohe Cong Jian Zhang Chen Liu Yanqing Wang Xi Leng Yichong Yi Yuanhao He Lei et\u00a0al. 2022. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv:2205.04421. Retrieved from https:\/\/arxiv.org\/abs\/2205.04421"},{"key":"e_1_3_1_35_2","unstructured":"Ren Yi Hu Chenxu Tan Xu Qin Tao Zhao Sheng Zhao Zhou and Liu Tie-Yan. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv:2006.04558. Retrieved from https:\/\/arxiv.org\/abs\/2006.04558"},{"key":"e_1_3_1_36_2","article-title":"Fastspeech: Fast, robust and controllable text to speech","volume":"32","author":"Yi Ren","year":"2019","unstructured":"Ren Yi, Ruan Yangjun, Tan Xu, Qin Tao, Zhao Sheng, Zhao Zhou, and Liu Tie-Yan. 2019. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (2019).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_37_2","first-page":"6945","volume-title":"Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Zhang Ya-Jie","year":"2019","unstructured":"Ya-Jie Zhang, Shifeng Pan, Lei He, and Zhen-Hua Ling. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 6945\u20136949."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3749102","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T13:29:32Z","timestamp":1757510972000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3749102"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,10]]},"references-count":36,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3749102"],"URL":"https:\/\/doi.org\/10.1145\/3749102","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,9,10]]},"assertion":[{"value":"2024-07-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-31","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}