{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T16:43:01Z","timestamp":1764175381466,"version":"build-2065373602"},"reference-count":46,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2023,1,12]],"date-time":"2023-01-12T00:00:00Z","timestamp":1673481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Natural Science Foundation of China\u2014Research on Key Technologies of Speech Recognition of Chinese and Western Asian Languages under Resource Constraints","award":["62066043","ZDI135-133"],"award-info":[{"award-number":["62066043","ZDI135-133"]}]},{"name":"National Language Commission key Project\u2014Research on Speech Keyword Search Technology of Chinese and Western Asian Languages","award":["62066043","ZDI135-133"],"award-info":[{"award-number":["62066043","ZDI135-133"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech\u2019s test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.<\/jats:p>","DOI":"10.3390\/s23020870","type":"journal-article","created":{"date-parts":[[2023,1,12]],"date-time":"2023-01-12T04:29:38Z","timestamp":1673497778000},"page":"870","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training"],"prefix":"10.3390","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9902-5061","authenticated-orcid":false,"given":"Weijing","family":"Meng","sequence":"first","affiliation":[{"name":"Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China"},{"name":"College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China"}]},{"given":"Nurmemet","family":"Yolwas","sequence":"additional","affiliation":[{"name":"Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China"},{"name":"College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,1,12]]},"reference":[{"key":"ref_1","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Lukas, B., Ondrej, G., Nagendra, G., Mirko, H., Petr, M., Yanmin, Q., and Petr, S. (2011, January 11\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA."},{"key":"ref_2","unstructured":"Mohamed, A., Okhonko, D., and Zettlemoyer, L. (2019). Transformers with convolutional context for ASR. arXiv."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., and Gonina, E. (2018, January 15\u201320). State-of-the-art speech recognition with sequence-to-sequence models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462105"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2016, January 20\u201325). End-to-end attention-based large vocabulary speech recognition. Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472618"},{"key":"ref_5","unstructured":"Chan, W., Jaitly, N., Le, Q.V., and Vinyals, O. (2015). Listen, attend and spell. arXiv."},{"key":"ref_6","unstructured":"Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition. Advances in Neural Information Processing Systems, Proceedings of the 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, Montreal, Canada, 7\u201312 December 2015, NeurIPS."},{"key":"ref_7","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4\u20139 December 2017, NeurIPS."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N.E.Y., Yamamoto, R., and Wang, X. (2019, January 14\u201318). A comparative study on transformer vs rnn in speech applications. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore.","DOI":"10.1109\/ASRU46091.2019.9003750"},{"key":"ref_9","unstructured":"Nakatani, T. (2019, January 15\u201319). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proceedings of the Proc. Interspeech, Graz, Austria."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Dong, L., Xu, S., and Xu, B. (2018, January 15\u201320). Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462506"},{"key":"ref_11","unstructured":"Lewis, M.P., Simon, G., and Fennig, C.D. (2022, December 01). Ethnologue: Languages of the World, Available online: http:\/\/www.ethnologue.com."},{"key":"ref_12","unstructured":"Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA."},{"key":"ref_13","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2022, December 03). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/research-covers\/language-unsupervised\/language_understanding_paper.pdf."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ling, S., Liu, Y., Salazar, J., and Kirchhoff, K. (2020, January 4\u20138). Deep contextualized acoustic representations for semi-supervised speech recognition. Proceedings of the ICASSP 2020\u20132020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053176"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., and Nakatani, T. (2019, January 12\u201317). Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682890"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Li, B., Sainath, T.N., Pang, R., and Wu, Z. (2019, January 12\u201317). Semi-supervised training for end-to-end models via weak distillation. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682172"},{"key":"ref_17","unstructured":"Caron, M., Bojanowski, P., Mairal, J., and Joulin, A. (November, January 27). Unsupervised pre-training of image features on non-curated data. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Steed, R., and Caliskan, A. (2021, January 3\u201310). Image representations learned with unsupervised pre-training contain human-like biases. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual.","DOI":"10.1145\/3442188.3445932"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Dai, Z., Cai, B., Lin, Y., and Chen, J. (2021, January 20\u201325). Up-detr: Unsupervised pre-training for object detection with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.","DOI":"10.1109\/CVPR46437.2021.00165"},{"key":"ref_20","unstructured":"Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv."},{"key":"ref_21","unstructured":"Steffen, S., Alexei, B., Ronan, C., and Michael, A. (2019, January 15\u201319). wav2vec: Unsupervised pre-training for speech recognition. Proceedings of the Interspeech 2019, Graz, Austria."},{"key":"ref_22","unstructured":"Baevski, A., Schneider, S., and Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv."},{"key":"ref_23","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_24","unstructured":"Yu-An, C., Wei-Ning, H., Hao, T., and James, G. (2019, January 15\u201319). An unsupervised autoregressive model for speech representation learning. Proceedings of the Interspeech 2019, Graz, Austria."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Jiang, D., Li, W., Zhang, R., Cao, M., Luo, N., Han, Y., Zou, W., Han, K., and Li, X. (2021, January 6\u201311). A further study of unsupervised pretraining for transformer based speech recognition. Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9414539"},{"key":"ref_26","unstructured":"Jiang, D., Lei, X., Li, W., Luo, N., Hu, Y., Zou, W., and Li, X. (2019). Improving transformer-based speech recognition using unsupervised pre-training. arXiv."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Bansal, S., Kamper, H., Livescu, K., Lopez, A., and Goldwater, S. (2018). Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv.","DOI":"10.21437\/Interspeech.2018-1326"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Hsu, J.Y., Chen, Y.J., and Lee, H. (2020, January 4\u20138). Meta learning for end-to-end low-resource speech recognition. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053112"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1109\/LSP.2021.3071668","article-title":"Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition","volume":"28","author":"Yi","year":"2021","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., and Khudanpur, S. (2018, January 2\u20136). Semi-orthogonal low-rank matrix factorization for deep neural networks. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1417"},{"key":"ref_31","unstructured":"Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1109\/TPAMI.2010.57","article-title":"Product quantization for nearest neighbor search","volume":"33","author":"Jegou","year":"2011","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An asr corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., and Varol, H. (2020). A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. arXiv.","DOI":"10.18653\/v1\/2021.eacl-main.58"},{"key":"ref_35","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_36","unstructured":"Zhou, S., Xu, S., and Xu, B. (2018). Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hayashi, T., Yamamoto, R., Inoue, K., Yoshimura, T., Watanabe, S., Toda, T., Takeda, K., Zhang, Y., and Tan, X. (2020, January 4\u20138). ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053512"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv-Ryan, R. (2018, January 15\u201320). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"ref_39","unstructured":"Ito, K., and Johnson, L. (2022, December 01). The LJ Speech Dataset. Available online: https:\/\/keithito.com\/LJ-Speech-Dataset\/."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Yamamoto, R., Song, E., and Kim, J.M. (2020, January 4\u20138). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053795"},{"key":"ref_41","unstructured":"Heafield, K. (2011, January 30\u201331). KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, UK."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., and Collobert, R. (2019, January 12\u201317). Wav2letter++: A fast open-source speech recognition system. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683535"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Jin, C., He, B., Hui, K., and Sun, L. (2018, January 15\u201320). TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1100"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human parity in conversational speech recognition. arXiv.","DOI":"10.1109\/TASLP.2017.2756440"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Zhang, S., Lei, M., Yan, Z., and Dai, L. (2018, January 15\u201320). Deep-FSMN for large vocabulary continuous speech recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461404"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1109\/LSP.2017.2723507","article-title":"Low latency acoustic modeling using temporal convolution and LSTMs","volume":"25","author":"Peddinti","year":"2017","journal-title":"IEEE Signal Process. Lett."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/870\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:03:52Z","timestamp":1760119432000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/23\/2\/870"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,12]]},"references-count":46,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2023,1]]}},"alternative-id":["s23020870"],"URL":"https:\/\/doi.org\/10.3390\/s23020870","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2023,1,12]]}}}