{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T19:32:46Z","timestamp":1769283166998,"version":"3.49.0"},"reference-count":61,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2022,5,12]],"date-time":"2022-05-12T00:00:00Z","timestamp":1652313600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Trade, Industry &amp; Energy (MOTIE, Korea)","award":["20012260"],"award-info":[{"award-number":["20012260"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Communication has been an important aspect of human life, civilization, and globalization for thousands of years. Biometric analysis, education, security, healthcare, and smart cities are only a few examples of speech recognition applications. Most studies have mainly concentrated on English, Spanish, Japanese, or Chinese, disregarding other low-resource languages, such as Uzbek, leaving their analysis open. In this paper, we propose an End-To-End Deep Neural Network-Hidden Markov Model speech recognition model and a hybrid Connectionist Temporal Classification (CTC)-attention network for the Uzbek language and its dialects. The proposed approach reduces training time and improves speech recognition accuracy by effectively using CTC objective function in attention model training. We evaluated the linguistic and lay-native speaker performances on the Uzbek language dataset, which was collected as a part of this study. Experimental results show that the proposed model achieved a word error rate of 14.3% using 207 h of recordings as an Uzbek language training dataset.<\/jats:p>","DOI":"10.3390\/s22103683","type":"journal-article","created":{"date-parts":[[2022,5,12]],"date-time":"2022-05-12T23:08:36Z","timestamp":1652396916000},"page":"3683","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":64,"title":["Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language"],"prefix":"10.3390","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1438-0628","authenticated-orcid":false,"given":"Abdinabi","family":"Mukhamadiyev","sequence":"first","affiliation":[{"name":"Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0573-6303","authenticated-orcid":false,"given":"Ilyos","family":"Khujayarov","sequence":"additional","affiliation":[{"name":"Department of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi, Tashkent 140100, Uzbekistan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0478-7889","authenticated-orcid":false,"given":"Oybek","family":"Djuraev","sequence":"additional","affiliation":[{"name":"Department of Hardware and Software of Control Systems in Telecommunication, Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan"}]},{"given":"Jinsoo","family":"Cho","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Korea"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"101055","DOI":"10.1016\/j.csl.2019.101055","article-title":"A survey on automatic speech recognition systems for Portuguese language and its variations","volume":"62","year":"2020","journal-title":"Comput. Speech Lang."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Chen, Y., Zhang, J., Yuan, X., Zhang, S., Chen, K., Wang, X., and Guo, S. (2021). SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems. arXiv.","DOI":"10.1145\/3510582"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Xia, K., Xie, X., Fan, H., and Liu, H. (2021). An Intelligent Hybrid\u2013Integrated System Using Speech Recognition and a 3D Display for Early Childhood Education. Electronics, 10.","DOI":"10.3390\/electronics10151862"},{"key":"ref_4","unstructured":"Ahmad, A., Mozelius, P., and Ahlin, K. (2021, January 20). Speech and Language Relearning for Stroke Patients-Understanding User Needs for Technology Enhancement. Proceedings of the Thirteenth International Conference on eHealth, Telemedicine, and Social Medicine (eTELEMED 2021), Nice, France."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sodhro, A., Sennersten, C., and Ahmad, A. (2022). Towards Cognitive Authentication for Smart Healthcare Applications. Sensors, 22.","DOI":"10.3390\/s22062101"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Avazov, K., Mukhriddin, M., Fazliddin, M., and Young, I. (2021). Fire Detection Method in Smart City Environments Using a Deep-Learning-Based Approach. Electronics, 11.","DOI":"10.3390\/electronics11010073"},{"key":"ref_7","first-page":"141","article-title":"Algorithms of multidimensional signals processing based on cubic basis splines for information systems and processes","volume":"24","author":"Khamdamov","year":"2021","journal-title":"J. Appl. Sci. Eng."},{"key":"ref_8","first-page":"215","article-title":"Automatic recognition of Uzbek speech based on integrated neural networks","volume":"Volume 1323","author":"Musaev","year":"2021","journal-title":"World Conference Intelligent System for Industrial Automation"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"394","DOI":"10.1109\/TASLP.2022.3140552","article-title":"Optimizing Data Usage for Low-Resource Speech Recognition","volume":"30","author":"Qian","year":"2022","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Processing"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"\u015awietlicka, I., Kuniszyk-J\u00f3\u017akowiak, W., and \u015awietlicki, M. (2022). Artificial Neural Networks Combined with the Principal Component Analysis for Non-Fluent Speech Recognition. Sensors, 22.","DOI":"10.3390\/s22010321"},{"key":"ref_11","unstructured":"Templeton, G. (2021, April 21). Language Support in Voice Assistants Compared. Available online: https:\/\/summalinguae.com\/language-technology\/language-support-voice-assistants-compared\/."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.inffus.2021.10.012","article-title":"Deep learning for depression recognition with audiovisual cues: A review","volume":"80","author":"He","year":"2021","journal-title":"Inf. Fusion."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"163829","DOI":"10.1109\/ACCESS.2020.3020421","article-title":"Acoustic modeling based on deep learning for low-resource speech recognition: An overview","volume":"8","author":"Yu","year":"2020","journal-title":"IEEE Access"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1016\/j.specom.2022.02.005","article-title":"Unsupervised Automatic Speech Recognition: A Review","volume":"139","author":"Aldarmaki","year":"2022","journal-title":"Speech Commun."},{"key":"ref_15","first-page":"5511","article-title":"Automatic Speaker Recognition Using Mel-Frequency Cepstral Coefficients Through Machine Learning","volume":"71","author":"Ayvaz","year":"2022","journal-title":"CMC-Comput. Mater. Contin."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"2067","DOI":"10.1109\/TASLP.2021.3078883","article-title":"Audio-visual multi-channel integration and recognition of overlapped speech","volume":"29","author":"Yu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Processing"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"572","DOI":"10.1109\/TASLP.2018.2888814","article-title":"Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment","volume":"27","author":"Deena","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Processing"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"101308","DOI":"10.1016\/j.csl.2021.101308","article-title":"Generative adversarial networks for speech processing: A review","volume":"72","author":"Wali","year":"2021","journal-title":"Comput. Speech Lang."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1385","DOI":"10.1109\/TASLP.2020.2988423","article-title":"Improving end-to-end single-channel multi-talker speech recognition","volume":"28","author":"Zhang","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Processing"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Mukhiddinov, M. (2019, January 1\u20135). Scene Text Detection and Localization using Fully Convolutional Network. Proceedings of the 2019 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan.","DOI":"10.1109\/ICISCT47635.2019.9012021"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1105","DOI":"10.1016\/j.csl.2013.02.003","article-title":"Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis","volume":"27","author":"Reddy","year":"2013","journal-title":"Comput. Speech Lang."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1549","DOI":"10.1109\/TASLP.2020.2993152","article-title":"Speech\/Music Classification Using Features from Spectral Peaks","volume":"28","author":"Bhattacharjee","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Processing"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1987","DOI":"10.1109\/TASLP.2021.3082307","article-title":"Receptive Field Regularization Techniques for Audio Classification and Tagging with Deep Convolutional Neural Networks","volume":"29","author":"Koutini","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"200395","DOI":"10.1109\/ACCESS.2020.3034762","article-title":"Optimizing arabic speech distinctive phonetic features and phoneme recognition using genetic algorithm","volume":"8","author":"Ibrahim","year":"2020","journal-title":"IEEE Access"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Mukhiddinov, M., Akmuradov, B., and Djuraev, O. (2019, January 1\u20135). Robust text recognition for Uzbek language in natural scene images. Proceedings of the 2019 International Conference on Information Science and Communications Technologies (ICISCT), Chongqing, China.","DOI":"10.1109\/ICISCT47635.2019.9011892"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"2986","DOI":"10.1109\/TASLP.2021.3110146","article-title":"FluentNet: End-to-End Detection of Stuttered Speech Disfluencies with Deep Learning","volume":"29","author":"Kourkounakis","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"3650","DOI":"10.1007\/s00034-016-0476-3","article-title":"Parameterization of Excitation Signal for Improving the Quality of HMM-Based Speech Synthesis System","volume":"36","author":"Narendra","year":"2017","journal-title":"Circuits Syst Signal Process."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Processing Mag."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"939","DOI":"10.21437\/Interspeech.2017-233","article-title":"A Comparison of Sequence-to-Sequence Models for Speech Recognition","volume":"2017","author":"Prabhavalkar","year":"2017","journal-title":"Interspeech"},{"key":"ref_30","unstructured":"Kanishka, R., Ha\u015fim, S., and Rohit, P. (2017, January 16\u201320). Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., and Gruenstein, A. (2019, January 12\u201317). Streaming end-to-end speech recognition for mobile devices. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8682336"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Li, J., Zhao, R., Meng, Z., Liu, Y., Wei, W., Parthasarathy, S., and Gong, Y. (2020). Developing RNN-T models surpassing high-performance hybrid models with customization capability. arXiv.","DOI":"10.21437\/Interspeech.2020-3016"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Musaev, M., Mussakhojayeva, S., Khujayorov, I., Khassanov, Y., Ochilov, M., and Atakan Varol, H. (2021). USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. arXiv.","DOI":"10.1007\/978-3-030-87802-3_40"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T. (2015). Pyaudioanalysis: An open-source python library for audio signal analysis. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0144610"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Khamdamov, U., Mukhiddinov, M., Akmuradov, B., and Zarmasov, E. (2020, January 4\u20136). A Novel Algorithm of Numbers to Text Conversion for Uzbek Language TTS Synthesizer. Proceedings of the 2020 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan.","DOI":"10.1109\/ICISCT50599.2020.9351434"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"2050052","DOI":"10.1142\/S0219691320500526","article-title":"Improvement of the end-to-end scene text recognition method for \u201ctext-to-speech\u201d conversion","volume":"18","author":"Makhmudov","year":"2020","journal-title":"Int. J. Wavelets Multiresolution Inf. Process."},{"key":"ref_37","unstructured":"Glorot, X., and Bengio, Y. (2010, January 13\u201315). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"342","DOI":"10.1109\/RBME.2020.3006860","article-title":"Speech technology for healthcare: Opportunities, challenges, and state of the art","volume":"14","author":"Latif","year":"2020","journal-title":"IEEE Rev. Biomed. Eng."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Latif, S., Rana, R., Khalifa, S., Jurdak, R., and Epps, J. (2019, January 15\u201319). Direct modelling of speech emotion from raw speech. Proceedings of the Intespeeech 2019, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-3252"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Palaz, D., Doss, M.M., and Collobert, R. (2015, January 19\u201324). Convolutional neural networks-based continuous speech recognition using raw speech signal. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia.","DOI":"10.1109\/ICASSP.2015.7178781"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Muckenhirn, H., Doss, M.M., and Marcell, S. (2018, January 15\u201320). Towards directly modeling raw speech signal for speaker verification using CNNs. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462165"},{"key":"ref_42","first-page":"1261","article-title":"A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition","volume":"29","author":"Passricha","year":"2020","journal-title":"J. Intell. Syst."},{"key":"ref_43","first-page":"3092","article-title":"Neural network acoustic models for the DARPA RATS program","volume":"2013","author":"Soltau","year":"2013","journal-title":"Interspeech"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., and Turdalykyzy, T. (2019). Automatic recognition of Kazakh speech using deep neural networks. Asian Conference on Intelligent Information and Database Systems, Yogyakarta, Indonesia, 8\u201311 April 2019, Springer.","DOI":"10.1007\/978-3-030-14802-7_40"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., and Bekarystankyzy, A. (2021, January 25\u201327). End-to-End Model Based on RNN-T for Kazakh Speech Recognition. Proceedings of the 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), Nagoya, Japan.","DOI":"10.1109\/ICCCI51764.2021.9486811"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., and Varol, H.A. (2020). A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. arXiv.","DOI":"10.18653\/v1\/2021.eacl-main.58"},{"key":"ref_47","unstructured":"Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv."},{"key":"ref_48","unstructured":"Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., and Weber, G. (2019). Common voice: A massively-multilingual speech corpus. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"101272","DOI":"10.1016\/j.csl.2021.101272","article-title":"Arabic speech recognition by end-to-end, modular systems and human","volume":"71","author":"Hussein","year":"2022","journal-title":"Comput. Speech Lang."},{"key":"ref_50","first-page":"2751","article-title":"Purely sequence-trained neural networks for ASR based on lattice-free MMI","volume":"2016","author":"Povey","year":"2016","journal-title":"Interspeech"},{"key":"ref_51","unstructured":"Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., and Schwarz, P. (2011, January 11\u201315). The Kaldi speech recognition toolkit. Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), no. CONF, Waikoloa, HI, USA."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Povey, D., Hadian, H., Ghahremani, P., Li, K., and Khudanpur, S. (2018, January 15\u201320). A time-restricted self-attention layer for ASR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462497"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., and Khudanpur, S. (2014, January 4\u20139). A pitch extraction algorithm tuned for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6854049"},{"key":"ref_54","first-page":"1021","article-title":"Rapid Collection of Spontaneous Speech Corpora Using Telephonic Community Forums","volume":"2018","author":"Raza","year":"2018","journal-title":"Interspeech"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Xiao, Z., Ou, Z., Chu, W., and Lin, H. (2018, January 26\u201329). Hybrid CTC-Attention based end-to-end speech recognition using subword units. Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ICSLP), Taiwan, China.","DOI":"10.1109\/ISCSLP.2018.8706675"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016, January 20\u201325). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015, January 6\u201310). Audio augmentation for speech recognition. Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany.","DOI":"10.21437\/Interspeech.2015-711"},{"key":"ref_58","first-page":"2613","article-title":"SpecAugment: A simple data augmentation method for automatic speech recognition","volume":"2019","author":"Park","year":"2019","journal-title":"Interspeech"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., and Dehak, N. (2019, January 14\u201318). Hierarchical transformers for long document classification. Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore.","DOI":"10.1109\/ASRU46091.2019.9003958"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv.","DOI":"10.18653\/v1\/P18-1007"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Mamatov, N.S., Niyozmatova, N.A., Abdullaev, S.S., Samijonov, A.N., and Erejepov, K.K. (2021, January 3\u20135). Speech Recognition Based on Transformer Neural Networks. Proceedings of the 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan.","DOI":"10.1109\/ICISCT52966.2021.9670093"}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/10\/3683\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:09:41Z","timestamp":1760137781000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/10\/3683"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,12]]},"references-count":61,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2022,5]]}},"alternative-id":["s22103683"],"URL":"https:\/\/doi.org\/10.3390\/s22103683","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,12]]}}}