{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T05:12:37Z","timestamp":1760073157785,"version":"build-2065373602"},"reference-count":201,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,10,4]],"date-time":"2025-10-04T00:00:00Z","timestamp":1759536000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["www.mdpi.com"],"crossmark-restriction":true},"short-container-title":["Informatics"],"abstract":"<jats:p>Automatic speech recognition (ASR) has advanced rapidly, evolving from early template-matching systems to modern deep learning frameworks. This review systematically traces ASR\u2019s technological evolution across four phases: the template-based era, statistical modeling approaches, the deep learning revolution, and the emergence of large-scale models under diverse learning paradigms. We analyze core technologies such as hidden Markov models (HMMs), Gaussian mixture models (GMMs), recurrent neural networks (RNNs), and recent architectures including Transformer-based models and Wav2Vec 2.0. Beyond algorithmic development, we examine how ASR integrates into intelligent information systems, analyzing real-world applications in healthcare, education, smart homes, enterprise systems, and automotive domains with attention to deployment considerations and system design. We also address persistent challenges\u2014noise robustness, low-resource adaptation, and deployment efficiency\u2014while exploring emerging solutions such as multimodal fusion, privacy-preserving modeling, and lightweight architectures. Finally, we outline future research directions to guide the development of robust, scalable, and intelligent ASR systems for complex, evolving environments.<\/jats:p>","DOI":"10.3390\/informatics12040107","type":"journal-article","created":{"date-parts":[[2025,10,6]],"date-time":"2025-10-06T08:10:51Z","timestamp":1759738251000},"page":"107","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Integrating Speech Recognition into Intelligent Information Systems: From Statistical Models to Deep Learning"],"prefix":"10.3390","volume":"12","author":[{"given":"Chaoji","family":"Wu","sequence":"first","affiliation":[{"name":"College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518122, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7965-5058","authenticated-orcid":false,"given":"Yi","family":"Pan","sequence":"additional","affiliation":[{"name":"College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518122, China"}]},{"given":"Haipan","family":"Wu","sequence":"additional","affiliation":[{"name":"College of Physics and Opto-Electronic Engineering, Shenzhen University, Shenzhen 518060, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0019-9761","authenticated-orcid":false,"given":"Lei","family":"Ning","sequence":"additional","affiliation":[{"name":"College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518122, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,10,4]]},"reference":[{"key":"ref_1","first-page":"1877","article-title":"Language Models Are Few-Shot Learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Li, J., Wu, Y., Gaur, Y., Wang, C., Zhao, R., and Liu, S. (2020). On the comparison of popular end-to-end models for large scale speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-2846"},{"key":"ref_3","unstructured":"Tsai, Y.-H.H., Ma, M.Q., Yang, M., Zhao, H., Morency, L.-P., and Salakhutdinov, R. (2021). Self-supervised representation learning with relative predictive coding. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Guliani, D., Beaufays, F., and Motta, G. (2021, January 6\u201311). Training speech recognition models with federated learning: A quality\/cost framework. Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413397"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Nguyen, T., Mdhaffar, S., Tomashenko, N., Bonastre, J.-F., and Est\u00e8ve, Y. (2023, January 4\u201310). Federated learning for ASR based on wav2vec 2.0. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096426"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Xu, M., Song, C., Tian, Y., Agrawal, N., Granqvist, F., van Dalen, R., Zhang, X., Argueta, A., Han, S., and Deng, Y. (2023, January 4\u201310). Training large-vocabulary neural language models by private federated learning for resource-constrained devices. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece.","DOI":"10.1109\/ICASSP49357.2023.10096570"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Hegdepatil, P., and Davuluri, K. (2021, January 28\u201329). Business intelligence based novel marketing strategy approach using automatic speech recognition and text summarization. Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA.","DOI":"10.1109\/CDS52072.2021.00108"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1016\/j.specom.2013.07.005","article-title":"Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain","volume":"56","author":"Mohan","year":"2014","journal-title":"Speech Commun."},{"key":"ref_9","first-page":"1980","article-title":"Combined automatic speech recognition and machine translation in business correspondence domain for English-Croatian","volume":"8","author":"Seljan","year":"2014","journal-title":"Int. J. Ind. Syst. Eng."},{"key":"ref_10","first-page":"88","article-title":"Industrial applications of automatic speech recognition systems","volume":"6","author":"Vajpai","year":"2016","journal-title":"Int. J. Eng. Res. Appl."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"102422","DOI":"10.1016\/j.inffus.2024.102422","article-title":"Automatic speech recognition using advanced deep learning approaches: A survey","volume":"109","author":"Kheddar","year":"2024","journal-title":"Inf. Fusion"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"23367","DOI":"10.1007\/s11042-023-16438-y","article-title":"A comprehensive survey on automatic speech recognition using neural networks","volume":"83","author":"Dhanjal","year":"2024","journal-title":"Multimed. Tools Appl."},{"key":"ref_13","first-page":"101572","article-title":"Trends and developments in automatic speech recognition research","volume":"84","author":"Zahorian","year":"2023","journal-title":"Comput. Speech Lang."},{"key":"ref_14","first-page":"100","article-title":"A survey on end-to-end speech recognition systems","volume":"5","author":"Khapra","year":"2024","journal-title":"Int. J. Comput. Inf. Technol."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kumar, A., Verma, S., and Mangla, H. (2018, January 12\u201313). A survey of deep learning techniques in speech recognition. Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India.","DOI":"10.1109\/ICACCCN.2018.8748399"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"9411","DOI":"10.1007\/s11042-020-10073-7","article-title":"Automatic speech recognition: A survey","volume":"80","author":"Malik","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1109\/TASLP.2023.3328283","article-title":"End-to-end speech recognition: A survey","volume":"32","author":"Prabhavalkar","year":"2023","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"637","DOI":"10.1121\/1.1906946","article-title":"Automatic recognition of spoken digits","volume":"24","author":"Davis","year":"1952","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1480","DOI":"10.1121\/1.1907653","article-title":"Results obtained from a vowel recognition computer program","volume":"31","author":"Forgie","year":"1959","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1072","DOI":"10.1121\/1.1908561","article-title":"Phonetic typewriter","volume":"28","author":"Olson","year":"1956","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1610","DOI":"10.1121\/1.1908515","article-title":"Phonetic typewriter III","volume":"33","author":"Olson","year":"1961","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"637","DOI":"10.1121\/1.1912679","article-title":"Speech Analysis and Synthesis by Linear Prediction of the Speech Wave","volume":"50","author":"Atal","year":"1971","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_23","first-page":"193","article-title":"Recognition of Japanese vowels","volume":"8","author":"Suzuki","year":"1961","journal-title":"J. Radio Res. Lab."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1664","DOI":"10.1121\/1.1936652","article-title":"Phonetic typewriter","volume":"33","author":"Sakai","year":"1961","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1109\/TASSP.1975.1162641","article-title":"Minimum prediction residual principle applied to speech recognition","volume":"23","author":"Itakura","year":"2003","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1109\/TASSP.1978.1163055","article-title":"Dynamic programming algorithm optimization for spoken word recognition","volume":"26","author":"Sakoe","year":"1978","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1345","DOI":"10.1121\/1.381666","article-title":"Review of the ARPA speech understanding project","volume":"62","author":"Klatt","year":"1977","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_28","unstructured":"Rabiner, L., and Juang, B.-H. (1993). Fundamentals of Speech Recognition, Prentice-Hall, Inc."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"284","DOI":"10.1109\/TASSP.1981.1163527","article-title":"A level building dynamic time warping algorithm for connected word recognition","volume":"29","author":"Myers","year":"2003","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_30","unstructured":"Myers, C., Rabiner, L., and Rosenberg, A. (1980, January 9\u201311). An investigation of the use of dynamic time warping for word spotting and connected speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Denver, CO, USA."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1649","DOI":"10.1109\/29.46547","article-title":"A frame-synchronous network search algorithm for connected word recognition","volume":"37","author":"Lee","year":"1989","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_32","unstructured":"Bridle, J.S., Brown, M., and Chamberlain, R. (1982, January 3\u20135). An Algorithm for Connected Word Recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Paris, France."},{"key":"ref_33","unstructured":"Lowerre, B.T. (1976). The Harpy Speech Recognition System, Carnegie Mellon University."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1109\/MASSP.1986.1165342","article-title":"An introduction to hidden Markov models","volume":"3","author":"Rabiner","year":"1986","journal-title":"IEEE ASSP Mag."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"268","DOI":"10.1109\/PROC.1973.9030","article-title":"The viterbi algorithm","volume":"61","author":"Forney","year":"2005","journal-title":"Proc. IEEE"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"251","DOI":"10.1080\/00401706.1991.10484833","article-title":"Hidden Markov models for speech recognition","volume":"33","author":"Juang","year":"1991","journal-title":"Technometrics"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1109\/79.410439","article-title":"Implementing the Viterbi algorithm","volume":"12","author":"Lou","year":"1995","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1109\/5.18626","article-title":"A tutorial on hidden Markov models and selected applications in speech recognition","volume":"77","author":"Rabiner","year":"1989","journal-title":"Proc. IEEE"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1109\/89.279278","article-title":"Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains","volume":"2","author":"Gauvain","year":"1994","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_40","unstructured":"Milvus (2025, August 20). What Is the History of Speech Recognition Technology?. Available online: https:\/\/milvus.io\/ai-quick-reference\/what-is-the-history-of-speech-recognition-technology."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"30","DOI":"10.1109\/TASL.2011.2134090","article-title":"Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition","volume":"20","author":"Dahl","year":"2011","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/MSP.2012.2205597","article-title":"Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups","volume":"29","author":"Hinton","year":"2012","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Graves, A., Mohamed, A.-r., and Hinton, G. (2013, January 26\u201331). Speech recognition with deep recurrent neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"ref_44","unstructured":"Sutskever, I., Martens, J., and Hinton, G.E. (July, January 28). Generating text with recurrent neural networks. Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA."},{"key":"ref_45","unstructured":"Bourlard, H.A., and Morgan, N. (2012). Connectionist Speech Recognition: A Hybrid Approach, Springer Science & Business Media."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/S0925-2312(00)00308-8","article-title":"A survey of hybrid ANN\/HMM models for automatic speech recognition","volume":"37","author":"Trentin","year":"2001","journal-title":"Neurocomputing"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"1527","DOI":"10.1162\/neco.2006.18.7.1527","article-title":"A fast learning algorithm for deep belief nets","volume":"18","author":"Hinton","year":"2006","journal-title":"Neural Comput."},{"key":"ref_48","unstructured":"Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., and Coates, A. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Graves, A., Eck, D., Beringer, N., and Schmidhuber, J. (2004, January 29\u201330). Biologically plausible speech recognition with LSTM neural nets. Proceedings of the Biologically Inspired Approaches to Advanced Information Technology: First International Workshop (BioADIT), Lausanne, Switzerland.","DOI":"10.1007\/978-3-540-27835-1_10"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"19143","DOI":"10.1109\/ACCESS.2019.2896880","article-title":"Speech recognition using deep neural networks: A systematic review","volume":"7","author":"Nassif","year":"2019","journal-title":"IEEE Access"},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"2554","DOI":"10.1073\/pnas.79.8.2554","article-title":"Neural networks and physical systems with emergent collective computational abilities","volume":"79","author":"Hopfield","year":"1982","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1985). Learning Internal Representations by Error Propagation, Institute for Cognitive Science, University of California.","DOI":"10.21236\/ADA164453"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"24961","DOI":"10.1007\/s00521-023-08462-8","article-title":"Exploration of English speech translation recognition based on the LSTM RNN algorithm","volume":"35","author":"Yuan","year":"2023","journal-title":"Neural Comput. Appl."},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"157","DOI":"10.1109\/72.279181","article-title":"Learning Long-Term Dependencies with Gradient Descent Is Difficult","volume":"5","author":"Bengio","year":"1994","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_56","unstructured":"Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., and Chen, G. (2016, January 19\u201324). Deep Speech 2: End-to-end Speech Recognition in English and Mandarin. Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA."},{"key":"ref_57","unstructured":"Ioffe, S., and Szegedy, C. (2015, January 6\u201311). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (ICML), Lille, France."},{"key":"ref_58","first-page":"27403","article-title":"DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1","volume":"93","author":"Garofolo","year":"1993","journal-title":"NASA STI\/Recon Tech. Rep. N"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Cho, K., Van Merri\u00ebnboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv.","DOI":"10.3115\/v1\/W14-4012"},{"key":"ref_61","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"1035","DOI":"10.35378\/gujs.816499","article-title":"Turkish speech recognition techniques and applications of recurrent units (LSTM and GRU)","volume":"34","author":"Erdem","year":"2021","journal-title":"Gazi Univ. J. Sci."},{"key":"ref_63","unstructured":"Graves, A., and Jaitly, N. (2014, January 21\u201326). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the International Conference on Machine Learning (ICML), Beijing, China."},{"key":"ref_64","unstructured":"Hau, D., and Chen, K. (2011, January 7\u20139). Exploring hierarchical speech representations with a deep convolutional neural network. Proceedings of the 11th Annual Workshop on Computational Intelligence (UKCI), Manchester, UK."},{"key":"ref_65","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1109\/TASLP.2014.2339736","article-title":"Convolutional Neural Networks for Speech Recognition","volume":"22","author":"Mohamed","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_67","unstructured":"Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K.J. (2013). Phoneme Recognition Using Time-Delay Neural Networks. Backpropagation, Psychology Press."},{"key":"ref_68","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1016\/j.neunet.2014.08.005","article-title":"Deep convolutional neural networks for large-scale speech tasks","volume":"64","author":"Sainath","year":"2015","journal-title":"Neural Netw."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Shon, S., Ali, A., and Glass, J. (2018). Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv.","DOI":"10.21437\/Odyssey.2018-14"},{"key":"ref_70","first-page":"10","article-title":"Kurdish dialect recognition using 1D CNN","volume":"9","author":"Ghafoor","year":"2021","journal-title":"ARO Sci. J. Koya Univ."},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_72","first-page":"1261","article-title":"A hybrid of deep CNN and bidirectional LSTM for automatic speech recognition","volume":"29","author":"Passricha","year":"2019","journal-title":"J. Intell. Syst."},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Graves, A., Fern\u00e1ndez, S., Gomez, F., and Schmidhuber, J. (2006, January 25\u201329). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA.","DOI":"10.1145\/1143844.1143891"},{"key":"ref_74","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016, January 20\u201325). Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472621"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Sak, H., Shannon, M., Rao, K., and Beaufays, F. (2017, January 20\u201324). Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. Proceedings of the Interspeech 2017, Stockholm, Sweden.","DOI":"10.21437\/Interspeech.2017-1705"},{"key":"ref_77","first-page":"5074","article-title":"An online sequence-to-sequence model using partial conditioning","volume":"29","author":"Jaitly","year":"2016","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_78","doi-asserted-by":"crossref","unstructured":"Graves, A. (2012). Sequence transduction with recurrent neural networks. arXiv.","DOI":"10.1007\/978-3-642-24797-2"},{"key":"ref_79","unstructured":"Chiu, C.-C., and Raffel, C. (2017). Monotonic chunkwise attention. arXiv."},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Chanchaochai, N., Cieri, C., Debrah, J., Liberman, M., Graff, D., Lee, J., Walker, K., Walter, T., and Wu, J. (2018, January 2\u20136). GlobalTIMIT: Acoustic-Phonetic Datasets for the World\u2019s Languages. Proceedings of the INTERSPEECH 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1185"},{"key":"ref_81","first-page":"1","article-title":"The digitization of the world from edge to core","volume":"16","author":"Rydning","year":"2018","journal-title":"Framingham: Int. Data Corp."},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1109\/MIS.2009.36","article-title":"The unreasonable effectiveness of data","volume":"24","author":"Halevy","year":"2009","journal-title":"IEEE Intell. Syst."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40537-014-0007-7","article-title":"Deep learning applications and challenges in big data analytics","volume":"2","author":"Najafabadi","year":"2015","journal-title":"J. Big Data"},{"key":"ref_84","first-page":"6000","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_85","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 26\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_86","unstructured":"Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). Layer Normalization. arXiv."},{"key":"ref_87","first-page":"577","article-title":"Attention-based models for speech recognition","volume":"28","author":"Chorowski","year":"2015","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_88","doi-asserted-by":"crossref","unstructured":"Moritz, N., Hori, T., and Le, J. (2020, January 4\u20138). Streaming automatic speech recognition with the transformer model. Proceedings of the ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual.","DOI":"10.1109\/ICASSP40776.2020.9054476"},{"key":"ref_89","unstructured":"Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv."},{"key":"ref_90","unstructured":"Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020, January 13\u201318). Transformers are rnns: Fast autoregressive transformers with linear attention. Proceedings of the International Conference on Machine Learning (ICML), Virtual."},{"key":"ref_91","unstructured":"Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., and Kaiser, L. (2020). Rethinking attention with performers. arXiv."},{"key":"ref_92","unstructured":"Wang, S., Li, B.Z., Khabsa, M., Fang, H., and Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv."},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C.-C., Qin, J., Gulati, A., Pang, R., and Wu, Y. (2020). Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv.","DOI":"10.21437\/Interspeech.2020-2059"},{"key":"ref_94","unstructured":"Peng, Y., Dalmia, S., Lane, I., and Watanabe, S. (2022, January 17\u201323). Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding. Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA."},{"key":"ref_95","doi-asserted-by":"crossref","unstructured":"Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., and Kumar, S. (2020, January 4\u20138). Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053896"},{"key":"ref_96","doi-asserted-by":"crossref","unstructured":"Dong, L., Xu, S., and Xu, B. (2018, January 15\u201320). Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8462506"},{"key":"ref_97","doi-asserted-by":"crossref","unstructured":"Paul, D.B., and Baker, J. (1992, January 23\u201326). The design for the Wall Street Journal-based CSR corpus. Proceedings of the Speech and Natural Language: Proceedings of a Workshop, Harriman, New York, NY, USA.","DOI":"10.3115\/1075527.1075614"},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An ASR corpus based on public domain audio books. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_99","unstructured":"(2025, August 14). Hugging Face. Wav2Vec2-Conformer. Available online: https:\/\/huggingface.co\/docs\/transformers\/model_doc\/wav2vec2-conformer."},{"key":"ref_100","doi-asserted-by":"crossref","unstructured":"Chen, Z., Ramabhadran, B., Biadsy, F., Zhang, X., Chen, Y., Jiang, L., and Moreno, P.J. (September, January 30). Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. Proceedings of the INTERSPEECH 2021, Brno, Czech Republic.","DOI":"10.21437\/Interspeech.2021-676"},{"key":"ref_101","unstructured":"Google Cloud (2025, August 20). Migrate from classic to Conformer Models. Available online: https:\/\/cloud.google.com\/speech-to-text\/docs\/conformer-migration."},{"key":"ref_102","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1214\/aoms\/1177729586","article-title":"A stochastic approximation method","volume":"22","author":"Robbins","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_103","doi-asserted-by":"crossref","unstructured":"Gu, Y., Shivakumar, P.G., Kolehmainen, J., Brusco, P., Sim, K.C., Ramabhadran, B., and Picheny, M. (2023). Scaling Laws for Discriminative Speech Recognition Rescoring Models. arXiv.","DOI":"10.21437\/Interspeech.2023-2128"},{"key":"ref_104","unstructured":"Subbaswamy, A., and Saria, S. (2018, January 6\u201310). Counterfactual Normalization: Proactively Addressing Dataset Shift Using Causal Mechanisms. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Monterey, CA, USA."},{"key":"ref_105","unstructured":"Xu, K.-T., Xie, F.-L., Tang, X., and Hu, Y. (2025). FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration. arXiv."},{"key":"ref_106","unstructured":"Bai, Y., Chen, J., Chen, J., Chen, W., Chen, Z., Ding, C., Dong, L., Dong, Q., Du, Y., and Gao, K. (2024). Seed-asr: Understanding Diverse Speech and Contexts with LLM-Based Speech Recognition. arXiv."},{"key":"ref_107","unstructured":"Shakhadri, S.A.G., Kr, K., and Angadi, K.B. (2025). Samba-asr state-of-the-art speech recognition leveraging structured state-space models. arXiv."},{"key":"ref_108","unstructured":"Hwang, D. (2024). FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information. arXiv."},{"key":"ref_109","unstructured":"Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.-C., Pang, R., Le, Q.V., and Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv."},{"key":"ref_110","doi-asserted-by":"crossref","unstructured":"Chung, Y.-A., Zhang, Y., Han, W., Chiu, C.-C., Qin, J., Pang, R., and Wu, Y. (2021, January 13\u201317). W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia.","DOI":"10.1109\/ASRU51503.2021.9688253"},{"key":"ref_111","doi-asserted-by":"crossref","unstructured":"Rekesh, D., Koluguri, N.R., Kriman, S., Majumdar, S., Noroozi, V., Huang, H., Hrinchuk, O., Puvvada, K., Kumar, A., and Balam, J. (2023, January 16\u201320). Fast conformer with linearly scalable attention for efficient speech recognition. Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan.","DOI":"10.1109\/ASRU57964.2023.10389701"},{"key":"ref_112","doi-asserted-by":"crossref","unstructured":"Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., Synnaeve, G., and Auli, M. (2021, January 6\u201311). Self-training and pre-training are complementary for speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual.","DOI":"10.1109\/ICASSP39728.2021.9414641"},{"key":"ref_113","doi-asserted-by":"crossref","unstructured":"Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.-C., Li, B., Wu, Y., and Le, Q.V. (2020). Improved noisy student training for automatic speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-1470"},{"key":"ref_114","unstructured":"Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. (2021). Speechstew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network. arXiv."},{"key":"ref_115","doi-asserted-by":"crossref","unstructured":"Pan, J., Shapiro, J., Wohlwend, J., Han, K.J., Lei, T., and Ma, T. (2020). ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-2947"},{"key":"ref_116","doi-asserted-by":"crossref","unstructured":"Fathullah, Y., Wu, C., Shangguan, Y., Jia, J., Xiong, W., Mahadeokar, J., Liu, C., Shi, Y., Kalinli, O., and Seltzer, M. (2023). Multi-head state space model for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2023-1036"},{"key":"ref_117","first-page":"12449","article-title":"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_118","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","article-title":"Hubert: Self-supervised speech representation learning by masked prediction of hidden units","volume":"29","author":"Hsu","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_119","doi-asserted-by":"crossref","first-page":"1505","DOI":"10.1109\/JSTSP.2022.3188113","article-title":"Wavlm: Large-scale self-supervised pre-training for full stack speech processing","volume":"16","author":"Chen","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_120","doi-asserted-by":"crossref","unstructured":"Kim, K., Wu, F., Peng, Y., Pan, J., Sridhar, P., Han, K.J., and Watanabe, S. (2022, January 9\u201312). E-branchformer: Branchformer with enhanced merging for speech recognition. Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar.","DOI":"10.1109\/SLT54892.2023.10022656"},{"key":"ref_121","unstructured":"Yao, Z., Kang, W., Yang, X., Kuang, F., Guo, L., Zhu, H., Jin, Z., Li, Z., Lin, L., and Povey, D. (2024). CR-CTC: Consistency regularization on CTC for improved speech recognition. arXiv."},{"key":"ref_122","unstructured":"Akmal, H.M., Chao, X., and Mehdi, R. (2021). Transformer-based ASR incorporating time-reduction layer and fine-tuning with self-knowledge distillation. arXiv."},{"key":"ref_123","doi-asserted-by":"crossref","unstructured":"Liu, C., Zhang, F., Le, D., Kim, S., Saraf, Y., and Zweig, G. (2021, January 19\u201322). Improving RNN transducer based ASR with auxiliary tasks. Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China.","DOI":"10.1109\/SLT48900.2021.9383548"},{"key":"ref_124","unstructured":"Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., and Auli, M. (2022, January 17\u201323). Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language. Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA."},{"key":"ref_125","doi-asserted-by":"crossref","unstructured":"Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A., Synnaeve, G., and Collobert, R. (2020). Iterative pseudo-labeling for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-1800"},{"key":"ref_126","unstructured":"Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap, V., Sriram, A., Liptchinsky, V., and Collobert, R. (2019). End-to-end asr: From supervised to semi-supervised learning with modern architectures. arXiv."},{"key":"ref_127","doi-asserted-by":"crossref","unstructured":"Zhang, F., Wang, Y., Zhang, X., Liu, C., Saraf, Y., and Zweig, G. (2020). Faster, simpler and more accurate hybrid asr systems using wordpieces. arXiv.","DOI":"10.21437\/Interspeech.2020-1995"},{"key":"ref_128","doi-asserted-by":"crossref","unstructured":"Nartey, O.T., Yang, G., Asare, S.K., Wu, J., and Frempong, L.N. (2020). Robust semi-supervised traffic sign recognition via self-training and weakly-supervised learning. Sensors, 20.","DOI":"10.3390\/s20092684"},{"key":"ref_129","doi-asserted-by":"crossref","unstructured":"Souly, N., Spampinato, C., and Shah, M. (2017). Semi and weakly supervised semantic segmentation using generative adversarial network. arXiv.","DOI":"10.1109\/ICCV.2017.606"},{"key":"ref_130","doi-asserted-by":"crossref","first-page":"549","DOI":"10.1049\/cit2.12216","article-title":"Weakly supervised machine learning","volume":"8","author":"Ren","year":"2023","journal-title":"CAAI Trans. Intell. Technol."},{"key":"ref_131","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1093\/nsr\/nwx106","article-title":"A brief introduction to weakly supervised learning","volume":"5","author":"Zhou","year":"2018","journal-title":"Natl. Sci. Rev."},{"key":"ref_132","unstructured":"Merz, C.J., Clair, D.C.S., and Bond, W.E. (1992, January 7\u201311). Semi-supervised adaptive resonance theory (smart2). Proceedings of the International Joint Conference on Neural Networks (IJCNN), Baltimore, MD, USA."},{"key":"ref_133","doi-asserted-by":"crossref","first-page":"542","DOI":"10.1109\/TNN.2009.2015974","article-title":"Semi-supervised learning (Chapelle, O. et al., eds.; 2006) [book reviews]","volume":"20","author":"Chapelle","year":"2009","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_134","unstructured":"Lee, D.-H. (2013, January 16\u201321). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Proceedings of the Workshop on Challenges in Representation Learning, Atlanta, GA, USA."},{"key":"ref_135","doi-asserted-by":"crossref","unstructured":"Park, D.S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E.D., and Le, Q.V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2019-2680"},{"key":"ref_136","doi-asserted-by":"crossref","unstructured":"Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2019-1873"},{"key":"ref_137","doi-asserted-by":"crossref","first-page":"1519","DOI":"10.1109\/JSTSP.2022.3182537","article-title":"Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition","volume":"16","author":"Zhang","year":"2022","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_138","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 23\u201329). Robust speech recognition via large-scale weak supervision. Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA."},{"key":"ref_139","doi-asserted-by":"crossref","unstructured":"Xie, Q., Luong, M.-T., Hovy, E., and Le, Q.V. (2020, January 13\u201319). Self-training with noisy student improves imagenet classification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01070"},{"key":"ref_140","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA."},{"key":"ref_141","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 13\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning (ICML), Virtual."},{"key":"ref_142","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 13\u201319). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_143","first-page":"5516","article-title":"Self-Supervised Learning across Domains","volume":"44","author":"Bucci","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_144","unstructured":"Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv."},{"key":"ref_145","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1109\/TPAMI.2010.57","article-title":"Product quantization for nearest neighbor search","volume":"33","author":"Jegou","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_146","doi-asserted-by":"crossref","unstructured":"Xu, M., Jin, A., Wang, S., Su, M., Ng, T., Mason, H., Han, S., Lei, Z., Deng, Y., and Huang, Z. (2023). Conformer-based speech recognition on extreme edge-computing devices. arXiv.","DOI":"10.18653\/v1\/2024.naacl-industry.12"},{"key":"ref_147","unstructured":"(2025, January 15). AssemblyAI. Conformer-2: A State-of-the-Art Speech Recognition Model Trained on 1.1M hours of Data. AssemblyAI Technical Blog, 2023. Available online: https:\/\/www.assemblyai.com\/blog\/conformer-2\/."},{"key":"ref_148","unstructured":"Miao, H., Cheng, G., Zhang, P., and Yan, Y. (2023). Online Hybrid CTC\/attention End-to-End Automatic Speech Recognition Architecture. arXiv."},{"key":"ref_149","unstructured":"Bao, C., Huo, C., Chen, Q., and Gao, C. (2025). AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition. arXiv."},{"key":"ref_150","unstructured":"(2025, January 15). NVIDIA. What Is Automatic Speech Recognition? NVIDIA Technical Blog, 2023. Available online: https:\/\/developer.nvidia.com\/blog\/essential-guide-to-automatic-speech-recognition-technology\/."},{"key":"ref_151","doi-asserted-by":"crossref","unstructured":"Wang, H., Guo, P., Zhou, P., and Xie, L. (2024, January 14\u201319). Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10446769"},{"key":"ref_152","unstructured":"Chen, C., Li, R., Hu, Y., Siniscalchi, S.M., Chen, P.-Y., Chng, E., and Yang, C.-H.H. (2024). It\u2019s never too late: Fusing acoustic information into large language models for automatic speech recognition. arXiv."},{"key":"ref_153","doi-asserted-by":"crossref","unstructured":"Seo, P.H., Nagrani, A., and Schmid, C. (2023, January 18\u201322). Avformer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.02195"},{"key":"ref_154","doi-asserted-by":"crossref","unstructured":"Hu, J., Li, Z., Wang, P., Wang, J., Li, X., and Zhao, W. (2023). VHASR: A Multimodal Speech Recognition System with Vision Hotwords. arXiv.","DOI":"10.18653\/v1\/2024.emnlp-main.821"},{"key":"ref_155","doi-asserted-by":"crossref","unstructured":"Gabeur, V., Seo, P.H., Nagrani, A., Schmid, C., and Vedaldi, A. (2022). Avatar: Unconstrained Audiovisual Speech Recognition. arXiv.","DOI":"10.21437\/Interspeech.2022-776"},{"key":"ref_156","doi-asserted-by":"crossref","unstructured":"Xu, B., Lu, C., Guo, Y., and Wang, J. (2020, January 13\u201319). Discriminative multi-modality speech recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01444"},{"key":"ref_157","unstructured":"Afouras, T., Chung, J.S., and Zisserman, A. (2018). LRS3-TED: A large-scale dataset for visual speech recognition. arXiv."},{"key":"ref_158","doi-asserted-by":"crossref","unstructured":"Yang, G., Ma, Z., Yu, F., Gao, Z., Zhang, S., and Chen, X. (2024). Mala-asr: Multimedia-assisted llm-based asr. arXiv.","DOI":"10.21437\/Interspeech.2024-488"},{"key":"ref_159","doi-asserted-by":"crossref","unstructured":"Wang, H., Yu, F., Shi, X., Wang, Y., Zhang, S., and Li, M. (2024, January 14\u201319). SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSP48485.2024.10448079"},{"key":"ref_160","unstructured":"Qin, R., Liu, D., Xu, G., Yan, Z., Xu, C., Hu, Y., Hu, X.S., Xiong, J., and Shi, Y. (2024). Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge. arXiv."},{"key":"ref_161","doi-asserted-by":"crossref","unstructured":"Manepalli, S.G., Whitenack, D., and Nemecek, J. (2021, January 14\u201331). DYN-ASR: Compact, multilingual speech recognition via spoken language and accent identification. Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA.","DOI":"10.1109\/WF-IoT51360.2021.9594961"},{"key":"ref_162","first-page":"9361","article-title":"Squeezeformer: An efficient transformer for automatic speech recognition","volume":"35","author":"Kim","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_163","doi-asserted-by":"crossref","unstructured":"Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., and Zhang, Y. (2020, January 4\u20138). Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual.","DOI":"10.1109\/ICASSP40776.2020.9053889"},{"key":"ref_164","unstructured":"Yao, Z., Guo, L., Yang, X., Kang, W., Kuang, F., Yang, Y., Jin, Z., Lin, L., and Povey, D. (2023). Zipformer: A faster and better encoder for automatic speech recognition. arXiv."},{"key":"ref_165","unstructured":"Jeffries, N., King, E., Kudlur, M., Nicholson, G., Wang, J., and Warden, P. (2024). Moonshine: Speech Recognition for Live Transcription and Voice Commands. arXiv."},{"key":"ref_166","doi-asserted-by":"crossref","first-page":"1659","DOI":"10.1016\/S0893-6080(97)00011-7","article-title":"Networks of spiking neurons: The third generation of neural network models","volume":"10","author":"Maass","year":"1997","journal-title":"Neural Netw."},{"key":"ref_167","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1016\/j.neunet.2018.12.002","article-title":"Deep learning in spiking neural networks","volume":"111","author":"Tavanaei","year":"2019","journal-title":"Neural Netw."},{"key":"ref_168","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1146\/annurev.neuro.31.060407.125639","article-title":"Spike Timing\u2013Dependent Plasticity: A Hebbian Learning Rule","volume":"31","author":"Caporale","year":"2008","journal-title":"Annu. Rev. Neurosci."},{"key":"ref_169","doi-asserted-by":"crossref","unstructured":"Auge, D., Hille, J., Kreutz, F., Mueller, E., and Knoll, A. (2021, January 14\u201317). End-to-End Spiking Neural Network for Speech Recognition Using Resonating Input Neurons. Proceedings of the International Conference on Artificial Neural Networks (ICANN), Bratislava, Slovakia.","DOI":"10.1007\/978-3-030-86383-8_20"},{"key":"ref_170","doi-asserted-by":"crossref","unstructured":"Wu, J., Y\u0131lmaz, E., Zhang, M., Li, H., and Tan, K.C. (2020). Deep spiking neural networks for large vocabulary automatic speech recognition. Front. Neurosci., 14.","DOI":"10.3389\/fnins.2020.00199"},{"key":"ref_171","doi-asserted-by":"crossref","unstructured":"Wang, Q., Zhang, T., Han, M., Wang, Y., Zhang, D., and Xu, B. (2023, January 7\u201314). Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.","DOI":"10.1609\/aaai.v37i1.25081"},{"key":"ref_172","doi-asserted-by":"crossref","unstructured":"Irugalbandara, C., Naseem, A.S., Perera, S., Kiruthikan, S., and Logeeshan, V. (2023). A Secure and Smart Home Automation System with Speech Recognition and Power Measurement Capabilities. Sensors, 23.","DOI":"10.3390\/s23135784"},{"key":"ref_173","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1007\/s42979-023-02466-w","article-title":"A Comprehensive Analysis of Speech Recognition Systems in Healthcare: Current Research Challenges and Future Prospects","volume":"5","author":"Kumar","year":"2024","journal-title":"SN Comput. Sci."},{"key":"ref_174","unstructured":"Le-Duc, K. (2024). Vietmed: A dataset and benchmark for automatic speech recognition of vietnamese in the medical domain. arXiv."},{"key":"ref_175","unstructured":"Korfiatis, A.P., Moramarco, F., Sarac, R., Cuendet, M.A., Chary, M., Velupillai, S., Nenadic, G., and Gkotsis, G. (2022). Primock57: A Dataset of Primary Care Mock Consultations. arXiv."},{"key":"ref_176","unstructured":"Adedeji, A., Sanni, M., Ayodele, E., Joshi, S., and Olatunji, T. (2025). The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?. arXiv."},{"key":"ref_177","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1080\/17501229.2024.2315101","article-title":"I Can Speak: Improving English Pronunciation through Automatic Speech Recognition-Based Language Learning Systems","volume":"18","author":"Bashori","year":"2024","journal-title":"Innov. Lang. Learn. Teach."},{"key":"ref_178","doi-asserted-by":"crossref","unstructured":"Sun, W. (2023). The Impact of Automatic Speech Recognition Technology on Second Language Pronunciation and Speaking Skills of EFL Learners: A Mixed Methods Investigation. Front. Psychol., 14.","DOI":"10.3389\/fpsyg.2023.1210187"},{"key":"ref_179","unstructured":"Cai, Y. (2023, January 14\u201316). The Application of Automatic Speech Recognition Technology in English as Foreign. Proceedings of the 2nd International Conference on Humanities, Wisdom Education and Service Management (HWESM 2023), Xi\u2019an, China."},{"key":"ref_180","unstructured":"(2025, August 23). Straits Research. Voice and Speech Recognition Market Size, Share & Trends Analysis Report by Function (Speech Recognition, Voice Recognition), by Technology (Artificial Intelligence Based, Non-Artificial Intelligence Based), by Vertical (Automotive, Enterprise, Consumer, BFSI, Government, Retail, Healthcare, Military, Legal, Education) and by Region (North America, Europe, APAC, Middle East and Africa, LATAM) Forecasts, 2025\u20132033; Report Code: SRTE2654DR. Available online: https:\/\/straitsresearch.com\/report\/voice-and-speech-recognition-market."},{"key":"ref_181","unstructured":"Paulus Schoutsen (2025, August 23). 2023: Home Assistant\u2019s Year of Voice. Available online: https:\/\/www.home-assistant.io\/blog\/2022\/12\/20\/year-of-voice\/."},{"key":"ref_182","unstructured":"Schoutsen, P. (2025, August 23). Year of the Voice-Chapter 2: Let\u2019s Talk. Home Assistant Blog, 27 April 2023. Available online: https:\/\/www.home-assistant.io\/blog\/2023\/04\/27\/year-of-the-voice-chapter-2\/."},{"key":"ref_183","unstructured":"Steadman, L., and Williams, W. (2025, August 23). Ursa 2: Elevating Speech Recognition Across 50+ Languages. Available online: https:\/\/www.speechmatics.com\/company\/articles-and-news\/ursa-2-elevating-speech-recognition-across-52-languages."},{"key":"ref_184","unstructured":"Uniphore (2025, August 23). What Is Automatic Speech Recognition (ASR)?. Available online: https:\/\/www.uniphore.com\/glossary\/automatic-speech-recognition\/."},{"key":"ref_185","unstructured":"(2025, August 23). Microsoft. Speech to Text documentation\u2013Tutorials, API Reference. Azure AI Services. Available online: https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/speech-service\/index-speech-to-text."},{"key":"ref_186","unstructured":"(2025, August 23). Google Cloud. Speech-to-Text documentation. Google Cloud Documentation. Available online: https:\/\/cloud.google.com\/speech-to-text\/docs."},{"key":"ref_187","unstructured":"(2025, August 23). Voicegain. Speech-to-Text APIs. Available online: https:\/\/www.voicegain.ai\/speech-to-text-apis."},{"key":"ref_188","unstructured":"Wadhwani, P. (2025, August 23). Automotive Voice Recognition Market Analysis: Market Size, Share & Forecasts 2023\u20132032. Global Market Insights. Available online: https:\/\/www.gminsights.com\/industry-analysis\/automotive-voice-recognition-market."},{"key":"ref_189","unstructured":"Behera, R. (2025, August 23). Advances in Automotive Voice Recognition Systems Redefining the In-Car Experience. Allied Market Research Blog, 20 May 2024. Available online: https:\/\/blog.alliedmarketresearch.com\/latest-technologies-in-automotive-voice-recognition-systems-1972."},{"key":"ref_190","doi-asserted-by":"crossref","unstructured":"Wang, H., Guo, P., Li, Y., Zhang, A., Sun, J., Xie, L., Chen, W., Zhou, P., Bu, H., and Xu, X. (2024, January 14\u201319). ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge. Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, Republic of Korea.","DOI":"10.1109\/ICASSPW62465.2024.10627712"},{"key":"ref_191","unstructured":"ResearchInChina (2025, August 23). Automotive Voice Industry Review 2023\u20132024, AutoTech News, Available online: https:\/\/autotech.news\/automotive-voice-industry-review-2023-2024\/."},{"key":"ref_192","doi-asserted-by":"crossref","unstructured":"Zhen, K., Radfar, M., Nguyen, H., Strimel, G.P., Susanj, N., and Mouchtaris, A. (2023, January 9\u201312). Sub-8-bit quantization for on-device speech recognition: A regularization-free approach. Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar.","DOI":"10.1109\/SLT54892.2023.10022821"},{"key":"ref_193","doi-asserted-by":"crossref","unstructured":"Ding, S., Meadowlark, P., He, Y., Lew, L., Agrawal, S., and Rybakov, O. (2022, January 18\u201322). 4-bit Conformer with Native Quantization Aware Training for Efficient Speech Recognition. Proceedings of the Interspeech 2022, Incheon, Republic of Korea.","DOI":"10.21437\/Interspeech.2022-10809"},{"key":"ref_194","doi-asserted-by":"crossref","unstructured":"Noroozi, V., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. (2024, January 14\u201319). Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition. Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea.","DOI":"10.1109\/ICASSP48485.2024.10446861"},{"key":"ref_195","unstructured":"K2-FSA Team (2025, September 22). Sherpa-ONNX: Streaming Conformer-Transducer Models for On-Device ASR. Available online: https:\/\/k2-fsa.github.io\/sherpa\/onnx\/pretrained_models\/online-transducer\/conformer-transducer-models.html."},{"key":"ref_196","doi-asserted-by":"crossref","unstructured":"Gupta, A., Parulekar, A., Chattopadhyay, S., and Jyothi, P. (2024). Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR. arXiv.","DOI":"10.18653\/v1\/2024.mrl-1.13"},{"key":"ref_197","doi-asserted-by":"crossref","unstructured":"Liu, Z., Venkateswaran, N., Le Ferrand, E., and Prud\u2019hommeaux, E. (2024, January 11\u201316). How Important is a Language Model for Low-resource ASR?. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.findings-acl.13"},{"key":"ref_198","doi-asserted-by":"crossref","unstructured":"Mainzinger, J., and Levow, G.-A. (2024, January 11\u201316). Fine-Tuning ASR models for Very Low-Resource Languages: A Study on Mvskoke. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Bangkok, Thailand.","DOI":"10.18653\/v1\/2024.acl-srw.16"},{"key":"ref_199","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1016\/j.specom.2021.06.003","article-title":"Curriculum Learning based approaches for robust end-to-end far-field speech recognition","volume":"132","author":"Ranjan","year":"2021","journal-title":"Speech Commun."},{"key":"ref_200","unstructured":"Dai, Y., Liu, S., Bataev, V., Shi, Y., Chen, X., Wang, H., Bu, H., and Li, S. (2024). AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition. arXiv."},{"key":"ref_201","doi-asserted-by":"crossref","unstructured":"Wang, Z., Hou, F., and Wang, R. (2023, January 20\u201324). CLRL-Tuning: A Novel Continual Learning Approach for Automatic Speech Recognition. Proceedings of the INTERSPEECH 2023, Dublin, Ireland.","DOI":"10.21437\/Interspeech.2023-503"}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/4\/107\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T04:42:21Z","timestamp":1760071341000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/4\/107"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,4]]},"references-count":201,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["informatics12040107"],"URL":"https:\/\/doi.org\/10.3390\/informatics12040107","relation":{},"ISSN":["2227-9709"],"issn-type":[{"type":"electronic","value":"2227-9709"}],"subject":[],"published":{"date-parts":[[2025,10,4]]}}}