{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T16:13:05Z","timestamp":1772554385355,"version":"3.50.1"},"reference-count":29,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,10,29]],"date-time":"2024-10-29T00:00:00Z","timestamp":1730160000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"IBM research"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Error correction is a vital element in modern automatic speech recognition (ASR) systems. A significant portion of ASR error correction work is closely integrated within specific ASR systems, which creates challenges for adapting these solutions to different ASR frameworks. This research introduces Lexical Error Guard (LEG), which leverages the extensive pre-trained knowledge of large language models (LLMs) and employs instructional learning to create an adaptable error correction system compatible with various ASR platforms. Additionally, a parameter-efficient fine-tuning method is utilized using quantized low-rank adaptation (QLoRA) to facilitate fast training of the system. Tested on the LibriSpeech data corpus, the results indicate that LEG improves ASR results when used with various Whisper model sizes. Improvements in WER are made, with a decrease from 2.27% to 2.21% on the \u201cTest Clean\u201d dataset for Whisper Large with beam search. Improvements on the \u201cTest Other\u201d dataset for Whisper Large with beam search are also made, from 4.93% to 4.72%.<\/jats:p>","DOI":"10.3390\/make6040120","type":"journal-article","created":{"date-parts":[[2024,10,29]],"date-time":"2024-10-29T08:42:09Z","timestamp":1730191329000},"page":"2435-2446","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Lexical Error Guard: Leveraging Large Language Models for Enhanced ASR Error Correction"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8642-8806","authenticated-orcid":false,"given":"Mei","family":"Si","sequence":"first","affiliation":[{"name":"Department of Cognitive Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"}]},{"given":"Omar","family":"Cobas","sequence":"additional","affiliation":[{"name":"Department of Cognitive Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"},{"name":"Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"}]},{"given":"Michael","family":"Fababeir","sequence":"additional","affiliation":[{"name":"Department of Cognitive Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"},{"name":"Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,29]]},"reference":[{"key":"ref_1","unstructured":"Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., and Diamos, G. (2015). End-to-end speech recognition in English and Mandarin. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Wang, Y.-W., Lu, K.-H., and Chen, K.-Y. (2023). Hypr: A comprehensive study for ASR hypothesis revising with a reference corpus. arXiv.","DOI":"10.21437\/Interspeech.2024-385"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Salazar, J., Liang, D., Nguyen, T.Q., and Kirchhoff, L. (2020). Masked language model scoring. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.240"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Xu, L., Gu, Y., Kolehmainen, J., Khan, H., Gandhe, A., Rastrow, A., Stolcke, A., and Bulyko, I. (2022, January 22\u201327). Rescorebert: Discriminative speech recognition rescoring with bert. Proceedings of the ICASSP 2022\u20142022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9747118"},{"key":"ref_5","unstructured":"Fohr, D., and Illina, I. (September, January 30). Bert-based semantic model for rescoring n-best speech recognition list. Proceedings of the INTERSPEECH, Brno, Czechia."},{"key":"ref_6","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 19\u201324). Librispeech: An asr corpus based on public domain audio books. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_8","unstructured":"Chen, Z., Jiang, F., Chen, J., Wang, T., Yu, F., Chen, G., Zhang, H., Liang, J., Zhang, C., and Zhang, Z. (2023). Phoenix: Democratizing chatgpt across languages. arXiv."},{"key":"ref_9","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023, January 23\u201329). Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA."},{"key":"ref_10","unstructured":"Ma, R., Qian, M., Manakul, P., Gales, M., and Knill, K. (2023). Can generative large language models perform asr error correction?. arXiv."},{"key":"ref_11","unstructured":"Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. arXiv."},{"key":"ref_12","first-page":"12449","article-title":"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_13","unstructured":"Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., and Coates, A. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Wang, Z., Kamma, R., Eswaran, S., and Sadagopan, N. (2023). Patcorrect: Non-autoregressive phoneme-augmented transformer for asr error correction. arXiv.","DOI":"10.21437\/Interspeech.2023-1135"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Tian, J., Yu, J., Weng, C., Zhang, S.-X., Su, D., Yu, D., and Zou, Y. (2022, January 22\u201327). Consistent training and decoding for end-to-end speech recognition using lattice-free nmi. Proceedings of the ICASSP 2022\u20142022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore.","DOI":"10.1109\/ICASSP43922.2022.9746579"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., and Wu, Y. (2020). Conformer: Convolution-augmented transformer for speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"ref_17","unstructured":"Hu, Y., Chen, C., Yang, C.-H.H., Li, R., Zhang, C., Chen, P.-Y., and Chng, E. (2024). Large Language Models are Efficient Learners of Noise-Robust Speech Recognition. arXiv."},{"key":"ref_18","unstructured":"Pu, J., Nguyen, T.-S., and Stuker, S. (2024). Multi-stage Large Language Model Correction for Speech Recognition. arXiv."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Yang, C.-H.H., Gu, Y., Liu, Y.-C., Ghosh, S., Bulyko, I., and Stolcke, A. (2023). Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting. arXiv.","DOI":"10.1109\/ASRU57964.2023.10389673"},{"key":"ref_20","unstructured":"Chen, C., Hu, Y., Yang, C.-H.H., Siniscalchi, S.M., Chen, P.-Y., and Chng, E. (2023). HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Yu, W., Tang, C., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., and Zhang, C. (2023). Connecting Speech Encoder and Large Language Model for ASR. arXiv.","DOI":"10.1109\/ICASSP48485.2024.10445874"},{"key":"ref_22","unstructured":"Adedeji, A., Joshi, S., and Doohan, B. (2024). The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models. arXiv."},{"key":"ref_23","unstructured":"Higuchi, Y., Ogawa, T., and Kobayashi, T. (2023). Harnessing the zero-shot power of instruction-tuned large language model in end-to-end speech recognition. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Trusov, A., Limonova, E., Slugin, D., Nikolaev, D., and Arlazarov, V.V. (2021, January 10\u201315). Fast implementation of 4-bit convolutional neural networks for mobile devices. Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy.","DOI":"10.1109\/ICPR48806.2021.9412841"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Ma, R., Gales, M.J.F., Knill, K.M., and Qian, M. (2023). N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space. arXiv.","DOI":"10.21437\/Interspeech.2023-1616"},{"key":"ref_26","unstructured":"Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. (2022). Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv."},{"key":"ref_27","unstructured":"Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., and Bossan, B. (2023, September 15). Peft: State-of-the-Art Parameter-Efficient Fine-Tuning Methods. Available online: https:\/\/github.com\/huggingface\/peft."},{"key":"ref_28","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Morris, A., Maier, V., and Green, P. (2004, January 4\u20138). From wer and ril to mer and wil: Improved evaluation measures for connected speech recognition. Proceedings of the Interspeech, Jeju Island, South Korea.","DOI":"10.21437\/Interspeech.2004-668"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/120\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:23:14Z","timestamp":1760113394000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/120"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,29]]},"references-count":29,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["make6040120"],"URL":"https:\/\/doi.org\/10.3390\/make6040120","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,29]]}}}