{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T18:21:42Z","timestamp":1775845302427,"version":"3.50.1"},"publisher-location":"Cham","reference-count":25,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783031441943","type":"print"},{"value":"9783031441950","type":"electronic"}],"license":[{"start":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T00:00:00Z","timestamp":1672531200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,9,22]],"date-time":"2023-09-22T00:00:00Z","timestamp":1695340800000},"content-version":"vor","delay-in-days":264,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network, which can be used as a frontend for downstream ASR models. However, the proposed methods were limited to specific fully convolutional architectures. In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We propose the Cleancoder preprocessor architecture that extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our preprocessor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.<\/jats:p>","DOI":"10.1007\/978-3-031-44195-0_31","type":"book-chapter","created":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T12:04:08Z","timestamp":1695297848000},"page":"376-388","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Bring the\u00a0Noise: Introducing Noise Robustness to\u00a0Pretrained Automatic Speech Recognition"],"prefix":"10.1007","author":[{"given":"Patrick","family":"Eickhoff","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matthias","family":"M\u00f6ller","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Theresa Pekarek","family":"Rosin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Johannes","family":"Twiefel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Stefan","family":"Wermter","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,9,22]]},"reference":[{"key":"31_CR1","unstructured":"Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449\u201312460 (2020)"},{"key":"31_CR2","doi-asserted-by":"crossref","unstructured":"Caroselli, J., Naranayan, A., O\u2019Malley, T.: Cleanformer: a microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR. ArXiv abs\/2204.11933 (2022)","DOI":"10.1109\/ICASSP49357.2023.10095118"},{"issue":"6","key":"31_CR3","doi-asserted-by":"publisher","first-page":"1505","DOI":"10.1109\/JSTSP.2022.3188113","volume":"16","author":"S Chen","year":"2022","unstructured":"Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 16(6), 1505\u20131518 (2022)","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"31_CR4","doi-asserted-by":"crossref","unstructured":"Chung, Y.A., et al.: w2v-BERT: combining contrastive learning and masked language modeling for self-supervised speech pre-training. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244\u2013250 (2021)","DOI":"10.1109\/ASRU51503.2021.9688253"},{"key":"31_CR5","doi-asserted-by":"crossref","unstructured":"Chung, Y.A., et al.: W2v-BERT: combining contrastive learning and masked language modeling for self-supervised speech pre-training. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244\u2013250. IEEE (2021)","DOI":"10.1109\/ASRU51503.2021.9688253"},{"key":"31_CR6","doi-asserted-by":"crossref","unstructured":"Fang, H., Wittmer, N., Twiefel, J., Wermter, S., Gerkmann, T.: Partially adaptive multichannel joint reduction of ego-noise and environmental noise. arXiv preprint arXiv:2303.15042 (2023)","DOI":"10.1109\/ICASSP49357.2023.10096344"},{"key":"31_CR7","doi-asserted-by":"crossref","unstructured":"Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of the Interspeech, pp. 5036\u20135040, October 2020","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"31_CR8","doi-asserted-by":"crossref","unstructured":"Heymann, J., Drude, L., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),. pp. 196\u2013200 (2016)","DOI":"10.1109\/ICASSP.2016.7471664"},{"key":"31_CR9","doi-asserted-by":"publisher","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","volume":"29","author":"WN Hsu","year":"2021","unstructured":"Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM Trans. Audio Speech Lang. Process. 29, 3451\u20133460 (2021)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"31_CR10","doi-asserted-by":"crossref","unstructured":"Huang, Y.A., Shabestary, T.Z., Gruenstein, A.: Hotword cleaner: dual-microphone adaptive noise cancellation with deferred filter coefficients for robust keyword spotting. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6346\u20136350 (2019)","DOI":"10.1109\/ICASSP.2019.8682682"},{"key":"31_CR11","doi-asserted-by":"crossref","unstructured":"Li, C., Yuan, P., Lee, H.: What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6434\u20136438. IEEE Press, May 2020","DOI":"10.1109\/ICASSP40776.2020.9054675"},{"key":"31_CR12","doi-asserted-by":"crossref","unstructured":"Li, J., et al.: Jasper: an end-to-end convolutional neural acoustic model. In: Proceedings of the Interspeech, pp. 71\u201375, September 2019","DOI":"10.21437\/Interspeech.2019-1819"},{"key":"31_CR13","doi-asserted-by":"crossref","unstructured":"M\u00f6ller, M., Twiefel, J., Weber, C., Wermter, S.: Controlling the noise robustness of end-to-end automatic speech recognition systems. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), July 2021","DOI":"10.1109\/IJCNN52387.2021.9533390"},{"key":"31_CR14","doi-asserted-by":"crossref","unstructured":"Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092\u20137096 (2013)","DOI":"10.1109\/ICASSP.2013.6639038"},{"key":"31_CR15","doi-asserted-by":"crossref","unstructured":"Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7092\u20137096. IEEE Press (2013)","DOI":"10.1109\/ICASSP.2013.6639038"},{"key":"31_CR16","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206\u20135210. IEEE Press, April 2015","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"31_CR17","unstructured":"Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022)"},{"key":"31_CR18","unstructured":"Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. arXiv preprint arXiv:1505.00387 (2015)"},{"key":"31_CR19","unstructured":"Thiemann, J., Ito, N., Vincent, E.: DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments (2013)"},{"key":"31_CR20","unstructured":"Valentini-Botinhao, C.: Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models (2017)"},{"key":"31_CR21","doi-asserted-by":"crossref","unstructured":"Valentini-Botinhao, C., Wang, X., Takaki, S., Yamagishi, J.: Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. In: Speech Synthesis Workshop (SSW), pp. 146\u2013152, September 2016","DOI":"10.21437\/SSW.2016-24"},{"issue":"8","key":"31_CR22","doi-asserted-by":"publisher","first-page":"1420","DOI":"10.1109\/TASLP.2018.2828980","volume":"26","author":"C Valentini-Botinhao","year":"2018","unstructured":"Valentini-Botinhao, C., Yamagishi, J.: Speech enhancement of noisy and reverberant speech for text-to-speech. IEEE\/ACM Trans. Audio Speech Lang. Process. 26(8), 1420\u20131433 (2018)","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"31_CR23","unstructured":"Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., December 2017"},{"key":"31_CR24","doi-asserted-by":"crossref","unstructured":"Veaux, C., Yamagishi, J., King, S.: The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In: Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA\/CASLRE), International Conference. Institute of Electrical and Electronics Engineers (IEEE), United States, November 2013","DOI":"10.1109\/ICSDA.2013.6709856"},{"key":"31_CR25","doi-asserted-by":"crossref","unstructured":"Yang, S.W., et al.: SUPERB: speech processing universal performance benchmark. In: Proceedings of the Interspeech, pp. 1194\u20131198 (2021)","DOI":"10.21437\/Interspeech.2021-1775"}],"container-title":["Lecture Notes in Computer Science","Artificial Neural Networks and Machine Learning \u2013 ICANN 2023"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-44195-0_31","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,21]],"date-time":"2023-09-21T12:10:06Z","timestamp":1695298206000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-44195-0_31"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"ISBN":["9783031441943","9783031441950"],"references-count":25,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-44195-0_31","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023]]},"assertion":[{"value":"22 September 2023","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"ICANN","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on Artificial Neural Networks","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Heraklion","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Greece","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2023","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 September 2023","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"29 September 2023","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"32","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"icann2023","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/e-nns.org\/icann2023\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Single-blind","order":1,"name":"type","label":"Type","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"easyacademia.org","order":2,"name":"conference_management_system","label":"Conference Management System","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"947","order":3,"name":"number_of_submissions_sent_for_review","label":"Number of Submissions Sent for Review","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"426","order":4,"name":"number_of_full_papers_accepted","label":"Number of Full Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"22","order":5,"name":"number_of_short_papers_accepted","label":"Number of Short Papers Accepted","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"45% - The value is computed by the equation \"Number of Full Papers Accepted \/ Number of Submissions Sent for Review * 100\" and then rounded to a whole number.","order":6,"name":"acceptance_rate_of_full_papers","label":"Acceptance Rate of Full Papers","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"2.4","order":7,"name":"average_number_of_reviews_per_paper","label":"Average Number of Reviews per Paper","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"4","order":8,"name":"average_number_of_papers_per_reviewer","label":"Average Number of Papers per Reviewer","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"No","order":9,"name":"external_reviewers_involved","label":"External Reviewers Involved","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}},{"value":"type of other papers accepted  : 9 Abstract","order":10,"name":"additional_info_on_review_process","label":"Additional Info on Review Process","group":{"name":"ConfEventPeerReviewInformation","label":"Peer Review Information (provided by the conference organizers)"}}]}}