{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T18:11:12Z","timestamp":1771956672417,"version":"3.50.1"},"reference-count":57,"publisher":"MDPI AG","issue":"20","license":[{"start":{"date-parts":[[2022,10,13]],"date-time":"2022-10-13T00:00:00Z","timestamp":1665619200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Second Century Fund (C2F), Chulalongkorn University","award":["SOC66210008"],"award-info":[{"award-number":["SOC66210008"]}]},{"name":"Thailand Science Research and Innovation Fund Chulalongkorn University","award":["SOC66210008"],"award-info":[{"award-number":["SOC66210008"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Because of their simple design structure, end-to-end deep learning (E2E-DL) models have gained a lot of attention for speech enhancement. A number of DL models have achieved excellent results in eliminating the background noise and enhancing the quality as well as the intelligibility of noisy speech. Designing resource-efficient and compact models during real-time processing is still a key challenge. In order to enhance the accomplishment of E2E models, the sequential and local characteristics of speech signal should be efficiently taken into consideration while modeling. In this paper, we present resource-efficient and compact neural models for end-to-end noise-robust waveform-based speech enhancement. Combining the Convolutional Encode-Decoder (CED) and Recurrent Neural Networks (RNNs) in the Convolutional Recurrent Network (CRN) framework, we have aimed at different speech enhancement systems. Different noise types and speakers are used to train and test the proposed models. With LibriSpeech and the DEMAND dataset, the experiments show that the proposed models lead to improved quality and intelligibility with fewer trainable parameters, notably reduced model complexity, and inference time than existing recurrent and convolutional models. The quality and intelligibility are improved by 31.61% and 17.18% over the noisy speech. We further performed cross corpus analysis to demonstrate the generalization of the proposed E2E SE models across different speech datasets.<\/jats:p>","DOI":"10.3390\/s22207782","type":"journal-article","created":{"date-parts":[[2022,10,14]],"date-time":"2022-10-14T01:44:13Z","timestamp":1665711853000},"page":"7782","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["End-to-End Deep Convolutional Recurrent Models for Noise Robust Waveform Speech Enhancement"],"prefix":"10.3390","volume":"22","author":[{"given":"Rizwan","family":"Ullah","sequence":"first","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}]},{"given":"Lunchakorn","family":"Wuttisittikulkij","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1715-9689","authenticated-orcid":false,"given":"Sushank","family":"Chaudhary","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}]},{"given":"Amir","family":"Parnianifard","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4310-7054","authenticated-orcid":false,"given":"Shashi","family":"Shah","sequence":"additional","affiliation":[{"name":"Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand"}]},{"given":"Muhammad","family":"Ibrar","sequence":"additional","affiliation":[{"name":"Department of Physics, Islamia College Peshawar, Peshawar 25000, Pakistan"}]},{"given":"Fazal-E","family":"Wahab","sequence":"additional","affiliation":[{"name":"National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei 230026, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,10,13]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Gnanamanickam, J., Natarajan, Y., and KR, S.P. (2021). A hybrid speech enhancement algorithm for voice assistance application. Sensors, 21.","DOI":"10.3390\/s21217025"},{"key":"ref_2","first-page":"78","article-title":"On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms","volume":"6","author":"Saleem","year":"2020","journal-title":"Int. J. Interact. Multimed. Artif. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"108784","DOI":"10.1016\/j.apacoust.2022.108784","article-title":"Gammatone Filter Bank-Deep Neural Network-based Monaural speech enhancement for unseen conditions","volume":"194","author":"Sivapatham","year":"2022","journal-title":"Appl. Acoust."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"883","DOI":"10.1007\/s10772-020-09674-2","article-title":"Fundamentals, present and future perspectives of speech enhancement","volume":"24","author":"Das","year":"2021","journal-title":"Int. J. Speech Technol."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1109\/TASLP.2019.2955276","article-title":"Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement","volume":"28","author":"Tan","year":"2019","journal-title":"IEEE\/Acm Trans. Audio Speech Lang. Process."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1109\/TASLP.2014.2364452","article-title":"A regression approach to speech enhancement based on deep neural networks","volume":"23","author":"Xu","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1109\/LSP.2013.2291240","article-title":"An experimental study on speech enhancement based on deep neural networks","volume":"21","author":"Xu","year":"2013","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1186\/s13634-020-00707-1","article-title":"Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration","volume":"2020","author":"Strake","year":"2020","journal-title":"Eurasip J. Adv. Signal Process."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1016\/j.specom.2014.02.001","article-title":"Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification","volume":"60","author":"Xia","year":"2014","journal-title":"Speech Commun."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1179","DOI":"10.1109\/TASLP.2019.2913512","article-title":"A new framework for CNN-based speech enhancement in the time domain","volume":"27","author":"Pandey","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1374","DOI":"10.1109\/TASLP.2022.3161143","article-title":"Self-attending RNN for speech enhancement to improve cross-corpus generalization","volume":"30","author":"Pandey","year":"2022","journal-title":"Ieee\/Acm Trans. Audio Speech Lang. Process."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"24013","DOI":"10.1007\/s11042-019-08293-7","article-title":"Text-independent speaker recognition using LSTM-RNN and speech enhancement","volume":"79","author":"Nassar","year":"2020","journal-title":"Multimed. Tools Appl."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"9037","DOI":"10.1007\/s12652-020-02598-4","article-title":"Multi-objective long-short term memory recurrent neural networks for speech enhancement","volume":"12","author":"Saleem","year":"2021","journal-title":"J. Ambient. Intell. Humaniz. Comput."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Pandey, A., and Wang, D. (2020, January 25\u201329). Learning Complex Spectral Mapping for Speech Enhancement with Improved Cross-Corpus Generalization. Proceedings of the Interspeech, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-2561"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"107347","DOI":"10.1016\/j.apacoust.2020.107347","article-title":"Speech enhancement using progressive learning-based convolutional recurrent neural network","volume":"166","author":"Li","year":"2020","journal-title":"Appl. Acoust."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"107019","DOI":"10.1016\/j.apacoust.2019.107019","article-title":"Speech enhancement based on simple recurrent unit network","volume":"157","author":"Cui","year":"2020","journal-title":"Appl. Acoust."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Lei, T., Zhang, Y., Wang, S.I., Dai, H., and Artzi, Y. (2017). Simple recurrent units for highly parallelizable recurrence. arXiv.","DOI":"10.18653\/v1\/D18-1477"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.specom.2021.10.002","article-title":"PACDNN: A phase-aware composite deep neural network for speech enhancement","volume":"136","author":"Hasannezhad","year":"2022","journal-title":"Speech Commun."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Lv, S., Hu, Y., Zhang, S., and Xie, L. (2021). Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement. arXiv.","DOI":"10.21437\/Interspeech.2021-1482"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Takahashi, N., Agrawal, P., Goswami, N., and Mitsufuji, Y. (2018, January 2\u20136). PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation. Proceedings of the Interspeech 2018, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1773"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Fu, S.W., Tsao, Y., Lu, X., and Kawai, H. (2017, January 12\u201315). Raw waveform-based speech enhancement by fully convolutional networks. Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia.","DOI":"10.1109\/APSIPA.2017.8281993"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Li, J., Zhang, H., Zhang, X., and Li, C. (2019, January 18\u201321). Single channel speech enhancement using temporal convolutional recurrent neural networks. Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China.","DOI":"10.1109\/APSIPAASC47483.2019.9023013"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1570","DOI":"10.1109\/TASLP.2018.2821903","article-title":"End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks","volume":"26","author":"Fu","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_24","unstructured":"Sainath, T., Weiss, R.J., Wilson, K., Senior, A.W., and Vinyals, O. (2022, October 01). Learning the speech front-end with raw waveform CLDNNs. Available online: https:\/\/storage.googleapis.com\/pub-tools-public-publication-data\/pdf\/43960.pdf."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Giri, R., Isik, U., and Krishnaswamy, A. (2019, January 20\u201323). Attention wave-u-net for speech enhancement. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA.","DOI":"10.1109\/WASPAA.2019.8937186"},{"key":"ref_26","unstructured":"Rethage, D., Pons, J., and Serra, X. (2014, January 4\u20139). A wavenet for speech denoising. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1016\/j.specom.2019.09.001","article-title":"Time-domain speech enhancement using generative adversarial networks","volume":"114","author":"Pascual","year":"2019","journal-title":"Speech Commun."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1016\/j.specom.2020.09.002","article-title":"A time\u2013frequency smoothing neural network for speech enhancement","volume":"124","author":"Yuan","year":"2020","journal-title":"Speech Commun."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"106666","DOI":"10.1016\/j.asoc.2020.106666","article-title":"Multi-scale decomposition based supervised single channel deep speech enhancement","volume":"95","author":"Saleem","year":"2020","journal-title":"Appl. Soft Comput."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"160581","DOI":"10.1109\/ACCESS.2020.3021061","article-title":"On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks","volume":"8","author":"Saleem","year":"2020","journal-title":"IEEE Access"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Abdulbaqi, J., Gu, Y., Chen, S., and Marsic, I. (2020, January 4\u20138). Residual recurrent neural network for speech enhancement. Proceedings of the ICASSP 2020\u20132020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain.","DOI":"10.1109\/ICASSP40776.2020.9053544"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"2149","DOI":"10.1109\/LSP.2020.3040693","article-title":"Wavecrn: An efficient convolutional recurrent neural network for end-to-end speech enhancement","volume":"27","author":"Hsieh","year":"2020","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_33","unstructured":"Stoller, D., Ewert, S., and Dixon, S. (2018). Wave-u-net: A multi-scale neural network for end-to-end audio source separation. arXiv."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Pandey, A., and Wang, D. (2019, January 12\u201317). TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683634"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., and Xie, L. (2020). DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv.","DOI":"10.21437\/Interspeech.2020-2537"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"825","DOI":"10.1109\/TASLP.2020.2968738","article-title":"On loss functions for supervised monaural time-domain speech enhancement","volume":"28","author":"Tan","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_37","unstructured":"Xiao, F., Guan, J., Kong, Q., and Wang, W. (2021). Time-domain speech enhancement with generative adversarial learning. arXiv."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015, January 9\u201324). Librispeech: An asr corpus based on public domain audio books. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia.","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1016\/0167-6393(90)90010-7","article-title":"Speech database development at MIT: TIMIT and beyond","volume":"9","author":"Zue","year":"1990","journal-title":"Speech Commun."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Veaux, C., Yamagishi, J., and King, S. (2013, January 25\u201327). The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. Proceedings of the International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA\/CASLRE), Gurgaon, India.","DOI":"10.1109\/ICSDA.2013.6709856"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Sun, S., Yeh, C.F., Ostendorf, M., Hwang, M.Y., and Xie, L. (2018). Training augmentation with adversarial examples for robust speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2018-1247"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1016\/0167-6393(93)90095-3","article-title":"Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems","volume":"12","author":"Varga","year":"1993","journal-title":"Speech Commun."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"035081","DOI":"10.1121\/1.4799597","article-title":"The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings","volume":"19","author":"Thiemann","year":"2013","journal-title":"Proc. Mtgs. Acoust."},{"key":"ref_44","unstructured":"Rix, A.W., Beerends, J.G., Hollier, M.P., and Hekstra, A.P. (2021, January 6\u201311). Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Taal, C.H., Hendriks, R.C., Heusdens, R., and Jensen, J. (2010, January 14\u201319). A short-time objective intelligibility measure for time-frequency weighted noisy speech. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA.","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1109\/TASL.2007.911054","article-title":"Evaluation of objective quality measures for speech enhancement","volume":"16","author":"Hu","year":"2007","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"4705","DOI":"10.1121\/1.4986931","article-title":"Long short-term memory for speaker generalization in supervised speech separation","volume":"141","author":"Chen","year":"2017","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/TASLP.2018.2870742","article-title":"Phase-aware speech enhancement based on deep neural networks","volume":"27","author":"Zheng","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Kounovsky, T., and Malek, J. (2017, January 24\u201326). Single channel speech enhancement using convolutional neural network. Proceedings of the International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Donostia-San Sebastian, Spain.","DOI":"10.1109\/ECMSM.2017.7945915"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Shah, N., Patil, H.A., and Soni, M.H. (2018, January 12\u201315). Time-frequency mask-based speech enhancement using convolutional generative adversarial network. Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA.","DOI":"10.23919\/APSIPA.2018.8659692"},{"key":"ref_51","unstructured":"Hasannezhad, M., Ouyang, Z., Zhu, W.P., and Champagne, B. (2020, January 7\u201310). An integrated CNN-GRU framework for complex ratio mask estimation in speech enhancement. Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Ouyang, Z., Yu, H., Zhu, W.P., and Champagne, B. (2019, January 12\u201317). A fully convolutional neural network for complex spectrogram processing in speech enhancement. Proceedings of the ICASSP 2019\u20142019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK.","DOI":"10.1109\/ICASSP.2019.8683423"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"1862","DOI":"10.1109\/LSP.2016.2627029","article-title":"Low-rank and sparsity analysis applied to speech enhancement via online estimated dictionary","volume":"23","author":"Sun","year":"2016","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Shi, W., Zhang, X., Zou, X., Han, W., and Min, G. (2017, January 24\u201326). Auditory mask estimation by RPCA for monaural speech enhancement. Proceedings of the IEEE\/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China.","DOI":"10.1109\/ICIS.2017.7959990"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"1109","DOI":"10.1109\/TASSP.1984.1164453","article-title":"Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator","volume":"32","author":"Ephraim","year":"1984","journal-title":"IEEE Trans. Acoust. Speech Signal Process."},{"key":"ref_56","first-page":"20","article-title":"Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx)","volume":"7","author":"Bohouta","year":"2017","journal-title":"Int. J. Eng. Res. Appl."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"1372","DOI":"10.1134\/S1064226919120155","article-title":"Spectral phase estimation based on deep neural networks for single channel speech enhancement","volume":"64","author":"Saleem","year":"2019","journal-title":"J. Commun. Technol. Electron."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/20\/7782\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:53:32Z","timestamp":1760144012000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/20\/7782"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,13]]},"references-count":57,"journal-issue":{"issue":"20","published-online":{"date-parts":[[2022,10]]}},"alternative-id":["s22207782"],"URL":"https:\/\/doi.org\/10.3390\/s22207782","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,13]]}}}