{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:29:21Z","timestamp":1775230161745,"version":"3.50.1"},"reference-count":268,"publisher":"Springer Science and Business Media LLC","issue":"S3","license":[{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Artif Intell Rev"],"published-print":{"date-parts":[[2023,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to perform speech denoising, dereverberation, speaker extraction and speaker separation. In this paper, we review the current DNN techniques being employed to achieve speech enhancement and separation. The review looks at the whole pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech, model training (supervised and unsupervised) to how they address label ambiguity problem. The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process. By this, we hope to provide an all inclusive reference of all the state of art DNN based techniques being applied in the domain of speech separation and enhancement. We further discuss future research directions. This survey can be used by both academic researchers and industry practitioners working in speech separation and enhancement domain.<\/jats:p>","DOI":"10.1007\/s10462-023-10612-2","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T06:01:41Z","timestamp":1698213701000},"page":"3651-3703","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":39,"title":["Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis"],"prefix":"10.1007","volume":"56","author":[{"given":"Peter","family":"Ochieng","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,10,25]]},"reference":[{"key":"10612_CR1","doi-asserted-by":"crossref","unstructured":"Adiga N, Pantazis Y, Tsiaras V, Stylianou Y (2019) Speech enhancement for noise-robust speech synthesis using wasserstein gan. In: INTERSPEECH, pp 1821\u20131825","DOI":"10.21437\/Interspeech.2019-2648"},{"key":"10612_CR2","doi-asserted-by":"crossref","unstructured":"Aihara R, Hanazawa T, Okato Y, Wichern G, Roux JL (2019) Teacher-student deep clustering for low-delay single channel speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2019-May, pp 690\u2013694","DOI":"10.1109\/ICASSP.2019.8682695"},{"key":"10612_CR3","doi-asserted-by":"crossref","unstructured":"Ai Y, Li H, Wang X, Yamagishi J, Ling Z (2021) Denoising-and-dereverberation hierarchical neural vocoder for robust waveform generation. In: 2021 IEEE spoken language technology workshop, SLT 2021\u2014proceedings, pp 477\u2013484","DOI":"10.1109\/SLT48900.2021.9383611"},{"key":"10612_CR4","doi-asserted-by":"crossref","unstructured":"Allen JB (1982) Applications of the short time Fourier transform to speech processing and spectral analysis. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 1982-May, pp 1012\u20131015","DOI":"10.1109\/ICASSP.1982.1171703"},{"issue":"11","key":"10612_CR5","doi-asserted-by":"crossref","first-page":"1558","DOI":"10.1109\/PROC.1977.10770","volume":"65","author":"JB Allen","year":"1977","unstructured":"Allen JB, Rabiner LR (1977) A unified approach to short-time fourier analysis and synthesis. Proc IEEE 65(11):1558\u20131564","journal-title":"Proc IEEE"},{"issue":"2","key":"10612_CR6","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1121\/1.3609258","volume":"130","author":"I Arweiler","year":"2011","unstructured":"Arweiler I, Buchholz JM (2011) The influence of spectral characteristics of early reflections on speech intelligibility. J Acoust Soc Am 130(2):996\u20131005","journal-title":"J Acoust Soc Am"},{"issue":"3","key":"10612_CR7","doi-asserted-by":"crossref","first-page":"560","DOI":"10.4271\/2014-01-0975","volume":"7","author":"KR Avery","year":"2014","unstructured":"Avery KR, Pan J, Engler-Pinto CC, Wei Z, Yang F, Lin S, Luo L, Konson D (2014) Fatigue behavior of stainless steel sheet specimens at extremely high temperatures. SAE Int J Mater Manuf 7(3):560\u2013566","journal-title":"SAE Int J Mater Manuf"},{"key":"10612_CR8","doi-asserted-by":"crossref","unstructured":"Baby D, Virtanen T, Barker T, Van Hamme H (2014) Coupled dictionary training for exemplar-based speech enhancement. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 2883\u20132887","DOI":"10.1109\/ICASSP.2014.6854127"},{"key":"10612_CR9","first-page":"1","volume":"1","author":"A Baevski","year":"2020","unstructured":"Baevski A, Zhou H, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 1:1\u201319","journal-title":"Adv Neural Inf Process Syst"},{"key":"10612_CR10","doi-asserted-by":"crossref","unstructured":"Bahmaninezhad F, Wu J, Gu R, Zhang SX, Xu Y, Yu M, Yu D (2019) A comprehensive study of speech separation: spectrogram vs waveform separation. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 4574\u20134578","DOI":"10.21437\/Interspeech.2019-3181"},{"key":"10612_CR11","unstructured":"Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271"},{"key":"10612_CR12","doi-asserted-by":"crossref","unstructured":"Bando Y, Mimura M, Itoyama K, Yoshii K, Kawahara T (2018) Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization, ICASSP, In: IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, no. Mcmc, pp 716\u2013720","DOI":"10.1109\/ICASSP.2018.8461530"},{"key":"10612_CR13","unstructured":"Beltagy I, Peters ME, Cohan A (2020) Longformer: the long-document transformer, arXiv preprint http:\/\/arxiv.org\/abs\/2004.05150"},{"key":"10612_CR14","doi-asserted-by":"crossref","first-page":"2993","DOI":"10.1109\/TASLP.2022.3207349","volume":"30","author":"X Bie","year":"2022","unstructured":"Bie X, Leglaive S, Alameda-Pineda X, Girin L (2022) Unsupervised speech enhancement using dynamical variational autoencoders. IEEE\/ACM Trans Audio Speech Lang Process 30:2993\u20133007","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR15","doi-asserted-by":"crossref","unstructured":"Brungart DS, Chang PS, Simpson BD, Wang D (2006) Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation, pp 4007\u20134018","DOI":"10.1121\/1.2363929"},{"key":"10612_CR16","doi-asserted-by":"crossref","first-page":"2753","DOI":"10.1109\/TASLP.2021.3101617","volume":"29","author":"J Byun","year":"2021","unstructured":"Byun J, Shin JW (2021) Monaural speech separation using speaker embedding from preliminary separation. IEEE\/ACM Trans Audio Speech Lang Process 29:2753\u20132763","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR17","doi-asserted-by":"crossref","unstructured":"Cao R, Abdulatif S, Yang B (2022) CMGAN: conformer-based metric GAN for speech enhancement, arXiv preprint arXiv:2209.11112, pp 936\u2013940","DOI":"10.36227\/techrxiv.21187846.v2"},{"key":"10612_CR18","doi-asserted-by":"crossref","unstructured":"Chandna P, Miron M, Janer J, G\u00f3mez E (2017) Monoaural audio source separation using deep convolutional neural networks, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 10169 LNCS, pp 258\u2013266","DOI":"10.1007\/978-3-319-53547-0_25"},{"key":"10612_CR19","doi-asserted-by":"crossref","unstructured":"Chang X, Zhang W, Qian Y, Le\u00a0Roux J, Watanabe S (2020) End-to-end multi-speaker speech recognition with transformer. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp 6134\u20136138","DOI":"10.1109\/ICASSP40776.2020.9054029"},{"key":"10612_CR25","doi-asserted-by":"crossref","unstructured":"Chen Z, Watanabe S, Erdogan H, Hershey JR (2015) Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2015-January, 2015, pp 3274\u20133278","DOI":"10.21437\/Interspeech.2015-659"},{"key":"10612_CR23","doi-asserted-by":"crossref","unstructured":"Chen Z, Luo Y, Mesgarani N (2017) Deep attractor network for single-microphone speaker separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 246\u2013250. IEEE","DOI":"10.1109\/ICASSP.2017.7952155"},{"key":"10612_CR24","doi-asserted-by":"crossref","unstructured":"Chen J, Mao Q, Liu D (2020) Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 2642\u20132646","DOI":"10.21437\/Interspeech.2020-2205"},{"key":"10612_CR26","doi-asserted-by":"crossref","unstructured":"Chen S, Wu Y, Chen Z, Wu J, Yoshioka T, Liu S, Li J, Yu X (2021) Ultra fast speech separation model with teacher student learning. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 3, pp 2298\u20132302","DOI":"10.21437\/Interspeech.2021-142"},{"key":"10612_CR21","doi-asserted-by":"crossref","unstructured":"Chen L-W, Cheng Y-F, Lee H-S, Tsao Y, Wang H-M (2023a) A training and inference strategy using noisy and enhanced speech as target for speech enhancement without clean speech. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, pp 5315\u20135319","DOI":"10.21437\/Interspeech.2023-1548"},{"issue":"1","key":"10612_CR22","doi-asserted-by":"crossref","first-page":"469","DOI":"10.3390\/app13010469","volume":"13","author":"L Chen","year":"2023","unstructured":"Chen L, Mo Z, Ren J, Cui C, Zhao Q (2023b) An electroglottograph auxiliary neural network for target speaker extraction. Appl Sci 13(1):469","journal-title":"Appl Sci"},{"key":"10612_CR28","doi-asserted-by":"crossref","unstructured":"Cho K, Van Merri\u00ebnboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014-2014 conference on empirical methods in natural language processing, proceedings of the conference, pp 1724\u20131734","DOI":"10.3115\/v1\/D14-1179"},{"key":"10612_CR27","doi-asserted-by":"crossref","unstructured":"Choi H-S, Heo H, Lee JH, Lee K (2020) Phase-aware single-stage speech denoising and dereverberation with U-Net. arXiv preprint arXiv:2006.00687","DOI":"10.1109\/ICASSP39728.2021.9414852"},{"key":"10612_CR29","doi-asserted-by":"crossref","unstructured":"Chung YA, Hsu WN, Tang H, Glass J (2019) An unsupervised autoregressive model for speech representation learning. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 146\u2013150","DOI":"10.21437\/Interspeech.2019-1473"},{"key":"10612_CR30","doi-asserted-by":"crossref","unstructured":"Chung YA, Tang H, Glass J (2020) Vector-quantized autoregressive predictive coding. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, no.\u00a01, pp 3760\u20133764","DOI":"10.21437\/Interspeech.2020-1228"},{"key":"10612_CR31","doi-asserted-by":"crossref","unstructured":"Cord-Landwehr T, Boeddeker C, von Neumann T, Zorila C, Doddipatla R, Haeb-Umbach R (2021) Monaural source separation: from anechoic to reverberant environments. In: 2022 international workshop on acoustic signal enhancement (IWAENC), pp 1\u20135. arXiv:org\/abs\/2111.07578","DOI":"10.1109\/IWAENC53105.2022.9914794"},{"key":"10612_CR32","doi-asserted-by":"crossref","unstructured":"de\u00a0Oliveira D, Peer T, Gerkmann T (2022) Efficient transformer-based speech enhancement using long frames and STFT magnitudes, arXiv preprint arXiv:2206.11703., no.\u00a01, pp 2948\u20132952","DOI":"10.21437\/Interspeech.2022-10781"},{"key":"10612_CR33","doi-asserted-by":"crossref","unstructured":"D\u00e9fossez A, Synnaeve G, Adi Y (2020) Real time speech enhancement in the waveform domain. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 3291\u20133295","DOI":"10.21437\/Interspeech.2020-2409"},{"key":"10612_CR34","doi-asserted-by":"crossref","unstructured":"Delcroix M, Zmolikova K, Kinoshita K, Ogawa A, Nakatani T (2018) Single channel target speaker extraction and recognition with speaker beam. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5554\u20135558. IEEE","DOI":"10.1109\/ICASSP.2018.8462661"},{"key":"10612_CR35","doi-asserted-by":"crossref","unstructured":"Donahue C, Li B, Prabhavalkar R (2018) Exploring speech enhancement with generative adversarial networks for robust speech recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, no. Figure 1, pp 5024\u20135028","DOI":"10.1109\/ICASSP.2018.8462581"},{"key":"10612_CR36","doi-asserted-by":"crossref","unstructured":"Dovrat S, Nachmani E, Wolf L (2021) Many-speakers single channel speech separation with optimal permutation training. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 4, pp 2408\u20132412","DOI":"10.21437\/Interspeech.2021-493"},{"key":"10612_CR38","doi-asserted-by":"crossref","unstructured":"Du J, Huo Q (2008) A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, pp 569\u2013572","DOI":"10.21437\/Interspeech.2008-168"},{"key":"10612_CR40","doi-asserted-by":"crossref","unstructured":"Du J, Tu Y, Xu Y, Dai L, Lee CH (2014) Speech separation of a target speaker based on deep neural networks. In: International conference on signal processing proceedings, ICSP, vol 2015-January, no. October, pp 473\u2013477","DOI":"10.1109\/ICOSP.2014.7015050"},{"key":"10612_CR37","doi-asserted-by":"crossref","first-page":"1493","DOI":"10.1109\/TASLP.2020.2991537","volume":"28","author":"Z Du","year":"2020","unstructured":"Du Z, Zhang X, Han J (2020) A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE\/ACM Trans Audio Speech Lang Process 28:1493\u20131505","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR39","doi-asserted-by":"crossref","unstructured":"Dupuis E, Novo D, O\u2019Connor I, Bosio A (2020) Sensitivity analysis and compression opportunities in DNNs using weight sharing. In: Proceedings\u20142020 23rd international symposium on design and diagnostics of electronic circuits and systems, DDECS 2020","DOI":"10.1109\/DDECS50862.2020.9095658"},{"issue":"6","key":"10612_CR41","doi-asserted-by":"crossref","first-page":"1109","DOI":"10.1109\/TASSP.1984.1164453","volume":"32","author":"Y Ephraim","year":"1984","unstructured":"Ephraim Y, Malah D (1984) Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 32(6):1109\u20131121","journal-title":"IEEE Trans Acoust Speech Signal Process"},{"key":"10612_CR42","doi-asserted-by":"crossref","unstructured":"Erdogan H, Hershey JR, Watanabe S, Le Roux J (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2015-August, pp 708\u2013712","DOI":"10.1109\/ICASSP.2015.7178061"},{"key":"10612_CR43","unstructured":"Erhan D, Courville A, Bengio Y, Vincent P (2010) Why does unsupervised pre-training help deep learning? In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp 201\u2013208"},{"key":"10612_CR44","doi-asserted-by":"crossref","first-page":"1303","DOI":"10.1109\/TASLP.2020.2982029","volume":"28","author":"C Fan","year":"2020","unstructured":"Fan C, Tao J, Liu B, Yi J, Wen Z, Liu X (2020) End-to-end post-filter for speech separation with deep attention fusion features. IEEE\/ACM Trans Audio Speech Lang Process 28:1303\u20131314","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR45","doi-asserted-by":"crossref","unstructured":"Fedorov I, Stamenovic M, Jensen C, Yang LC, Mandell A, Gan Y, Mattina M, Whatmough PN (2020) TinyLSTMs: efficient neural speech enhancement for hearing aids. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 4054\u20134058","DOI":"10.21437\/Interspeech.2020-1864"},{"key":"10612_CR46","unstructured":"Friedman DH (1985) Instantaneous-frequency distribution vs. time: an interpretation of the phase structure of speech. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 1121\u20131124"},{"key":"10612_CR49","doi-asserted-by":"crossref","unstructured":"Fu SW, Tsao Y, Lu X (2016) SNR-aware convolutional neural network modeling for speech enhancement. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-Sept, pp 3768\u20133772","DOI":"10.21437\/Interspeech.2016-211"},{"key":"10612_CR47","doi-asserted-by":"crossref","unstructured":"Fu SW, Hu TY, Tsao Y, Lu X (2017) Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In: IEEE international workshop on machine learning for signal processing, MLSP, vol 2017-September, pp 1\u20136","DOI":"10.1109\/MLSP.2017.8168119"},{"key":"10612_CR50","unstructured":"Fu SW, Wang TW, Tsao Y, Lu X, Kawai H, Stoller D, Ewert S, Dixon S, Lu X, Tsao Y, Matsuda S, Hori C, Xu Y, Du J, Dai LR, Lee CH, Gao T, Du J, Dai LR, Lee CH, Fu SW, Tsao Y, Lu X, Weninger F, Hershey JR, Le Roux J, Schuller B, Xu Y, Du J, Dai LR, Lee CH, Llu\u00eds F, Pons J, Serra X (2018a) Speech enhancement based on deep denoising autoencoder. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-Sept, no.\u00a01, pp 7\u201319"},{"issue":"9","key":"10612_CR52","doi-asserted-by":"crossref","first-page":"1570","DOI":"10.1109\/TASLP.2018.2821903","volume":"26","author":"SW Fu","year":"2018","unstructured":"Fu SW, Wang TW, Tsao Y, Lu X, Kawai H (2018b) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE\/ACM Trans Audio Speech Lang Process 26(9):1570\u20131584","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR48","unstructured":"Fu SW, Liao CF, Tsao Y, Lin SD (2019) MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement. In: 36th international conference on machine learning, ICML 2019, vol 2019-June, pp 3566\u20133576"},{"key":"10612_CR51","doi-asserted-by":"crossref","unstructured":"Fu S-W, Yu C, Hung K-H, Ravanelli M, Tsao Y (2022) Metricgan-u: unsupervised speech enhancement\/dereverberation based only on noisy\/reverberated speech. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7412\u20137416. IEEE","DOI":"10.1109\/ICASSP43922.2022.9747180"},{"key":"10612_CR53","doi-asserted-by":"crossref","unstructured":"Fujimura T, Koizumi Y, Yatabe K, Miyazaki R (2021) Noisy-target training: a training strategy for DNN-based speech enhancement without clean speech. In: 2021 29th european signal processing conference (EUSIPCO), pp 436\u2013440. IEEE","DOI":"10.23919\/EUSIPCO54536.2021.9616166"},{"key":"10612_CR54","doi-asserted-by":"crossref","unstructured":"Gamper H, Tashev IJ (2018) Blind reverberation time estimation using a convolutional neural network. In: 16th international workshop on acoustic signal enhancement, IWAENC 2018\u2014proceedings, pp 136\u2013140","DOI":"10.1109\/IWAENC.2018.8521241"},{"issue":"1","key":"10612_CR55","first-page":"2030","volume":"17","author":"Y Ganin","year":"2016","unstructured":"Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2030\u20132096","journal-title":"J Mach Learn Res"},{"issue":"4","key":"10612_CR56","doi-asserted-by":"crossref","first-page":"692","DOI":"10.1109\/TASLP.2016.2647702","volume":"25","author":"S Gannot","year":"2017","unstructured":"Gannot S, Vincent E, Markovich-Golan S, Ozerov A (2017) A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE\/ACM Trans Audio Speech Lang Process 25(4):692\u2013730","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR57","doi-asserted-by":"crossref","unstructured":"Gao T, Du J, Dai LR, Lee CH (2016) SNR-based progressive learning of deep neural network for speech enhancement. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-Sept, pp 3713\u20133717","DOI":"10.21437\/Interspeech.2016-224"},{"issue":"3","key":"10612_CR58","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1109\/TASL.2008.916519","volume":"16","author":"G Garau","year":"2008","unstructured":"Garau G, Renals S (2008) Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Trans Audio Speech Lang Process 16(3):508\u2013518","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"10612_CR59","doi-asserted-by":"crossref","unstructured":"Germain FG, Chen Q, Koltun V (2018) Speech denoising with deep feature losses, arXiv preprint arXiv:1806.10522","DOI":"10.21437\/Interspeech.2019-1924"},{"key":"10612_CR60","doi-asserted-by":"crossref","unstructured":"Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2022) A survey of quantization methods for efficient neural network inference, low-power computer vision, pp 291\u2013326","DOI":"10.1201\/9781003162810-13"},{"key":"10612_CR61","unstructured":"Goodfellow I (2016) NIPS 2016 Tutorial: generative adversarial networks, arXiv preprint arXiv. arXiv:org\/abs\/1701.00160"},{"key":"10612_CR62","doi-asserted-by":"crossref","first-page":"1789","DOI":"10.1007\/s11263-021-01453-z","volume":"129","author":"J Gou","year":"2021","unstructured":"Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129:1789\u20131819","journal-title":"Int J Comput Vis"},{"key":"10612_CR63","doi-asserted-by":"crossref","unstructured":"Grais EM, Plumbley MD (2018) Single channel audio source separation using convolutional denoising autoencoders. In: 2017 IEEE global conference on signal and information processing, GlobalSIP 2017\u2014proceedings, vol 2018-Janua, pp 1265\u20131269","DOI":"10.1109\/GlobalSIP.2017.8309164"},{"key":"10612_CR64","doi-asserted-by":"crossref","unstructured":"Grais EM, Sen MU, Erdogan H (2014) Deep neural networks for single channel source separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 3734\u20133738","DOI":"10.1109\/ICASSP.2014.6854299"},{"issue":"2","key":"10612_CR65","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1109\/TASSP.1984.1164317","volume":"32","author":"DW Griffin","year":"1984","unstructured":"Griffin DW, Lim JS (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236\u2013243","journal-title":"IEEE Trans Acoust Speech Signal Process"},{"key":"10612_CR66","doi-asserted-by":"crossref","unstructured":"Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, Pang R (2020) Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 5036\u20135040","DOI":"10.21437\/Interspeech.2020-3015"},{"key":"10612_CR67","unstructured":"Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inf Process Syst 30"},{"issue":"5","key":"10612_CR68","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1109\/LSP.2010.2042530","volume":"17","author":"D Gunawan","year":"2010","unstructured":"Gunawan D, Sen D (2010) Iterative phase estimation for the synthesis of separated sources from single-channel mixtures. IEEE Signal Process Lett 17(5):421\u2013424","journal-title":"IEEE Signal Process Lett"},{"issue":"6","key":"10612_CR69","doi-asserted-by":"crossref","first-page":"982","DOI":"10.1109\/TASLP.2015.2416653","volume":"23","author":"K Han","year":"2015","unstructured":"Han K, Wang Y, Wang DL, Woods WS, Merks I, Zhang T (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE Trans Audio Speech Lang Process 23(6):982\u2013992","journal-title":"IEEE Trans Audio Speech Lang Process"},{"issue":"5","key":"10612_CR70","first-page":"1","volume":"5","author":"C Han","year":"2019","unstructured":"Han C, O\u2019Sullivan J, Luo Y, Herrero J, Mehta AD, Mesgarani N (2019) Speaker-independent auditory attention decoding without access to clean speech sources. Sci Adv 5(5):1\u201312","journal-title":"Sci Adv"},{"key":"10612_CR71","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1016\/j.neunet.2022.11.013","volume":"158","author":"X Hao","year":"2023","unstructured":"Hao X, Xu C, Xie L (2023) Neural speech enhancement with unsupervised pre-training and mixture training. Neural Netw 158:216\u2013227","journal-title":"Neural Netw"},{"issue":"4","key":"10612_CR72","volume":"1213","author":"Y He","year":"2019","unstructured":"He Y, Zhao J (2019) Temporal convolutional networks for anomaly detection in time series. J Phys 1213(4):042050","journal-title":"J Phys"},{"key":"10612_CR73","doi-asserted-by":"crossref","unstructured":"Heitkaemper J, Jakobeit D, Boeddeker C, Drude L, Haeb-Umbach R (2020) Demystifying TasNet: a dissecting approach. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2020-May, pp 6359\u20136363","DOI":"10.1109\/ICASSP40776.2020.9052981"},{"key":"10612_CR74","doi-asserted-by":"crossref","unstructured":"Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 31\u201335. IEEE","DOI":"10.1109\/ICASSP.2016.7471631"},{"issue":"2","key":"10612_CR76","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1142\/S1793005715400013","volume":"11","author":"TD Hien","year":"2015","unstructured":"Hien TD, Tuan DV, At PV, Son LH (2015) Novel algorithm for non-negative matrix factorization. New Math Nat Comput 11(2):121\u2013133","journal-title":"New Math Nat Comput"},{"key":"10612_CR77","unstructured":"Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 2.7, pp 1\u20139"},{"issue":"8","key":"10612_CR78","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735\u20131780","journal-title":"Neural Comput"},{"key":"10612_CR79","unstructured":"Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) MobileNets: efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861"},{"key":"10612_CR80","doi-asserted-by":"crossref","unstructured":"Hsu YT, Lin YC, Fu SW, Tsao Y, Kuo TW (2019) A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN). In: 2018 IEEE spoken language technology workshop, SLT 2018\u2014proceedings, pp 566\u2013573","DOI":"10.1109\/SLT.2018.8639508"},{"key":"10612_CR81","doi-asserted-by":"crossref","first-page":"3451","DOI":"10.1109\/TASLP.2021.3122291","volume":"29","author":"WN Hsu","year":"2021","unstructured":"Hsu WN, Bolte B, Tsai YHH, Lakhotia K, Salakhutdinov R, Mohamed A (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM Trans Audio Speech Lang Process 29:3451\u20133460","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR83","first-page":"22509","volume":"27","author":"X Hu","year":"2021","unstructured":"Hu X, Li K, Zhang W, Luo Y, Lemercier JM, Gerkmann T (2021) Speech separation using an asynchronous fully recurrent convolutional neural network. Adv Neural Inf Process Syst 27:22509\u201322522","journal-title":"Adv Neural Inf Process Syst"},{"issue":"1","key":"10612_CR84","doi-asserted-by":"crossref","first-page":"33","DOI":"10.5506\/APhysPolB.42.33","volume":"42","author":"P-S Huang","year":"2011","unstructured":"Huang P-S, Kim M, Hasegawa-Johnson M, Smaragdis P (2011) Deep learning for monaural speech separation. Acta Phys Pol B 42(1):33\u201344","journal-title":"Acta Phys Pol B"},{"issue":"12","key":"10612_CR85","doi-asserted-by":"crossref","first-page":"2136","DOI":"10.1109\/TASLP.2015.2468583","volume":"23","author":"PS Huang","year":"2015","unstructured":"Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE\/ACM Trans Audio Speech Lang Process 23(12):2136\u20132147","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR86","doi-asserted-by":"crossref","unstructured":"Huang Z, Watanabe S, Yang SW, Garc\u00eda P, Khudanpur S (2022) Investigating self-supervised learning for speech enhancement and separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2022-May, pp 6837\u20136841","DOI":"10.1109\/ICASSP43922.2022.9746303"},{"key":"10612_CR87","doi-asserted-by":"crossref","unstructured":"Hung K-H, Fu S-w, Tseng H-H, Chiang H-T, Tsao Y, Lin C-W, (2022) Boosting self-supervised embeddings for speech enhancement, arXiv preprint arXiv:2204.03339","DOI":"10.21437\/Interspeech.2022-10002"},{"key":"10612_CR88","doi-asserted-by":"crossref","unstructured":"Irvin B, Stamenovic M, Kegler M, Yang L-C (2023) Self-supervised learning for speech enhancement through synthesis. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1\u20135. IEEE","DOI":"10.1109\/ICASSP49357.2023.10094705"},{"key":"10612_CR90","doi-asserted-by":"crossref","unstructured":"Isik Y, Le Roux J, Chen Z, Watanabe S, Hershey JR (2016) Single-channel multi-speaker separation using deep clustering. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-Sept, pp 545\u2013549","DOI":"10.21437\/Interspeech.2016-1176"},{"key":"10612_CR89","doi-asserted-by":"crossref","unstructured":"Isik U, Giri R, Phansalkar N, Valin JM, Helwani K, Krishnaswamy A (2020) PoCoNet: better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 2487\u20132491","DOI":"10.21437\/Interspeech.2020-3027"},{"key":"10612_CR91","doi-asserted-by":"crossref","unstructured":"Isola P, Zhu J-Y, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125\u20131134","DOI":"10.1109\/CVPR.2017.632"},{"key":"10612_CR92","unstructured":"Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T (2017) Singing voice separation with deep U-Net convolutional networks. In: Proceedings of the 18th international society for music information retrieval conference, ISMIR 2017, pp 745\u2013751"},{"key":"10612_CR96","doi-asserted-by":"crossref","unstructured":"Ji X, Yu M, Zhang C, Su D, Yu T, Liu X, Yu D (2020) Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7294\u20137298. IEEE","DOI":"10.1109\/ICASSP40776.2020.9054311"},{"key":"10612_CR93","doi-asserted-by":"crossref","first-page":"1859","DOI":"10.1109\/LSP.2020.3029704","volume":"27","author":"F Jiang","year":"2020","unstructured":"Jiang F, Duan Z (2020) Speaker attractor network: generalizing speech separation to unseen numbers of sources. IEEE Signal Process Lett 27:1859\u20131863","journal-title":"IEEE Signal Process Lett"},{"issue":"12","key":"10612_CR94","doi-asserted-by":"crossref","first-page":"2112","DOI":"10.1109\/TASLP.2014.2361023","volume":"22","author":"Y Jiang","year":"2014","unstructured":"Jiang Y, Wang DL, Liu RS, Feng ZM (2014) Binaural classification for reverberant speech segregation using deep neural networks. IEEE\/ACM Trans Audio Speech Lang Process 22(12):2112\u20132121","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"4","key":"10612_CR95","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1109\/TASL.2008.2010633","volume":"17","author":"Z Jin","year":"2009","unstructured":"Jin Z, Wang D (2009) A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans Audio Speech Lang Process 17(4):625\u2013638","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"10612_CR97","doi-asserted-by":"crossref","unstructured":"Karamatl\u0131 E, K\u0131rb\u0131z S (2022) Mixcycle: unsupervised speech separation via cyclic mixture permutation invariant training. IEEE Signal Process Lett","DOI":"10.1109\/LSP.2022.3232276"},{"key":"10612_CR98","doi-asserted-by":"crossref","unstructured":"Kavalerov I, Wisdom S, Erdogan H, Patton B, Wilson K, Le Roux J, Hershey JR (2019) Universal sound separation. In: IEEE workshop on applications of signal processing to audio and acoustics, vol 2019-October, pp 175\u2013179","DOI":"10.1109\/WASPAA.2019.8937253"},{"key":"10612_CR99","doi-asserted-by":"crossref","unstructured":"Kim M, Smaragdis P (2015) Adaptive denoising autoencoders: a fine-tuning scheme to learn from test mixtures. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 9237, pp 100\u2013107","DOI":"10.1007\/978-3-319-22482-4_12"},{"key":"10612_CR100","unstructured":"Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: 2nd international conference on learning representations, ICLR 2014\u2014conference track proceedings, no.\u00a0Ml, pp 1\u201314"},{"key":"10612_CR101","doi-asserted-by":"crossref","unstructured":"Kinoshita K, Drude L, Delcroix M, Nakatani T (2018) Listening to each speaker one by one with recurrent selective hearing networks. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, pp 5064\u20135068","DOI":"10.1109\/ICASSP.2018.8462646"},{"key":"10612_CR102","unstructured":"Kitaev N, Kaiser \u0141, Levskaya A (2020) Reformer: the efficient transformer. In: International conference on learning representations, pp 1\u201312 arXiv:org\/abs\/2001.04451"},{"issue":"3","key":"10612_CR103","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1121\/1.3179673","volume":"126","author":"U Kjems","year":"2009","unstructured":"Kjems U, Boldt JB, Pedersen MS, Lunner T, Wang D (2009) Role of mask pattern in intelligibility of ideal binary-masked noisy speech. J Acoust Soc Am 126(3):1415\u20131426","journal-title":"J Acoust Soc Am"},{"key":"10612_CR104","doi-asserted-by":"crossref","unstructured":"Koizumi Y, Niwa K, Hioka Y, Kobayashi K, Haneda Y (2017) DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 81\u201385","DOI":"10.1109\/ICASSP.2017.7952122"},{"issue":"10","key":"10612_CR105","doi-asserted-by":"crossref","first-page":"1901","DOI":"10.1109\/TASLP.2017.2726762","volume":"25","author":"M Kolb\u00e6k","year":"2017","unstructured":"Kolb\u00e6k M, Yu D, Tan Z-H, Jensen J (2017a) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE\/ACM Trans Audio Speech Lang Process 25(10):1901\u20131913","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"1","key":"10612_CR106","first-page":"149","volume":"25","author":"M Kolb\u00e6k","year":"2017","unstructured":"Kolb\u00e6k M, Tan ZH, Jensen J (2017b) Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE\/ACM Trans Audio Speech Lang Process 25(1):149\u2013163","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"2","key":"10612_CR107","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1109\/TASLP.2018.2877909","volume":"27","author":"M Kolbaek","year":"2018","unstructured":"Kolbaek M, Tan Z-H, Jensen J (2018a) On the relationship between short-time objective intelligibility and short-time spectral-amplitude mean-square error for speech enhancement. IEEE\/ACM Trans Audio Speech Lang Process 27(2):283\u2013295","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR109","doi-asserted-by":"crossref","unstructured":"Kolbcek M, Tan ZH, Jensen J (2018b) Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, pp 5059\u20135063","DOI":"10.1109\/ICASSP.2018.8462040"},{"key":"10612_CR108","doi-asserted-by":"crossref","first-page":"825","DOI":"10.1109\/TASLP.2020.2968738","volume":"28","author":"M Kolbaek","year":"2020","unstructured":"Kolbaek M, Tan ZH, Jensen SH, Jensen J (2020) On loss functions for supervised monaural time-domain speech enhancement. IEEE\/ACM Trans Audio Speech Lang Process 28:825\u2013838","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR110","first-page":"17\u00c3\u201a\u00c2 022","volume":"33","author":"J Kong","year":"2020","unstructured":"Kong J, Kim J, Bae J (2020) Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst 33:17\u00c3\u201a\u00c2 022-17\u00c3\u201a\u00c2 033","journal-title":"Adv Neural Inf Process Syst"},{"key":"10612_CR111","doi-asserted-by":"crossref","unstructured":"Kong Z, Ping W, Dantrey A, Catanzaro B (2022) Speech denoising in the waveform domain with self-attention. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2022-May, pp 7867\u20137871","DOI":"10.1109\/ICASSP43922.2022.9746169"},{"key":"10612_CR112","doi-asserted-by":"crossref","first-page":"1600","DOI":"10.1109\/TASLP.2022.3155286","volume":"30","author":"V Kothapally","year":"2022","unstructured":"Kothapally V, Hansen JH (2022a) Skipconvgan: monaural speech dereverberation using generative adversarial networks via complex time-frequency masking. IEEE\/ACM Trans Audio Speech Lang Process 30:1600\u20131613","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR113","doi-asserted-by":"crossref","unstructured":"Kothapally V, Hansen JH (2022b) Complex-valued time-frequency self-attention for speech dereverberation, arXiv preprint arXiv:2211.12632","DOI":"10.21437\/Interspeech.2022-11277"},{"key":"10612_CR114","doi-asserted-by":"crossref","unstructured":"Kumar A, Florencio D (2016) Speech enhancement in multiple-noise conditions using deep neural networks. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-September-2016, pp 3738\u20133742","DOI":"10.21437\/Interspeech.2016-88"},{"key":"10612_CR115","doi-asserted-by":"crossref","unstructured":"Lam MW, Wang J, Su D, Yu D (2021a) Effective low-cost time-domain audio separation using globally attentive locally recurrent networks. In: 2021 IEEE spoken language technology workshop, SLT 2021\u2013proceedings, pp 801\u2013808","DOI":"10.1109\/SLT48900.2021.9383464"},{"key":"10612_CR116","doi-asserted-by":"crossref","unstructured":"Lam MW, Wang J, Su D, Yuy D (2021b) Sandglasset: a light multi-granularity self-attentive network for time-domain speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2021-June, pp 5759\u20135763","DOI":"10.1109\/ICASSP39728.2021.9413837"},{"issue":"2","key":"10612_CR117","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1109\/JSTSP.2019.2904183","volume":"13","author":"J Le Roux","year":"2019","unstructured":"Le Roux J, Wichern G, Watanabe S, Sarroff A, Hershey JR (2019) Phasebook and friends: leveraging discrete representations for source separation. IEEE J Sel Top Sign Process 13(2):370\u2013382","journal-title":"IEEE J Sel Top Sign Process"},{"key":"10612_CR118","doi-asserted-by":"crossref","unstructured":"Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 156\u2013165","DOI":"10.1109\/CVPR.2017.113"},{"key":"10612_CR119","doi-asserted-by":"crossref","unstructured":"Lee Y-S, Wang C-Y, Wang S-F, Wang J-C, Wu C-H (2017) Fully complex deep neural network for phase-incorporating monaural source separation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 281\u2013285. IEEE","DOI":"10.1109\/ICASSP.2017.7952162"},{"key":"10612_CR120","doi-asserted-by":"crossref","first-page":"56\u00c3\u201a\u00c2 031","DOI":"10.1109\/ACCESS.2022.3176003","volume":"10","author":"JH Lee","year":"2022","unstructured":"Lee JH, Chang JH, Yang JM, Moon HG (2022) NAS-TasNet: neural architecture search for time-domain speech separation. IEEE Access 10:56\u00c3\u201a\u00c2 031-56\u00c3\u201a\u00c2 043","journal-title":"IEEE Access"},{"key":"10612_CR123","doi-asserted-by":"crossref","unstructured":"Leglaive S, Girin L, Horaud R (2018) A variance modeling framework based on variational autoencoders for speech enhancement. In: IEEE international workshop on machine learning for signal processing, MLSP, vol. 2018-September","DOI":"10.1109\/MLSP.2018.8516711"},{"key":"10612_CR124","doi-asserted-by":"crossref","unstructured":"Leglaive S, Simsekli U, Liutkus A, Girin L, Horaud R (2019) Speech enhancement with variational autoencoders and alpha-stable distributions. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2019-May, pp 541\u2013545","DOI":"10.1109\/ICASSP.2019.8682546"},{"key":"10612_CR121","doi-asserted-by":"crossref","unstructured":"Leglaive S, Alameda-Pineda X, Girin L, Horaud R (2020) A recurrent variational autoencoder for speech enhancement. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2020-May, pp 371\u2013375","DOI":"10.1109\/ICASSP40776.2020.9053164"},{"key":"10612_CR125","unstructured":"Lehtinen J, Munkberg J, Hasselgren J, Laine S, Karras T, Aittala M, Aila T (2018) Noise2noise: learning image restoration without clean data, arXiv preprint arXiv:1803.04189"},{"key":"10612_CR126","unstructured":"Le\u00f3n D, Tobar F (2021) Late reverberation suppression using U-nets, arXiv preprint arXiv:2110.02144., no.\u00a01"},{"key":"10612_CR138","doi-asserted-by":"crossref","unstructured":"Li K, Wu B, Lee CH (2016) An iterative phase recovery framework with phase mask for spectral mapping with an application to speech enhancement. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-Sept, pp 3773\u20133777","DOI":"10.21437\/Interspeech.2016-494"},{"key":"10612_CR140","unstructured":"Li H, Zhang X, Zhang H, Gao G (2017) Integrated speech enhancement method based on weighted prediction error and DNN for dereverberation and denoising, arXiv preprint arXiv:1708.08251"},{"issue":"11","key":"10612_CR127","doi-asserted-by":"publisher","first-page":"5005","DOI":"10.1007\/s00034-018-0798-4","volume":"37","author":"ZX Li","year":"2018","unstructured":"Li ZX, Dai LR, Song Y, McLoughlin I (2018a) A conditional generative model for speech enhancement. Circ Syst Signal Process 37(11):5005\u20135022. https:\/\/doi.org\/10.1007\/s00034-018-0798-4","journal-title":"Circ Syst Signal Process"},{"key":"10612_CR139","doi-asserted-by":"crossref","unstructured":"Li Y, Zhang X, Chen D (2018b) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1091\u20131100","DOI":"10.1109\/CVPR.2018.00120"},{"issue":"1","key":"10612_CR128","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1109\/TAI.2021.3119927","volume":"3","author":"Y Li","year":"2021","unstructured":"Li Y, Sun Y, Horoshenkov K, Naqvi SM (2021a) Domain adaptation and autoencoder-based unsupervised speech enhancement. IEEE Trans Artif Intell 3(1):43\u201352","journal-title":"IEEE Trans Artif Intell"},{"key":"10612_CR131","doi-asserted-by":"crossref","unstructured":"Li A, Liu W, Luo X, Yu G, Zheng C, Li X (2021b) A simultaneous denoising and dereverberation framework with target decoupling. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2, pp 796\u2013800","DOI":"10.21437\/Interspeech.2021-1137"},{"issue":"2","key":"10612_CR129","doi-asserted-by":"crossref","first-page":"832","DOI":"10.3390\/app12020832","volume":"12","author":"H Li","year":"2022","unstructured":"Li H, Chen K, Wang L, Liu J, Wan B, Zhou B (2022) Sound source separation mechanisms of different deep networks explained from the perspective of auditory perception. Appl Sci 12(2):832","journal-title":"Appl Sci"},{"key":"10612_CR130","doi-asserted-by":"crossref","unstructured":"Liao C-F, Tsao Y, Lee H-Y, Wang H-M (2018) Noise adaptive speech enhancement using domain adversarial training, arXiv preprint arXiv:1807.07501","DOI":"10.21437\/Interspeech.2019-1519"},{"issue":"12","key":"10612_CR132","doi-asserted-by":"crossref","first-page":"1586","DOI":"10.1109\/PROC.1979.11540","volume":"67","author":"JS Lim","year":"1979","unstructured":"Lim JS, Oppenheim AV (1979) Enhancement and bandwidth compression of noisy speech. Proc IEEE 67(12):1586\u20131604","journal-title":"Proc IEEE"},{"key":"10612_CR133","doi-asserted-by":"crossref","unstructured":"Lin YC, Hsu YT, Fu SW, Tsao Y, Kuo TW (2019) IA-Net: acceleration and compression of speech enhancement using integer-adder deep neural network. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 1801\u20131805","DOI":"10.21437\/Interspeech.2019-1207"},{"issue":"12","key":"10612_CR135","doi-asserted-by":"crossref","first-page":"2092","DOI":"10.1109\/TASLP.2019.2941148","volume":"27","author":"Y Liu","year":"2019","unstructured":"Liu Y, Wang D (2019) Divide and conquer: a deep CASA approach to talker-independent monaural speaker separation. IEEE\/ACM Trans Audio Speech Lang Process 27(12):2092\u20132102","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR134","doi-asserted-by":"crossref","unstructured":"Liu AT, Yang SW, Chi PH, Hsu PC, Lee HY (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2020-May, pp 6419\u20136423","DOI":"10.1109\/ICASSP40776.2020.9054458"},{"key":"10612_CR136","doi-asserted-by":"crossref","first-page":"2351","DOI":"10.1109\/TASLP.2021.3095662","volume":"29","author":"AT Liu","year":"2021","unstructured":"Liu AT, Li SW, Lee HY (2021) TERA: self-supervised learning of transformer encoder representation for speech. IEEE\/ACM Trans Audio Speech Lang Process 29:2351\u20132366","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR137","doi-asserted-by":"crossref","unstructured":"Liu H, Liu X, Kong Q, Tian Q, Zhao Y, Wang D, Huang C, Wang Y (2022) VoiceFixer: a unified framework for high-fidelity speech restoration, arXiv preprint arXiv:2204.05841, no. September, pp 4232\u20134236","DOI":"10.21437\/Interspeech.2022-11026"},{"key":"10612_CR141","doi-asserted-by":"crossref","unstructured":"Llu\u00eds F, Pons J, Serra X (2019) End-to-end music source separation: is it possible in the waveform domain? In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 4619\u20134623","DOI":"10.21437\/Interspeech.2019-1177"},{"key":"10612_CR142","doi-asserted-by":"crossref","DOI":"10.1201\/b14529","volume-title":"Speech enhancement: theory and practice","author":"PC Loizou","year":"2013","unstructured":"Loizou PC (2013) Speech enhancement: theory and practice. CRC Press, BOca Raton"},{"issue":"1","key":"10612_CR143","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1109\/TASL.2010.2045180","volume":"19","author":"PC Loizou","year":"2011","unstructured":"Loizou PC, Kim G (2011) Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions. IEEE Trans Audio Speech Lang Process 19(1):47\u201356","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"10612_CR155","doi-asserted-by":"crossref","unstructured":"Lu X, Tsao Y, Matsuda S, Hori C (2013) Speech enhancement based on deep denoising autoencoder. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, no. August, pp 436\u2013440","DOI":"10.21437\/Interspeech.2013-130"},{"key":"10612_CR144","unstructured":"Lu Y-J, Tsao Y, Watanabe S (2021) A study on speech enhancement based on diffusion probabilistic model. In: 2021 Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC), 2021, pp 659\u2013666. IEEE"},{"key":"10612_CR145","doi-asserted-by":"crossref","unstructured":"Lu Y-J, Wang Z-Q, Watanabe S, Richard A, Yu C, Tsao Y (2022) Conditional diffusion probabilistic model for speech enhancement. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7402\u20137406. IEEE","DOI":"10.1109\/ICASSP43922.2022.9746901"},{"key":"10612_CR146","unstructured":"Luo C (2022) Understanding diffusion models: a unified perspective, arXiv preprint arXiv:2208.11970"},{"key":"10612_CR151","doi-asserted-by":"crossref","unstructured":"Luo Y, Mesgarani N (2018) TaSNet: time-domain audio separation network for real-time, single-channel speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, pp 696\u2013700","DOI":"10.1109\/ICASSP.2018.8462116"},{"issue":"8","key":"10612_CR147","doi-asserted-by":"crossref","first-page":"1256","DOI":"10.1109\/TASLP.2019.2915167","volume":"27","author":"Y Luo","year":"2019","unstructured":"Luo Y, Mesgarani N (2019) Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE\/ACM Trans Audio Speech Lang Process 27(8):1256\u20131266","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR152","doi-asserted-by":"crossref","unstructured":"Luo Y, Mesgarani N (2020) Separating varying numbers of sources with auxiliary autoencoding loss. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 2622\u20132626","DOI":"10.21437\/Interspeech.2020-0034"},{"key":"10612_CR149","doi-asserted-by":"crossref","unstructured":"Luo Y, Chen Z, Hershey JR, Le Roux J, Mesgarani N (2017) Deep clustering and conventional networks for music separation: stronger together. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 61\u201365","DOI":"10.1109\/ICASSP.2017.7952118"},{"issue":"4","key":"10612_CR148","doi-asserted-by":"crossref","first-page":"787","DOI":"10.1109\/TASLP.2018.2795749","volume":"26","author":"Y Luo","year":"2018","unstructured":"Luo Y, Chen Z, Mesgarani N (2018) Speaker-independent speech separation with deep attractor network. IEEE\/ACM Trans Audio Speech Lang Process 26(4):787\u2013796","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR150","doi-asserted-by":"crossref","unstructured":"Luo Y, Chen Z, Yoshioka T (2020) Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol. 2020-May, pp 46\u201350","DOI":"10.1109\/ICASSP40776.2020.9054266"},{"key":"10612_CR153","doi-asserted-by":"crossref","unstructured":"Luo J, Wang J, Cheng N, Xiao E, Zhang X, Xiao J (2022) Tiny-sepformer: a tiny time-domain transformer network for speech separation, arXiv preprint arXiv:2206.13689, no.\u00a01, pp 5313\u20135317","DOI":"10.21437\/Interspeech.2022-66"},{"key":"10612_CR154","doi-asserted-by":"crossref","unstructured":"Lutati S, Nachmani E, Wolf L (2022) SepIt: approaching a single channel speech separation bound, arXiv preprint arXiv:2205.11801, pp 5323\u20135327","DOI":"10.21437\/Interspeech.2022-149"},{"key":"10612_CR156","doi-asserted-by":"crossref","unstructured":"Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul\u00a0Smolley S (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2794\u20132802","DOI":"10.1109\/ICCV.2017.304"},{"issue":"11","key":"10612_CR157","doi-asserted-by":"crossref","first-page":"1680","DOI":"10.1109\/LSP.2018.2871419","volume":"25","author":"JM Martin-Donas","year":"2018","unstructured":"Martin-Donas JM, Gomez AM, Gonzalez JA, Peinado AM (2018) A deep learning loss function based on the perceptual evaluation of the speech quality. IEEE Signal Process Lett 25(11):1680\u20131684","journal-title":"IEEE Signal Process Lett"},{"issue":"11","key":"10612_CR158","doi-asserted-by":"crossref","first-page":"1938","DOI":"10.1109\/TASLP.2015.2457612","volume":"23","author":"Y Miao","year":"2015","unstructured":"Miao Y, Zhang H, Metze F (2015) Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE\/ACM Trans Audio Speech Lang Process 23(11):1938\u20131949","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"4","key":"10612_CR159","doi-asserted-by":"crossref","first-page":"1259","DOI":"10.1121\/1.398740","volume":"86","author":"AK N\u00e1b\u011blek","year":"1989","unstructured":"N\u00e1b\u011blek AK, Letowski TR, Tucker FM (1989) Reverberant overlap- and self-masking in consonant identification. J Acoust Soc Am 86(4):1259\u20131265","journal-title":"J Acoust Soc Am"},{"key":"10612_CR160","unstructured":"Nachmani E, Adi Y, Wolf L (2020) Voice separation with an unknown number of multiple speakers. In: 37th international conference on machine learning, ICML 2020, vol PartF16814, pp 7121\u20137132"},{"key":"10612_CR162","doi-asserted-by":"crossref","unstructured":"Narayanan A, Wang D (2013) Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 7092\u20137096","DOI":"10.1109\/ICASSP.2013.6639038"},{"issue":"1","key":"10612_CR161","first-page":"92","volume":"23","author":"A Narayanan","year":"2015","unstructured":"Narayanan A, Wang D (2015) Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE\/ACM Trans Audio Speech Lang Process 23(1):92\u2013101","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR163","doi-asserted-by":"crossref","unstructured":"Natsiou A, O\u2019Leary S (2021) Audio representations for deep learning in sound synthesis: a review. In: Proceedings of IEEE\/ACS international conference on computer systems and applications, AICCSA, vol 2021-Decem","DOI":"10.1109\/AICCSA53542.2021.9686838"},{"key":"#cr-split#-10612_CR164.1","doi-asserted-by":"crossref","unstructured":"Naylor NDG, Patrick A (2010) Speech dereverberation. In: Naylor NDG Patrick A","DOI":"10.1007\/978-1-84996-056-4"},{"key":"#cr-split#-10612_CR164.2","unstructured":"(ed) vol.\u00a053, no.\u00a01. Springer, London (2010)"},{"key":"10612_CR165","doi-asserted-by":"crossref","unstructured":"Neumann TV, Kinoshita K, Delcroix M, Araki S, Nakatani T, Haeb-Umbach R (2019) All-neural online source separation, counting, and diarization for meeting analysis. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2019-May, pp 91\u201395","DOI":"10.1109\/ICASSP.2019.8682572"},{"key":"10612_CR166","doi-asserted-by":"crossref","unstructured":"Nossier SA, Wall J, Moniri M, Glackin C, Cannings N (2020a) A comparative study of time and frequency domain approaches to deep learning based speech enhancement. In: Proceedings of the international joint conference on neural networks","DOI":"10.1109\/IJCNN48605.2020.9206928"},{"key":"10612_CR168","doi-asserted-by":"crossref","unstructured":"Nossier SA, Wall J, Moniri M, Glackin C, Cannings N (2020b) Mapping and masking targets comparison using different deep learning based speech enhancement architectures. In: 2020 international joint conference on neural networks (IJCNN). IEEE, pp 1\u20138","DOI":"10.1109\/IJCNN48605.2020.9206623"},{"key":"10612_CR169","doi-asserted-by":"crossref","unstructured":"Ochiai T, Matsuda S, Lu X, Hori C, Katagiri S (2014) Speaker adaptive training using deep neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6349\u20136353. IEEE","DOI":"10.1109\/ICASSP.2014.6854826"},{"key":"10612_CR170","volume-title":"Discrete-time signal processing","author":"AV Oppenheim","year":"1999","unstructured":"Oppenheim AV (1999) Discrete-time signal processing, 2nd\u00c3\u201a\u00c2 ed. Prentice-Hall, Upper Saddle River","edition":"2nd\u00c3\u201a\u00c2 ed"},{"issue":"5","key":"10612_CR171","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1109\/PROC.1981.12022","volume":"69","author":"AV Oppenheim","year":"1981","unstructured":"Oppenheim AV, Lim JS (1981) The importance of phase in signals. Proc IEEE 69(5):529\u2013541","journal-title":"Proc IEEE"},{"issue":"4","key":"10612_CR172","doi-asserted-by":"crossref","first-page":"465","DOI":"10.1016\/j.specom.2010.12.003","volume":"53","author":"K Paliwal","year":"2011","unstructured":"Paliwal K, W\u00f3jcicki K, Shannon B (2011) The importance of phase in speech enhancement. Speech Commun 53(4):465\u2013494","journal-title":"Speech Commun"},{"issue":"2","key":"10612_CR173","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1109\/TNN.2010.2091281","volume":"22","author":"SJ Pan","year":"2010","unstructured":"Pan SJ, Tsang IW, Kwok JT, Yang Q (2010) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199\u2013210","journal-title":"IEEE Trans Neural Netw"},{"key":"10612_CR174","unstructured":"Parveen S, Green P (2004) Speech enhancement with missing data techniques using recurrent neural networks. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol\u00a01, no. Figure 1, pp 13\u201316"},{"key":"10612_CR175","doi-asserted-by":"crossref","unstructured":"Pascual S, Bonafonte A, Serra J (2017) SEGAN: speech enhancement generative adversarial network. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2017-August, no.\u00a0D, pp 3642\u20133646","DOI":"10.21437\/Interspeech.2017-1428"},{"key":"10612_CR176","doi-asserted-by":"crossref","unstructured":"Pascual S, Park M, Serr\u00e0 J, Bonafonte A, Ahn K-H (2018) Language and noise transfer in speech enhancement generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5019\u20135023. IEEE","DOI":"10.1109\/ICASSP.2018.8462322"},{"key":"10612_CR177","doi-asserted-by":"crossref","unstructured":"Pascual S, Serr\u00e0 J, Bonafonte A (2019) Towards generalized speech enhancement with generative adversarial networks. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 1791\u20131795","DOI":"10.21437\/Interspeech.2019-2688"},{"key":"10612_CR178","doi-asserted-by":"crossref","first-page":"1700","DOI":"10.1109\/LSP.2020.3025020","volume":"27","author":"H Phan","year":"2020","unstructured":"Phan H, McLoughlin IV, Pham L, Chen OY, Koch P, De Vos M, Mertins A (2020) Improving GANs for speech enhancement. IEEE Signal Process Lett 27:1700\u20131704","journal-title":"IEEE Signal Process Lett"},{"issue":"1","key":"10612_CR179","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1109\/TASSP.1980.1163359","volume":"28","author":"MR Portnoff","year":"1980","unstructured":"Portnoff MR (1980) Time-frequency representation of. digital signals. IEEE Trans Acoust Speech Signal Process 28(1):55\u201369","journal-title":"IEEE Trans Acoust Speech Signal Process"},{"key":"10612_CR180","doi-asserted-by":"crossref","unstructured":"Qian K, Zhang Y, Chang S, Yang X, Flor\u00eancio D, Hasegawa-Johnson M (2017) Speech enhancement using bayesian wavenet. In: Interspeech, pp 2013\u20132017","DOI":"10.21437\/Interspeech.2017-1672"},{"key":"10612_CR181","first-page":"2018","volume":"1","author":"S Qin","year":"2018","unstructured":"Qin S, Jiang T (2018) Improved Wasserstein conditional generative adversarial network speech enhancement. EURASIP J Wirel Commun Netw 1:2018","journal-title":"EURASIP J Wirel Commun Netw"},{"key":"10612_CR182","doi-asserted-by":"crossref","first-page":"82\u00a0571","DOI":"10.1109\/ACCESS.2020.2989833","volume":"8","author":"S Qin","year":"2020","unstructured":"Qin S, Jiang T, Wu S, Wang N, Zhao X (2020) Graph convolution-based deep clustering for speech separation. IEEE Access 8:82\u00a0571-82\u00a0580","journal-title":"IEEE Access"},{"key":"10612_CR183","doi-asserted-by":"crossref","first-page":"78\u00c3\u201a\u00c2 754","DOI":"10.1109\/ACCESS.2022.3193245","volume":"10","author":"W Qiu","year":"2022","unstructured":"Qiu W, Hu Y (2022) Dual-path hybrid attention network for monaural speech separation. IEEE Access 10:78\u00c3\u201a\u00c2 754-78\u00c3\u201a\u00c2 763","journal-title":"IEEE Access"},{"key":"10612_CR184","doi-asserted-by":"crossref","unstructured":"Reddy CK, Gopal V, Cutler R (2021) Dnsmos: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6493\u20136497. IEEE","DOI":"10.1109\/ICASSP39728.2021.9414878"},{"key":"10612_CR185","doi-asserted-by":"crossref","unstructured":"Rethage D, Pons J, Serra X (2018) A wavenet for speech denoising. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, pp 5069\u20135073","DOI":"10.1109\/ICASSP.2018.8462417"},{"key":"10612_CR186","doi-asserted-by":"crossref","unstructured":"Rix AW, Beerends JG, Hollier MP, Hekstra AP (2001) Perceptual evaluation of speech quality (PESQ)\u2014a new method for speech quality assessment of telephone networks and codecs. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2, pp 749\u2013752","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"10612_CR187","doi-asserted-by":"crossref","unstructured":"Roux JL, Wisdom S, Erdogan H, Hershey JR (2019) SDR\u2014half-baked or well done? In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2019-May, pp 626\u2013630","DOI":"10.1109\/ICASSP.2019.8683855"},{"key":"10612_CR188","doi-asserted-by":"crossref","unstructured":"Sainath TN, Weiss RJ, Senior A, Wilson KW, Vinyals O (2015) Learning the speech front-end with raw waveform CLDNNs. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2015-January, pp 1\u20135","DOI":"10.21437\/Interspeech.2015-1"},{"key":"10612_CR189","unstructured":"Saito K, Uhlich S, Fabbro G, Mitsufuji Y (2021) Training speech enhancement systems with noisy speech datasets, arXiv preprint arXiv:2105.12315"},{"key":"10612_CR190","doi-asserted-by":"crossref","unstructured":"Schmidt MN, Olsson RK (2006) Single-channel speech separation using sparse non-negative matrix factorization. In: INTERSPEECH 2006 and 9th international conference on spoken language processing, INTERSPEECH 2006\u2014ICSLP, vol\u00a05, pp 2614\u20132617","DOI":"10.21437\/Interspeech.2006-655"},{"key":"10612_CR191","doi-asserted-by":"crossref","unstructured":"Senior A, Lopez-Moreno I (2014) Improving DNN speaker independence with i-vector inputs. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 225\u2013229. IEEE","DOI":"10.1109\/ICASSP.2014.6853591"},{"issue":"1","key":"10612_CR192","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1109\/TSA.2005.854106","volume":"14","author":"Y Shao","year":"2006","unstructured":"Shao Y, Wang D (2006) Model-based sequential organization in cochannel speech. IEEE Trans Audio Speech Lang Process 14(1):289\u2013298","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"10612_CR194","doi-asserted-by":"crossref","unstructured":"Shi J, Xu J, Liu G, Xu B (2018) Listen, think and listen again: capturing top-down auditory attention for speaker-independent speech separation. In: IJCAI international joint conference on artificial intelligence, vol 2018-July, pp 4353\u20134360","DOI":"10.24963\/ijcai.2018\/605"},{"key":"10612_CR193","doi-asserted-by":"crossref","unstructured":"Shivakumar PG, Georgiou P (2016) Perception optimized deep denoising autoencoders for speech enhancement. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 08-12-September-2016, pp 3743\u20133747","DOI":"10.21437\/Interspeech.2016-1284"},{"key":"10612_CR195","unstructured":"Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. PMLR, pp 2256\u20132265"},{"key":"10612_CR196","unstructured":"Stoller D, Ewert S, Dixon S (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation, arXiv preprint arXiv:1806.03185"},{"key":"10612_CR197","doi-asserted-by":"crossref","unstructured":"Subakan C, Ravanelli M, Cornell S, Bronzi M, Zhong J (2021) Attention is all you need in speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2021-June, pp 21\u201325","DOI":"10.1109\/ICASSP39728.2021.9413901"},{"key":"10612_CR198","unstructured":"Subakan C, Ravanelli M, Cornell S, Grondin F, Bronzi M (2022a) On using transformers for speech-separation. In: International workshop on acoustic signal enhancement, vol\u00a014, no.\u00a08, pp 1\u201310. arXiv:org\/abs\/2202.02884"},{"key":"10612_CR199","unstructured":"Subakan C, Ravanelli M, Cornell S, Lepoutre F, Grondin F (2022b) Resource-efficient separation transformer, arXiv preprint arXiv:2206.09507, pp 1\u20135"},{"key":"10612_CR200","doi-asserted-by":"crossref","unstructured":"Su J, Jin Z, Finkelstein A (2020) HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, no.\u00a03, pp 4506\u20134510","DOI":"10.21437\/Interspeech.2020-2143"},{"key":"10612_CR201","doi-asserted-by":"crossref","unstructured":"Sun H, Li S (2017) An optimization method for speech enhancement based on deep neural network. In: IOP conference series: earth and environmental science, vol\u00a069, no\u00a01","DOI":"10.1088\/1755-1315\/69\/1\/012139"},{"key":"10612_CR202","doi-asserted-by":"crossref","unstructured":"Taal CH, Hendriks RC, Heusdens R, Jensen J (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: IEEE international conference on acoustics, speech, and signal processing, pp 4214\u20134217","DOI":"10.1109\/ICASSP.2010.5495701"},{"issue":"7","key":"10612_CR203","doi-asserted-by":"crossref","first-page":"2125","DOI":"10.1109\/TASL.2011.2114881","volume":"19","author":"CH Taal","year":"2011","unstructured":"Taal CH, Hendriks RC, Heusdens R, Jensen J (2011) An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 19(7):2125\u20132136","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"10612_CR204","doi-asserted-by":"crossref","unstructured":"Tachibana H (2021) Towards listening to 10 people simultaneously: an efficient permutation invariant training of audio source separation using Sinkhorn\u2019s algorithm. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014 proceedings, vol 2021-June, pp 491\u2013495","DOI":"10.1109\/ICASSP39728.2021.9414508"},{"key":"10612_CR205","doi-asserted-by":"crossref","unstructured":"Takahashi N, Parthasaarathy S, Goswami N, Mitsufuji Y (2019) Recursive speech separation for unknown number of speakers. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2019-September, pp 1348\u20131352","DOI":"10.21437\/Interspeech.2019-1550"},{"key":"10612_CR206","doi-asserted-by":"crossref","first-page":"1785","DOI":"10.1109\/TASLP.2021.3082282","volume":"29","author":"K Tan","year":"2021","unstructured":"Tan K, Wang D (2021) Towards model compression for deep learning based speech enhancement. IEEE\/ACM Trans Audio Speech Lang Process 29:1785\u20131794","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR207","doi-asserted-by":"crossref","unstructured":"Trinh VA, Braun S (2022) Unsupervised speech enhancement with speech recognition embedding and disentanglement losses. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 391\u2013395. IEEE","DOI":"10.1109\/ICASSP43922.2022.9746973"},{"key":"10612_CR208","doi-asserted-by":"crossref","unstructured":"Tu Y, Du J, Xu Y, Dai L, Lee CH (2014) Deep neural network based speech separation for robust speech recognition. In: International conference on signal processing proceedings, ICSP, vol 2015-January, no. October, pp 532\u2013536","DOI":"10.1109\/ICOSP.2014.7015061"},{"key":"10612_CR210","doi-asserted-by":"crossref","unstructured":"Tzinis E, Venkataramani S, Wang Z, Subakan C, Smaragdis P (2020a) Two-step sound source separation: training on learned latent targets. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2020-May, pp 31\u201335","DOI":"10.1109\/ICASSP40776.2020.9054172"},{"key":"10612_CR211","doi-asserted-by":"crossref","unstructured":"Tzinis E, Wang Z, Smaragdis P (2020b) Sudo RM -RF: efficient networks for universal audio source separation. in: IEEE international workshop on machine learning for signal processing, MLSP, vol 2020-September","DOI":"10.1109\/MLSP49062.2020.9231900"},{"issue":"6","key":"10612_CR209","doi-asserted-by":"crossref","first-page":"1329","DOI":"10.1109\/JSTSP.2022.3200911","volume":"16","author":"E Tzinis","year":"2022","unstructured":"Tzinis E, Adi Y, Ithapu VK, Xu B, Smaragdis P, Kumar A (2022) Remixit: continual self-training of speech enhancement models via bootstrapped remixing. IEEE J Sel Topics Signal Process 16(6):1329\u20131341","journal-title":"IEEE J Sel Topics Signal Process"},{"issue":"2","key":"10612_CR212","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1007\/s11265-015-1007-3","volume":"82","author":"Y Ueda","year":"2016","unstructured":"Ueda Y, Wang L, Kai A, Xiao X, Chng ES, Li H (2016) Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization. J Signal Process Syst 82(2):151\u2013161","journal-title":"J Signal Process Syst"},{"key":"10612_CR213","unstructured":"Valin J-M, Giri R, Venkataramani S, Isik U, Krishnaswamy A (2022) To dereverb or not to dereverb? Perceptual studies on real-time dereverberation targets, arXiv:2206.07917"},{"key":"10612_CR214","unstructured":"van\u00a0den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio, arXiv preprint arXiv:1609.03499, pp 1\u201315"},{"issue":"4","key":"10612_CR215","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1016\/0165-1684(85)90002-7","volume":"8","author":"P Vary","year":"1985","unstructured":"Vary P, Eurasip M (1985) Noise suppression by spectral magnitude estimation-mechanism and theoretical limits. Signal Process 8(4):387\u2013400","journal-title":"Signal Process"},{"key":"10612_CR216","first-page":"5999","volume":"2017","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999\u20136009","journal-title":"Adv Neural Inf Process Syst"},{"key":"10612_CR217","doi-asserted-by":"crossref","unstructured":"Venkataramani S, Casebeer J, Smaragdis P (2018) End-to-end source separation with adaptive front-ends. In: 2018 52nd asilomar conference on signals, systems, and computers, no.\u00a01, pp 684\u2013688","DOI":"10.1109\/ACSSC.2018.8645535"},{"key":"10612_CR218","doi-asserted-by":"crossref","unstructured":"Vesel\u1ef3 K, Watanabe S, \u017dmol\u00edkov\u00e1 K, Karafi\u00e1t M, Burget L, \u010cernock\u1ef3 JH (2016) Sequence summarizing neural network for speaker adaptation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5315\u20135319. IEEE","DOI":"10.1109\/ICASSP.2016.7472692"},{"key":"10612_CR219","doi-asserted-by":"crossref","unstructured":"Virtanen T (2006) Speech recognition using factorial hidden Markov models for separation in the feature space. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 1, pp 89\u201392","DOI":"10.21437\/Interspeech.2006-23"},{"key":"10612_CR220","doi-asserted-by":"crossref","unstructured":"Virtanen T, Cemgil AT (2009) Mixtures of gamma priors for non-negative matrix factorization based speech separation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 5441, no 3, pp 646\u2013653","DOI":"10.1007\/978-3-642-00599-2_81"},{"key":"10612_CR221","doi-asserted-by":"crossref","unstructured":"von Neumann T, Boeddeker C, Drude L, Kinoshita K, Delcroix M, Nakatani T, Haeb-Umbach R (2020) Multi-talker ASR for an unknown number of sources: joint training of source counting, separation and ASR. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, vol 2020-October, pp 3097\u20133101","DOI":"10.21437\/Interspeech.2020-2519"},{"key":"10612_CR222","doi-asserted-by":"crossref","unstructured":"Wang D (2008) Time\u2014frequency masking for speech hearing aid design. Trends in amplification, vol\u00a012, pp 332\u2013353. http:\/\/www.ncbi.nlm.nih.gov\/pubmed\/18974204","DOI":"10.1177\/1084713808326455"},{"issue":"10","key":"10612_CR225","doi-asserted-by":"crossref","first-page":"1702","DOI":"10.1109\/TASLP.2018.2842159","volume":"26","author":"D Wang","year":"2018","unstructured":"Wang D, Chen J (2018) Supervised speech separation based on deep learning: an overview. IEEE\/ACM Trans Audio Speech Lang Process 26(10):1702\u20131726","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"4","key":"10612_CR226","doi-asserted-by":"crossref","first-page":"679","DOI":"10.1109\/TASSP.1982.1163920","volume":"30","author":"DL Wang","year":"1982","unstructured":"Wang DL, Lim JS (1982) The unimportance of phase in speech enhancement. IEEE Trans Acoust Speech Signal Process 30(4):679\u2013681","journal-title":"IEEE Trans Acoust Speech Signal Process"},{"key":"10612_CR238","unstructured":"Wang Z, Sha F (2014) Discriminative non-negative matrix factorization for single-channel speech separation. In: 2014 IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 3777\u20133781 https:\/\/pdfs.semanticscholar.org\/854a\/454106bd42a8bca158426d8b12b48ba0cae8.pdf"},{"issue":"7","key":"10612_CR227","doi-asserted-by":"crossref","first-page":"1381","DOI":"10.1109\/TASL.2013.2250961","volume":"21","author":"Y Wang","year":"2013","unstructured":"Wang Y, Wang DL (2013) Towards scaling up classification-based speech separation. IEEE Trans Audio Speech Lang Process 21(7):1381\u20131390","journal-title":"IEEE Trans Audio Speech Lang Process"},{"issue":"6","key":"10612_CR228","doi-asserted-by":"crossref","first-page":"3048","DOI":"10.1109\/TPAMI.2021.3055564","volume":"44","author":"L Wang","year":"2022","unstructured":"Wang L, Yoon KJ (2022) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Trans Pattern Anal Mach Intell 44(6):3048\u20133068","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"2","key":"10612_CR229","doi-asserted-by":"crossref","first-page":"270","DOI":"10.1109\/TASL.2012.2221459","volume":"21","author":"Y Wang","year":"2013","unstructured":"Wang Y, Han K, Wang D (2013) Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Lang Process 21(2):270\u2013279","journal-title":"IEEE Trans Audio Speech Lang Process"},{"issue":"12","key":"10612_CR230","doi-asserted-by":"crossref","first-page":"1849","DOI":"10.1109\/TASLP.2014.2352935","volume":"22","author":"Y Wang","year":"2014","unstructured":"Wang Y, Narayanan A, Wang DL (2014) On training targets for supervised speech separation. IEEE\/ACM Trans Audio Speech Lang Process 22(12):1849\u20131858","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR233","doi-asserted-by":"crossref","unstructured":"Wang Y, Du J, Dai L-R, Lee C-H (2016) Unsupervised single-channel speech separation via deep neural network for different gender mixtures. In: 2016 Asia-pacific signal and information processing association annual summit and conference (APSIPA), pp 1\u20134. IEEE","DOI":"10.1109\/APSIPA.2016.7820736"},{"issue":"7","key":"10612_CR231","doi-asserted-by":"crossref","first-page":"1535","DOI":"10.1109\/TASLP.2017.2700540","volume":"25","author":"Y Wang","year":"2017","unstructured":"Wang Y, Du J, Dai LR, Lee CH (2017) A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks. IEEE\/ACM Trans Audio Speech Lang Process 25(7):1535\u20131546","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR223","doi-asserted-by":"crossref","unstructured":"Wang ZQ, Roux JL, Hershey JR (2018a) Alternative objective functions for deep clustering. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2018-April, pp 686\u2013690","DOI":"10.1109\/ICASSP.2018.8462507"},{"key":"10612_CR232","doi-asserted-by":"crossref","unstructured":"Wang J, Chen J, Su D, Chen L, Yu M, Qian Y, Yu D (2018b) Deep extractor network for target speaker recovery from single channel speech mixtures, arXiv preprint arXiv:1807.08974","DOI":"10.21437\/Interspeech.2018-1205"},{"key":"10612_CR236","doi-asserted-by":"crossref","unstructured":"Wang Q, Muckenhirn H, Wilson K, Sridhar P, Wu Z, Hershey J, Saurous RA, Weiss RJ, Jia Y, Moreno IL (2018c) Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking, arXiv preprint arXiv:1810.04826","DOI":"10.21437\/Interspeech.2019-1101"},{"key":"10612_CR237","doi-asserted-by":"crossref","unstructured":"Wang Q, Rao W, Sun S, Xie L, Chng ES, Li H (2018d) Unsupervised domain adaptation via domain adversarial training for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4889\u20134893. IEEE","DOI":"10.1109\/ICASSP.2018.8461423"},{"key":"10612_CR224","doi-asserted-by":"crossref","unstructured":"Wang ZQ, Tan K, Wang D (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol 2019-May, pp. 71\u201375","DOI":"10.1109\/ICASSP.2019.8683231"},{"key":"10612_CR235","unstructured":"Wang S, Li BZ, Khabsa M, Fang H, Ma H (2020) Linformer: self-attention with linear complexity, vol 2048, no. 2019. arXiv:org\/abs\/2006.04768"},{"key":"10612_CR234","doi-asserted-by":"crossref","unstructured":"Wang K, He B, Zhu WP (2021) Tstnn: two-stage transformer based neural network for speech enhancement in the time domain. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, vol. 2021-June, pp 7098\u20137102","DOI":"10.1109\/ICASSP39728.2021.9413740"},{"issue":"10","key":"10612_CR239","doi-asserted-by":"crossref","first-page":"1670","DOI":"10.1109\/TASLP.2015.2444659","volume":"23","author":"C Weng","year":"2015","unstructured":"Weng C, Yu D, Seltzer ML, Droppo J (2015) Deep neural networks for single-channel multi-talker speech recognition. IEEE\/ACM Trans Audio Speech Lang Process 23(10):1670\u20131679","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR241","doi-asserted-by":"crossref","unstructured":"Weninger F, Hershey JR, Le Roux J, Schuller B (2014) Discriminatively trained recurrent neural networks for single-channel speech separation. In: 2014 IEEE global conference on signal and information processing, GlobalSIP 2014, pp 577\u2013581","DOI":"10.1109\/GlobalSIP.2014.7032183"},{"key":"10612_CR240","doi-asserted-by":"crossref","unstructured":"Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR, Schuller B (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 9237, pp 91\u201399","DOI":"10.1007\/978-3-319-22482-4_11"},{"key":"10612_CR242","doi-asserted-by":"crossref","unstructured":"Wichern G, Lukin A (2017) Low-latency approximation of bidirectional recurrent networks for speech denoising. In: IEEE workshop on applications of signal processing to audio and acoustics, vol 2017-October, pp 66\u201370","DOI":"10.1109\/WASPAA.2017.8169996"},{"key":"10612_CR243","doi-asserted-by":"crossref","unstructured":"Williamson DS, Wang D (2017a) Speech dereverberation and denoising using complex ratio masks. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP) 2017, pp 5590\u20135594","DOI":"10.1109\/ICASSP.2017.7953226"},{"issue":"7","key":"10612_CR244","doi-asserted-by":"crossref","first-page":"1492","DOI":"10.1109\/TASLP.2017.2696307","volume":"25","author":"DS Williamson","year":"2017","unstructured":"Williamson DS, Wang D (2017b) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE\/ACM Trans Audio Speech Lang Process 25(7):1492\u20131501","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"3","key":"10612_CR245","doi-asserted-by":"crossref","first-page":"483","DOI":"10.1109\/TASLP.2015.2512042","volume":"24","author":"DS Williamson","year":"2016","unstructured":"Williamson DS, Wang Y, Wang DL (2016) Complex ratio masking for monaural speech separation. IEEE\/ACM Trans Audio Speech Lang Process 24(3):483\u2013492","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR246","unstructured":"Wisdom S, Tzinis E, Erdogan H, Weiss RJ, Wilson K, Hershey JR (2020) Unsupervised sound separation using mixture invariant training. In: Advances in neural information processing systems, vol 2020-December, june 2020. arXiv:org\/abs\/2006.12701"},{"issue":"12","key":"10612_CR247","doi-asserted-by":"crossref","first-page":"1887","DOI":"10.1109\/LSP.2019.2951950","volume":"26","author":"JY Wu","year":"2019","unstructured":"Wu JY, Yu C, Fu SW, Liu CT, Chien SY, Tsao Y (2019) Increasing compactness of deep learning based speech enhancement models with parameter pruning and quantization techniques. IEEE Signal Process Lett 26(12):1887\u20131891","journal-title":"IEEE Signal Process Lett"},{"key":"10612_CR248","doi-asserted-by":"crossref","unstructured":"Xia B, Bao C (2014) Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification, pp 13\u201329","DOI":"10.1016\/j.specom.2014.02.001"},{"key":"10612_CR249","doi-asserted-by":"crossref","first-page":"1826","DOI":"10.1109\/TASLP.2020.2997118","volume":"28","author":"Y Xiang","year":"2020","unstructured":"Xiang Y, Bao C (2020) A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network. IEEE\/ACM Trans Audio Speech Lang Process 28:1826\u20131838","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR250","doi-asserted-by":"crossref","unstructured":"Xiao X, Chen Z, Yoshioka T, Erdogan H, Liu C, Dimitriadis D, Droppo J, Gong Y (2019) Single-channel speech extraction using speaker inventory and attention network. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 86\u201390","DOI":"10.1109\/ICASSP.2019.8682245"},{"key":"10612_CR251","unstructured":"Xiao F, Guan J, Kong Q, Wang W (2021) Time-domain speech enhancement with generative adversarial learning, arXiv preprint arXiv:2103.16149"},{"issue":"1","key":"10612_CR252","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1109\/LSP.2013.2291240","volume":"21","author":"Y Xu","year":"2014","unstructured":"Xu Y, Du J, Dai LR, Lee CH (2014a) An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 21(1):65\u201368","journal-title":"IEEE Signal Process Lett"},{"key":"10612_CR254","doi-asserted-by":"crossref","unstructured":"Xu Y, Du J, Dai L-R, Lee C-H (2014b) Cross-language transfer learning for deep neural network based speech enhancement. In: The 9th international symposium on chinese spoken language processing, pp 336\u2013340. IEEE","DOI":"10.1109\/ISCSLP.2014.6936608"},{"key":"10612_CR255","doi-asserted-by":"crossref","unstructured":"Xu Y, Du J, Dai L-R, Lee C-H (2014c) Global variance equalization for improving deep neural network based speech enhancement. In: 2014 IEEE China summit & international conference on signal and information processing (ChinaSIP). IEEE, pp 71\u201375","DOI":"10.1109\/ChinaSIP.2014.6889204"},{"issue":"1","key":"10612_CR253","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1109\/TASLP.2014.2364452","volume":"23","author":"Y Xu","year":"2015","unstructured":"Xu Y, Du J, Dai LR, Lee CH (2015) A regression approach to speech enhancement based on deep neural networks. IEEE\/ACM Trans Audio Speech Lang Process 23(1):7\u201319","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR256","unstructured":"Yan Z, Buye X, Ritwik G, Tao Z (2018) Perceptually guided speech enhancement using deep neural networks. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 5074\u20135078"},{"key":"10612_CR257","doi-asserted-by":"crossref","unstructured":"Ye F, Tsao Y, Chen F (2019) Subjective feedback-based neural network pruning for speech enhancement. In: 2019 Asia-pacific signal and information processing association annual summit and conference, APSIPA ASC 2019, pp 673\u2013677","DOI":"10.1109\/APSIPAASC47483.2019.9023330"},{"key":"10612_CR258","doi-asserted-by":"crossref","unstructured":"Yu D, Kolbaek M, Tan ZH, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 241\u2013245","DOI":"10.1109\/ICASSP.2017.7952154"},{"key":"10612_CR259","unstructured":"Yul D, Kalbcek M, Tan Z-H, Jensen J (2017) Speaker-independent multi-talker speech separation. In: IEEE international conference on acoustics, speech and signal processing, pp 241\u2013245"},{"issue":"4","key":"10612_CR260","doi-asserted-by":"crossref","first-page":"2840","DOI":"10.1109\/TASLP.2021.3099291","volume":"29","author":"N Zeghidour","year":"2021","unstructured":"Zeghidour N, Grangier D (2021) Wavesplit: end-to-end speech separation by speaker clustering. IEEE\/ACM Trans Audio Speech Lang Process 29(4):2840\u20132849","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"5","key":"10612_CR261","doi-asserted-by":"crossref","first-page":"967","DOI":"10.1109\/TASLP.2016.2536478","volume":"24","author":"XL Zhang","year":"2016","unstructured":"Zhang XL, Wang D (2016) A deep ensemble learning method for monaural speech separation. IEEE\/ACM Trans Audio Speech Lang Process 24(5):967\u2013977","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR265","doi-asserted-by":"crossref","unstructured":"Zhang H, Zhang X, Gao G (2018) Training supervised speech separation system to improve STOI and PESQ directly. In: ICASSP, IEEE international conference on acoustics, speech and signal processing\u2014proceedings, pp 5374\u20135378","DOI":"10.1109\/ICASSP.2018.8461965"},{"key":"10612_CR262","doi-asserted-by":"crossref","first-page":"1404","DOI":"10.1109\/TASLP.2020.2987441","volume":"28","author":"Q Zhang","year":"2020","unstructured":"Zhang Q, Nicolson A, Wang M, Paliwal KK, Wang C (2020a) DeepMMSE: a deep learning approach to mmse-based noise power spectral density estimation. IEEE\/ACM Trans Audio Speech Lang Process 28:1404\u20131415","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR263","doi-asserted-by":"crossref","unstructured":"Zhang L, Shi Z, Han J, Shi A, Ma D (2020b) FurcaNeXt: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 11961 LNCS, pp 653\u2013665","DOI":"10.1007\/978-3-030-37731-1_53"},{"key":"10612_CR264","doi-asserted-by":"crossref","unstructured":"Zhang C, Yu M, Weng C, Yu D (2021a) Towards robust speaker verification with target speaker enhancement. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6693\u20136697. IEEE","DOI":"10.1109\/ICASSP39728.2021.9414017"},{"key":"10612_CR266","doi-asserted-by":"crossref","unstructured":"Zhang J, Zorila C, Doddipatla R, Barker J (2021b) Teacher-student mixit for unsupervised and semi-supervised speech separation, arXiv preprint arXiv:2106.07843","DOI":"10.21437\/Interspeech.2021-1243"},{"issue":"1","key":"10612_CR267","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1109\/TASLP.2018.2870725","volume":"27","author":"Y Zhao","year":"2019","unstructured":"Zhao Y, Wang ZQ, Wang D (2019) Two-stage deep learning for noisy-reverberant speech enhancement. IEEE\/ACM Trans Audio Speech Lang Process 27(1):53\u201362","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR268","doi-asserted-by":"crossref","first-page":"1598","DOI":"10.1109\/TASLP.2020.2995273","volume":"28","author":"Y Zhao","year":"2020","unstructured":"Zhao Y, Wang D, Xu B, Zhang T (2020) Monaural speech dereverberation using temporal convolutional networks with self attention. IEEE\/ACM Trans Audio Speech Lang Process 28:1598\u20131607","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"issue":"1","key":"10612_CR269","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/TASLP.2018.2870742","volume":"27","author":"N Zheng","year":"2019","unstructured":"Zheng N, Zhang XL (2019) Phase-aware speech enhancement based on deep neural networks. IEEE\/ACM Trans Audio Speech Lang Process 27(1):63\u201376","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"10612_CR270","unstructured":"Zhou R, Zhu W, Li X (2022) Single-channel speech dereverberation using subband network with a reverberation time shortening target, arXiv preprint arXiv:2210.11089arXiv:2204.08765"},{"key":"10612_CR271","doi-asserted-by":"crossref","unstructured":"Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223\u20132232","DOI":"10.1109\/ICCV.2017.244"},{"issue":"6","key":"10612_CR272","doi-asserted-by":"crossref","first-page":"514","DOI":"10.1016\/j.specom.2007.04.005","volume":"49","author":"A Zolnay","year":"2007","unstructured":"Zolnay A, Kocharov D, Schl\u00fcter R, Ney H (2007) Using multiple acoustic feature sets for speech recognition. Speech Commun 49(6):514\u2013525","journal-title":"Speech Commun"}],"container-title":["Artificial Intelligence Review"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-023-10612-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10462-023-10612-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10462-023-10612-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,18]],"date-time":"2023-11-18T07:14:26Z","timestamp":1700291666000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10462-023-10612-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,25]]},"references-count":268,"journal-issue":{"issue":"S3","published-print":{"date-parts":[[2023,12]]}},"alternative-id":["10612"],"URL":"https:\/\/doi.org\/10.1007\/s10462-023-10612-2","relation":{},"ISSN":["0269-2821","1573-7462"],"issn-type":[{"value":"0269-2821","type":"print"},{"value":"1573-7462","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,25]]},"assertion":[{"value":"18 September 2023","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 October 2023","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}