{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T00:27:03Z","timestamp":1760228823876,"version":"build-2065373602"},"reference-count":42,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2022,5,24]],"date-time":"2022-05-24T00:00:00Z","timestamp":1653350400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control","award":["F20221018"],"award-info":[{"award-number":["F20221018"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>Most deep-learning-based multi-channel speech enhancement methods focus on designing a set of beamforming coefficients, to directly filter the low signal-to-noise ratio signals received by microphones, which hinders the performance of these approaches. To handle these problems, this paper designs a causal neural filter that fully exploits the spectro-temporal-spatial information in the beamspace domain. Specifically, multiple beams are designed to steer towards all directions, using a parameterized super-directive beamformer in the first stage. After that, a deep-learning-based filter is learned by, simultaneously, modeling the spectro-temporal-spatial discriminability of the speech and the interference, so as to extract the desired speech, coarsely, in the second stage. Finally, to further suppress the interference components, especially at low frequencies, a residual estimation module is adopted, to refine the output of the second stage. Experimental results demonstrate that the proposed approach outperforms many state-of-the-art (SOTA) multi-channel methods, on the generated multi-channel speech dataset based on the DNS-Challenge dataset.<\/jats:p>","DOI":"10.3390\/sym14061081","type":"journal-article","created":{"date-parts":[[2022,5,25]],"date-time":"2022-05-25T05:12:27Z","timestamp":1653455547000},"page":"1081","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A Neural Beamspace-Domain Filter for Real-Time Multi-Channel Speech Enhancement"],"prefix":"10.3390","volume":"14","author":[{"given":"Wenzhe","family":"Liu","sequence":"first","affiliation":[{"name":"Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Andong","family":"Li","sequence":"additional","affiliation":[{"name":"Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China"}]},{"given":"Xiao","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Humanities and Management, Southwest Medical University, Luzhou 646099, China"},{"name":"National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control, Beijing 100088, China"}]},{"given":"Minmin","family":"Yuan","sequence":"additional","affiliation":[{"name":"National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control, Beijing 100088, China"},{"name":"Research Institute of Highway Ministry of Transport, Beijing 100088, China"}]},{"given":"Yi","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Humanities and Management, Southwest Medical University, Luzhou 646099, China"},{"name":"National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control, Beijing 100088, China"}]},{"given":"Chengshi","family":"Zheng","sequence":"additional","affiliation":[{"name":"Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control, Beijing 100088, China"}]},{"given":"Xiaodong","family":"Li","sequence":"additional","affiliation":[{"name":"Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,24]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1702","DOI":"10.1109\/TASLP.2018.2842159","article-title":"Supervised speech separation based on deep learning: An overview","volume":"26","author":"Wang","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."},{"key":"ref_2","unstructured":"Benesty, J., Makino, S., and Chen, J. (2005). Speech Enhancement, Springer Science & Business Media."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Makino, S., Lee, T.W., and Sawada, H. (2007). Blind Speech Separation, Springer Science & Business Media.","DOI":"10.1007\/978-1-4020-6479-1"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Tawara, N., Kobayashi, T., and Ogawa, T. (2019, January 15\u201319). Multi-channel speech enhancement using time-domain convolutional denoising autoencoder. Proceedings of the Interspeech, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-3197"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1888","DOI":"10.1109\/TASLP.2020.2976193","article-title":"Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks","volume":"28","author":"Liu","year":"2020","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"542","DOI":"10.1109\/JSTSP.2020.2987209","article-title":"Audio-visual speech separation and dereverberation with a two-stage multimodal network","volume":"14","author":"Tan","year":"2020","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Wu, J., Chen, Z., Li, J., Yoshioka, T., Tan, Z., Lin, E., Luo, Y., and Xie, L. (2020, January 25\u201329). An end-to-end architecture of online multi-channel speech separation. Proceedings of the Interspeech, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1981"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Gu, R., Chen, L., Zhang, S., Zheng, J., Xu, Y., Yu, M., Su, D., Zou, Y., and Yu, D. (2019, January 15\u201319). Neural spatial filter: Target speaker speech separation assisted with directional information. Proceedings of the Interspeech, Graz, Austria.","DOI":"10.21437\/Interspeech.2019-2266"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Fu, Y., Wu, J., Hu, Y., Xing, M., and Xie, L. (2021, January 19\u201322). Desnet: A multi-channel network for simultaneous speech dereverberation, enhancement and separation. Proceedings of the IEEE Spoken Language Technology Workshop, Shenzhen, China.","DOI":"10.1109\/SLT48900.2021.9383604"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Xu, Y., Yu, M., Zhang, S., Chen, L., Weng, C., Liu, J., and Yu, D. (2020, January 25\u201329). Neural Spatio-Temporal Beamformer for Target Speech Separation. Proceedings of the Interspeech, Shanghai, China.","DOI":"10.21437\/Interspeech.2020-1458"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Xu, Y., Yu, M., Zhang, S., Chen, L., and Yu, D. (2021, January 6\u201311). Adl-mvdr: All deep learning mvdr beamformer for target speech separation. Proceedings of the ICASSP, Toronto, ON, Canada.","DOI":"10.1109\/ICASSP39728.2021.9413594"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Heymann, J., Drude, L., and Haeb-Umbach, R. (2016, January 20\u201325). Neural network based spectral mask estimation for acoustic beamforming. Proceedings of the ICASSP, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7471664"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Zhang, X., Wang, Z., and Wang, D. (2017, January 5\u20139). A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust asr. Proceedings of the ICASSP, New Orleans, LA, USA.","DOI":"10.1109\/ICASSP.2017.7952161"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1370","DOI":"10.1109\/LSP.2021.3076374","article-title":"Complex neural spatial filter: Enhancing multi-channel target speech separation in complex domain","volume":"28","author":"Gu","year":"2021","journal-title":"IEEE Signal Proc. Let."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"3291","DOI":"10.1121\/10.0011396","article-title":"Low-latency monaural speech enhancement with deep filter-bank equalizer","volume":"151","author":"Zheng","year":"2022","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Luo, Y., Han, C., Mesgarani, N., Ceolini, E., and Liu, S. (2019, January 14\u201318). Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing. Proceedings of the ASRU, Sentosa, Singapore.","DOI":"10.1109\/ASRU46091.2019.9003849"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Luo, Y., Chen, Z., Mesgarani, N., and Yoshioka, T. (2020, January 4\u20138). End-to-end microphone permutation and number invariant multi-channel speech separation. Proceedings of the ICASSP, Virtual.","DOI":"10.1109\/ICASSP40776.2020.9054177"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M.L., Chen, G., Zhang, Y., Mandel, M., and Yu, D. (2016, January 20\u201325). Deep beamforming networks for multi-channel speech recognition. Proceedings of the ICASSP, Shanghai, China.","DOI":"10.1109\/ICASSP.2016.7472778"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Xu, Y., Zhang, Z., Yu, M., Zhang, S., and Yu, D. (2021). Generalized spatio-temporal rnn beamformer for target speech separation. arXiv.","DOI":"10.21437\/Interspeech.2021-430"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Ren, X., Zhang, X., Chen, L., Zheng, X., Zhang, X., Guo, L., and Yu, B. (2021, January 30). A causal u-net based neural beamforming network for real-time multi-channel speech enhancement. Proceedings of the Interspeech, Brno, Czechia.","DOI":"10.21437\/Interspeech.2021-1457"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Chen, J., Li, J., Xiao, X., Yoshioka, T., Wang, H., Wang, Z., and Gong, Y. (2017, January 16\u201320). Fasnet: Cracking the cocktail party problem by multi-beam deep attractor network. Proceedings of the ASRU, Okinawa, Japan.","DOI":"10.1109\/ASRU.2017.8268969"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Reddy, C., Dubey, H., Gopal, V., Cutler, R., Braun, S., Gamper, H., Aichner, R., and Srinivasan, S. (2021, January 6\u201311). Icassp 2021 deep noise suppression challenge. Proceedings of the ICASSP, Toronto, ON, Canada.","DOI":"10.21437\/Interspeech.2021-1609"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Li, A., Zheng, C., Zhang, L., and Li, X. (2021). Glance and gaze: A collaborative learning framework for single-channel speech enhancement. arXiv.","DOI":"10.1016\/j.apacoust.2021.108499"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1121\/1.395561","article-title":"Maximum directivity proof for three-dimensional arrays","volume":"82","author":"Parsons","year":"1987","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1548","DOI":"10.1109\/TASLP.2016.2568044","article-title":"Reduced-Order Robust Superdirective Beamforming With Uniform Linear Microphone Arrays","volume":"24","author":"Pan","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1829","DOI":"10.1109\/TASLP.2021.3079813","article-title":"Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement","volume":"29","author":"Li","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Tan, K., and Wang, D. (2018, January 2\u20136). A convolutional recurrent neural network for real-time speech enhancement. Proceedings of the Interspeech, Hyderabad, India.","DOI":"10.21437\/Interspeech.2018-1405"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Long, J., Shelhamer, E., and Darrell, T. (2015, January 8\u201310). Fully convolutional networks for semantic segmentation. Proceedings of the CVPR, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014, January 22\u201327). A convolutional neural network for modelling sentences. Proceedings of the ACL, Balimore, MD, USA.","DOI":"10.3115\/v1\/P14-1062"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Ciaburro, G., and Iannace, G. (2020). Improving smart cities safety using sound events detection based on deep neural network algorithms. Informatics, 7.","DOI":"10.3390\/informatics7030023"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ciaburro, G. (2020). Sound Event Detection in Underground Parking Garage Using Convolutional Neural Network. Big Data Cogn. Comput., 4.","DOI":"10.3390\/bdcc4030020"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1109\/TASLP.2019.2955276","article-title":"Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement","volume":"28","author":"Tan","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"107404","DOI":"10.1016\/j.patcog.2020.107404","article-title":"U2-Net: Going deeper with nested U-structure for salient object detection","volume":"106","author":"Qin","year":"2020","journal-title":"Pattern Recognit."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"103519","DOI":"10.1016\/j.dsp.2022.103519","article-title":"A separation and interaction framework for causal multi-channel speech enhancement","volume":"126","author":"Liu","year":"2022","journal-title":"Digital Signal Process."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1016\/0167-6393(90)90010-7","article-title":"Speech database development at mit: Timit and beyond","volume":"9","author":"Zue","year":"1990","journal-title":"Speech Commun."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1016\/0167-6393(93)90095-3","article-title":"Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems","volume":"12","author":"Varga","year":"1993","journal-title":"Speech Commun."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Barker, J., Marxer, R., Vincent, E., and Watanabe, S. (2015, January 13\u201317). The third \u2018CHiME\u2019 speech separation and recognition challenge: Dataset, task and baselines. Proceedings of the ASRU, Scottsdale, AZ, USA.","DOI":"10.1109\/ASRU.2015.7404837"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"943","DOI":"10.1121\/1.382599","article-title":"Image method for efficiently simulating small-room acoustics","volume":"65","author":"Allen","year":"1979","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Zhang, J., Zoril\u0103, C., Doddipatla, R., and Barker, J. (2020, January 4\u20138). On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments. Proceedings of the ICASSP, Virtual.","DOI":"10.1109\/ICASSP40776.2020.9053833"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Rao, W., Fu, Y., Hu, Y., Xu, X., Jv, Y., Han, J., Shang, S., Jiang, Z., Xie, L., and Wang, Y. (2021). Interspeech 2021 conferencingspeech challenge: Towards far-field multi-channel speech enhancement for video conferencing. arXiv.","DOI":"10.1109\/ASRU51503.2021.9688126"},{"key":"ref_41","unstructured":"Rix, A., Beerends, J., Hollier, M., and Hekstra, A. (2001, January 7\u201311). Perceptual evaluation of speech quality (pesq)\u2014A new method for speech quality assessment of telephone networks and codecs. Proceedings of the ICASSP, Salt Palace Convention Center, Salt Lake City, UT, USA."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"2009","DOI":"10.1109\/TASLP.2016.2585878","article-title":"An algorithm for predicting the intelligibility of speech masked by modulated noise maskers","volume":"24","author":"Jensen","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Proc."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/6\/1081\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:17:45Z","timestamp":1760138265000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/14\/6\/1081"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,24]]},"references-count":42,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2022,6]]}},"alternative-id":["sym14061081"],"URL":"https:\/\/doi.org\/10.3390\/sym14061081","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2022,5,24]]}}}