{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T11:19:24Z","timestamp":1768994364389,"version":"3.49.0"},"reference-count":42,"publisher":"MDPI AG","issue":"10","license":[{"start":{"date-parts":[[2024,10,4]],"date-time":"2024-10-04T00:00:00Z","timestamp":1728000000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>An effective approach to addressing the speech separation problem is utilizing a time\u2013frequency (T-F) mask. The ideal binary mask (IBM) and ideal ratio mask (IRM) have long been widely used to separate speech signals. However, the IBM is better at improving speech intelligibility, while the IRM is better at improving speech quality. To leverage their respective strengths and overcome weaknesses, we propose an ideal threshold-based mask (ITM) to combine these two masks. By adjusting two thresholds, these two masks are combined to jointly act on speech separation. We list the impact of using different threshold combinations on speech separation performance under ideal conditions and discuss a reasonable range for fine tuning the thresholds. By using masks as a training target, to evaluate the effectiveness of the proposed method, we conducted supervised speech separation experiments applying a deep neural network (DNN) and long short-term memory (LSTM), the results of which were measured by three objective indicators: the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio improvement (SAR). Experimental results show that the proposed mask combines the strengths of the IBM and IRM and implies that the accuracy of speech separation can potentially be further improved by effectively leveraging the advantages of different masks.<\/jats:p>","DOI":"10.3390\/info15100608","type":"journal-article","created":{"date-parts":[[2024,10,4]],"date-time":"2024-10-04T07:49:42Z","timestamp":1728028182000},"page":"608","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Threshold-Based Combination of Ideal Binary Mask and Ideal Ratio Mask for Single-Channel Speech Separation"],"prefix":"10.3390","volume":"15","author":[{"given":"Peng","family":"Chen","sequence":"first","affiliation":[{"name":"Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan"}]},{"given":"Binh Thien","family":"Nguyen","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan"}]},{"given":"Kenta","family":"Iwai","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan"}]},{"given":"Takanobu","family":"Nishiura","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 567-8570, Osaka, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Makino, S., Lee, T., and Sawada, H. (2007). Blind Speech Separation, Springer.","DOI":"10.1007\/978-1-4020-6479-1"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1631\/FITEE.1700814","article-title":"Past review, current progress, and challenges ahead on the cocktail party problem","volume":"19","author":"Qian","year":"2018","journal-title":"Front. Inf. Technol. Electron. Eng."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1702","DOI":"10.1109\/TASLP.2018.2842159","article-title":"Supervised speech separation based on deep learning: An overview","volume":"26","author":"Wang","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"31035","DOI":"10.1007\/s11042-023-14649-x","article-title":"A review on speech separation in cocktail party environment: Challenges and approaches","volume":"82","author":"Agrawal","year":"2023","journal-title":"Multimed. Tools Appl."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Yu, D., and Deng, L. (2016). Automatic Speech Recognition, Springer.","DOI":"10.1007\/978-1-4471-5779-3"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"9411","DOI":"10.1007\/s11042-020-10073-7","article-title":"Automatic speech recognition: A survey","volume":"80","author":"Malik","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_7","unstructured":"St\u00fcber, G.L., and Steuber, G.L. (2001). Principles of Mobile Communication, Springer."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"7332","DOI":"10.1073\/pnas.0610245104","article-title":"Structure and tie strengths in mobile communication networks","volume":"104","author":"Onnela","year":"2007","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1437","DOI":"10.1109\/5.628714","article-title":"Speaker recognition: A tutorial","volume":"85","author":"Campbell","year":"1997","journal-title":"Proc. IEEE"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1109\/MSP.2015.2462851","article-title":"Speaker recognition by machines and humans: A tutorial review","volume":"32","author":"Hansen","year":"2015","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1819","DOI":"10.1016\/j.sigpro.2007.01.011","article-title":"Source separation using single channel ICA","volume":"87","author":"Davies","year":"2007","journal-title":"Signal Process."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.csl.2009.02.006","article-title":"Monaural speech separation and recognition challenge","volume":"24","author":"Cooke","year":"2010","journal-title":"Comput. Speech Lang."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"405","DOI":"10.1109\/89.242486","article-title":"Multi-channel signal separation by decorrelation","volume":"1","author":"Weinstein","year":"1993","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1652","DOI":"10.1109\/TASLP.2016.2580946","article-title":"Multichannel audio source separation with deep neural networks","volume":"24","author":"Nugraha","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Loizou, P.C. (2007). Speech Enhancement: Theory and Practice, CRC Press.","DOI":"10.1201\/9781420015836"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"731","DOI":"10.1109\/89.952491","article-title":"Speech enhancement using a constrained iterative sinusoidal model","volume":"9","author":"Jensen","year":"2001","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1109\/TASSP.1979.1163209","article-title":"Suppression of acoustic noise in speech using spectral subtraction","volume":"27","author":"Boll","year":"1979","journal-title":"IEEE Trans. Acoust. Speech, Signal Process."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"445","DOI":"10.1109\/89.709670","article-title":"HMM-based strategies for enhancement of speech signals embedded in nonstationary noise","volume":"6","author":"Sameti","year":"1998","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1006\/dspr.1999.0361","article-title":"Speaker verification using adapted Gaussian mixture models","volume":"10","author":"Reynolds","year":"2000","journal-title":"Digit. Signal Process."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1006\/csla.1994.1016","article-title":"Computational auditory scene analysis","volume":"8","author":"Brown","year":"1994","journal-title":"Comput. Speech Lang."},{"key":"ref_21","unstructured":"Wang, D., and Brown, G.J. (2006). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Liu, Y., and Wang, D. (2018, January 15\u201320). A CASA approach to deep learning based speaker-independent co-channel speech separation. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada.","DOI":"10.1109\/ICASSP.2018.8461477"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Wang, D. (2005). On ideal binary mask as the computational goal of auditory scene analysis. Speech Separation by Humans and Machines, Springer.","DOI":"10.1007\/0-387-22794-6_12"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1415","DOI":"10.1121\/1.3179673","article-title":"Role of mask pattern in intelligibility of ideal binary-masked noisy speech","volume":"126","author":"Kjems","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Narayanan, A., and Wang, D. (2013, January 26\u201331). Ideal ratio mask estimation using deep neural networks for robust speech recognition. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada.","DOI":"10.1109\/ICASSP.2013.6639038"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Hummersone, C., Stokes, T., and Brookes, T. (2014). On the ideal ratio mask as the goal of computational auditory scene analysis. Blind Source Separation: Advances in Theory, Algorithms and Applications, Springer.","DOI":"10.1007\/978-3-642-55016-4_12"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1673","DOI":"10.1121\/1.2832617","article-title":"Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction","volume":"123","author":"Li","year":"2008","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2336","DOI":"10.1121\/1.3083233","article-title":"Speech intelligibility in background noise with ideal binary time-frequency masking","volume":"125","author":"Wang","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1849","DOI":"10.1109\/TASLP.2014.2352935","article-title":"On training targets for supervised speech separation","volume":"22","author":"Wang","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Minipriya, T., and Rajavel, R. (2018, January 27\u201328). Review of ideal binary and ratio mask estimation techniques for monaural speech separation. Proceedings of the 2018 Fourth International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, India.","DOI":"10.1109\/AEEICB.2018.8480857"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Chen, J., and Wang, D. (2018). DNN Based Mask Estimation for Supervised Speech Separation. Audio Source Separation, Springer.","DOI":"10.1007\/978-3-319-73031-8_9"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1109\/LSP.2013.2291240","article-title":"An experimental study on speech enhancement based on deep neural networks","volume":"21","author":"Xu","year":"2013","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"1424","DOI":"10.1109\/TASLP.2016.2558822","article-title":"A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks","volume":"24","author":"Du","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1085","DOI":"10.1109\/TASLP.2017.2687829","article-title":"Features for masking-based monaural speech separation in reverberant conditions","volume":"25","author":"Delfarah","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., and Schuller, B. (2015, January 25\u201328). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Proceedings of the Latent Variable Analysis and Signal Separation: 12th International Conference, LVA\/ICA 2015, Liberec, Czech Republic. Proceedings 12.","DOI":"10.1007\/978-3-319-22482-4_11"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"4705","DOI":"10.1121\/1.4986931","article-title":"Long short-term memory for speaker generalization in supervised speech separation","volume":"141","author":"Chen","year":"2017","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Strake, M., Defraene, B., Fluyt, K., Tirry, W., and Fingscheidt, T. (2019, January 20\u201323). Separated noise suppression and speech restoration: LSTM-based speech enhancement in two stages. Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA.","DOI":"10.1109\/WASPAA.2019.8937222"},{"key":"ref_38","unstructured":"Garofolo, J.S. (1993). Timit Acoustic Phonetic Continuous Speech Corpus, Linguistic Data Consortium."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Huang, P.S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. (2014, January 4\u20139). Deep learning for monaural speech separation. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6853860"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Grais, E.M., Sen, M.U., and Erdogan, H. (2014, January 4\u20139). Deep neural networks for single channel source separation. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Florence, Italy.","DOI":"10.1109\/ICASSP.2014.6854299"},{"key":"ref_41","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1462","DOI":"10.1109\/TSA.2005.858005","article-title":"Performance measurement in blind audio source separation","volume":"14","author":"Vincent","year":"2006","journal-title":"IEEE Trans. Audio Speech Lang. Process."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/15\/10\/608\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:10:38Z","timestamp":1760112638000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/15\/10\/608"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,4]]},"references-count":42,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2024,10]]}},"alternative-id":["info15100608"],"URL":"https:\/\/doi.org\/10.3390\/info15100608","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,10,4]]}}}