{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T06:42:57Z","timestamp":1774939377859,"version":"3.50.1"},"reference-count":57,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2023,6,21]],"date-time":"2023-06-21T00:00:00Z","timestamp":1687305600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Speech separation is a well-known problem, especially when there is only one sound mixture available. Estimating the Ideal Binary Mask (IBM) is one solution to this problem. Recent research has focused on the supervised classification approach. The challenge of extracting features from the sources is critical for this method. Speech separation has been accomplished by using a variety of feature extraction models. The majority of them, however, are concentrated on a single feature. The complementary nature of various features have not been thoroughly investigated. In this paper, we propose a deep neural network (DNN) ensemble architecture to completely explore the complimentary nature of the diverse features obtained from raw acoustic features. We examined the penultimate discriminative representations instead of employing the features acquired from the output layer. The learned representations were also fused to produce a new features vector, which was then classified by using the Extreme Learning Machine (ELM). In addition, a genetic algorithm (GA) was created to optimize the parameters globally. The results of the experiments showed that our proposed system completely considered various features and produced a high-quality IBM under different conditions.<\/jats:p>","DOI":"10.3390\/info14070352","type":"journal-article","created":{"date-parts":[[2023,6,22]],"date-time":"2023-06-22T02:09:17Z","timestamp":1687399757000},"page":"352","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Ensemble System of Deep Neural Networks for Single-Channel Audio Separation"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5542-9144","authenticated-orcid":false,"given":"Musab T. S.","family":"Al-Kaltakchi","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering, College of Engineering, Mustansiriyah University, Baghdad 10047, Iraq"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6141-2605","authenticated-orcid":false,"given":"Ahmad Saeed","family":"Mohammad","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, College of Engineering, Mustansiriyah University, Baghdad 10047, Iraq"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8698-7605","authenticated-orcid":false,"given":"Wai Lok","family":"Woo","sequence":"additional","affiliation":[{"name":"Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne NE1 8ST, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,6,21]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1006\/csla.1994.1016","article-title":"Computational auditory scene analysis","volume":"8","author":"Brown","year":"1994","journal-title":"Comput. Speech Lang."},{"key":"ref_2","unstructured":"Wang, D. (2005). Speech Separation by Humans and Machines, Springer."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1438","DOI":"10.1109\/TSMCB.2009.2039566","article-title":"Multiview spectral embedding","volume":"40","author":"Xia","year":"2010","journal-title":"IEEE Trans. Syst. Man, Cybern. Part B Cybern."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2303","DOI":"10.1109\/TNNLS.2014.2308519","article-title":"Learning deep and wide: A spectral method for learning deep networks","volume":"25","author":"Shao","year":"2014","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1109\/TASL.2008.916519","article-title":"Combining spectral representations for large-vocabulary continuous speech recognition","volume":"16","author":"Garau","year":"2008","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1773","DOI":"10.1109\/TASLP.2017.2716443","article-title":"Two-stage single-channel audio source separation using deep neural networks","volume":"25","author":"Grais","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1535","DOI":"10.1109\/TASLP.2017.2700540","article-title":"A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks","volume":"25","author":"Wang","year":"2017","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Zhao, M., Yao, X., Wang, J., Yan, Y., Gao, X., and Fan, Y. (2021). Single-channel blind source separation of spatial aliasing signal based on stacked-LSTM. Sensors, 21.","DOI":"10.3390\/s21144844"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2233","DOI":"10.1109\/TSP.2021.3064181","article-title":"Null space component analysis of one-shot single-channel source separation problem","volume":"69","author":"Hwang","year":"2021","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1109\/TASLP.2018.2869692","article-title":"Gaussian modeling-based multichannel audio source separation exploiting generic source spectral model","volume":"27","author":"Duong","year":"2018","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1109\/LSP.2021.3055463","article-title":"Ray-space-based multichannel nonnegative matrix factorization for audio source separation","volume":"28","author":"Pezzoli","year":"2021","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"100013","DOI":"10.1109\/ACCESS.2020.2997871","article-title":"Multi-head self-attention-based deep clustering for single-channel speech separation","volume":"8","author":"Jin","year":"2020","journal-title":"IEEE Access"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1016\/j.neucom.2021.01.052","article-title":"Generative adversarial networks for single channel separation of convolutive mixed speech signals","volume":"438","author":"Li","year":"2021","journal-title":"Neurocomputing"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1256","DOI":"10.1109\/TASLP.2019.2915167","article-title":"Conv-tasnet: Surpassing ideal time\u2013frequency magnitude masking for speech separation","volume":"27","author":"Luo","year":"2019","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1109\/JSTSP.2020.2980956","article-title":"Multi-modal multi-channel target speech separation","volume":"14","author":"Gu","year":"2020","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"43444","DOI":"10.1109\/ACCESS.2021.3065775","article-title":"Singular spectrum analysis for source separation in drone-based audio recording","volume":"9","author":"Encinas","year":"2021","journal-title":"IEEE Access"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2840","DOI":"10.1109\/TASLP.2021.3099291","article-title":"Wavesplit: End-to-end speech separation by speaker clustering","volume":"29","author":"Zeghidour","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Mika, D., Budzik, G., and Jozwik, J. (2020). Single channel source separation with ICA-based time-frequency decomposition. Sensors, 20.","DOI":"10.3390\/s20072019"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"6655125","DOI":"10.1155\/2021\/6655125","article-title":"An improved unsupervised single-channel speech separation algorithm for processing speech sensor signals","volume":"2021","author":"Jiang","year":"2021","journal-title":"Wirel. Commun. Mob. Comput."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"2083","DOI":"10.1109\/TASLP.2021.3082331","article-title":"Conditioned source separation for musical instrument performances","volume":"29","author":"Slizovskaia","year":"2021","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"227399","DOI":"10.1109\/ACCESS.2020.3045791","article-title":"Majorization-minimization algorithm for discriminative non-negative matrix factorization","volume":"8","author":"Li","year":"2020","journal-title":"IEEE Access"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1109\/LSP.2019.2909968","article-title":"A moment-based estimation strategy for underdetermined single-sensor blind source separation","volume":"26","author":"Smith","year":"2019","journal-title":"IEEE Signal Process. Lett."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1424","DOI":"10.1109\/TASLP.2016.2558822","article-title":"A regression approach to single-channel speech separation via high-resolution deep neural networks","volume":"24","author":"Du","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"1652","DOI":"10.1109\/TASLP.2016.2580946","article-title":"Multichannel audio source separation with deep neural networks","volume":"24","author":"Nugraha","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1066","DOI":"10.1109\/TASLP.2016.2540805","article-title":"A pairwise algorithm using the deep stacking network for speech separation and pitch estimation","volume":"24","author":"Zhang","year":"2016","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1381","DOI":"10.1109\/TASL.2013.2250961","article-title":"Towards scaling up classification-based speech separation","volume":"21","author":"Wang","year":"2013","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"2087","DOI":"10.1109\/TASLP.2014.2357677","article-title":"Informed single-channel speech separation using HMM\u2013GMM user-generated exemplar source","volume":"22","author":"Wang","year":"2014","journal-title":"IEEE\/ACM Trans. Audio Speech Lang. Process."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1722","DOI":"10.1109\/TNNLS.2013.2258680","article-title":"Single-channel blind separation using pseudo-stereo mixture and complex 2-D histogram","volume":"24","author":"Tengtrairat","year":"2013","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1355","DOI":"10.1109\/TASL.2013.2250959","article-title":"CLOSE\u2014A data-driven approach to speech separation","volume":"21","author":"Ming","year":"2013","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.specom.2010.08.005","article-title":"Mask classification for missing-feature reconstruction for robust speech recognition in unknown background noise","volume":"53","author":"Kim","year":"2011","journal-title":"Speech Commun."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"1135","DOI":"10.1109\/TNN.2004.832812","article-title":"Monaural speech segregation based on pitch tracking and amplitude modulation","volume":"15","author":"Hu","year":"2004","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"662","DOI":"10.1109\/TCSI.2012.2215735","article-title":"Unsupervised single-channel separation of nonstationary signals using Gammatone filterbank and itakura\u2013saito nonnegative matrix two-dimensional factorizations","volume":"60","author":"Gao","year":"2012","journal-title":"IEEE Trans. Circuits Syst. I Regul. Pap."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"489","DOI":"10.1016\/j.neucom.2005.12.126","article-title":"Extreme learning machine: Theory and applications","volume":"70","author":"Huang","year":"2006","journal-title":"Neurocomputing"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"2885","DOI":"10.1109\/TCYB.2015.2492468","article-title":"Extreme learning machine with subnetwork hidden nodes for regression and classification","volume":"46","author":"Yang","year":"2015","journal-title":"IEEE Trans. Cybern."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"809","DOI":"10.1109\/TNNLS.2015.2424995","article-title":"Extreme learning machine for multilayer perceptron","volume":"27","author":"Tang","year":"2015","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1109\/TSMCB.2011.2168604","article-title":"Extreme learning machine for regression and multiclass classification","volume":"42","author":"Huang","year":"2011","journal-title":"IEEE Trans. Syst. Man, Cybern. Part B Cybern."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1486","DOI":"10.1121\/1.3184603","article-title":"An algorithm that improves speech intelligibility in noise for normal-hearing listeners","volume":"126","author":"Kim","year":"2009","journal-title":"J. Acoust. Soc. Am."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"270","DOI":"10.1109\/TASL.2012.2221459","article-title":"Exploring monaural features for classification-based speech segregation","volume":"21","author":"Wang","year":"2012","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"578","DOI":"10.1109\/89.326616","article-title":"RASTA processing of speech","volume":"2","author":"Hermansky","year":"1994","journal-title":"IEEE Trans. Speech Audio Process."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1186\/s13634-017-0515-7","article-title":"Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects","volume":"2017","author":"Woo","year":"2017","journal-title":"EURASIP J. Adv. Signal Process."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"851","DOI":"10.1007\/s10772-019-09630-9","article-title":"Thorough evaluation of TIMIT database speaker identification performance under noise with and without the G. 712 type handset","volume":"22","author":"Abdullah","year":"2019","journal-title":"Int. J. Speech Technol."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1236","DOI":"10.3906\/elk-1906-118","article-title":"Comparisons of extreme learning machine and backpropagation-based i-vectorapproach for speaker identification","volume":"28","author":"Abdullah","year":"2020","journal-title":"Turk. J. Electr. Eng. Comput. Sci."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"4903","DOI":"10.1007\/s00034-021-01697-7","article-title":"Combined i-vector and extreme learning machine approach for robust speaker identification and evaluation with SITW 2016, NIST 2008, TIMIT databases","volume":"40","author":"Abdullah","year":"2021","journal-title":"Circuits Syst. Signal Process."},{"key":"ref_44","unstructured":"Hinton, G.E. (2012). Neural Networks: Tricks of the Trade, Springer. [2nd ed.]."},{"key":"ref_45","unstructured":"Erhan, D., Courville, A., Bengio, Y., and Vincent, P. (2010, January 13\u201315). Why does unsupervised pre-training help deep learning?. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy."},{"key":"ref_46","unstructured":"Mohammad, A.S., Nguyen, D.H.H., Rattani, A., Puttagunta, R.S., Li, Z., and Derakhshani, R.R. (2021). Authentication Verification Using Soft Biometric Traits. (10,922,399), U.S. Patent."},{"key":"ref_47","unstructured":"Mohammad, A.S. (2018). Multi-Modal Ocular Recognition in Presence of Occlusion in Mobile Devices, University of Missouri-Kansas City."},{"key":"ref_48","first-page":"136","article-title":"Comparison of squeezed convolutional neural network models for eyeglasses detection in mobile environment","volume":"33","author":"Mohammad","year":"2018","journal-title":"J. Comput. Sci. Coll."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Mohammad, A.S., Reddy, N., James, F., and Beard, C. (2018, January 8\u201310). Demodulation of faded wireless signals using deep convolutional neural networks. Proceedings of the 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA.","DOI":"10.1109\/CCWC.2018.8301731"},{"key":"ref_50","unstructured":"Bezdek, J., and Hathaway, R. (2002). Advances in Soft Computing\u2014AFSS 2002, Springer."},{"key":"ref_51","unstructured":"Bhatia, R. (2013). Matrix Analysis, Springer Science & Business Media."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"621","DOI":"10.1016\/j.csl.2012.10.004","article-title":"The PASCAL CHiME speech separation and recognition challenge","volume":"27","author":"Barker","year":"2013","journal-title":"Comput. Speech Lang."},{"key":"ref_53","unstructured":"Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R. (2023, April 23). RWC Music Database: Music Genre Database and Musical Instrument Sound Database. Available online: http:\/\/jhir.library.jhu.edu\/handle\/1774.2\/36."},{"key":"ref_54","unstructured":"Ellis, D. (2023, April 23). PLP, RASTA, and MFCC, Inversion in Matlab. Available online: http:\/\/www.ee.columbia.edu\/~dpwe\/resources\/matlab\/rastamat\/."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"793","DOI":"10.1162\/neco.2008.04-08-771","article-title":"Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis","volume":"21","author":"Bertin","year":"2009","journal-title":"Neural Comput."},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"2125","DOI":"10.1109\/TASL.2011.2114881","article-title":"An algorithm for intelligibility prediction of time\u2013frequency weighted noisy speech","volume":"19","author":"Taal","year":"2011","journal-title":"IEEE Trans. Audio Speech Lang. Process."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/2200000006","article-title":"Learning deep architectures for AI","volume":"2","author":"Bengio","year":"2009","journal-title":"Found. Trends Mach. Learn."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/7\/352\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:58:03Z","timestamp":1760126283000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/7\/352"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,21]]},"references-count":57,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["info14070352"],"URL":"https:\/\/doi.org\/10.3390\/info14070352","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,21]]}}}