{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:18:02Z","timestamp":1740143882894,"version":"3.37.3"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Nature Science Foundation of China","doi-asserted-by":"crossref","award":["No.62071039"],"award-info":[{"award-number":["No.62071039"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing Natural Science Foundation","award":["No.L223033"],"award-info":[{"award-number":["No.L223033"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J AUDIO SPEECH MUSIC PROC."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model\u2019s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system\u2019s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.<\/jats:p>","DOI":"10.1186\/s13636-023-00317-3","type":"journal-article","created":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T14:03:35Z","timestamp":1701871415000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Lightweight target speaker separation network based on joint training"],"prefix":"10.1186","volume":"2023","author":[{"given":"Jing","family":"Wang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hanyue","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liang","family":"Xu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenjing","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5217-5966","authenticated-orcid":false,"given":"Weiming","family":"Yi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fang","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,12,6]]},"reference":[{"key":"317_CR1","doi-asserted-by":"crossref","unstructured":"E.C. Cherry, Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 25(5), 975\u2013979 (1953)","DOI":"10.1121\/1.1907229"},{"key":"317_CR2","doi-asserted-by":"crossref","unstructured":"D.L. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE Trans. Audio Speech Lang. Process. 26(10), 1702\u20131726 (2018)","DOI":"10.1109\/TASLP.2018.2842159"},{"key":"317_CR3","unstructured":"A. Canziani, A. Paszke, E. Culurciello, An analysis of deep neural network models for practical applications (2016). http:\/\/arxiv.org\/abs\/1605.07678"},{"key":"317_CR4","doi-asserted-by":"crossref","unstructured":"W.S. Noble, What is a support vector machine? Nat. Biotechnol. 24(12), 1565\u20131567 (2006)","DOI":"10.1038\/nbt1206-1565"},{"key":"317_CR5","doi-asserted-by":"crossref","unstructured":"D.L. Wang, On ideal binary mask as the computational goal of auditory scene analysis. In: Speech Separation by Humans and Machines, pp. 181\u2013197. (Springer, Boston, 2005)","DOI":"10.1007\/0-387-22794-6_12"},{"key":"317_CR6","doi-asserted-by":"crossref","unstructured":"P.S. Huang, M. Kim, M. Hasegawa-Johnson, et al., Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE Trans. Audio Speech Lang. Process. 23(12), 2136\u20132147 (2015)","DOI":"10.1109\/TASLP.2015.2468583"},{"key":"317_CR7","doi-asserted-by":"crossref","unstructured":"J. Chen, D.L. Wang, Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 141(6), 4705\u20134714 (2017)","DOI":"10.1121\/1.4986931"},{"key":"317_CR8","doi-asserted-by":"crossref","unstructured":"Y. Luo, N. Mesgarani, Tasnet: Time-domain audio separation network for real-time, single-channel speech separation. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Calgary, 2018), p. 696\u2013700","DOI":"10.1109\/ICASSP.2018.8462116"},{"key":"317_CR9","doi-asserted-by":"crossref","unstructured":"Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time frequency magnitude masking for speech separation. IEEE Trans. Audio Speech Lang. Process. 27(8), 1256\u20131266 (2019)","DOI":"10.1109\/TASLP.2019.2915167"},{"key":"317_CR10","doi-asserted-by":"crossref","unstructured":"C. Lea, M.D. Flynn, R. Vidal, et al., Temporal convolutional networks for action segmentation and detection. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., (IEEE, Honolulu, 2017),\u00a0 p. 156\u2013165","DOI":"10.1109\/CVPR.2017.113"},{"key":"317_CR11","doi-asserted-by":"crossref","unstructured":"Y. Luo, Z. Chen, T. Yoshioka, Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Barcelona, 2020), p. 46\u201350","DOI":"10.1109\/ICASSP40776.2020.9054266"},{"key":"317_CR12","doi-asserted-by":"crossref","unstructured":"M.H. Radfar, R.M. Dansereau, A. Sayadiyan, Monaural speech segregation based on fusion of source-driven with model-driven techniques. Speech Commun. 49(6), 464\u2013476 (2007)","DOI":"10.1016\/j.specom.2007.04.007"},{"key":"317_CR13","doi-asserted-by":"crossref","unstructured":"M.N. Schmidt, R.K. Olsson, Single-channel speech separation using sparse non-negative matrix factorization. In: Interspeech. (ISCA, Pittsburgh, 2006), p. 2\u20135","DOI":"10.21437\/Interspeech.2006-655"},{"key":"317_CR14","doi-asserted-by":"crossref","unstructured":"J.V. Stone, Independent Component Analysis: a Tutorial Introduction. (MIT Press, Cambridge, 2004)","DOI":"10.7551\/mitpress\/3717.001.0001"},{"key":"317_CR15","doi-asserted-by":"publisher","unstructured":"J. Wang, H. Liu, H. Ying, C. Qiu, J. Li, M.S. Anwar, Attention-based neural network for end-to-end music separation. CAAI Trans. Intell. Technol. 8, 355\u2013363 (2023). https:\/\/doi.org\/10.1049\/cit2.12163","DOI":"10.1049\/cit2.12163"},{"key":"317_CR16","doi-asserted-by":"crossref","unstructured":"K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani, Speaker-aware neural network based beamformer for speaker extraction in speech mixtures. In: Interspeech (2017)","DOI":"10.21437\/Interspeech.2017-667"},{"key":"317_CR17","doi-asserted-by":"crossref","unstructured":"K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani, Learning speaker representation for neural network based multichannel speaker extraction. In: IEEE Autom. Speech Recognit. Understanding Workshop (ASRU). (IEEE, Okinawa, 2017), p. 8\u201315","DOI":"10.1109\/ASRU.2017.8268910"},{"key":"317_CR18","doi-asserted-by":"crossref","unstructured":"J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, D. Yu, Deep extractor network for target speaker recovery from single channel speech mixtures (2018). http:\/\/arxiv.org\/abs\/1807.08974","DOI":"10.21437\/Interspeech.2018-1205"},{"key":"317_CR19","doi-asserted-by":"publisher","unstructured":"Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. Saurous, R. Weiss, Y. Jia, I. Moreno, Voicefilter: targeted voice separation by speaker-conditioned spectrogram masking, pp. 2728\u20132732 (2019). https:\/\/doi.org\/10.21437\/Interspeech.2019-1101","DOI":"10.21437\/Interspeech.2019-1101"},{"key":"317_CR20","doi-asserted-by":"crossref","unstructured":"C. Xu, W. Rao, E.S. Chng, H. Li, Spex: Multi-scale time domain speaker extraction network. IEEE Trans. Audio Speech Lang. Process. 28, 1370\u20131384 (2020)","DOI":"10.1109\/TASLP.2020.2987429"},{"key":"317_CR21","doi-asserted-by":"publisher","unstructured":"S. He, H. Li, X. Zhang, Speakerfilter-pro: an improved target speaker extractor combines the time domain and frequency domain. In: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 473\u2013477 (2022). https:\/\/doi.org\/10.1109\/ISCSLP57327.2022.10037794","DOI":"10.1109\/ISCSLP57327.2022.10037794"},{"key":"317_CR22","doi-asserted-by":"crossref","unstructured":"J.S. Chung, J. Huh, S. Mun, et al., In defence of metric learning for speaker recognition (2020). http:\/\/arxiv.org\/abs\/2003.11982","DOI":"10.21437\/Interspeech.2020-1064"},{"key":"317_CR23","doi-asserted-by":"crossref","unstructured":"T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12\u201340 (2010)","DOI":"10.1016\/j.specom.2009.08.009"},{"key":"317_CR24","doi-asserted-by":"crossref","unstructured":"Y. Tai, J. Yang, X. Liu, Image super-resolution via deep recursive residual network. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (IEEE, Honolulu, 2017), p. 3147\u20133155","DOI":"10.1109\/CVPR.2017.298"},{"key":"317_CR25","doi-asserted-by":"crossref","unstructured":"E. Variani, X. Lei, E. McDermott, et al., Deep neural networks for small footprint text-dependent speaker verification. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Florence, 2014), p. 4052\u20134056","DOI":"10.1109\/ICASSP.2014.6854363"},{"key":"317_CR26","doi-asserted-by":"crossref","unstructured":"J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., (IEEE, Salt Lake City, 2018), p. 7132\u20137141","DOI":"10.1109\/CVPR.2018.00745"},{"key":"317_CR27","doi-asserted-by":"crossref","unstructured":"D. Lee, Z. Tian, L. Xue, et al., Enhancing content preservation in text style transfer using reverse attention and conditional layer normalization (2021). http:\/\/arxiv.org\/abs\/2108.00449","DOI":"10.18653\/v1\/2021.acl-long.8"},{"key":"317_CR28","unstructured":"J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization (2016). http:\/\/arxiv.org\/abs\/1607.06450"},{"key":"317_CR29","doi-asserted-by":"crossref","unstructured":"F. Wang, J. Cheng, W. Liu, et al., Additive margin softmax for face verification. IEEE Signal Process Lett. 25(7), 926\u2013930 (2018)","DOI":"10.1109\/LSP.2018.2822810"},{"key":"317_CR30","unstructured":"W.J. Yang, et al., A target speaker separation neural network with joint-training. In: 2021 Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA ASC). (APSIPA, Tokyo, 2021),\u00a0p. 614\u2013618"},{"key":"317_CR31","doi-asserted-by":"crossref","unstructured":"H. Wang, Y. Wang, Z. Zhou, et al., Cosface: large margin cosine loss for deep face recognition. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., (IEEE, Salt Lake City, 2018),\u00a0p. 5265\u20135274","DOI":"10.1109\/CVPR.2018.00552"},{"key":"317_CR32","unstructured":"A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification (2017). http:\/\/arxiv.org\/abs\/1703.07737"},{"key":"317_CR33","doi-asserted-by":"crossref","unstructured":"A. Nagrani, J.S. Chung, A. Zisserman, Voxceleb: a large-scale speaker identification dataset (2017). http:\/\/arxiv.org\/abs\/1706.08612","DOI":"10.21437\/Interspeech.2017-950"},{"key":"317_CR34","doi-asserted-by":"crossref","unstructured":"J.S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: deep speaker recognition (2018). http:\/\/arxiv.org\/abs\/1806.05622","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"317_CR35","doi-asserted-by":"crossref","unstructured":"Y. Fan, J.W. Kang, L.T. Li, et al., Cn-celeb: a challenging Chinese speaker recognition dataset. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Barcelona, 2020),\u00a0p. 7604\u20137608","DOI":"10.1109\/ICASSP40776.2020.9054017"},{"key":"317_CR36","doi-asserted-by":"crossref","unstructured":"V. Panayotov, G. Chen, D. Povey, et al., Librispeech: an ASR corpus based on public domain audio books. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Brisbane, 2015),\u00a0p. 5206\u20135210","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"317_CR37","unstructured":"J. Du, X. Na, X. Liu, et al., Aishell-2: transforming mandarin ASR research into industrial scale (2018). http:\/\/arxiv.org\/abs\/1808.10583"},{"key":"317_CR38","unstructured":"D. Snyder, G. Chen, D. Povey, Musan: a music, speech, and noise corpus (2015). http:\/\/arxiv.org\/abs\/1510.08484"},{"key":"317_CR39","doi-asserted-by":"crossref","unstructured":"A.W. Rix, J.G. Beerends, M.P. Hollier, et al., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Salt Lake City, 2001),\u00a0p. 749\u2013752","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"317_CR40","doi-asserted-by":"crossref","unstructured":"C.H. Taal, R.C. Hendriks, R. Heusdens, et al., A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Dallas, 2010),\u00a0p. 4214\u20134217","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"317_CR41","doi-asserted-by":"crossref","unstructured":"A. Gray, J. Markel, Distance measures for speech processing. IEEE Trans. Audio Speech Lang. Process. 24(5), 380\u2013391 (1976)","DOI":"10.1109\/TASSP.1976.1162849"},{"key":"317_CR42","doi-asserted-by":"crossref","unstructured":"R.C. Streijl, S. Winkler, D.S. Hands, Mean Opinion Score (MOS) revisited: methods and applications, limitations and alternatives. Multimedia Syst. 22(2), 213\u2013227 (2016)","DOI":"10.1007\/s00530-014-0446-1"},{"key":"317_CR43","doi-asserted-by":"crossref","unstructured":"L. Wan, Q. Wang, A. Papir, et al., Generalized end-to-end loss for speaker verification. In: Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP). (IEEE, Calgary, 2018),\u00a0p. 4879\u20134883","DOI":"10.1109\/ICASSP.2018.8462665"}],"container-title":["EURASIP Journal on Audio, Speech, and Music Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-023-00317-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13636-023-00317-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13636-023-00317-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,5]],"date-time":"2024-11-05T08:45:57Z","timestamp":1730796357000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmp-eurasipjournals.springeropen.com\/articles\/10.1186\/s13636-023-00317-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,6]]},"references-count":43,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["317"],"URL":"https:\/\/doi.org\/10.1186\/s13636-023-00317-3","relation":{},"ISSN":["1687-4722"],"issn-type":[{"type":"electronic","value":"1687-4722"}],"subject":[],"published":{"date-parts":[[2023,12,6]]},"assertion":[{"value":"14 June 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 November 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 December 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"53"}}