{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T22:09:23Z","timestamp":1740175763576,"version":"3.37.3"},"reference-count":38,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T00:00:00Z","timestamp":1709164800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T00:00:00Z","timestamp":1709164800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62071484"],"award-info":[{"award-number":["62071484"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004608","name":"Natural Science Foundation of Jiangsu Province","doi-asserted-by":"publisher","award":["BK20180080"],"award-info":[{"award-number":["BK20180080"]}],"id":[{"id":"10.13039\/501100004608","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62371469"],"award-info":[{"award-number":["62371469"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,6]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.<\/jats:p>","DOI":"10.1007\/s40747-024-01375-6","type":"journal-article","created":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T10:02:33Z","timestamp":1709200953000},"page":"3981-3994","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A noise-robust voice conversion method with controllable background sounds"],"prefix":"10.1007","volume":"10","author":[{"given":"Lele","family":"Chen","sequence":"first","affiliation":[]},{"given":"Xiongwei","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Yihao","family":"Li","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7435-3752","authenticated-orcid":false,"given":"Meng","family":"Sun","sequence":"additional","affiliation":[]},{"given":"Weiwei","family":"Chen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,29]]},"reference":[{"key":"1375_CR1","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1109\/TASLP.2020.3038524","volume":"29","author":"B Sisman","year":"2021","unstructured":"Sisman B, Yamagishi J, King S, Li H (2021) An overview of voice conversion and its challenges: from statistical modeling to deep learning. IEEE\/ACM Trans Audio Speech Lang Process 29:132\u2013157","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"1375_CR2","doi-asserted-by":"publisher","first-page":"2623","DOI":"10.1007\/s40747-022-00665-1","volume":"8","author":"A Singh","year":"2022","unstructured":"Singh A, Kaur N, Kukreja V (2022) Computational intelligence in processing of speech acoustics: a survey. Complex Intell Syst 8:2623\u20132661","journal-title":"Complex Intell Syst"},{"key":"1375_CR3","doi-asserted-by":"crossref","unstructured":"Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun Int J 88: 65\u201382","DOI":"10.1016\/j.specom.2017.01.008"},{"key":"1375_CR4","doi-asserted-by":"crossref","unstructured":"Liu F-k, Wang H, Ke Y-x, Zheng C-s (2022) One-shot voice conversion using a combination of U2-Net and vector quantization. Appl Acoustics 99: 109014","DOI":"10.1016\/j.apacoust.2022.109014"},{"key":"1375_CR5","doi-asserted-by":"publisher","DOI":"10.1016\/j.dsp.2020.102951","volume":"110","author":"M-S Fahad","year":"2021","unstructured":"Fahad M-S, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digital Signal Processing 110:102951","journal-title":"Digital Signal Processing"},{"key":"1375_CR6","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1007\/s40747-022-00782-x","volume":"9","author":"X Zhang","year":"2023","unstructured":"Zhang X, Zhang X, Sun M (2023) Imperceptible black-box waveform-level adversarial attack towards automatic speaker recognition. Complex Intell Syst 9:65\u201379","journal-title":"Complex Intell Syst"},{"key":"1375_CR7","doi-asserted-by":"crossref","unstructured":"Ram SR, Kumar V M, Subramanian B, Bacanin N, Zivkovic M, Strumberger I (2020) Speech enhancement through improvised conditional generative adversarial networks. Microprocessors Microsyst 79: 103281","DOI":"10.1016\/j.micpro.2020.103281"},{"key":"1375_CR8","doi-asserted-by":"publisher","first-page":"1700","DOI":"10.1109\/LSP.2020.3025020","volume":"27","author":"H Phan","year":"2020","unstructured":"Phan H et al (2020) Improving GANs for Speech Enhancement. IEEE Signal Process Lett 27:1700\u20131704. https:\/\/doi.org\/10.1109\/LSP.2020.3025020","journal-title":"IEEE Signal Process Lett"},{"key":"1375_CR9","doi-asserted-by":"publisher","unstructured":"Wang C, Yu Y-B CycleGAN-VC-GP: Improved CycleGAN-based Non-parallel Voice Conversion. In: 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 2020, 1281\u20131284, https:\/\/doi.org\/10.1109\/ICCT50939.2020.9295938.","DOI":"10.1109\/ICCT50939.2020.9295938"},{"key":"1375_CR10","doi-asserted-by":"publisher","unstructured":"Yu X, Mak B Non-parallel many-to-many voice conversion by knowledge transfer from a text-to-speech model. In: ICASSP 2021\u20142021 IEEE international conference on acoustics, speech and signal processing (ICASSP), Toronto, ON, Canada, 2021, 5924\u20135928, https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414757.","DOI":"10.1109\/ICASSP39728.2021.9414757"},{"issue":"5","key":"1375_CR11","doi-asserted-by":"publisher","first-page":"2489","DOI":"10.1109\/JBHI.2023.3239551","volume":"27","author":"M Chu","year":"2023","unstructured":"Chu M et al (2023) E-DGAN: an encoder-decoder generative adversarial network based method for pathological to normal voice conversion. IEEE J Biomed Health Inform 27(5):2489\u20132500. https:\/\/doi.org\/10.1109\/JBHI.2023.3239551","journal-title":"IEEE J Biomed Health Inform"},{"key":"1375_CR12","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2023.110851","volume":"277","author":"H Kheddar","year":"2023","unstructured":"Kheddar H, Himeur Y, Al-Maadeed S, Amira A, Bensaali F (2023) Deep transfer learning for automatic speech recognition: towards better generalization. Knowl-Based Syst 277:110851","journal-title":"Knowl-Based Syst"},{"key":"1375_CR13","doi-asserted-by":"publisher","DOI":"10.1016\/j.dsp.2021.103110","volume":"116","author":"X Kang","year":"2021","unstructured":"Kang X, Huang H, Hu Y, Huang Z (2021) Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion. Digital Signal Processing 116:103110","journal-title":"Digital Signal Processing"},{"key":"1375_CR14","doi-asserted-by":"publisher","unstructured":"Wu D-Y, Lee H-y (2020) One-Shot Voice Conversion by Vector Quantization. In: ICASSP 2020\u20142020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp 7734\u20137738, https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053854.","DOI":"10.1109\/ICASSP40776.2020.9053854"},{"key":"1375_CR15","doi-asserted-by":"publisher","unstructured":"Chen M, Shi Y, Hain T Towards Low-resource stargan voice conversion using weight adaptive instance normalization. In: ICASSP 2021\u20142021 IEEE international conference on acoustics, speech and signal processing (ICASSP), Toronto, ON, Canada, 2021, pp. 5949\u20135953, https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9415042.","DOI":"10.1109\/ICASSP39728.2021.9415042"},{"key":"1375_CR16","doi-asserted-by":"publisher","unstructured":"Ronssin D, Cernak M (2021) AC-VC: Non-parallel low latency phonetic posteriorgrams based voice conversion. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, pp. 710-716, https:\/\/doi.org\/10.1109\/ASRU51503.2021.9688277","DOI":"10.1109\/ASRU51503.2021.9688277"},{"issue":"4","key":"1375_CR17","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1016\/j.neunet.2022.01.003","volume":"48","author":"H Du","year":"2022","unstructured":"Du H, Xie L, Li H (2022) Noise-robust voice conversion with domain adversarial training. Neural Netw 48(4):74\u201384","journal-title":"Neural Netw"},{"issue":"7","key":"1375_CR18","doi-asserted-by":"publisher","first-page":"1179","DOI":"10.1109\/TASLP.2019.2913512","volume":"27","author":"A Pandey","year":"2019","unstructured":"Pandey A, Wang D (2019) A New Framework for CNN-Based Speech Enhancement in the Time Domain. IEEE\/ACM Trans Audio Speech Lang Process 27(7):1179\u20131188. https:\/\/doi.org\/10.1109\/TASLP.2019.2913512","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"1375_CR19","doi-asserted-by":"publisher","unstructured":"Koizumi Y, Yatabe K, Delcroix M, Masuyama Y, Takeuchi D (2020) Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention, In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp 181\u2013185, https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053214.","DOI":"10.1109\/ICASSP40776.2020.9053214"},{"key":"1375_CR20","unstructured":"Xie C, Wu Y-C, Tobing PL, Huang W-C, Toda T Noisy-to-Noisy Voice Conversion Framework with Denoising Model. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 2021, pp. 814-820"},{"key":"1375_CR21","doi-asserted-by":"publisher","unstructured":"Yao J et al. preserving background sound in noise-robust voice conversion via multi-task learning. In: ICASSP 2023\u20132023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1\u20135, https:\/\/doi.org\/10.1109\/ICASSP49357.2023.10095960.","DOI":"10.1109\/ICASSP49357.2023.10095960"},{"key":"1375_CR22","doi-asserted-by":"crossref","unstructured":"Hu Y, Liu Y, Lv S, Xing M, Zhang S, Fu Y, Wu J, Zhang B, Xie L DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. in INTERSPEECH 2020, 2020.","DOI":"10.21437\/Interspeech.2020-2537"},{"key":"1375_CR23","doi-asserted-by":"crossref","unstructured":"Chen B, Wang Y, Liu Z, Tang R, Guo W, Zheng H, Yao W, Zhang M, He X Enhancing explicit and implicit feature interactions via information sharing for parallel deep CTR models. The 30th ACM International Conference on Information and Knowledge Management, Virtual Event Queensland, Australia 2021, pp.3757\u20133766.","DOI":"10.1145\/3459637.3481915"},{"key":"1375_CR24","doi-asserted-by":"crossref","unstructured":"Wang D, Deng L, Yu TY, Chen X, Meng H (2021) VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion. In: 2021 21th Annual Conference of the International Speech Communication Association (INTERSPEECH), Brno, Czechia, March, pp 1344\u20131348.","DOI":"10.21437\/Interspeech.2021-283"},{"key":"1375_CR25","doi-asserted-by":"crossref","unstructured":"Reddy CK, Dubey H, Gopal V, Cutler R, Braun S, Gamper H, Aichner R, Srinivasan S (2021) Icassp 2021 deep noise suppression challenge. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, , pp. 6623\u20136627.","DOI":"10.21437\/Interspeech.2021-1609"},{"key":"1375_CR26","doi-asserted-by":"publisher","unstructured":"Wu Y-H, Lin W-H, Huang S-H Low-power hardware implementation for parametric rectified linear unit function. 2020 IEEE International Conference on Consumer Electronics\u2014Taiwan (ICCE-Taiwan), Taoyuan, Taiwan, 2020, pp. 1\u20132, https:\/\/doi.org\/10.1109\/ICCE-Taiwan49838.2020.9258135.","DOI":"10.1109\/ICCE-Taiwan49838.2020.9258135"},{"key":"1375_CR27","unstructured":"Trabelsi C, Bilaniuk O, Zhang Y, Serdyuk D, Subramanian S, Santos JF, Mehri S, Rostamzadeh N, Bengio Y, Pal CJ Deep complex networks. arXiv preprint arXiv:1705.09792, 2017."},{"key":"1375_CR28","doi-asserted-by":"publisher","unstructured":"Kaneko T, Kameoka H CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. IN: 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 2018, 2100-2104, https:\/\/doi.org\/10.23919\/EUSIPCO.2018.8553236","DOI":"10.23919\/EUSIPCO.2018.8553236"},{"key":"1375_CR29","unstructured":"Rafii Z, Liutkus A, St\u00a8oter F-R (2017) Stylianos Ioannis Mimilakis, and Rachel Bittner, The MUSDB18 corpus for music separation"},{"key":"1375_CR30","volume-title":"Superseded-CSRT VCTK Corpus: English multi-speaker corpus for CSRT voice cloning toolkit","author":"C Veaux","year":"2016","unstructured":"Veaux C, Yamagishi J, MacDonald K (2016) Superseded-CSRT VCTK Corpus: English multi-speaker corpus for CSRT voice cloning toolkit. University of Edinburgh, The Centre for Speech Technology Research (CSTR)"},{"key":"1375_CR31","unstructured":"Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M AutoVC: Zero-shot voice style transfer with only autoencoder loss. International Conference on Machine Learning (ICML 2019), Long Beach, California, June 2019, pp. 5210\u20135219."},{"key":"1375_CR32","doi-asserted-by":"crossref","unstructured":"Chou JC, Yeh CC, Lee HY One-shot voice conversion by separating speaker and content representations with instance normalization. In: Proc. 2019 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), Graz, Austria, Sept. 2019, pp.664\u2013668","DOI":"10.21437\/Interspeech.2019-2663"},{"key":"1375_CR33","doi-asserted-by":"crossref","unstructured":"Pascual S, Bonafonte A, Serr\u00e0 J (2017) SEGAN: Speech Enhancement Generative Adversarial Network. In: Proc. 2017 18th Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, Sweden, pp. 3642\u20133646.","DOI":"10.21437\/Interspeech.2017-1428"},{"key":"1375_CR34","doi-asserted-by":"publisher","unstructured":"Naderi B, M\u00f6ller S Transformation of Mean Opinion Scores to Avoid Misleading of Ranked Based Statistical Techniques. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 2020, pp. 1\u20134, https:\/\/doi.org\/10.1109\/QoMEX48832.2020.9123078.","DOI":"10.1109\/QoMEX48832.2020.9123078"},{"key":"1375_CR35","doi-asserted-by":"crossref","unstructured":"Polyak A, Wolf L (2019) Attention-based wavenet autoencoder for universal voice conversion. In: Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, , pp.6800\u20136804.","DOI":"10.1109\/ICASSP.2019.8682589"},{"key":"1375_CR36","doi-asserted-by":"crossref","unstructured":"Rix A, Beerends J, Hollier M, Hekstra A Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2. IEEE, 2001, pp. 749\u2013752.","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"1375_CR37","doi-asserted-by":"crossref","unstructured":"Taal CH, Richard HRCH, Jesper J (2011) An algorithm for intelligibility prediction of timefrequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7): 2125\u20132136","DOI":"10.1109\/TASL.2011.2114881"},{"issue":"3","key":"1375_CR38","doi-asserted-by":"publisher","first-page":"247","DOI":"10.1016\/0167-6393(93)90095-3","volume":"12","author":"A Varga","year":"1993","unstructured":"Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247\u2013251","journal-title":"Speech Commun"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01375-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01375-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01375-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,13]],"date-time":"2024-11-13T15:01:55Z","timestamp":1731510115000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01375-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,29]]},"references-count":38,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,6]]}},"alternative-id":["1375"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01375-6","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"type":"print","value":"2199-4536"},{"type":"electronic","value":"2198-6053"}],"subject":[],"published":{"date-parts":[[2024,2,29]]},"assertion":[{"value":"25 September 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 January 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 February 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declaration"}},{"value":"The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}