{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T15:29:22Z","timestamp":1772119762442,"version":"3.50.1"},"reference-count":54,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.<\/jats:p>","DOI":"10.1007\/s11063-024-11613-0","type":"journal-article","created":{"date-parts":[[2024,5,7]],"date-time":"2024-05-07T22:01:38Z","timestamp":1715119298000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion"],"prefix":"10.1007","volume":"56","author":[{"given":"Kyungdeuk","family":"Ko","sequence":"first","affiliation":[]},{"given":"Donghyeon","family":"Kim","sequence":"additional","affiliation":[]},{"given":"Kyungseok","family":"Oh","sequence":"additional","affiliation":[]},{"given":"Hanseok","family":"Ko","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,8]]},"reference":[{"key":"11613_CR1","unstructured":"Li Y, Jin Y, Kwak J, Yoon D, Han D, Ko H (2021) Adaptive content feature enhancement gan for multimodal selfie to anime translation"},{"issue":"1","key":"11613_CR2","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1109\/TCE.2005.1405694","volume":"51","author":"S Ahn","year":"2005","unstructured":"Ahn S, Ko H (2005) Background noise reduction via dual-channel scheme for speech recognition in vehicular environment. IEEE Trans Consum Electron 51(1):22\u201327","journal-title":"IEEE Trans Consum Electron"},{"key":"11613_CR3","doi-asserted-by":"crossref","unstructured":"Kim G, Han DK, Ko H (2021) Specmix: a mixed sample data augmentation method for training with time-frequency domain features. arXiv preprint arXiv:2108.03020","DOI":"10.31219\/osf.io\/ubcft"},{"key":"11613_CR4","doi-asserted-by":"crossref","unstructured":"Nachmani E, Wolf L (2019) Unsupervised singing voice conversion. arXiv preprint arXiv:1904.06590","DOI":"10.21437\/Interspeech.2019-1761"},{"issue":"1","key":"11613_CR5","doi-asserted-by":"publisher","first-page":"134","DOI":"10.1016\/j.specom.2011.07.007","volume":"54","author":"K Nakamura","year":"2012","unstructured":"Nakamura K, Toda T, Saruwatari H, Shikano K (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Commun 54(1):134\u2013146","journal-title":"Speech Commun"},{"key":"11613_CR6","unstructured":"Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M (2019) Autovc: zero-shot voice style transfer with only autoencoder loss. In: International conference on machine learning. PMLR, pp 5210\u20135219"},{"key":"11613_CR7","doi-asserted-by":"crossref","unstructured":"Chou J, Yeh C, Lee H (2019) One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742","DOI":"10.21437\/Interspeech.2019-2663"},{"key":"11613_CR8","doi-asserted-by":"crossref","unstructured":"Qian K, Jin Z, Hasegawa-Johnson M, Mysore GJ (2020) F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In: ICASSP 2020\u20142020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6284\u20136288","DOI":"10.1109\/ICASSP40776.2020.9054734"},{"key":"11613_CR9","doi-asserted-by":"crossref","unstructured":"Wu D-Y, Lee H (2020) One-shot voice conversion by vector quantization. In: ICASSP 2020\u20142020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7734\u20137738","DOI":"10.1109\/ICASSP40776.2020.9053854"},{"key":"11613_CR10","doi-asserted-by":"crossref","unstructured":"Wu D-Y, Chen Y-H, Lee H-Y (2020) Vqvc+: one-shot voice conversion by vector quantization and u-net architecture. arXiv preprint arXiv:2006.04154","DOI":"10.21437\/Interspeech.2020-1443"},{"key":"11613_CR11","doi-asserted-by":"crossref","unstructured":"Wang D, Deng L, Yeung YT, Chen X, Liu X, Meng H (2021) Vqmivc: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. arXiv preprint arXiv:2106.10132","DOI":"10.21437\/Interspeech.2021-283"},{"issue":"11","key":"11613_CR12","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1145\/3422622","volume":"63","author":"I Goodfellow","year":"2020","unstructured":"Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139\u2013144","journal-title":"Commun ACM"},{"key":"11613_CR13","unstructured":"Mun S, Park S, Han DK, Ko H (2017) Generative adversarial network based acoustic scene training set augmentation and selection using svm hyper-plane. In: DCASE, pp 93\u2013102"},{"key":"11613_CR14","doi-asserted-by":"crossref","unstructured":"Choi Y, Choi M, Kim M, Ha J-W, Kim S, Choo J (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8789\u20138797","DOI":"10.1109\/CVPR.2018.00916"},{"key":"11613_CR15","doi-asserted-by":"crossref","unstructured":"Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 266\u2013273","DOI":"10.1109\/SLT.2018.8639535"},{"key":"11613_CR16","doi-asserted-by":"crossref","unstructured":"Kaneko T, Kameoka H, Tanaka K, Hojo N (2019) Stargan-vc2: rethinking conditional methods for stargan-based voice conversion. arXiv preprint arXiv:1907.12279","DOI":"10.21437\/Interspeech.2019-2236"},{"key":"11613_CR17","doi-asserted-by":"crossref","unstructured":"Li YA, Zare A, Mesgarani N (2021) Starganv2-vc: a diverse, unsupervised, non-parallel framework for natural-sounding voice conversion. arXiv preprint arXiv:2107.10394","DOI":"10.21437\/Interspeech.2021-319"},{"key":"11613_CR18","first-page":"66","volume":"32","author":"K Kumar","year":"2019","unstructured":"Kumar K, Kumar R, Boissiere T, Gestin L, Teoh WZ, Sotelo J, Br\u00e9bisson A, Bengio Y, Courville AC (2019) Melgan: generative adversarial networks for conditional waveform synthesis. Adv Neural Inf Process Syst 32:66","journal-title":"Adv Neural Inf Process Syst"},{"key":"11613_CR19","doi-asserted-by":"crossref","unstructured":"Yamamoto R, Song E, Kim J-M (2020) Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020\u20142020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6199\u20136203","DOI":"10.1109\/ICASSP40776.2020.9053795"},{"key":"11613_CR20","first-page":"17022","volume":"33","author":"J Kong","year":"2020","unstructured":"Kong J, Kim J, Bae J (2020) Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst 33:17022\u201317033","journal-title":"Adv Neural Inf Process Syst"},{"key":"11613_CR21","doi-asserted-by":"crossref","unstructured":"Wu Y-C, Kobayashi K, Hayashi T, Tobing PL, Toda T (2018) Collapsed speech segment detection and suppression for wavenet vocoder. arXiv preprint arXiv:1804.11055","DOI":"10.21437\/Interspeech.2018-1210"},{"key":"11613_CR22","unstructured":"Oord Avd, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499"},{"key":"11613_CR23","unstructured":"Ar\u0131k S\u00d6, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J (2017) Deep voice: real-time neural text-to-speech. In: International conference on machine learning. PMLR, pp 195\u2013204"},{"key":"11613_CR24","doi-asserted-by":"crossref","unstructured":"Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S et al (2017) Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"11613_CR25","first-page":"66","volume":"30","author":"A Gibiansky","year":"2017","unstructured":"Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. Adv Neural Inf Process Syst 30:66","journal-title":"Adv Neural Inf Process Syst"},{"key":"11613_CR26","doi-asserted-by":"crossref","unstructured":"Zhang Y-J, Pan S, He L, Ling Z-H (2019) Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP 2019\u20142019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6945\u20136949","DOI":"10.1109\/ICASSP.2019.8683623"},{"key":"11613_CR27","unstructured":"Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114"},{"key":"11613_CR28","unstructured":"Ping W, Peng K, Chen J (2018) Clarinet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281"},{"key":"11613_CR29","unstructured":"Ren Y, Hu C, Tan X, Qin T, Zhao S, Zhao Z, Liu T-Y (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558"},{"key":"11613_CR30","unstructured":"Donahue J, Dieleman S, Bi\u0144kowski M, Elsen E, Simonyan K (2020) End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575"},{"key":"11613_CR31","doi-asserted-by":"crossref","unstructured":"Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision, pp 1501\u20131510","DOI":"10.1109\/ICCV.2017.167"},{"key":"11613_CR32","unstructured":"Ko K, Lee B, Hong J, Han D, Ko H (2021) Deep degradation prior for real-world super-resolution. In: BMVC"},{"key":"11613_CR33","doi-asserted-by":"crossref","unstructured":"Chen Y-H, Wu D-Y, Wu T-H, Lee H (2021) Again-vc: a one-shot voice conversion using activation guidance and adaptive instance normalization. In: ICASSP 2021\u20142021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5954\u20135958","DOI":"10.1109\/ICASSP39728.2021.9414257"},{"key":"11613_CR34","doi-asserted-by":"crossref","unstructured":"Lin YY, Chien C-M, Lin J-H, Lee H, Lee L (2021) Fragmentvc: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. In: ICASSP 2021\u20142021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5939\u20135943","DOI":"10.1109\/ICASSP39728.2021.9413699"},{"key":"11613_CR35","first-page":"294","volume":"34","author":"S-H Lee","year":"2021","unstructured":"Lee S-H, Kim J-H, Chung H, Lee S-W (2021) Voicemixer: adversarial voice style mixup. Adv Neural Inf Process Syst 34:294\u2013308","journal-title":"Adv Neural Inf Process Syst"},{"key":"11613_CR36","doi-asserted-by":"crossref","unstructured":"Wang Q, Zhang X, Wang J, Cheng N, Xiao J (2022) Drvc: a framework of any-to-any voice conversion with self-supervised learning. In: ICASSP 2022\u20142022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3184\u20133188","DOI":"10.1109\/ICASSP43922.2022.9747434"},{"key":"11613_CR37","doi-asserted-by":"crossref","unstructured":"Nguyen B, Cardinaux F (2022) Nvc-net: end-to-end adversarial voice conversion. In: ICASSP 2022\u20142022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7012\u20137016","DOI":"10.1109\/ICASSP43922.2022.9747020"},{"key":"11613_CR38","unstructured":"Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415"},{"issue":"7","key":"11613_CR39","doi-asserted-by":"publisher","first-page":"1324","DOI":"10.3390\/app9071324","volume":"9","author":"S Kum","year":"2019","unstructured":"Kum S, Nam J (2019) Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Appl Sci 9(7):1324","journal-title":"Appl Sci"},{"key":"11613_CR40","doi-asserted-by":"crossref","unstructured":"Kim S, Hori T, Watanabe S (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4835\u20134839","DOI":"10.1109\/ICASSP.2017.7953075"},{"key":"11613_CR41","doi-asserted-by":"publisher","first-page":"1943","DOI":"10.1109\/LSP.2022.3205275","volume":"29","author":"B Lee","year":"2022","unstructured":"Lee B, Ko K, Hong J, Ku B, Ko H (2022) Information bottleneck measurement for compressed sensing image reconstruction. IEEE Signal Process Lett 29:1943\u20131947","journal-title":"IEEE Signal Process Lett"},{"key":"11613_CR42","doi-asserted-by":"publisher","unstructured":"Yamagishi J, Veaux C, MacDonald K (2019) CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR). https:\/\/doi.org\/10.7488\/ds\/2645","DOI":"10.7488\/ds\/2645"},{"issue":"7","key":"11613_CR43","doi-asserted-by":"publisher","first-page":"1877","DOI":"10.1587\/transinf.2015EDP7457","volume":"99","author":"M Morise","year":"2016","unstructured":"Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst 99(7):1877\u20131884","journal-title":"IEICE Trans Inf Syst"},{"key":"11613_CR44","doi-asserted-by":"crossref","unstructured":"Park HJ, Yang SW, Kim JS, Shin W, Han SW (2023) Triaan-vc: triple adaptive attention normalization for any-to-any voice conversion. In: ICASSP 2023\u20142023 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1\u20135","DOI":"10.1109\/ICASSP49357.2023.10096642"},{"key":"11613_CR45","unstructured":"Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch"},{"key":"11613_CR46","doi-asserted-by":"crossref","unstructured":"Leng Y, Tan X, Zhao S, Soong F, Li X-Y, Qin T (2021) Mbnet: Mos prediction for synthesized speech with mean-bias network. In: ICASSP 2021\u20142021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 391\u2013395","DOI":"10.1109\/ICASSP39728.2021.9413877"},{"key":"11613_CR47","first-page":"12449","volume":"33","author":"A Baevski","year":"2020","unstructured":"Baevski A, Zhou Y, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449\u201312460","journal-title":"Adv Neural Inf Process Syst"},{"key":"11613_CR48","doi-asserted-by":"crossref","unstructured":"Baevski A, Auli M, Mohamed A (2019) Effectiveness of self-supervised pre-training for speech recognition. arXiv preprint arXiv:1911.03912","DOI":"10.1109\/ICASSP40776.2020.9054224"},{"key":"11613_CR49","doi-asserted-by":"crossref","unstructured":"Koluguri NR, Park T, Ginsburg B (2022) Titanet: neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022\u20142022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8102\u20138106","DOI":"10.1109\/ICASSP43922.2022.9746806"},{"key":"11613_CR50","doi-asserted-by":"crossref","unstructured":"Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized end-to-end loss for speaker verification. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4879\u20134883","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"11613_CR51","doi-asserted-by":"crossref","unstructured":"Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) Contextnet: improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191","DOI":"10.21437\/Interspeech.2020-2059"},{"key":"11613_CR52","doi-asserted-by":"crossref","unstructured":"Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 4690\u20134699","DOI":"10.1109\/CVPR.2019.00482"},{"key":"11613_CR53","doi-asserted-by":"crossref","unstructured":"Van\u00a0Niekerk B, Nortje L, Kamper H (2020) Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409","DOI":"10.21437\/Interspeech.2020-1693"},{"key":"11613_CR54","unstructured":"Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11613-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11613-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11613-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,15]],"date-time":"2024-07-15T07:17:36Z","timestamp":1721027856000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11613-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,8]]},"references-count":54,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2024,6]]}},"alternative-id":["11613"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11613-0","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-3180016\/v1","asserted-by":"object"}]},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,8]]},"assertion":[{"value":"6 April 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 May 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"166"}}