{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T22:09:56Z","timestamp":1740175796453,"version":"3.37.3"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2022,1,4]],"date-time":"2022-01-04T00:00:00Z","timestamp":1641254400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,1,4]],"date-time":"2022-01-04T00:00:00Z","timestamp":1641254400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001691","name":"Japan Society for the Promotion of Science","doi-asserted-by":"publisher","award":["JP19K12162"],"award-info":[{"award-number":["JP19K12162"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2022,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Voice adaptation is an interactive speech processing technique that allows the speaker to transmit with a chosen target voice. We propose a novel method that is intended for dynamic scenarios, such as online video games, where the source speaker\u2019s and target speaker\u2019s data are nonaligned. This would yield massive improvements to immersion and experience by fully becoming a character, and address privacy concerns to protect against harassment by disguising the voice. With unaligned data, traditional methods, e.g., probabilistic models become inaccurate, while recent methods such as deep neural networks (DNN) require too substantial preparation work. Common methods require multiple subjects to be trained in parallel, which constraints practicality in productive environments. Our proposal trains a subject nonparallel into a voice profile used against any unknown source speaker. Prosodic data such as pitch, power and temporal structure are encoded into RGBA-colored frames used in a multi-objective optimization problem to adjust interrelated features based on color likeness. Finally, frames are smoothed and adjusted before output. The method was evaluated using Mean Opinion Score, ABX, MUSHRA, Single Ease Questions and performance benchmarks using two voice profiles of varying sizes and lastly discussion regarding game implementation. Results show improved adaptation quality, especially in a larger voice profile, and audience is positive about using such technology in future games.<\/jats:p>","DOI":"10.1007\/s40747-021-00604-6","type":"journal-article","created":{"date-parts":[[2022,1,4]],"date-time":"2022-01-04T07:03:04Z","timestamp":1641279784000},"page":"1539-1550","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Voice adaptation by color-encoded frame matching as a multi-objective optimization problem for future games"],"prefix":"10.1007","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4303-1697","authenticated-orcid":false,"given":"Mads","family":"Midtlyng","sequence":"first","affiliation":[]},{"given":"Yuji","family":"Sato","sequence":"additional","affiliation":[]},{"given":"Hiroshi","family":"Hosobe","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,1,4]]},"reference":[{"key":"604_CR1","doi-asserted-by":"crossref","unstructured":"Eason Y, Stylianou (2009) Voice transformation: a survey. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, pp 3585\u20133588","DOI":"10.1109\/ICASSP.2009.4960401"},{"key":"604_CR2","doi-asserted-by":"crossref","unstructured":"Erro D, Moreno A (2007) Weighted frequency warping for voice conversion. In: 8th Annual Conference of the International Speech Communication Association INTERSPEECH, Antwerp, pp 1965\u20131968","DOI":"10.21437\/Interspeech.2007-550"},{"key":"604_CR3","first-page":"285","volume":"1","author":"Y Stylianou","year":"1998","unstructured":"Stylianou Y, Capp\u00e9 O, Moulines E (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 1:285\u2013288","journal-title":"IEEE Trans Speech Audio Process"},{"key":"604_CR4","doi-asserted-by":"crossref","unstructured":"Toda T, Saruwatari H, Shikano K (2001) Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In: Proc. ICASSP, pp 841\u2013844","DOI":"10.1109\/ICASSP.2001.941046"},{"issue":"2","key":"604_CR5","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1016\/0167-6393(95)90054-3","volume":"16","author":"E Moulines","year":"1995","unstructured":"Moulines E, Sagisaka Y (1995) Voice conversion: state of the art and perspectives. Speech Commun 16(2):125\u2013126 (Special Issue)","journal-title":"Speech Commun"},{"key":"604_CR6","doi-asserted-by":"crossref","unstructured":"Kain A, Macon MW (1998) Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seattle, pp 285\u2013299","DOI":"10.1109\/ICASSP.1998.674423"},{"issue":"4","key":"604_CR7","doi-asserted-by":"publisher","first-page":"1301","DOI":"10.1109\/TSA.2005.860839","volume":"14","author":"H Ye","year":"2006","unstructured":"Ye H, Young S (2006) Quality-enhanced voice morphing using maximum likelihood transformations. IEEE Trans Audio Speech Lang Process 14(4):1301\u20131312","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"604_CR8","doi-asserted-by":"crossref","unstructured":"Chen Y, Chu M, Chang E, Liu J, Liu R (2003) Voice conversion with smoothed GMM and map adaptation. In: 8th European Conference on Speech Communication and Technology (Eurospeech 2003\u2014Interspeech 2003), Geneva, pp 2413\u20132416","DOI":"10.21437\/Eurospeech.2003-664"},{"issue":"10","key":"604_CR9","doi-asserted-by":"publisher","first-page":"1506","DOI":"10.1109\/TASLP.2014.2333242","volume":"22","author":"Z Wu","year":"2014","unstructured":"Wu Z, Virtanen T, Chng ES, Li H (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE Trans Audio Speech Lang Process 22(10):1506\u20131521","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"604_CR10","doi-asserted-by":"crossref","unstructured":"Takashima R, Takiguchi T, Ariki Y (2012) Exemplar-based Voice conversion in noisy environment. In: IEEE Spoken Language Technology Workshop (SLT), Miami, pp 313\u2013317","DOI":"10.1109\/SLT.2012.6424242"},{"key":"604_CR11","doi-asserted-by":"publisher","first-page":"1859","DOI":"10.1109\/TASLP.2014.2353991","volume":"22","author":"F Villavicencio","year":"2014","unstructured":"Villavicencio F, Bonada J (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE\/ACM Trans Audio Speech Lange Process (TASLP) J 22:1859\u20131872","journal-title":"IEEE\/ACM Trans Audio Speech Lange Process (TASLP) J"},{"key":"604_CR12","doi-asserted-by":"publisher","first-page":"181","DOI":"10.1016\/j.asoc.2004.06.005","volume":"5","author":"Y Sato","year":"2004","unstructured":"Sato Y (2004) Voice quality conversion using interactive evolution of prosodic control. Appl Soft Comput J 5:181\u2013192","journal-title":"Appl Soft Comput J"},{"key":"604_CR13","doi-asserted-by":"crossref","unstructured":"Abe M, Nakamura S, Shikano K, Kuwabara H (1988) Voice conversion through vector quantization. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, New York, pp 655\u2013658","DOI":"10.1109\/ICASSP.1988.196671"},{"key":"604_CR14","doi-asserted-by":"crossref","unstructured":"Villavicencio F, Bonada J (2010) Applying voice conversion to concatenative singing-voice synthesis. In: 11th Annual Conference of the International Speech Communication Association (INTERSPEECH), Chiba, pp 2162\u20132165","DOI":"10.21437\/Interspeech.2010-596"},{"key":"604_CR15","doi-asserted-by":"crossref","unstructured":"Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, pp 5279\u20135283","DOI":"10.1109\/ICASSP.2018.8462342"},{"key":"604_CR16","doi-asserted-by":"crossref","unstructured":"Kaneko T, Kameoka T, Tanaka K, Hojo N (2019) Cyclegan-VC2: improved cyclegan-based non-parallel voice conversion. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, pp 6820\u20136824","DOI":"10.1109\/ICASSP.2019.8682897"},{"key":"604_CR17","doi-asserted-by":"crossref","unstructured":"Hsu C-C, Hwang H-T, Wu Y-C, Tsao Y, Wang H-M (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In: Asia\u2013Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, pp 1\u20136","DOI":"10.1109\/APSIPA.2016.7820786"},{"key":"604_CR18","doi-asserted-by":"crossref","unstructured":"Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, Kinnunen T et al. (2018) The voice conversion challenge 2018: promoting development of parallel and nonparallel methods, Odyssy 2018","DOI":"10.21437\/Odyssey.2018-28"},{"key":"604_CR19","doi-asserted-by":"crossref","unstructured":"Liu L, Ling Z, Jiang Y, Zhou M, Dai L (2018) WaveNet Vocoder with limited training data for voice conversion. In: Proc. Interspeech 2018, pp 1983\u20131987","DOI":"10.21437\/Interspeech.2018-1190"},{"key":"604_CR20","doi-asserted-by":"crossref","unstructured":"Midtlyng M, Sato Y (2016) Real-time voice adaptation with abstract normalization and sound-indexed based search. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, pp 60\u201365","DOI":"10.1109\/SMC.2016.7844220"},{"key":"604_CR21","doi-asserted-by":"crossref","unstructured":"Midtlyng M, Sato Y (2018) Voice adaptation from mean dataset voice profile with dynamic power. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Shizuoka, pp 2037\u20132042","DOI":"10.1109\/SMC.2018.00351"},{"key":"604_CR22","doi-asserted-by":"crossref","unstructured":"Midtlyng M, Sato Y (2020) Lightweight multi-objective voice adaptation for real-time speech interaction applied in games. In: IEEE Conference on Games (CoG), Osaka, pp 237\u2013243","DOI":"10.1109\/CoG47356.2020.9231643"},{"issue":"6","key":"604_CR23","doi-asserted-by":"publisher","first-page":"712","DOI":"10.1109\/TEVC.2007.892759","volume":"11","author":"Q Zhang","year":"2007","unstructured":"Zhang Q, Li H (2007) MOEA\/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans Evol Comput 11(6):712\u2013731","journal-title":"IEEE Trans Evol Comput"},{"key":"604_CR24","doi-asserted-by":"publisher","first-page":"314","DOI":"10.1016\/j.chb.2013.07.014","volume":"33","author":"J Fox","year":"2014","unstructured":"Fox J, Tang WY (2014) Sexism in online video games: the role of conformity to masculine norms and social dominance orientation. Comput Hum Behav 33:314\u2013320","journal-title":"Comput Hum Behav"},{"key":"604_CR25","unstructured":"Rideout V (2015) The common sense census: media use by tweens and teens. Analysis & Policy Observatory, Common Sense Media"},{"key":"604_CR26","doi-asserted-by":"crossref","unstructured":"Sekii Y, Orihara R, Kojima K, Sei Y, Tahara Y, Ohsuga A (2017) Fast many-to-one voice conversion using autoencoders. In: International Conference on Agents and Artificial Intelligence (ICAART), Porto, pp 164\u2013174","DOI":"10.5220\/0006193301640174"},{"key":"604_CR27","doi-asserted-by":"crossref","unstructured":"Kotani G, Saito D, Minematsu N (2017) Voice conversion based on deep neural networks for time-variant linear transformations. In: Asia\u2013Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, pp 1259\u20131262","DOI":"10.1109\/APSIPA.2017.8282216"},{"key":"604_CR28","doi-asserted-by":"crossref","unstructured":"Tamura M, Morita M, Kagoshima T, Akamine M (2011) One sentence voice adaptation using GMM-based frequency warping and shift with a sub-band basis spectrum model. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp 5124\u20135127","DOI":"10.1109\/ICASSP.2011.5947510"},{"key":"604_CR29","doi-asserted-by":"crossref","unstructured":"Li Y, Lee KA, Yuan Y, Li H, Yang Z (2018) Many-to-many voice conversion based on bottleneck features with variational autoencoder for non-parallel training data. In: Asia\u2013Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hawaii, pp 829\u2013833","DOI":"10.23919\/APSIPA.2018.8659628"},{"issue":"3","key":"604_CR30","doi-asserted-by":"publisher","first-page":"556","DOI":"10.1109\/TASL.2012.2227735","volume":"21","author":"D Erro","year":"2013","unstructured":"Erro D, Navas E, Hern\u00e1ez I (2013) Parametric voice conversion based on bilinear frequency warping plus amplitude scaling. IEEE Trans Audio Speech Lang Process 21(3):556\u2013566","journal-title":"IEEE Trans Audio Speech Lang Process"},{"key":"604_CR31","doi-asserted-by":"publisher","first-page":"1290","DOI":"10.1109\/TASLP.2021.3066047","volume":"29","author":"M Zhang","year":"2021","unstructured":"Zhang M, Zhou Y, Zhao L, Li H (2021) Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE\/ACM Trans Audio Speech Lang Process 29:1290\u20131302","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"604_CR32","doi-asserted-by":"publisher","first-page":"745","DOI":"10.1109\/TASLP.2021.3049336","volume":"29","author":"W-C Huang","year":"2021","unstructured":"Huang W-C, Hayashi T, Wu Y-C, Kameoka H, Toda T (2021) Pretraining techniques for sequence-to-sequence voice conversion. IEEE\/ACM Trans Audio Speech Lang Process 29:745\u2013755","journal-title":"IEEE\/ACM Trans Audio Speech Lang Process"},{"key":"604_CR33","doi-asserted-by":"crossref","unstructured":"Zhou K, Sisman B, Liu R, Li H (2021) Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. In: ICASSP 2021\u20142021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp 920\u2013924","DOI":"10.1109\/ICASSP39728.2021.9413391"},{"key":"604_CR34","unstructured":"Microsoft .NET5 SDK (2020) [Online]. Available: https:\/\/dotnet.microsoft.com\/download\/dotnet\/current"},{"key":"604_CR35","unstructured":"Microsoft WebView2 web rendering (2020) [Online]. Available: https:\/\/docs.microsoft.com\/en-us\/microsoft-edge\/webview2\/"},{"key":"604_CR36","unstructured":"Unity3D, Unity Technologies. Accessed on: January 1. 2019 [Online]. Available: https:\/\/unity.com\/"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-021-00604-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-021-00604-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-021-00604-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,21]],"date-time":"2023-01-21T18:44:10Z","timestamp":1674326650000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-021-00604-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,4]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,4]]}},"alternative-id":["604"],"URL":"https:\/\/doi.org\/10.1007\/s40747-021-00604-6","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"type":"print","value":"2199-4536"},{"type":"electronic","value":"2198-6053"}],"subject":[],"published":{"date-parts":[[2022,1,4]]},"assertion":[{"value":"17 April 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 January 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Consent is obtained from the participants aiding the study which resulted in the presented evaluation data. Technical implementation and primary researcher is the corresponding author, while research supervision is contributed to the 2<sup>nd<\/sup> author and manuscript quality assurance from the 3<sup>rd<\/sup> author.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Informed consent"}}]}}