{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,3]],"date-time":"2025-08-03T04:32:22Z","timestamp":1754195542442},"reference-count":35,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"11","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2020,11,1]]},"DOI":"10.1587\/transinf.2020edp7032","type":"journal-article","created":{"date-parts":[[2020,10,31]],"date-time":"2020-10-31T22:13:24Z","timestamp":1604182404000},"page":"2340-2350","source":"Crossref","is-referenced-by-count":2,"title":["Speech Chain VC: Linking Linguistic and Acoustic Levels via Latent Distinctive Features for RBM-Based Voice Conversion"],"prefix":"10.1587","volume":"E103.D","author":[{"given":"Takuya","family":"KISHIDA","sequence":"first","affiliation":[{"name":"Graduate School of Informatics and Engineering, The University of Electro-Communications"}]},{"given":"Toru","family":"NAKASHIKA","sequence":"additional","affiliation":[{"name":"Graduate School of Informatics and Engineering, The University of Electro-Communications"}]}],"member":"532","reference":[{"doi-asserted-by":"publisher","unstructured":"[1] Y. Saito, S. Takamichi, and H. Saruwatari, \u201cStatistical parametric speech synthesis incorporating generative adversarial networks,\u201d IEEE\/ACM Trans. Audio Speech Lang. Process., vol.26, no.1, pp.84-96, 2017. 10.1109\/taslp.2017.2761547","key":"1","DOI":"10.1109\/TASLP.2017.2761547"},{"doi-asserted-by":"crossref","unstructured":"[2] T. Kaneko and H. Kameoka, \u201cCycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,\u201d Proc. EUSIPCO, pp.2100-2104, IEEE, 2018. 10.23919\/eusipco.2018.8553236","key":"2","DOI":"10.23919\/EUSIPCO.2018.8553236"},{"doi-asserted-by":"crossref","unstructured":"[3] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, \u201cCycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion,\u201d Proc. ICASSP, pp.6820-6824, IEEE, 2019. 10.1109\/icassp.2019.8682897","key":"3","DOI":"10.1109\/ICASSP.2019.8682897"},{"doi-asserted-by":"crossref","unstructured":"[4] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, \u201cStarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,\u201d 2018 IEEE Spoken Language Technology Workshop (SLT), pp.266-273, IEEE, 2018. 10.1109\/slt.2018.8639535","key":"4","DOI":"10.1109\/SLT.2018.8639535"},{"doi-asserted-by":"crossref","unstructured":"[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, \u201cStarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion,\u201d Proc. Interspeech, pp.679-683, 2019. 10.21437\/interspeech.2019-2236","key":"5","DOI":"10.21437\/Interspeech.2019-2236"},{"unstructured":"[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, \u201cGenerative adversarial nets,\u201d Proc. NIPS, pp.2672-2680, 2014.","key":"6"},{"doi-asserted-by":"crossref","unstructured":"[7] P. Isola, J.-Y. Zhu, T. Zhou, and A.A. Efros, \u201cImage-to-image translation with conditional adversarial networks,\u201d Proc. CVPR, pp.5967-5976, 2017. 10.1109\/cvpr.2017.632","key":"7","DOI":"10.1109\/CVPR.2017.632"},{"doi-asserted-by":"crossref","unstructured":"[8] J.-Y. Zhu, T. Park, P. Isola, and A.A. Efros, \u201cUnpaired image-to-image translation using cycle-consistent adversarial networks,\u201d Proc. ICCV, pp.2223-2232, 2017. 10.1109\/iccv.2017.244","key":"8","DOI":"10.1109\/ICCV.2017.244"},{"doi-asserted-by":"publisher","unstructured":"[9] T. Toda, A.W. Black, and K. Tokuda, \u201cVoice conversion based on maximum-likelihood estimation of spectral parameter trajectory,\u201d IEEE Trans. Audio, Speech, and Lang. Process., vol.15, no.8, pp.2222-2235, 2007. 10.1109\/tasl.2007.907344","key":"9","DOI":"10.1109\/TASL.2007.907344"},{"doi-asserted-by":"crossref","unstructured":"[10] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, \u201cVoice conversion from non-parallel corpora using variational auto-encoder,\u201d Proc. APSIPA, pp.1-6, IEEE, 2016. 10.1109\/apsipa.2016.7820786","key":"10","DOI":"10.1109\/APSIPA.2016.7820786"},{"doi-asserted-by":"crossref","unstructured":"[11] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, \u201cNon-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,\u201d Proc. ICASSP, pp.5274-5278, IEEE, 2018. 10.1109\/icassp.2018.8461384","key":"11","DOI":"10.1109\/ICASSP.2018.8461384"},{"doi-asserted-by":"publisher","unstructured":"[12] T. Nakashika, T. Takiguchi, and Y. Ariki, \u201cVoice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines,\u201d IEEE\/ACM Trans. Audio, Speech and Language Process., vol.23, no.3, pp.580-587, 2015. 10.1109\/taslp.2014.2379589","key":"12","DOI":"10.1109\/TASLP.2014.2379589"},{"doi-asserted-by":"crossref","unstructured":"[13] L. Sun, S. Kang, K. Li, and H. Meng, \u201cVoice conversion using deep bidirectional long short-term memory based recurrent neural networks,\u201d Proc. ICASSP, pp.4869-4873, IEEE, 2015. 10.1109\/icassp.2015.7178896","key":"13","DOI":"10.1109\/ICASSP.2015.7178896"},{"unstructured":"[14] F. Doshi-Velez and B. Kim, \u201cTowards a rigorous science of interpretable machine learning,\u201d arXiv preprint arXiv:1702.08608, 2017.","key":"14"},{"doi-asserted-by":"publisher","unstructured":"[15] G. Montavon, W. Samek, and K.R. M\u00fcller, \u201cMethods for interpreting and understanding deep neural networks,\u201d Digital Signal Processing, vol.73, pp.1-15, 2018. 10.1016\/j.dsp.2017.10.011","key":"15","DOI":"10.1016\/j.dsp.2017.10.011"},{"unstructured":"[16] N. Liu, M. Du, and X. Hu, \u201cAdversarial machine learning: An interpretation perspective,\u201d arXiv preprint arXiv:2004.11488, 2020.","key":"16"},{"doi-asserted-by":"publisher","unstructured":"[17] T. Nakashika, T. Takiguchi, and Y. Minami, \u201cNon-parallel training in voice conversion using an adaptive restricted Boltzmann machine,\u201d IEEE\/ACM Trans. Audio, Speech and Language Process., vol.24, no.11, pp.2032-2045, 2016. 10.1109\/taslp.2016.2593263","key":"17","DOI":"10.1109\/TASLP.2016.2593263"},{"doi-asserted-by":"publisher","unstructured":"[18] M. Pitz and H. Ney, \u201cVocal tract normalization equals linear transformation in cepstral space,\u201d IEEE Trans. Speech and Audio Process., vol.13, no.5, pp.930-944, 2005. 10.1109\/tsa.2005.848881","key":"18","DOI":"10.1109\/TSA.2005.848881"},{"unstructured":"[19] P.B. Denes and E.N. Pinson, The Speech Chain, 2 ed., W.H. Freeman and Co., New York, 1993.","key":"19"},{"doi-asserted-by":"publisher","unstructured":"[20] K. Sone and T. Nakashika, \u201cPre-training of DNN-based speech synthesis based on bidirectional conversion between text and speech,\u201d IEICE Trans. Inf.&amp; Syst., vol.E102-D, no.8, pp.1546-1553, 2019. 10.1587\/transinf.2018edp7344","key":"20","DOI":"10.1587\/transinf.2018EDP7344"},{"doi-asserted-by":"crossref","unstructured":"[21] K. Cho, A. Ilin, and T. Raiko, \u201cImproved learning of Gaussian-Bernoulli restricted Boltzmann machines,\u201d Proc. ICANN, vol.6791, pp.10-17, Springer, 2011. 10.1007\/978-3-642-21735-7_2","key":"21","DOI":"10.1007\/978-3-642-21735-7_2"},{"doi-asserted-by":"publisher","unstructured":"[22] H. Kawahara, \u201cSTRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds,\u201d Acoust. Sci. &amp; Tech., vol.27, no.6, pp.349-353, 2006. 10.1250\/ast.27.349","key":"22","DOI":"10.1250\/ast.27.349"},{"unstructured":"[23] D.P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint arXiv:1412.6980, 2014.","key":"23"},{"doi-asserted-by":"crossref","unstructured":"[24] A.W. Rix, J.G. Beerends, M.P. Hollier, and A.P. Hekstra, \u201cPerceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,\u201d Proc. ICASSP, pp.749-752, IEEE, 2001. 10.1109\/icassp.2001.941023","key":"24","DOI":"10.1109\/ICASSP.2001.941023"},{"doi-asserted-by":"crossref","unstructured":"[25] M. Wester, Z. Wu, and J. Yamagishi, \u201cAnalysis of the Voice Conversion Challenge 2016 Evaluation Results,\u201d Proc. Interspeech, pp.1637-1641, 2016. 10.21437\/interspeech.2016-1331","key":"25","DOI":"10.21437\/Interspeech.2016-1331"},{"doi-asserted-by":"crossref","unstructured":"[26] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, \u201cThe voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,\u201d arXiv preprint arXiv:1804.04262, 2018. 10.21437\/odyssey.2018-28","key":"26","DOI":"10.21437\/Odyssey.2018-28"},{"doi-asserted-by":"publisher","unstructured":"[27] K. Pearson and D. Heron, \u201cOn theories of association,\u201d Biometrika, vol.9, no.1-2, pp.159-315, 1913. 10.1093\/biomet\/9.1-2.159","key":"27","DOI":"10.1093\/biomet\/9.1-2.159"},{"unstructured":"[28] A. Spencer, Phonology: Theory and Description, Blackwell, Oxford, 1996.","key":"28"},{"doi-asserted-by":"publisher","unstructured":"[29] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, \u201cFront-end factor analysis for speaker verification,\u201d IEEE Trans. Audio, Speech, and Language Process., vol.19, no.4, pp.788-798, 2010. 10.1109\/tasl.2010.2064307","key":"29","DOI":"10.1109\/TASL.2010.2064307"},{"doi-asserted-by":"crossref","unstructured":"[30] J. Wu, Z. Wu, and L. Xie, \u201cOn the use of i-vectors and average voice model for voice conversion without parallel data,\u201d Proc. APSIPA, pp.1-6, IEEE, 2016. 10.1109\/apsipa.2016.7820901","key":"30","DOI":"10.1109\/APSIPA.2016.7820901"},{"doi-asserted-by":"crossref","unstructured":"[31] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, \u201cNon-parallel voice conversion using i-vector plda: Towards unifying speaker verification and transformation,\u201d Proc. ICASSP, pp.5535-5539, IEEE, 2017. 10.1109\/icassp.2017.7953215","key":"31","DOI":"10.1109\/ICASSP.2017.7953215"},{"doi-asserted-by":"crossref","unstructured":"[32] S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng, \u201cVoice conversion across arbitrary speakers based on a single target-speaker utterance.,\u201d Proc. Interspeech, pp.496-500, 2018. 10.21437\/interspeech.2018-1504","key":"32","DOI":"10.21437\/Interspeech.2018-1504"},{"unstructured":"[33] S. Takamichi, K. Mitsui, Y. Saito, T. Koriyama, N. Tanji, and H. Saruwatari, \u201cJVS corpus: free Japanese multi-speaker voice corpus,\u201d arXiv preprint arXiv:1908.06248, 2019.","key":"33"},{"doi-asserted-by":"publisher","unstructured":"[34] M. Morise, F. Yokomori, and K. Ozawa, \u201cWORLD: A vocoder-based high-quality speech synthesis system for real-time applications,\u201d IEICE Trans. Inf.&amp; Syst., vol.E99-D, no.7, pp.1877-1884, 2016. 10.1587\/transinf.2015edp7457","key":"34","DOI":"10.1587\/transinf.2015EDP7457"},{"doi-asserted-by":"crossref","unstructured":"[35] D. Garcia-Romero and C.Y. Espy-Wilson, \u201cAnalysis of i-vector length normalization in speaker recognition systems,\u201d Proc. Interspeech, pp.249-252, 2011.","key":"35","DOI":"10.21437\/Interspeech.2011-53"}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E103.D\/11\/E103.D_2020EDP7032\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T19:07:47Z","timestamp":1723835267000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E103.D\/11\/E103.D_2020EDP7032\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,11,1]]},"references-count":35,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2020]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2020edp7032","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"type":"print","value":"0916-8532"},{"type":"electronic","value":"1745-1361"}],"subject":[],"published":{"date-parts":[[2020,11,1]]}}}