{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T05:35:20Z","timestamp":1668231320854},"reference-count":53,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"9","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2020,9,1]]},"DOI":"10.1587\/transinf.2019edp7297","type":"journal-article","created":{"date-parts":[[2020,8,31]],"date-time":"2020-08-31T22:14:45Z","timestamp":1598912085000},"page":"1978-1987","source":"Crossref","is-referenced-by-count":0,"title":["Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams"],"prefix":"10.1587","volume":"E103.D","author":[{"given":"Yuki","family":"SAITO","sequence":"first","affiliation":[{"name":"DeNA Co., Ltd."},{"name":"The University of Tokyo"}]},{"given":"Kei","family":"AKUZAWA","sequence":"additional","affiliation":[{"name":"DeNA Co., Ltd."},{"name":"The University of Tokyo"}]},{"given":"Kentaro","family":"TACHIBANA","sequence":"additional","affiliation":[{"name":"DeNA Co., Ltd."}]}],"member":"532","reference":[{"key":"1","doi-asserted-by":"crossref","unstructured":"[1] Y. Stylianou, O. Capp\u00e9, and E. Moulines, \u201cContinuous probabilistic transform for voice conversion,\u201d IEEE Trans. Speech and Audio Processing, vol.6, no.2, pp.131-142, March 1988. 10.1109\/89.661472","DOI":"10.1109\/89.661472"},{"key":"2","doi-asserted-by":"publisher","unstructured":"[2] A.B. Kain, J.-P. Hosom, X. Niu, J.P.H. van Santen, M. Fried-Oken, and J. Staehely, \u201cImproving the intelligibility of dysarthric speech,\u201d Speech Communication, vol.49, no.9, pp.743-756, 2007. 10.1016\/j.specom.2007.05.001","DOI":"10.1016\/j.specom.2007.05.001"},{"key":"3","doi-asserted-by":"crossref","unstructured":"[3] H. Doi, T. Toda, T. Nakano, M. Goto, and S. Nakamura, \u201cSinging voice conversion method based on many-to-many Eigenvoice conversion and training data generation using a singing-to-singing synthesis system,\u201d Proc. APSIPA ASC, pp.1-6, Nov. 2012.","DOI":"10.21437\/Interspeech.2013-120"},{"key":"4","doi-asserted-by":"crossref","unstructured":"[4] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,T. Kinnunen, and Z.H. Ling, \u201cThe Voice Conversion Challenge 2018: Promoting development of parallel and nonparallel methods,\u201d Odyssey Workshop, Les Sables d&apos;Olonne, France, pp.195-202, June 2018. 10.21437\/odyssey.2018-28","DOI":"10.21437\/Odyssey.2018-28"},{"key":"5","doi-asserted-by":"publisher","unstructured":"[5] T. Toda, A.W. Black, and K. Tokuda, \u201cVoice conversion based on maximum-likelihood estimation of spectral parameter trajectory,\u201d IEEE Trans. Audio, Speech, Language Process., vol.15, no.8, pp.2222-2235, Nov. 2007. 10.1109\/tasl.2007.907344","DOI":"10.1109\/TASL.2007.907344"},{"key":"6","doi-asserted-by":"crossref","unstructured":"[6] T. Toda, O. Ohtani, and K. Shikano, \u201cOne-to-many and many-to-one voice conversion based on Eigenvoices,\u201d Proc. ICASSP, Hawaii, U.S.A., pp.1249-1252, April 2007. 10.1109\/icassp.2007.367303","DOI":"10.1109\/ICASSP.2007.367303"},{"key":"7","doi-asserted-by":"crossref","unstructured":"[7] S. Desai, E.V. Raghavendra, B. Yegnanarayana, A.W. Black, and K. Prahallad, \u201cVoice conversion using artificial neural networks,\u201d Proc. ICASSP, Taipei, Taiwan, pp.3893-3896, April 2009. 10.1109\/icassp.2009.4960478","DOI":"10.1109\/ICASSP.2009.4960478"},{"key":"8","doi-asserted-by":"crossref","unstructured":"[8] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kunio,\u201cSequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,\u201d Proc.INTERSPEECH, Stockholm, Sweden, pp.1283-1287, Aug. 2017. 10.21437\/interspeech.2017-970","DOI":"10.21437\/Interspeech.2017-970"},{"key":"9","doi-asserted-by":"crossref","unstructured":"[9] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, \u201cAttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,\u201d Proc. ICASSP, Brighton, U.K., pp.6805-6809, May 2019. 10.1109\/icassp.2019.8683282","DOI":"10.1109\/ICASSP.2019.8683282"},{"key":"10","doi-asserted-by":"publisher","unstructured":"[10] T. Nakashika, T. Takiguchi, and Y. Minami, \u201cNon-parallel training in voice conversion using an adaptive restricted Boltzmann machine,\u201d IEEE\/ACM Trans. Audio, Speech, Language Process., vol.24, no.11, pp.2032-2045, Nov. 2016. 10.1109\/taslp.2016.2593263","DOI":"10.1109\/TASLP.2016.2593263"},{"key":"11","doi-asserted-by":"crossref","unstructured":"[11] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, \u201cVoice conversion from non-parallel corpora using variational auto-encoder,\u201d Proc. APSIPA ASC, Jeju, South Korea, Dec. 2016. 10.1109\/apsipa.2016.7820786","DOI":"10.1109\/APSIPA.2016.7820786"},{"key":"12","doi-asserted-by":"crossref","unstructured":"[12] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, \u201cNon-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,\u201d Proc. ICASSP, Alberta, Canada, pp.5274-5278, April 2018. 10.1109\/icassp.2018.8461384","DOI":"10.1109\/ICASSP.2018.8461384"},{"key":"13","doi-asserted-by":"crossref","unstructured":"[13] H. Zen, A. Senior, and M. Schuster, \u201cStatistical parametric speech synthesis using deep neural networks,\u201d Proc. ICASSP, Vancouver, Canada, pp.7962-7966, May 2013. 10.1109\/icassp.2013.6639215","DOI":"10.1109\/ICASSP.2013.6639215"},{"key":"14","doi-asserted-by":"publisher","unstructured":"[14] G.E. Hinton and R.R. Salakhutdinov, \u201cReducing the dimensionality of data with neural networks,\u201d Science, vol.313, no.5786, pp.504-507, 2006. 10.1126\/science.1127647","DOI":"10.1126\/science.1127647"},{"key":"15","doi-asserted-by":"publisher","unstructured":"[15] Z. Wu, P.L.D. Leon, C. Demiroglu, A. Khodabakhsh, S. King, Z. Ling, D. Saito, B. Stewart, T. Toda, M. Wester, and J. Yamagishi,\u201cAnti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance,\u201d IEEE\/ACM Trans. Audio, Speech, Language Process., vol.24, no.4, pp.768-783, 2016. 10.1109\/taslp.2016.2526653","DOI":"10.1109\/TASLP.2016.2526653"},{"key":"16","doi-asserted-by":"crossref","unstructured":"[16] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, \u201cPhonetic posteriorgrams for many-to-one voice conversion without parallel data training,\u201d Proc. ICME, Seattle, U.S.A., July 2016. 10.1109\/icme.2016.7552917","DOI":"10.1109\/ICME.2016.7552917"},{"key":"17","doi-asserted-by":"crossref","unstructured":"[17] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, \u201cVoice conversion using sequence-to-sequence learning of context posterior probabilities,\u201d Proc. INTERSPEECH, Stockholm, Sweden, pp.1268-1272, Aug. 2017. 10.21437\/interspeech.2017-247","DOI":"10.21437\/Interspeech.2017-247"},{"key":"18","doi-asserted-by":"publisher","unstructured":"[18] H. Zen, K. Tokuda, and A. Black, \u201cStatistical parametric speech synthesis,\u201d Speech Communication, vol.51, no.11, pp.1039-1064, 2009. 10.1016\/j.specom.2009.04.004","DOI":"10.1016\/j.specom.2009.04.004"},{"key":"19","unstructured":"[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, \u201cDomain-adversarial training of neural networks,\u201d Journal of Machine Learning Research, vol.17, no.59, pp.1-35, April 2016."},{"key":"20","doi-asserted-by":"publisher","unstructured":"[20] Y. Saito, S. Takamichi, and H. Saruwatari, \u201cStatistical parametric speech synthesis incorporating generative adversarial networks,\u201d IEEE\/ACM Trans. Audio, Speech, Language Process., vol.26, no.1, pp.84-96, Jan. 2018. 10.1109\/taslp.2017.2761547","DOI":"10.1109\/TASLP.2017.2761547"},{"key":"21","unstructured":"[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, \u201cGenerative adversarial nets,\u201d Proc. NIPS, Montreal, Canada, pp.2672-2680, 2014."},{"key":"22","doi-asserted-by":"crossref","unstructured":"[22] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, \u201cDomain adversarial training for accented speech recognition,\u201d Proc. ICASSP, Alberta, Canada, pp.4854-4858, April 2018. 10.1109\/icassp.2018.8462663","DOI":"10.1109\/ICASSP.2018.8462663"},{"key":"23","doi-asserted-by":"crossref","unstructured":"[23] Q. Wang, W. Rao, S. Sun, L. Xie, E.S. Chng, and H. Li, \u201cUnsupervised domain adaptation via domain adversarial training for speaker recognition,\u201d Proc. ICASSP, Alberta, Canada, pp.4889-4893, April 2018. 10.1109\/icassp.2018.8461423","DOI":"10.1109\/ICASSP.2018.8461423"},{"key":"24","doi-asserted-by":"crossref","unstructured":"[24] J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, \u201cMulti-target voice conversion without parallel data by adversarially learning disentangled audio representations,\u201d Proc. INTERSPEECH, Hyderabad,India, pp.501-505, Sept. 2018. 10.21437\/interspeech.2018-1830","DOI":"10.21437\/Interspeech.2018-1830"},{"key":"25","unstructured":"[25] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein GAN,\u201d arXiv, vol.abs\/1701.07875, 2017."},{"key":"26","doi-asserted-by":"crossref","unstructured":"[26] T. Kaneko and H. Kameoka, \u201cCycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,\u201d Proc.EUSIPCO, Rome, Italy, pp.2114-2118, Sept. 2018. 10.23919\/eusipco.2018.8553236","DOI":"10.23919\/EUSIPCO.2018.8553236"},{"key":"27","doi-asserted-by":"crossref","unstructured":"[27] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, \u201cCycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion,\u201d Proc. ICASSP, Brighton, U.K., pp.6820-6824, May 2019. 10.1109\/icassp.2019.8682897","DOI":"10.1109\/ICASSP.2019.8682897"},{"key":"28","doi-asserted-by":"crossref","unstructured":"[28] J.-Y. Zhu, T. Park, P. Isola, and A.A. Efros, \u201cUnpaired image-to-image translation using cycle-consistent adversarial networks,\u201d Proc. ICCV, Venice, Italy, pp.2223-2232, Oct. 2017. 10.1109\/iccv.2017.244","DOI":"10.1109\/ICCV.2017.244"},{"key":"29","doi-asserted-by":"crossref","unstructured":"[29] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, \u201cStarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,\u201d Proc. SLT, Greece, Athens, pp.266-273, Dec. 2018. 10.1109\/slt.2018.8639535","DOI":"10.1109\/SLT.2018.8639535"},{"key":"30","doi-asserted-by":"crossref","unstructured":"[30] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, \u201cStarGAN-VC2: Rethinking conditional methods for stargan-based voice conversion,\u201d Proc. INTERSPEECH, Graz, Austria, pp.679-683, Sept. 2019. 10.21437\/interspeech.2019-2236","DOI":"10.21437\/Interspeech.2019-2236"},{"key":"31","doi-asserted-by":"crossref","unstructured":"[31] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo,\u201cStarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,\u201d Prop. CVPR, Salt Lake City, U.S.A., pp.8789-8797, June 2018. 10.1109\/cvpr.2018.00916","DOI":"10.1109\/CVPR.2018.00916"},{"key":"32","doi-asserted-by":"crossref","unstructured":"[32] A. Tjandra, S. Sakti, and S. Nakamura, \u201cListening while speaking: Speech chain by deep learning,\u201d Proc. ASRU, Okinawa, Japan, pp.301-308, Dec. 2017. 10.1109\/asru.2017.8268950","DOI":"10.1109\/ASRU.2017.8268950"},{"key":"33","unstructured":"[33] M. Mirza and S. Osindero, \u201cConditional generative adversarial nets,\u201d arXiv, vol.abs\/1411.1784, 2014."},{"key":"34","doi-asserted-by":"publisher","unstructured":"[34] N. Hojo, Y. Ijima, and H. Mizuno, \u201cDNN-based speech synthesis using speaker codes,\u201d IEICE Trans. Inf. &amp; Syst., vol.E101-D, no.2, pp.462-472, Feb. 2018. 10.1587\/transinf.2017edp7165","DOI":"10.1587\/transinf.2017EDP7165"},{"key":"35","doi-asserted-by":"crossref","unstructured":"[35] Y. Zhou, X. Tian, H. Xu, R.K. Das, and H. Li, \u201cCross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling,\u201d Proc. ICASSP, Brighton, U.K., pp.6790-6794, May 2019. 10.1109\/icassp.2019.8683746","DOI":"10.1109\/ICASSP.2019.8683746"},{"key":"36","doi-asserted-by":"crossref","unstructured":"[36] S.H. Mohammadi and T. Kim, \u201cOne-shot voice conversion with disentangled representations by leveraging phonetic posteriorgrams,\u201d Proc. INTERSPEECH, Graz, Austria, pp.704-708, Sept. 2019. 10.21437\/interspeech.2019-1798","DOI":"10.21437\/Interspeech.2019-1798"},{"key":"37","unstructured":"[37] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, \u201cWaveNet: A generative model for raw audio,\u201d arXiv, vol.abs\/1609.03499, 2016."},{"key":"38","doi-asserted-by":"crossref","unstructured":"[38] H. Lu, Z. Wu, R. Li, S. Kang, J. Jia, and H. Meng, \u201cA compact framework for voice conversion using wavenet conditioned on phonetic posteriorgrams,\u201d Proc. ICASSP, Brighton, U.K., pp.6810-6814, May 2019. 10.1109\/icassp.2019.8682938","DOI":"10.1109\/ICASSP.2019.8682938"},{"key":"39","doi-asserted-by":"crossref","unstructured":"[39] S. Liu, Y. Cao, X. Wu, L. Sun, X. Liu, and H. Meng, \u201cJointly trained conversion model and WaveNet vocoder for non-parallel voice conversion using mel-spectrograms and phonetic posteriorgrams,\u201d Proc. INTERSPEECH, Graz, Austria, pp.704-708, Sept. 2019. 10.21437\/interspeech.2019-1316","DOI":"10.21437\/Interspeech.2019-1316"},{"key":"40","doi-asserted-by":"publisher","unstructured":"[40] K. Sugiura, Y. Shiga, H. Kawai, T. Misu, and C. Hori, \u201cA cloud robotics approach towards dialogue-oriented robot speech,\u201d Advanced Robotics, vol.29, no.7, pp.449-456, March 2015. 10.1080\/01691864.2015.1009164","DOI":"10.1080\/01691864.2015.1009164"},{"key":"41","unstructured":"[41] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, \u201cSpontaneous speech corpus of Japanese,\u201d Proc. LREC, pp.947-952, May 2000."},{"key":"42","doi-asserted-by":"publisher","unstructured":"[42] M. Morise, F. Yokomori, and K. Ozawa, \u201cWORLD: A vocoder-based high-quality speech synthesis system for real-time applications,\u201d IEICE Trans. Inf. &amp; Syst., vol.E99-D, no.7, pp.1877-1884, July 2016. 10.1587\/transinf.2015edp7457","DOI":"10.1587\/transinf.2015EDP7457"},{"key":"43","doi-asserted-by":"publisher","unstructured":"[43] M. Morise, \u201cD4C, a band-aperiodicity estimator for high-quality speech synthesis,\u201d Speech Communication, vol.84, pp.57-65, Nov. 2016. 10.1016\/j.specom.2016.09.001","DOI":"10.1016\/j.specom.2016.09.001"},{"key":"44","unstructured":"[44] A. Camacho, \u201cSwipe: A sawtooth waveform inspired pitch estimator for speech and music,\u201d Ph.D. dissertation, University of Florida, 2007."},{"key":"45","unstructured":"[45] D. Talkin, \u201cA robust algorithm for pitch tracking (RAPT),\u201d Speech Coding and Synthesis, pp.495-518, 1995."},{"key":"46","doi-asserted-by":"publisher","unstructured":"[46] O. Abdel-Hamid, A.r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, \u201cConvolutional neural networks for speech recognition,\u201d IEEE\/ACM Trans. Audio, Speech, Language Process., vol.22, no.10, pp.1533-1545, Oct. 2014. 10.1109\/taslp.2014.2339736","DOI":"10.1109\/TASLP.2014.2339736"},{"key":"47","unstructured":"[47] A.L. Maas, A.Y. Hannun, and A.Y. Ng, \u201cRectifier nonlinearities improve neural network acoustic models,\u201d Proc. ICML, 2013."},{"key":"48","unstructured":"[48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, \u201cDropout: A simple way to prevent neural networks from overfitting,\u201d The Journal of Machine Learning Research, vol.15, no.1, pp.1929-1958, April 2014."},{"key":"49","unstructured":"[49] S. Ioffe and C. Szegedy, \u201cBatch normalization: Accelerating deep network training by reducing internal covariate shift,\u201d Proc. ICML, 2015."},{"key":"50","unstructured":"[50] J. Duchi, E. Hazan, and Y. Singer, \u201cAdaptive subgradient methods for online learning and stochastic optimization,\u201d Journal of Machine Learning Research, vol.12, pp.2121-2159, July 2011."},{"key":"51","doi-asserted-by":"publisher","unstructured":"[51] B.W. Matthews, \u201cComparison of the predicted and observed secondary structure of t4 phage lysozyme,\u201d Biochimica et Biophysica Acta (BBA)-Protein Structure, vol.405, no.2, pp.442-451, Oct. 1975. 10.1016\/0005-2795(75)90109-9","DOI":"10.1016\/0005-2795(75)90109-9"},{"key":"52","doi-asserted-by":"crossref","unstructured":"[52] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, \u201cATR Japanese speech database as a tool of speech recognition and synthesis,\u201d Speech Communication, vol.9, no.4, pp.357-363, Aug. 1990. 10.1016\/0167-6393(90)90011-w","DOI":"10.1016\/0167-6393(90)90011-W"},{"key":"53","doi-asserted-by":"publisher","unstructured":"[53] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai, \u201cSequence-to-sequence acoustic modeling for voice conversion,\u201d IEEE\/ACM Trans. Audio, Speech, Language Process., vol.27, no.3, pp.631-644, Jan. 2019. 10.1109\/taslp.2019.2892235","DOI":"10.1109\/TASLP.2019.2892235"}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E103.D\/9\/E103.D_2019EDP7297\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,11,12]],"date-time":"2022-11-12T02:48:33Z","timestamp":1668221313000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E103.D\/9\/E103.D_2019EDP7297\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,1]]},"references-count":53,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2020]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2019edp7297","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"value":"0916-8532","type":"print"},{"value":"1745-1361","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,1]]}}}