{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:24Z","timestamp":1750220484936,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":27,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,18]]},"DOI":"10.1145\/3461615.3491115","type":"proceedings-article","created":{"date-parts":[[2021,12,18]],"date-time":"2021-12-18T04:57:40Z","timestamp":1639803460000},"page":"131-136","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Noise Robust Singing Voice Synthesis Using Gaussian Mixture Variational Autoencoder"],"prefix":"10.1145","author":[{"given":"Heyang","family":"Xue","sequence":"first","affiliation":[{"name":"Northwestern Polytechnical University, China"}]},{"given":"Xiao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Northwestern Polytechnical University, China"}]},{"given":"Jie","family":"Wu","sequence":"additional","affiliation":[{"name":"Xiaomi AI Lab, China"}]},{"given":"Jian","family":"Luan","sequence":"additional","affiliation":[{"name":"Xiaomi AI Lab, China"}]},{"given":"Yujun","family":"Wang","sequence":"additional","affiliation":[{"name":"Xiaomi AI Lab, China"}]},{"given":"Lei","family":"Xie","sequence":"additional","affiliation":[{"name":"Northwestern Polytechnical University, China"}]}],"member":"320","published-online":{"date-parts":[[2021,12,17]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9415061"},{"key":"e_1_3_2_1_2_1","volume-title":"Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020","author":"Blaauw Merlijn","year":"2020","unstructured":"Merlijn Blaauw and Jordi Bonada . 2020 . Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020 , Barcelona, Spain , May 4-8, 2020. 7229\u20137233. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053944 Merlijn Blaauw and Jordi Bonada. 2020. Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 7229\u20137233. https:\/\/doi.org\/10.1109\/ICASSP40776.2020.9053944"},{"key":"e_1_3_2_1_3_1","unstructured":"Jiawei Chen Xu Tan Jian Luan Tao Qin and Tie-Yan Liu. 2020. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. CoRR abs\/2009.01776(2020). arXiv:2009.01776https:\/\/arxiv.org\/abs\/2009.01776  Jiawei Chen Xu Tan Jian Luan Tao Qin and Tie-Yan Liu. 2020. HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. CoRR abs\/2009.01776(2020). arXiv:2009.01776https:\/\/arxiv.org\/abs\/2009.01776"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Matthew\u00a0C Cieslak Ann\u00a0M Castelfranco Vittoria Roncalli Petra\u00a0H Lenz and Daniel\u00a0K Hartline. 2020. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine genomics 51(2020) 100723.  Matthew\u00a0C Cieslak Ann\u00a0M Castelfranco Vittoria Roncalli Petra\u00a0H Lenz and Daniel\u00a0K Hartline. 2020. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine genomics 51(2020) 100723.","DOI":"10.1016\/j.margen.2019.100723"},{"key":"e_1_3_2_1_5_1","unstructured":"Nat Dilokthanakul Pedro A.\u00a0M. Mediano Marta Garnelo Matthew C.\u00a0H. Lee Hugh Salimbeni Kai Arulkumaran and Murray Shanahan. 2016. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. CoRR abs\/1611.02648(2016). arXiv:1611.02648http:\/\/arxiv.org\/abs\/1611.02648  Nat Dilokthanakul Pedro A.\u00a0M. Mediano Marta Garnelo Matthew C.\u00a0H. Lee Hugh Salimbeni Kai Arulkumaran and Murray Shanahan. 2016. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders. CoRR abs\/1611.02648(2016). arXiv:1611.02648http:\/\/arxiv.org\/abs\/1611.02648"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks . In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010 , Chia Laguna Resort, Sardinia, Italy , May 13-15, 2010. 249\u2013256. http:\/\/proceedings.mlr.press\/v9\/glorot10a.html Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010. 249\u2013256. http:\/\/proceedings.mlr.press\/v9\/glorot10a.html"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCSLP49672.2021.9362104"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.21105\/joss.02154"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683561"},{"key":"e_1_3_2_1_10_1","volume-title":"Hierarchical Generative Modeling for Controllable Speech Synthesis. In 7th International Conference on Learning Representations, ICLR 2019","author":"Hsu Wei-Ning","year":"2019","unstructured":"Wei-Ning Hsu , Yu Zhang , Ron\u00a0 J. Weiss , Heiga Zen , Yonghui Wu , Yuxuan Wang , Yuan Cao , Ye Jia , Zhifeng Chen , Jonathan Shen , Patrick Nguyen , and Ruoming Pang . 2019 . Hierarchical Generative Modeling for Controllable Speech Synthesis. In 7th International Conference on Learning Representations, ICLR 2019 , New Orleans, LA, USA , May 6-9, 2019. https:\/\/openreview.net\/forum?id=rygkk305YQ Wei-Ning Hsu, Yu Zhang, Ron\u00a0J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, and Ruoming Pang. 2019. Hierarchical Generative Modeling for Controllable Speech Synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https:\/\/openreview.net\/forum?id=rygkk305YQ"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2017\/273"},{"key":"e_1_3_2_1_12_1","volume-title":"Semi-supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014","author":"Kingma P.","year":"2014","unstructured":"Diederik\u00a0 P. Kingma , Shakir Mohamed , Danilo\u00a0Jimenez Rezende , and Max Welling . 2014 . Semi-supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 , December 8-13 2014, Montreal, Quebec, Canada. 3581\u20133589. https:\/\/proceedings.neurips.cc\/paper\/ 2014\/hash\/d523773c6b194f37b938d340d5d02232-Abstract.html Diederik\u00a0P. Kingma, Shakir Mohamed, Danilo\u00a0Jimenez Rezende, and Max Welling. 2014. Semi-supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 3581\u20133589. https:\/\/proceedings.neurips.cc\/paper\/2014\/hash\/d523773c6b194f37b938d340d5d02232-Abstract.html"},{"volume-title":"Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1312","author":"P.","key":"e_1_3_2_1_13_1","unstructured":"Diederik\u00a0 P. Kingma and Max Welling. 2014 . Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1312 .6114 Diederik\u00a0P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. http:\/\/arxiv.org\/abs\/1312.6114"},{"key":"e_1_3_2_1_14_1","first-page":"2020","volume-title":"21st Annual Conference of the International Speech Communication Association, Virtual Event","author":"Liu Haohe","year":"2020","unstructured":"Haohe Liu , Lei Xie , Jian Wu , and Geng Yang . 2020 . Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. In Interspeech 2020 , 21st Annual Conference of the International Speech Communication Association, Virtual Event , Shanghai, China , 25-29 October 2020. 1241\u20131245. https:\/\/doi.org\/10.21437\/Interspeech. 2020 - 2555 Haohe Liu, Lei Xie, Jian Wu, and Geng Yang. 2020. Channel-Wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1241\u20131245. https:\/\/doi.org\/10.21437\/Interspeech.2020-2555"},{"key":"e_1_3_2_1_15_1","first-page":"2020","volume-title":"21st Annual Conference of the International Speech Communication Association, Virtual Event","author":"Lu Peiling","year":"2020","unstructured":"Peiling Lu , Jie Wu , Jian Luan , Xu Tan , and Li Zhou . 2020 . XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. In Interspeech 2020 , 21st Annual Conference of the International Speech Communication Association, Virtual Event , Shanghai, China , 25-29 October 2020. 1306\u20131310. https:\/\/doi.org\/10.21437\/Interspeech. 2020 - 1410 Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. 2020. XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1306\u20131310. https:\/\/doi.org\/10.21437\/Interspeech.2020-1410"},{"key":"e_1_3_2_1_16_1","unstructured":"Chandan K.\u00a0A. Reddy Harishchandra Dubey Kazuhito Koishida Arun\u00a0Asokan Nair Vishak Gopal Ross Cutler Sebastian Braun Hannes Gamper Robert Aichner and Sriram Srinivasan. 2021. Interspeech 2021 Deep Noise Suppression Challenge. CoRR abs\/2101.01902(2021). arXiv:2101.01902https:\/\/arxiv.org\/abs\/2101.01902  Chandan K.\u00a0A. Reddy Harishchandra Dubey Kazuhito Koishida Arun\u00a0Asokan Nair Vishak Gopal Ross Cutler Sebastian Braun Hannes Gamper Robert Aichner and Sriram Srinivasan. 2021. Interspeech 2021 Deep Noise Suppression Challenge. CoRR abs\/2101.01902(2021). arXiv:2101.01902https:\/\/arxiv.org\/abs\/2101.01902"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403249"},{"key":"e_1_3_2_1_18_1","volume-title":"Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021","author":"Shi Jiatong","year":"2021","unstructured":"Jiatong Shi , Shuai Guo , Nan Huo , Yuekai Zhang , and Qin Jin . 2021 . Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021 , Toronto, ON, Canada , June 6-11, 2021. 76\u201380. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414348 Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, and Qin Jin. 2021. Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 76\u201380. https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414348"},{"key":"e_1_3_2_1_19_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan","author":"Skerry-Ryan J.","year":"2018","unstructured":"R.\u00a0 J. Skerry-Ryan , Eric Battenberg , Ying Xiao , Yuxuan Wang , Daisy Stanton , Joel Shor , Ron\u00a0 J. Weiss , Rob Clark , and Rif\u00a0 A. Saurous . 2018 . Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan , Stockholm, Sweden , July 10-15, 2018. 4700\u20134709. http:\/\/proceedings.mlr.press\/v80\/skerry-ryan18a.html R.\u00a0J. Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron\u00a0J. Weiss, Rob Clark, and Rif\u00a0A. Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018. 4700\u20134709. http:\/\/proceedings.mlr.press\/v80\/skerry-ryan18a.html"},{"key":"e_1_3_2_1_20_1","volume-title":"A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495","author":"Talkin David","year":"1995","unstructured":"David Talkin and W\u00a0Bastiaan Kleijn . 1995. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495 ( 1995 ), 518. David Talkin and W\u00a0Bastiaan Kleijn. 1995. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495 (1995), 518."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.21437\/SSW.2016-24"},{"key":"e_1_3_2_1_22_1","volume-title":"The 4th CHiME speech separation and recognition challenge. URL: http:\/\/spandh. dcs. shef. ac. uk\/chime_challenge\/(last accessed on","author":"Vincent Emmanuel","year":"2018","unstructured":"Emmanuel Vincent , Shinji Watanabe , Jon Barker , and Ricard Marxer . 2016. The 4th CHiME speech separation and recognition challenge. URL: http:\/\/spandh. dcs. shef. ac. uk\/chime_challenge\/(last accessed on 1 August , 2018 )(2016). Emmanuel Vincent, Shinji Watanabe, Jon Barker, and Ricard Marxer. 2016. The 4th CHiME speech separation and recognition challenge. URL: http:\/\/spandh. dcs. shef. ac. uk\/chime_challenge\/(last accessed on 1 August, 2018)(2016)."},{"key":"e_1_3_2_1_23_1","volume-title":"18th Annual Conference of the International Speech Communication Association","author":"Wang Yuxuan","year":"2017","unstructured":"Yuxuan Wang , R.\u00a0 J. Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron\u00a0 J. Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao , Zhifeng Chen , Samy Bengio , Quoc\u00a0 V. Le , Yannis Agiomyrgiannakis , Rob Clark , and Rif\u00a0 A. Saurous . 2017 . Tacotron: Towards End-to-End Speech Synthesis. In Interspeech 2017 , 18th Annual Conference of the International Speech Communication Association , Stockholm, Sweden , August 20-24, 2017. 4006\u20134010. http:\/\/www.isca-speech.org\/archive\/Interspeech_2017\/abstracts\/1452.html Yuxuan Wang, R.\u00a0J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron\u00a0J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc\u00a0V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif\u00a0A. Saurous. 2017. Tacotron: Towards End-to-End Speech Synthesis. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017. 4006\u20134010. http:\/\/www.isca-speech.org\/archive\/Interspeech_2017\/abstracts\/1452.html"},{"key":"e_1_3_2_1_24_1","first-page":"2020","volume-title":"21st Annual Conference of the International Speech Communication Association, Virtual Event","author":"Wu Jie","year":"2020","unstructured":"Jie Wu and Jian Luan . 2020 . Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. In Interspeech 2020 , 21st Annual Conference of the International Speech Communication Association, Virtual Event , Shanghai, China , 25-29 October 2020. 1296\u20131300. https:\/\/doi.org\/10.21437\/Interspeech. 2020 - 1109 Jie Wu and Jian Luan. 2020. Adversarially Trained Multi-Singer Sequence-to-Sequence Singing Synthesizer. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020. 1296\u20131300. https:\/\/doi.org\/10.21437\/Interspeech.2020-1109"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.3233\/XST-200650"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413934"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2007.899236"}],"event":{"name":"ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"location":"Montreal QC Canada","acronym":"ICMI '21"},"container-title":["Companion Publication of the 2021 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3461615.3491115","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3461615.3491115","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:49:04Z","timestamp":1750193344000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3461615.3491115"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":27,"alternative-id":["10.1145\/3461615.3491115","10.1145\/3461615"],"URL":"https:\/\/doi.org\/10.1145\/3461615.3491115","relation":{},"subject":[],"published":{"date-parts":[[2021,10,18]]},"assertion":[{"value":"2021-12-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}