{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T07:03:17Z","timestamp":1781593397562,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Zhejiang Natural Science Foundation","award":["LR19F020006"],"award-info":[{"award-number":["LR19F020006"]}]},{"name":"National Key R&D Program of China","award":["No.61836002"],"award-info":[{"award-number":["No.61836002"]}]},{"name":"National Key R&D Program of China","award":["No. 62072397"],"award-info":[{"award-number":["No. 62072397"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547855","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"2595-2605","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":129,"title":["ProDiff"],"prefix":"10.1145","author":[{"given":"Rongjie","family":"Huang","sequence":"first","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhou","family":"Zhao","sequence":"additional","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Huadai","family":"Liu","sequence":"additional","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jinglin","family":"Liu","sequence":"additional","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenye","family":"Cui","sequence":"additional","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yi","family":"Ren","sequence":"additional","affiliation":[{"name":"Zhejiang University, HangZhou, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993","author":"Chen Mingjian","year":"2021","unstructured":"Mingjian Chen , Xu Tan , Bohan Li , Yanqing Liu , Tao Qin , Sheng Zhao , and Tie-Yan Liu . 2021 . Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021). Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Adaspeech: Adaptive text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021)."},{"key":"e_1_3_2_2_2_1","volume-title":"Proc. of ICLR.","author":"Chen Nanxin","year":"2020","unstructured":"Nanxin Chen , Yu Zhang , Heiga Zen , Ron J Weiss , Mohammad Norouzi , and William Chan . 2020 . WaveGrad: Estimating Gradients for Waveform Generation . In Proc. of ICLR. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. In Proc. of ICLR."},{"key":"e_1_3_2_2_3_1","volume-title":"Generative adversarial networks: An overview","author":"Creswell Antonia","year":"2018","unstructured":"Antonia Creswell , Tom White , Vincent Dumoulin , Kai Arulkumaran , Biswa Sengupta , and Anil A Bharath . 2018. Generative adversarial networks: An overview . IEEE Signal Processing Magazine ( 2018 ). Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine (2018)."},{"key":"e_1_3_2_2_4_1","volume-title":"EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317","author":"Cui Chenye","year":"2021","unstructured":"Chenye Cui , Yi Ren , Jinglin Liu , Feiyang Chen , Rongjie Huang , Ming Lei , and Zhou Zhao . 2021 . EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317 (2021). Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, and Zhou Zhao. 2021. EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. arXiv preprint arXiv:2106.09317 (2021)."},{"key":"e_1_3_2_2_5_1","volume-title":"Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233","author":"Dhariwal Prafulla","year":"2021","unstructured":"Prafulla Dhariwal and Alex Nichol . 2021. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233 ( 2021 ). Prafulla Dhariwal and Alex Nichol. 2021. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233 (2021)."},{"key":"e_1_3_2_2_6_1","volume-title":"End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575","author":"Donahue Jeff","year":"2020","unstructured":"Jeff Donahue , Sander Dieleman , Miko\"aj Bi\"kowski, Erich Elsen , and Karen Simonyan . 2020. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 ( 2020 ). Jeff Donahue, Sander Dieleman, Miko\"aj Bi\"kowski, Erich Elsen, and Karen Simonyan. 2020. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 (2020)."},{"key":"e_1_3_2_2_7_1","volume-title":"Victor OK Li, and Richard Socher","author":"Gu Jiatao","year":"2017","unstructured":"Jiatao Gu , James Bradbury , Caiming Xiong , Victor OK Li, and Richard Socher . 2017 . Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017). Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017)."},{"key":"e_1_3_2_2_8_1","volume-title":"Proc. of NeurIPS.","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho , Ajay Jain , and Pieter Abbeel . 2020 . Denoising diffusion probabilistic models . Proc. of NeurIPS. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Proc. of NeurIPS."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475437"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"crossref","unstructured":"Rongjie Huang Chenye Cui Feiyang Chen Yi Ren Jinglin Liu Zhou Zhao Baoxing Huai and Zhefeng Wang. 2021. SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation.  Rongjie Huang Chenye Cui Feiyang Chen Yi Ren Jinglin Liu Zhou Zhao Baoxing Huai and Zhefeng Wang. 2021. SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation.","DOI":"10.1145\/3503161.3547854"},{"key":"e_1_3_2_2_11_1","volume-title":"Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao.","author":"Huang Rongjie","year":"2022","unstructured":"Rongjie Huang , Max WY Lam , Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022 . FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis . arXiv preprint arXiv:2204.09934 (2022). Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. arXiv preprint arXiv:2204.09934 (2022)."},{"key":"e_1_3_2_2_12_1","volume-title":"GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis. arXiv preprint arXiv:2205.07211","author":"Huang Rongjie","year":"2022","unstructured":"Rongjie Huang , Yi Ren , Jinglin Liu , Chenye Cui , and Zhou Zhao . 2022. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis. arXiv preprint arXiv:2205.07211 ( 2022 ). Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2022. GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis. arXiv preprint arXiv:2205.07211 (2022)."},{"key":"e_1_3_2_2_13_1","volume-title":"https:\/\/github.com\/ huawei-noah\/ SpeechBackbones","year":"2021","unstructured":"huawei noah. 2021. Speech-Backbones. https:\/\/github.com\/ huawei-noah\/ SpeechBackbones ( 2021 ). huawei noah. 2021. Speech-Backbones. https:\/\/github.com\/ huawei-noah\/ SpeechBackbones (2021)."},{"key":"e_1_3_2_2_14_1","volume-title":"The lj speech dataset. https:\/\/keithito.com\/ LJ-Speech-Dataset\/","author":"Ito Keith","year":"2017","unstructured":"Keith Ito . 2017. The lj speech dataset. https:\/\/keithito.com\/ LJ-Speech-Dataset\/ ( 2017 ). Keith Ito. 2017. The lj speech dataset. https:\/\/keithito.com\/ LJ-Speech-Dataset\/ (2017)."},{"key":"e_1_3_2_2_15_1","volume-title":"Byoung Jin Choi, and Nam Soo Kim.","author":"Jeong Myeonghun","year":"2021","unstructured":"Myeonghun Jeong , Hyeongju Kim , Sung Jun Cheon , Byoung Jin Choi, and Nam Soo Kim. 2021 . Diff-tts : A de noising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021). Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. 2021. Diff-tts: A denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021)."},{"key":"e_1_3_2_2_16_1","first-page":"8067","article-title":"Glow-tts: A generative flow for text-to-speech via monotonic alignment search","volume":"33","author":"Kim Jaehyeon","year":"2020","unstructured":"Jaehyeon Kim , Sungwon Kim , Jungil Kong , and Sungroh Yoon . 2020 . Glow-tts: A generative flow for text-to-speech via monotonic alignment search . Advances in Neural Information Processing Systems 33 (2020), 8067 -- 8077 . Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems 33 (2020), 8067--8077.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_17_1","volume-title":"International Conference on Machine Learning. PMLR, 5530--5540","author":"Kim Jaehyeon","year":"2021","unstructured":"Jaehyeon Kim , Jungil Kong , and Juhee Son . 2021 . Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech . In International Conference on Machine Learning. PMLR, 5530--5540 . Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530--5540."},{"key":"e_1_3_2_2_18_1","volume-title":"Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947","author":"Kim Yoon","year":"2016","unstructured":"Yoon Kim and Alexander M Rush . 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 ( 2016 ). Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)."},{"key":"e_1_3_2_2_19_1","volume-title":"Proc. of NeurIPS","author":"Kong Jungil","year":"2020","unstructured":"Jungil Kong , Jaehyeon Kim , and Jaekyoung Bae . 2020 . HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis . Proc. of NeurIPS (2020). Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Proc. of NeurIPS (2020)."},{"key":"e_1_3_2_2_20_1","volume-title":"Proc. of ICLR.","author":"Kong Zhifeng","year":"2020","unstructured":"Zhifeng Kong , Wei Ping , Jiaji Huang , Kexin Zhao , and Bryan Catanzaro . 2020 . DiffWave: A Versatile Diffusion Model for Audio Synthesis . In Proc. of ICLR. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proc. of ICLR."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACRIM.1993.407206"},{"key":"e_1_3_2_2_22_1","unstructured":"Max W. Y. Lam Jun Wang Rongjie Huang Dan Su and Dong Yu. 2021. Bilateral Denoising Diffusion Models. arXiv:2108.11514 [cs.LG]  Max W. Y. Lam Jun Wang Rongjie Huang Dan Su and Dong Yu. 2021. Bilateral Denoising Diffusion Models. arXiv:2108.11514 [cs.LG]"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016706"},{"key":"e_1_3_2_2_24_1","volume-title":"Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 2","author":"Liu Jinglin","year":"2021","unstructured":"Jinglin Liu , Chengxi Li , Yi Ren , Feiyang Chen , Peng Liu , and Zhou Zhao . 2021 . Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 2 (2021). Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Peng Liu, and Zhou Zhao. 2021. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. arXiv preprint arXiv:2105.02446 2 (2021)."},{"key":"e_1_3_2_2_25_1","volume-title":"DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs. arXiv preprint arXiv:2201.11972","author":"Liu Songxiang","year":"2022","unstructured":"Songxiang Liu , Dan Su , and Dong Yu. 2022. DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs. arXiv preprint arXiv:2201.11972 ( 2022 ). Songxiang Liu, Dan Su, and Dong Yu. 2022. DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs. arXiv preprint arXiv:2201.11972 (2022)."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00286"},{"key":"e_1_3_2_2_27_1","volume-title":"Eunho Yang, and Sung Ju Hwang.","author":"Min Dongchan","year":"2021","unstructured":"Dongchan Min , Dong Bok Lee , Eunho Yang, and Sung Ju Hwang. 2021 . Metastylespeech : Multi-speaker adaptive text-to-speech generation. (2021), 7748-- 7759. Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. 2021. Metastylespeech: Multi-speaker adaptive text-to-speech generation. (2021), 7748-- 7759."},{"key":"e_1_3_2_2_28_1","volume-title":"https:\/\/github.com\/MoonInTheRiver\/DiffSinger","year":"2021","unstructured":"MoonInTheRiver. 2021. DiffSinger. https:\/\/github.com\/MoonInTheRiver\/DiffSinger ( 2021 ). MoonInTheRiver. 2021. DiffSinger. https:\/\/github.com\/MoonInTheRiver\/DiffSinger (2021)."},{"key":"e_1_3_2_2_29_1","volume-title":"Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499","author":"van den Oord Aaron","year":"2016","unstructured":"Aaron van den Oord , Sander Dieleman , Heiga Zen , Karen Simonyan , Oriol Vinyals , Alex Graves , Nal Kalchbrenner , Andrew Senior , and Koray Kavukcuoglu . 2016 . Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016). Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)."},{"key":"e_1_3_2_2_30_1","volume-title":"International Conference on Machine Learning. PMLR, 8599--8608","author":"Popov Vadim","year":"2021","unstructured":"Vadim Popov , Ivan Vovk , Vladimir Gogoryan , Tasnima Sadekova , and Mikhail Kudinov . 2021 . Grad-tts: A diffusion probabilistic model for text-to-speech . In International Conference on Machine Learning. PMLR, 8599--8608 . Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning. PMLR, 8599--8608."},{"key":"e_1_3_2_2_31_1","unstructured":"Flavio Protasio Ribeiro Dinei Florencio Cha Zhang and Seltze. [n.d.]. CROWDMOS: An Approach for Crowdsourcing Mean Opinion Score Studies. ([n. d.]). Edition: ICASSP.  Flavio Protasio Ribeiro Dinei Florencio Cha Zhang and Seltze. [n.d.]. CROWDMOS: An Approach for Crowdsourcing Mean Opinion Score Studies. ([n. d.]). Edition: ICASSP."},{"key":"e_1_3_2_2_32_1","volume-title":"Searching for activation functions. arXiv preprint arXiv:1710.05941","author":"Ramachandran Prajit","year":"2017","unstructured":"Prajit Ramachandran , Barret Zoph , and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 ( 2017 ). Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017)."},{"key":"e_1_3_2_2_33_1","volume-title":"Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558","author":"Ren Yi","year":"2020","unstructured":"Yi Ren , Chenxu Hu , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , and Tie-Yan Liu . 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 ( 2020 ). Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)."},{"key":"e_1_3_2_2_34_1","volume-title":"PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Advances in Neural Information Processing Systems 34","author":"Ren Yi","year":"2021","unstructured":"Yi Ren , Jinglin Liu , and Zhou Zhao . 2021. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Advances in Neural Information Processing Systems 34 ( 2021 ). Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_3_2_2_35_1","volume-title":"Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32","author":"Ren Yi","year":"2019","unstructured":"Yi Ren , Yangjun Ruan , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , and Tie-Yan Liu . 2019 . Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (2019). Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_3_2_2_36_1","volume-title":"Revisiting OverSmoothness in Text to Speech. arXiv preprint arXiv:2202.13066","author":"Ren Yi","year":"2022","unstructured":"Yi Ren , Xu Tan , Tao Qin , Zhou Zhao , and Tie-Yan Liu . 2022. Revisiting OverSmoothness in Text to Speech. arXiv preprint arXiv:2202.13066 ( 2022 ). Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2022. Revisiting OverSmoothness in Text to Speech. arXiv preprint arXiv:2202.13066 (2022)."},{"key":"e_1_3_2_2_37_1","volume-title":"Proc. of ICONIP.","author":"Richardson Eitan","year":"2018","unstructured":"Eitan Richardson and Yair Weiss . 2018 . On GANs and GMMs . In Proc. of ICONIP. Eitan Richardson and Yair Weiss. 2018. On GANs and GMMs. In Proc. of ICONIP."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"e_1_3_2_2_39_1","volume-title":"Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512","author":"Salimans Tim","year":"2022","unstructured":"Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 ( 2022 ). Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)."},{"key":"e_1_3_2_2_40_1","volume-title":"Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600","author":"San-Roman Robin","year":"2021","unstructured":"Robin San-Roman , Eliya Nachmani , and Lior Wolf . 2021. Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600 ( 2021 ). Robin San-Roman, Eliya Nachmani, and Lior Wolf. 2021. Noise estimation for generative diffusion models. arXiv preprint arXiv:2104.02600 (2021)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_2_42_1","volume-title":"International Conference on Machine Learning. PMLR, 2256--2265","author":"Sohl-Dickstein Jascha","year":"2015","unstructured":"Jascha Sohl-Dickstein , Eric Weiss , Niru Maheswaranathan , and Surya Ganguli . 2015 . Deep unsupervised learning using nonequilibrium thermodynamics . In International Conference on Machine Learning. PMLR, 2256--2265 . Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256--2265."},{"key":"e_1_3_2_2_43_1","volume-title":"Proc. of ICLR.","author":"Song Jiaming","year":"2020","unstructured":"Jiaming Song , Chenlin Meng , and Stefano Ermon . 2020 . Denoising Diffusion Implicit Models . In Proc. of ICLR. Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising Diffusion Implicit Models. In Proc. of ICLR."},{"key":"e_1_3_2_2_44_1","volume-title":"Improved techniques for training scorebased generative models. Advances in neural information processing systems 33","author":"Song Yang","year":"2020","unstructured":"Yang Song and Stefano Ermon . 2020. Improved techniques for training scorebased generative models. Advances in neural information processing systems 33 ( 2020 ), 12438--12448. Yang Song and Stefano Ermon. 2020. Improved techniques for training scorebased generative models. Advances in neural information processing systems 33 (2020), 12438--12448."},{"key":"e_1_3_2_2_45_1","volume-title":"Proc. of ICLR.","author":"Song Yang","year":"2020","unstructured":"Yang Song , Jascha Sohl-Dickstein , Diederik P Kingma , Abhishek Kumar , Stefano Ermon , and Ben Poole . 2020 . Score-Based Generative Modeling through Stochastic Differential Equations . In Proc. of ICLR. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. Score-Based Generative Modeling through Stochastic Differential Equations. In Proc. of ICLR."},{"key":"e_1_3_2_2_46_1","volume-title":"Token-level ensemble distillation for grapheme-to-phoneme conversion. arXiv preprint arXiv:1904.03446","author":"Sun Hao","year":"2019","unstructured":"Hao Sun , Xu Tan , Jun-Wei Gan , Hongzhi Liu , Sheng Zhao , Tao Qin , and Tie-Yan Liu . 2019. Token-level ensemble distillation for grapheme-to-phoneme conversion. arXiv preprint arXiv:1904.03446 ( 2019 ). Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-level ensemble distillation for grapheme-to-phoneme conversion. arXiv preprint arXiv:1904.03446 (2019)."},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"e_1_3_2_2_48_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_2_49_1","volume-title":"Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135","author":"Wang Yuxuan","year":"2017","unstructured":"Yuxuan Wang , RJ Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron J Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao , Zhifeng Chen , Samy Bengio , 2017 . Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017). Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)."},{"key":"e_1_3_2_2_50_1","volume-title":"Image quality assessment: from error visibility to structural similarity","author":"Wang Zhou","year":"2004","unstructured":"Zhou Wang , Alan C Bovik , Hamid R Sheikh , and Eero P Simoncelli . 2004. Image quality assessment: from error visibility to structural similarity . IEEE transactions on image processing 13, 4 ( 2004 ), 600--612. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612."},{"key":"e_1_3_2_2_51_1","volume-title":"Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. arXiv preprint arXiv:2112.07804","author":"Xiao Zhisheng","year":"2021","unstructured":"Zhisheng Xiao , Karsten Kreis , and Arash Vahdat . 2021. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. arXiv preprint arXiv:2112.07804 ( 2021 ). Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. 2021. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. arXiv preprint arXiv:2112.07804 (2021)."},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053795"},{"key":"e_1_3_2_2_53_1","volume-title":"GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. arXiv preprint arXiv:2106.15153","author":"Yang Jinhyeok","year":"2021","unstructured":"Jinhyeok Yang , Jae-Sung Bae , Taejun Bak , Youngik Kim , and Hoon-Young Cho . 2021. GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. arXiv preprint arXiv:2106.15153 ( 2021 ). Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Youngik Kim, and Hoon-Young Cho. 2021. GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. arXiv preprint arXiv:2106.15153 (2021)."},{"key":"e_1_3_2_2_54_1","volume-title":"GAN Vocoder: Multi-Resolution Discriminator Is All You Need. arXiv preprint arXiv:2103.05236","author":"You Jaeseong","year":"2021","unstructured":"Jaeseong You , Dalhyun Kim , Gyuhyeon Nam , Geumbyeol Hwang , and Gyeongsu Chae . 2021. GAN Vocoder: Multi-Resolution Discriminator Is All You Need. arXiv preprint arXiv:2103.05236 ( 2021 ). Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, and Gyeongsu Chae. 2021. GAN Vocoder: Multi-Resolution Discriminator Is All You Need. arXiv preprint arXiv:2103.05236 (2021)."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"crossref","unstructured":"Heiga Zen Viet Dang Rob Clark Yu Zhang Ron J Weiss Ye Jia Zhifeng Chen and Yonghui Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-tospeech. arXiv preprint arXiv:1904.02882 (2019)  Heiga Zen Viet Dang Rob Clark Yu Zhang Ron J Weiss Ye Jia Zhifeng Chen and Yonghui Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-tospeech. arXiv preprint arXiv:1904.02882 (2019)","DOI":"10.21437\/Interspeech.2019-2441"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547855","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547855","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:35Z","timestamp":1750186955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547855"}},"subtitle":["Progressive Fast Diffusion Model for High-Quality Text-to-Speech"],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":55,"alternative-id":["10.1145\/3503161.3547855","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547855","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}