{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T15:53:46Z","timestamp":1781538826652,"version":"3.54.5"},"publisher-location":"New York, NY, USA","reference-count":46,"publisher":"ACM","license":[{"start":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T00:00:00Z","timestamp":1781481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/legalcode"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2026,6,16]]},"DOI":"10.1145\/3805622.3810884","type":"proceedings-article","created":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:42:57Z","timestamp":1781534577000},"page":"855-864","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["ExpV2S: Zero-Shot Expressive Video-to-Speech Synthesis via Latent Diffusion Model"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-6870-8657","authenticated-orcid":false,"given":"Fang","family":"Zhang","sequence":"first","affiliation":[{"name":"State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, Anhui, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9863-7689","authenticated-orcid":false,"given":"Anqi","family":"Gou","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, Anhui, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0227-3793","authenticated-orcid":false,"given":"Linli","family":"Xu","sequence":"additional","affiliation":[{"name":"State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, Anhui, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2026,6,15]]},"reference":[{"key":"e_1_3_3_2_2_2","unstructured":"Triantafyllos Afouras Joon\u00a0Son Chung and Andrew Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1809.00496 (2018)."},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Houwei Cao David\u00a0G Cooper Michael\u00a0K Keutmann Ruben\u00a0C Gur Ani Nenkova and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5 4 (2014) 377\u2013390.","DOI":"10.1109\/TAFFC.2014.2336244"},{"key":"e_1_3_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Edresson Casanova Christopher Shulby Eren G\u00f6lge Nicolas\u00a0Michael M\u00fcller Frederico\u00a0Santos De\u00a0Oliveira Arnaldo\u00a0Candido Junior Anderson da\u00a0Silva Soares Sandra\u00a0Maria Aluisio and Moacir\u00a0Antonelli Ponti. 2021. SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2104.05557 (2021).","DOI":"10.21437\/Interspeech.2021-1774"},{"key":"e_1_3_3_2_5_2","first-page":"1597","volume-title":"International conference on machine learning","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597\u20131607."},{"key":"e_1_3_3_2_6_2","unstructured":"Xinlei Chen Haoqi Fan Ross Girshick and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2003.04297 (2020)."},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00718"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-194"},{"key":"e_1_3_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-54427-4_19"},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Martin Cooke Jon Barker Stuart Cunningham and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120 5 (2006) 2421\u20132424.","DOI":"10.1121\/1.2229005"},{"key":"e_1_3_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01034"},{"key":"e_1_3_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i1.19966"},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_2_14_2","unstructured":"Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko Diederik\u00a0P Kingma Ben Poole Mohammad Norouzi David\u00a0J Fleet et\u00a0al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2210.02303 (2022)."},{"key":"e_1_3_3_2_15_2","unstructured":"Jonathan Ho Ajay Jain and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020) 6840\u20136851."},{"key":"e_1_3_3_2_16_2","first-page":"13916","volume-title":"International Conference on Machine Learning","author":"Huang Rongjie","year":"2023","unstructured":"Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning. PMLR, 13916\u201313932."},{"key":"e_1_3_3_2_17_2","doi-asserted-by":"crossref","unstructured":"Jesper Jensen and Cees\u00a0H Taal. 2016. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE\/ACM Transactions on Audio Speech and Language Processing 24 11 (2016) 2009\u20132022.","DOI":"10.1109\/TASLP.2016.2585878"},{"key":"e_1_3_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10095582"},{"key":"e_1_3_3_2_19_2","volume-title":"International Conference on Learning Representations","author":"Kong Zhifeng","year":"2021","unstructured":"Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In International Conference on Learning Representations."},{"key":"e_1_3_3_2_20_2","first-page":"3355","volume-title":"Interspeech","author":"Le\u00a0Cornu Thomas","year":"2015","unstructured":"Thomas Le\u00a0Cornu and Ben Milner. 2015. Reconstructing intelligible audio speech from visual speech features.. In Interspeech. 3355\u20133359."},{"key":"e_1_3_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72784-9_4"},{"key":"e_1_3_3_2_22_2","unstructured":"Haohe Liu Zehua Chen Yi Yuan Xinhao Mei Xubo Liu Danilo Mandic Wenwu Wang and Mark\u00a0D Plumbley. 2023. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2301.12503 (2023)."},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Haohe Liu Yi Yuan Xubo Liu Xinhao Mei Qiuqiang Kong Qiao Tian Yuping Wang Wenwu Wang Yuxuan Wang and Mark\u00a0D Plumbley. 2024. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE\/ACM Transactions on Audio Speech and Language Processing (2024).","DOI":"10.1109\/TASLP.2024.3399607"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"crossref","unstructured":"Steven\u00a0R Livingstone and Frank\u00a0A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic multimodal set of facial and vocal expressions in North American English. PloS one 13 5 (2018) e0196391.","DOI":"10.1371\/journal.pone.0196391"},{"key":"e_1_3_3_2_25_2","unstructured":"I Loshchilov. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/1711.05101 (2017)."},{"key":"e_1_3_3_2_26_2","unstructured":"Simian Luo Chuanhao Yan Chenxu Hu and Hang Zhao. 2024. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_3_3_2_27_2","unstructured":"Zhengxiong Luo Dayou Chen Yingya Zhang Yan Huang Liang Wang Yujun Shen Deli Zhao Jingren Zhou and Tieniu Tan. 2023. Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2303.08320 (2023)."},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9415063"},{"key":"e_1_3_3_2_29_2","unstructured":"Rodrigo Mira Alexandros Haliassos Stavros Petridis Bj\u00f6rn\u00a0W Schuller and Maja Pantic. 2022. SVTS: scalable video-to-speech synthesis. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2205.02058 (2022)."},{"key":"e_1_3_3_2_30_2","doi-asserted-by":"crossref","unstructured":"Rodrigo Mira Konstantinos Vougioukas Pingchuan Ma Stavros Petridis Bj\u00f6rn\u00a0W Schuller and Maja Pantic. 2022. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE transactions on cybernetics 53 6 (2022) 3454\u20133466.","DOI":"10.1109\/TCYB.2022.3162495"},{"key":"e_1_3_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV61041.2025.00283"},{"key":"e_1_3_3_2_32_2","first-page":"8599","volume-title":"International Conference on Machine Learning","author":"Popov Vadim","year":"2021","unstructured":"Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning. PMLR, 8599\u20138608."},{"key":"e_1_3_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01381"},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG57933.2023.10042606"},{"key":"e_1_3_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_3_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00985"},{"key":"e_1_3_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096464"},{"key":"e_1_3_3_2_39_2","unstructured":"Bowen Shi Wei-Ning Hsu Kushal Lakhotia and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2201.02184 (2022)."},{"key":"e_1_3_3_2_40_2","first-page":"2256","volume-title":"International conference on machine learning","author":"Sohl-Dickstein Jascha","year":"2015","unstructured":"Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. PMLR, 2256\u20132265."},{"key":"e_1_3_3_2_41_2","first-page":"6447","volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition","author":"Son\u00a0Chung Joon","year":"2017","unstructured":"Joon Son\u00a0Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6447\u20136456."},{"key":"e_1_3_3_2_42_2","unstructured":"Jiaming Song Chenlin Meng and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2010.02502 (2020)."},{"key":"e_1_3_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8462665"},{"key":"e_1_3_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9747427"},{"key":"e_1_3_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58589-1_42"},{"key":"e_1_3_3_2_46_2","volume-title":"The Twelfth International Conference on Learning Representations","author":"Yemini Yochai","year":"2024","unstructured":"Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot, and Ethan Fetaya. 2024. LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading. In The Twelfth International Conference on Learning Representations."},{"key":"e_1_3_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46487-9_40"}],"event":{"name":"ICMR '26: International Conference on Multimedia Retrieval","location":"Amsterdam The Netherlands","acronym":"ICMR '26","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2026 International Conference on Multimedia Retrieval"],"original-title":[],"deposited":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T14:56:29Z","timestamp":1781535389000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3805622.3810884"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,6,15]]},"references-count":46,"alternative-id":["10.1145\/3805622.3810884","10.1145\/3805622"],"URL":"https:\/\/doi.org\/10.1145\/3805622.3810884","relation":{},"subject":[],"published":{"date-parts":[[2026,6,15]]},"assertion":[{"value":"2026-06-15","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}