{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:06:54Z","timestamp":1765339614302,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":71,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["No. 2024QY1400"],"award-info":[{"award-number":["No. 2024QY1400"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. 62425604"],"award-info":[{"award-number":["No. 62425604"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Tsinghua University Initiative Scientific Research Program"},{"name":"Shenzhen Science and Technology Program","award":["JCYJ20220818101014030"],"award-info":[{"award-number":["JCYJ20220818101014030"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755736","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T06:55:00Z","timestamp":1761375300000},"page":"6720-6729","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["HarmoniVox: Painting Voices to Match the Avatar's Soul"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-5972-3955","authenticated-orcid":false,"given":"Songtao","family":"Zhou","sequence":"first","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9720-3220","authenticated-orcid":false,"given":"Xiaoyu","family":"Qin","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-6363-891X","authenticated-orcid":false,"given":"Yixuan","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shenzhen International Graduate School, Tsinghua University, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5832-8192","authenticated-orcid":false,"given":"Qixin","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8465-8878","authenticated-orcid":false,"given":"Zeyu","family":"Jin","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7291-6198","authenticated-orcid":false,"given":"Zixuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, Tsinghua University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8533-0524","authenticated-orcid":false,"given":"Zhiyong","family":"Wu","sequence":"additional","affiliation":[{"name":"Shenzhen International Graduate School, Tsinghua University, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8449-278X","authenticated-orcid":false,"given":"Jia","family":"Jia","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW54120.2021.00166"},{"key":"e_1_3_2_1_2_1","first-page":"12449","volume-title":"Lin (Eds.)","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12449-12460."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-021-86841-8"},{"key":"e_1_3_2_1_4_1","volume-title":"Improving image generation with better captions. Computer Science. https:\/\/cdn.openai.com\/papers\/dall-e-3.pdf","author":"Betker James","year":"2023","unstructured":"James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, and others. 2023. Improving image generation with better captions. Computer Science. https:\/\/cdn.openai.com\/papers\/dall-e-3.pdf, Vol. 2 (2023), 3."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311556"},{"key":"e_1_3_2_1_6_1","volume-title":"Emotion and motivation. Handbook of psychophysiology","author":"Bradley Margaret M","year":"2000","unstructured":"Margaret M Bradley and Peter J Lang. 2000. Emotion and motivation. Handbook of psychophysiology, Vol. 2 (2000), 602-642."},{"key":"e_1_3_2_1_7_1","first-page":"1877","volume-title":"Lin (Eds.)","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877-1901."},{"key":"e_1_3_2_1_8_1","volume-title":"Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909","author":"Chen Guoguo","year":"2021","unstructured":"Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, and others. 2021. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 (2021)."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISMAR-Adjunct60411.2023.00118"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"J. S. Chung A. Nagrani and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2018-1929"},{"volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Conneau Alexis","key":"e_1_3_2_1_11_1","unstructured":"Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm\u00e1n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 8440-8451."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:JONB.0000023655.25550.be"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"Jiahao Cui Hui Li Yun Zhang Hanlin Shang Kaihui Cheng Yuqi Ma Shan Mu Hang Zhou Jingdong Wang and Siyu Zhu. 2024. Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer. _eprint: 2412.00733.","DOI":"10.1109\/CVPR52734.2025.01964"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3124365"},{"key":"e_1_3_2_1_15_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (May","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (May 2019). 101 citations (INSPIRE 2024\/4\/13) 101 citations w\/o self (INSPIRE 2024\/4\/13) arXiv:1810.04805 [cs.CL]."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3395208"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2522628.2522633"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.biopsycho.2005.09.003"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472306.3478338"},{"key":"e_1_3_2_1_20_1","volume-title":"Generative adversarial networks. Commun. ACM","author":"Goodfellow Ian","year":"2020","unstructured":"Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM, Vol. 63, 11 (October 2020), 139-144. Place: New York, NY, USA Publisher: Association for Computing Machinery."},{"key":"e_1_3_2_1_21_1","first-page":"1321","article-title":"Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image","author":"Goto Shunsuke","year":"2020","unstructured":"Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. 2020. Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image.. In INTERSPEECH. 1321-1325.","journal-title":"INTERSPEECH."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","unstructured":"Shreyank N. Gowda Dheeraj Pandey and Shashank Narayana Gowda. 2023. From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications. doi:10.48550\/arXiv.2308.16041 arXiv:2308.16041 [cs].","DOI":"10.48550\/arXiv.2308.16041"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29769"},{"key":"e_1_3_2_1_24_1","unstructured":"Jianzhu Guo Dingyun Zhang Xiaoqiang Liu Zhizhou Zhong Yuan Zhang Pengfei Wan and Di Zhang. 2024. LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control. arXiv:2407.03168 [cs]."},{"key":"e_1_3_2_1_25_1","first-page":"1","volume-title":"Prompttts: Controllable Text-To-Speech With Text Descriptions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE","author":"Guo Zhifang","year":"2023","unstructured":"Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. 2023. Prompttts: Controllable Text-To-Speech With Text Descriptions. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Rhodes Island, Greece, 1-5."},{"key":"e_1_3_2_1_26_1","unstructured":"Pengcheng He Xiaodong Liu Jianfeng Gao and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs]."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240601"},{"key":"e_1_3_2_1_28_1","first-page":"10301","article-title":"TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics","author":"Ji Shengpeng","year":"2024","unstructured":"Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, and Zhou Zhao. 2024. TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 10301-10305.","journal-title":"Speech and Signal Processing (ICASSP)."},{"key":"e_1_3_2_1_29_1","volume-title":"Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. arXiv:2409.02634 [cs].","author":"Jiang Jianwen","year":"2024","unstructured":"Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. 2024. Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. arXiv:2409.02634 [cs]."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681674"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cub.2003.09.005"},{"volume-title":"The Blaue Reiter almanac (new documentary ed. \/ edited and with an introduction by klaus lankheit ed.). Thames and Hudson","author":"Kandinsky Wassily","key":"e_1_3_2_1_32_1","unstructured":"Wassily Kandinsky, Franz Marc, and Klaus Lankheit. 1974. The Blaue Reiter almanac (new documentary ed. \/ edited and with an introduction by klaus lankheit ed.). Thames and Hudson, New York, NY, USA. https:\/\/ci.nii.ac.jp\/ncid\/BA21725515"},{"key":"e_1_3_2_1_33_1","volume-title":"Kelly and Quang-Anh Ngo Tran","author":"Spencer","year":"2023","unstructured":"Spencer D. Kelly and Quang-Anh Ngo Tran. 2023. Exploring the Emotional Functions of Co-Speech Hand Gesture in Language and Communication. Topics in cognitive science (2023)."},{"key":"e_1_3_2_1_34_1","volume-title":"International Conference on Machine Learning. PMLR, 5530-5540","author":"Kim Jaehyeon","year":"2021","unstructured":"Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530-5540."},{"key":"e_1_3_2_1_35_1","unstructured":"Shigenobu Kobayashi. 1998. Colorist : a practical handbook for personal and professional Use. Kodansha International."},{"key":"e_1_3_2_1_36_1","volume-title":"Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802","author":"Koizumi Yuma","year":"2023","unstructured":"Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. 2023. Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802 (2023)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Jungil Kong Jihoon Park Beomjeong Kim Jeongmin Kim Dohee Kong and Sangjin Kim. 2023. VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. arXiv:2307.16430 [cs eess].","DOI":"10.21437\/Interspeech.2023-534"},{"key":"e_1_3_2_1_38_1","volume-title":"Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).","author":"Lee Jiyoung","year":"2023","unstructured":"Jiyoung Lee, Joon Son Chung, and Soo-Whan Chung. 2023. Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)."},{"key":"e_1_3_2_1_39_1","unstructured":"Yichong Leng Zhifang Guo Kai Shen Xu Tan Zeqian Ju Yanqing Liu Yufei Liu Dongchao Yang Leying Zhang Kaitao Song Lei He Xiang-Yang Li Sheng Zhao Tao Qin and Jiang Bian. 2023. PromptTTS 2: Describing and Generating Voices with Text Prompt. arXiv:2309.02285 [cs eess]."},{"key":"e_1_3_2_1_40_1","volume-title":"Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"19742","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 19730-19742."},{"key":"e_1_3_2_1_41_1","volume-title":"Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"12900","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888-12900."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130800.3130813"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2816795.2818013"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475198"},{"key":"e_1_3_2_1_45_1","unstructured":"Yifeng Ma Shiwei Zhang Jiayu Wang Xiang Wang Yingya Zhang and Zhidong Deng. 2023a. DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models. arXiv:2312.09767 [cs]."},{"key":"e_1_3_2_1_46_1","volume-title":"emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. arXiv preprint arXiv:2312.15185","author":"Ma Ziyang","year":"2023","unstructured":"Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2023b. emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. arXiv preprint arXiv:2312.15185 (2023)."},{"key":"e_1_3_2_1_47_1","unstructured":"Lawrence E. Marks. 1978. The Unity of the Senses: Interrelations Among the Modalities. https:\/\/api.semanticscholar.org\/CorpusID:27335285"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1037\/a0030945"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2948066"},{"key":"e_1_3_2_1_50_1","unstructured":"Aaron van den Oord Yazhe Li and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748 [cs stat]."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1002\/ejsp.584"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2004163117"},{"key":"e_1_3_2_1_53_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research","volume":"8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748-8763."},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/B978-008045089-6.50008-3"},{"key":"e_1_3_2_1_55_1","volume-title":"Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567","author":"Shi Yao","year":"2020","unstructured":"Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. 2020. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567 (2020)."},{"key":"e_1_3_2_1_56_1","volume-title":"Crossmodal Harmony: Looking for the Meaning of Harmony Beyond Hearing. i-Perception","author":"Spence Charles","year":"2022","unstructured":"Charles Spence and Nicola Di Stefano. 2022. Crossmodal Harmony: Looking for the Meaning of Harmony Beyond Hearing. i-Perception, Vol. 13 (2022). https:\/\/api.semanticscholar.org\/CorpusID:246766300"},{"key":"e_1_3_2_1_57_1","unstructured":"Xusen Sun Longhao Zhang Hao Zhu Peng Zhang Bang Zhang Xinya Ji Kangneng Zhou Daiheng Gao Liefeng Bo and Xun Cao. 2023. VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior. arXiv:2312.01841 [cs]."},{"key":"e_1_3_2_1_58_1","volume-title":"EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions. arXiv preprint arXiv:2402.17485","author":"Tian Linrui","year":"2024","unstructured":"Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. EMO: Emote Portrait Alive-Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions. arXiv preprint arXiv:2402.17485 (2024)."},{"key":"e_1_3_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612065"},{"key":"e_1_3_2_1_60_1","first-page":"700","volume-title":"UK","author":"Wang Kaisiyuan","year":"2020","unstructured":"Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Computer Vision - ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI. Springer-Verlag, Berlin, Heidelberg, 700-717. event-place: Glasgow, United Kingdom."},{"key":"e_1_3_2_1_61_1","volume-title":"International conference on machine learning. PMLR, 9929-9939","author":"Wang Tongzhou","year":"2020","unstructured":"Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning. PMLR, 9929-9939."},{"key":"e_1_3_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548090"},{"key":"e_1_3_2_1_63_1","unstructured":"Haozhe Wu Jia Jia Junliang Xing Hongwei Xu Xiangyuan Wang and Jelo Wang. 2023a. MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation. arXiv:2303.09797 [cs]."},{"key":"e_1_3_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3611775"},{"key":"e_1_3_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478513.3480545"},{"key":"e_1_3_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i17.29902"},{"key":"e_1_3_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i11.26628"},{"key":"e_1_3_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612703"},{"key":"e_1_3_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414005"},{"key":"e_1_3_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00836"},{"key":"e_1_3_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00366"}],"event":{"name":"MM '25: The 33rd ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Dublin Ireland","acronym":"MM '25"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755736","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T04:02:42Z","timestamp":1765339362000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755736"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":71,"alternative-id":["10.1145\/3746027.3755736","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755736","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}