{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T17:49:55Z","timestamp":1773942595582,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","funder":[{"name":"Yango Charitable Foundation"},{"name":"National Natural Science Foundation","award":["62072382"],"award-info":[{"award-number":["62072382"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,30]]},"DOI":"10.1145\/3731715.3733322","type":"proceedings-article","created":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T18:29:43Z","timestamp":1750876183000},"page":"145-154","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["EmoHuman: Fine-Grained Emotion-Controlled Talking Head Generation via Audio-Text Multimodal Detangling"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-2071-8506","authenticated-orcid":false,"given":"Qifeng","family":"Dai","sequence":"first","affiliation":[{"name":"Xiamen University, Xiamen, Fujian, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0889-0584","authenticated-orcid":false,"given":"Huidong","family":"Feng","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, Fujian, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0286-4634","authenticated-orcid":false,"given":"Wendi","family":"Cui","sequence":"additional","affiliation":[{"name":"China Mobile Communications Corporation, Xiamen, Fujian, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9607-1040","authenticated-orcid":false,"given":"Xinqi","family":"Cai","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, Fujian, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4671-6111","authenticated-orcid":false,"given":"Yinglin","family":"Zheng","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, Fujian, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5056-0706","authenticated-orcid":false,"given":"Ming","family":"Zeng","sequence":"additional","affiliation":[{"name":"Xiamen University, Xiamen, Fujian, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,30]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3596711.3596730"},{"key":"e_1_3_2_1_3_1","unstructured":"Andreas Blattmann Tim Dockhorn Sumith Kulal Daniel Mendelevitch Maciej Kilian Dominik Lorenz Yam Levi Zion English Vikram Voleti Adam Letts et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)."},{"key":"e_1_3_2_1_4_1","volume-title":"Crema-d: Crowd-sourced emotional multimodal actors dataset","author":"Cao Houwei","year":"2014","unstructured":"Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing , Vol. 5, 4 (2014), 377--390."},{"key":"e_1_3_2_1_5_1","unstructured":"Haoxin Chen Menghan Xia Yingqing He Yong Zhang Xiaodong Cun Shaoshu Yang Jinbo Xing Yaofang Liu Qifeng Chen Xintao Wang et al. 2023a. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)."},{"key":"e_1_3_2_1_6_1","volume-title":"DWFormer: Dynamic Window Transformer for Speech Emotion Recognition. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.","author":"Chen Shuaiqi","year":"2023","unstructured":"Shuaiqi Chen, Xiaofen Xing, Weibin Zhang, Weidong Chen, and Xiangmin Xu. 2023b. DWFormer: Dynamic Window Transformer for Speech Emotion Recognition. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550469.3555399"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-54427-4_19"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02069"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"Zhifu Gao Zerui Li Jiaming Wang Haoneng Luo Xian Shi Mengzhe Chen Yabin Li Lingyun Zuo Zhihao Du Zhangyu Xiao and Shiliang Zhang. 2023. FunASR: A Fundamental End-to-End Speech Recognition Toolkit. In INTERSPEECH.","DOI":"10.21437\/Interspeech.2023-1428"},{"key":"e_1_3_2_1_11_1","volume-title":"ICONIP 2013, daegu, korea, november 3--7, 2013. Proceedings, Part III 20","author":"Goodfellow Ian J","year":"2013","unstructured":"Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In Neural information processing: 20th international conference, ICONIP 2013, daegu, korea, november 3--7, 2013. Proceedings, Part III 20. Springer, 117--124."},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00573"},{"key":"e_1_3_2_1_13_1","volume-title":"Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725","author":"Guo Yuwei","year":"2023","unstructured":"Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)."},{"key":"e_1_3_2_1_14_1","volume-title":"Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_1_15_1","volume-title":"Denoising diffusion probabilistic models. Advances in neural information processing systems","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, Vol. 33 (2020), 6840--6851."},{"key":"e_1_3_2_1_16_1","first-page":"8633","article-title":"Video diffusion models","volume":"35","author":"Ho Jonathan","year":"2022","unstructured":"Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models. Advances in Neural Information Processing Systems, Vol. 35 (2022), 8633--8646.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_17_1","volume-title":"Amir Mohammad Rostami, and Padideh Choobdar","author":"Jafarzadeh Pourya","year":"2024","unstructured":"Pourya Jafarzadeh, Amir Mohammad Rostami, and Padideh Choobdar. 2024. Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT. arXiv preprint arXiv:2411.02964 (2024)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528233.3530745"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01386"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0196391"},{"key":"e_1_3_2_1_21_1","volume-title":"Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101","author":"Loshchilov I","year":"2017","unstructured":"I Loshchilov. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)."},{"key":"e_1_3_2_1_22_1","volume-title":"Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378","author":"Luo Simian","year":"2023","unstructured":"Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. 2023a. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)."},{"key":"e_1_3_2_1_23_1","volume-title":"Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556","author":"Luo Simian","year":"2023","unstructured":"Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolin\u00e1rio Passos, Longbo Huang, Jian Li, and Hang Zhao. 2023b. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556 (2023)."},{"key":"e_1_3_2_1_24_1","volume-title":"Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767","author":"Ma Yifeng","year":"2023","unstructured":"Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. 2023a. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767 (2023)."},{"key":"e_1_3_2_1_25_1","volume-title":"emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185","author":"Ma Ziyang","year":"2023","unstructured":"Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2023b. emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185 (2023)."},{"key":"e_1_3_2_1_26_1","volume-title":"Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01891"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413532"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00197"},{"key":"e_1_3_2_1_32_1","volume-title":"Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792","author":"Singer Uriel","year":"2022","unstructured":"Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. 2022. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)."},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00502"},{"key":"e_1_3_2_1_34_1","volume-title":"Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. arXiv preprint arXiv:2312.01841","author":"Sun Xusen","year":"2023","unstructured":"Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, and Xun Cao. 2023. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior. arXiv preprint arXiv:2312.01841 (2023)."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72658-3_23"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.02024"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-73010-8_15"},{"key":"e_1_3_2_1_38_1","volume-title":"FVD: A new metric for video generation.","author":"Unterthiner Thomas","year":"2019","unstructured":"Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha\u00ebl Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new metric for video generation. (2019)."},{"key":"e_1_3_2_1_39_1","volume-title":"V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation. arXiv preprint arXiv:2406.02511","author":"Wang Cong","year":"2024","unstructured":"Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, and Wei Yang. 2024b. V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation. arXiv preprint arXiv:2406.02511 (2024)."},{"key":"e_1_3_2_1_40_1","volume-title":"MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation. In ECCV.","author":"Wang Kaisiyuan","year":"2020","unstructured":"Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation. In ECCV."},{"key":"e_1_3_2_1_41_1","volume-title":"Advances in Neural Information Processing Systems","volume":"36","author":"Wang Xiang","year":"2024","unstructured":"Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. 2024c. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, Vol. 36 (2024)."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP48485.2024.10447726"},{"key":"e_1_3_2_1_43_1","volume-title":"Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694","author":"Wei Huawei","year":"2024","unstructured":"Huawei Wei, Zejun Yang, and Zhisheng Wang. 2024. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694 (2024)."},{"key":"e_1_3_2_1_44_1","volume-title":"Vasa-1: Lifelike audio-driven talking faces generated in real time. arXiv preprint arXiv:2404.10667","author":"Xu Sicheng","year":"2024","unstructured":"Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. 2024. Vasa-1: Lifelike audio-driven talking faces generated in real time. arXiv preprint arXiv:2404.10667 (2024)."},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00836"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i3.25464"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00366"},{"key":"e_1_3_2_1_48_1","volume-title":"Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018","author":"Zhou Daquan","year":"2022","unstructured":"Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. 2022. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018 (2022)."},{"key":"e_1_3_2_1_49_1","volume-title":"Vaw-gan for disentanglement and recomposition of emotional elements in speech. In 2021 IEEE spoken language technology workshop (SLT)","author":"Zhou Kun","unstructured":"Kun Zhou, Berrak Sisman, and Haizhou Li. 2021. Vaw-gan for disentanglement and recomposition of emotional elements in speech. In 2021 IEEE spoken language technology workshop (SLT). IEEE, 415--422."},{"key":"e_1_3_2_1_50_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3414685.3417774","article-title":"Makelttalk: speaker-aware talking-head animation","volume":"39","author":"Zhou Yang","year":"2020","unstructured":"Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) , Vol. 39, 6 (2020), 1--15.","journal-title":"ACM Transactions On Graphics (TOG)"},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"crossref","unstructured":"Hao Zhu Wayne Wu Wentao Zhu Liming Jiang Siwei Tang Li Zhang Ziwei Liu and Chen Change Loy. 2022. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset. In ECCV.","DOI":"10.1007\/978-3-031-20071-7_38"}],"event":{"name":"ICMR '25: International Conference on Multimedia Retrieval","location":"Chicago IL USA","acronym":"ICMR '25","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2025 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731715.3733322","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T04:09:26Z","timestamp":1755749366000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731715.3733322"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":51,"alternative-id":["10.1145\/3731715.3733322","10.1145\/3731715"],"URL":"https:\/\/doi.org\/10.1145\/3731715.3733322","relation":{},"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2025-06-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}