{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T01:10:24Z","timestamp":1755825024516,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100006374","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61902415"],"award-info":[{"award-number":["61902415"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Open Fund of PDL","award":["WDZC20235250106"],"award-info":[{"award-number":["WDZC20235250106"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,30]]},"DOI":"10.1145\/3731715.3733276","type":"proceedings-article","created":{"date-parts":[[2025,6,25]],"date-time":"2025-06-25T18:31:04Z","timestamp":1750876264000},"page":"25-34","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["AnchorTalk: High-Fidelity Upper-Body Talking Human Generation From Speech"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4039-895X","authenticated-orcid":false,"given":"Yali","family":"Cai","sequence":"first","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6752-7892","authenticated-orcid":false,"given":"Peng","family":"Qiao","sequence":"additional","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9743-2034","authenticated-orcid":false,"given":"Dongsheng","family":"Li","sequence":"additional","affiliation":[{"name":"National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China"}]}],"member":"320","published-online":{"date-parts":[[2025,6,30]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Chaitanya Ahuja Dong Lee Ryo Ishii and Louis-Philippe Morency. 2020a. No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures. 1884--1895.","DOI":"10.18653\/v1\/2020.findings-emnlp.170"},{"key":"e_1_3_2_1_2_1","volume-title":"Yukiko I. Nakano, and Louis-Philippe Morency.","author":"Ahuja Chaitanya","year":"2020","unstructured":"Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, and Louis-Philippe Morency. 2020b. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. arxiv: 2007.12553 [cs.CV] https:\/\/arxiv.org\/abs\/2007.12553"},{"key":"e_1_3_2_1_3_1","volume-title":"wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv","author":"Baevski Alexei","year":"2006","unstructured":"Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arxiv: 2006.11477 [cs.CL] https:\/\/arxiv.org\/abs\/2006.11477"},{"key":"e_1_3_2_1_4_1","volume-title":"Speech driven video editing via an audio-conditioned diffusion model. arXiv preprint arXiv:2301.04474","author":"Bigioi Dan","year":"2023","unstructured":"Dan Bigioi, Shubhajit Basak, Hugh Jordan, Rachel McDonnell, and Peter Corcoran. 2023. Speech driven video editing via an audio-conditioned diffusion model. arXiv preprint arXiv:2301.04474 (2023)."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.24792"},{"key":"e_1_3_2_1_6_1","volume-title":"TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans. arXiv preprint arXiv:2409.16666","author":"Chatziagapi Aggelina","year":"2024","unstructured":"Aggelina Chatziagapi, Bindita Chaudhuri, Amit Kumar, Rakesh Ranjan, Dimitris Samaras, and Nikolaos Sarafianos. 2024. TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans. arXiv preprint arXiv:2409.16666 (2024)."},{"key":"e_1_3_2_1_7_1","volume-title":"Asian conference on computer vision. Springer, 251--263","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251--263."},{"key":"e_1_3_2_1_8_1","volume-title":"Proceedings, Part XXX 16","author":"Das Dipanjan","year":"2020","unstructured":"Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. 2020. Speech-driven facial animation using cascaded gans for learning of motion and texture. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX 16. Springer, 408--424."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01413"},{"key":"e_1_3_2_1_10_1","volume-title":"Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. arxiv","author":"Gafni Guy","year":"2012","unstructured":"Guy Gafni, Justus Thies, Michael Zollh\u00f6fer, and Matthias Nie\u00dfner. 2020. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. arxiv: 2012.03065 [cs.CV] https:\/\/arxiv.org\/abs\/2012.03065"},{"key":"e_1_3_2_1_11_1","volume-title":"DensePose: Dense Human Pose Estimation in the Wild. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society","author":"Guler Riza Alp","year":"2018","unstructured":"Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. In 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 7297--7306."},{"key":"e_1_3_2_1_12_1","volume-title":"Rad-NeRF: Ray-decoupled Training of Neural Radiance Field. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=nBrnfYeKf9","author":"Guo Lidong","year":"2024","unstructured":"Lidong Guo, Xuefei Ning, Yonggan Fu, Tianchen Zhao, Zhuoliang Kang, Jincheng Yu, Yingyan Celine Lin, and Yu Wang. 2024. Rad-NeRF: Ray-decoupled Training of Neural Radiance Field. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. https:\/\/openreview.net\/forum?id=nBrnfYeKf9"},{"key":"e_1_3_2_1_13_1","unstructured":"Yudong Guo Keyu Chen Sen Liang Yong-Jin Liu Hujun Bao and Juyong Zhang. 2021. AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. arxiv: 2103.11078 [cs.CV]"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00220"},{"key":"e_1_3_2_1_15_1","unstructured":"Fa-Ting Hong Longhao Zhang Li Shen and Dan Xu. 2022. Depth-Aware Generative Adversarial Network for Talking Head Video Generation. arxiv: 2203.06605 [cs.CV] https:\/\/arxiv.org\/abs\/2203.06605"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2024.110758"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00668"},{"key":"e_1_3_2_1_18_1","unstructured":"Geumbyeol Hwang Sunwon Hong Seunghyun Lee Sungwoo Park and Gyeongsu Chae. 2023. DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions. arxiv: 2303.07697 [cs.CV] https:\/\/arxiv.org\/abs\/2303.07697"},{"key":"e_1_3_2_1_19_1","volume-title":"A Style-Based Generator Architecture for Generative Adversarial Networks. arxiv","author":"Karras Tero","year":"1812","unstructured":"Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. arxiv: 1812.04948 [cs.NE] https:\/\/arxiv.org\/abs\/1812.04948"},{"key":"e_1_3_2_1_20_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2017","unstructured":"Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv: 1412.6980 [cs.LG] https:\/\/arxiv.org\/abs\/1412.6980"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CASE59546.2024.10711766"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00278"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00696"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3596711.3596800"},{"key":"e_1_3_2_1_25_1","unstructured":"Haoyu Ma Tong Zhang Shanlin Sun Xiangyi Yan Kun Han and Xiaohui Xie. 2023a. CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer. arxiv: 2311.06443 [cs.CV] https:\/\/arxiv.org\/abs\/2311.06443"},{"key":"e_1_3_2_1_26_1","volume-title":"Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv e-prints","author":"Ma Yifeng","year":"2023","unstructured":"Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. 2023b. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv e-prints (2023), arXiv--2312."},{"key":"e_1_3_2_1_27_1","volume-title":"Huang","author":"Ni Haomiao","year":"2023","unstructured":"Haomiao Ni, Jiachen Liu, Yuan Xue, and Sharon X. Huang. 2023. 3D-Aware Talking-Head Video Motion Transfer. arxiv: 2311.02549 [cs.CV] https:\/\/arxiv.org\/abs\/2311.02549"},{"key":"e_1_3_2_1_28_1","volume-title":"Gustav Eje Henter, and Michael Neff","author":"Nyatsanga Simbarashe","year":"2023","unstructured":"Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. In Computer Graphics Forum, Vol. 42. Wiley Online Library, 569--596."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Pascal Paysan Reinhard Knothe Brian Amberg Sami Romdhani and Thomas Vetter. 2009. A 3D Face Model for Pose and Illumination Invariant Face Recognition.","DOI":"10.1109\/AVSS.2009.58"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00070"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00197"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00502"},{"key":"e_1_3_2_1_33_1","unstructured":"Jiacheng Su Kunhong Liu Liyan Chen Junfeng Yao Qingsong Liu and Dongdong Lv. 2024. Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN. arxiv: 2407.05577 [cs.CV] https:\/\/arxiv.org\/abs\/2407.05577"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Kim Sung-Bin Lee Chae-Yeon Gihun Son Oh Hyun-Bin Janghoon Ju Suekyeong Nam and Tae-Hyun Oh. 2024. MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset. arxiv: 2406.14272 [cs.CV] https:\/\/arxiv.org\/abs\/2406.14272","DOI":"10.21437\/Interspeech.2024-1794"},{"key":"e_1_3_2_1_35_1","unstructured":"Jiaxiang Tang Kaisiyuan Wang Hang Zhou Xiaokang Chen Dongliang He Tianshu Hu Jingtuo Liu Gang Zeng and Jingdong Wang. 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arxiv: 2211.12368 [cs.CV] https:\/\/arxiv.org\/abs\/2211.12368"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.119678"},{"key":"e_1_3_2_1_37_1","unstructured":"Haotian Wang Yuzhe Weng Yueyan Li Zilu Guo Jun Du Shutong Niu Jiefeng Ma Shan He Xiaoyan Wu Qiming Hu Bing Yin Cong Liu and Qingfeng Liu. 2024b. EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion. arxiv: 2411.16726 [cs.CV] https:\/\/arxiv.org\/abs\/2411.16726"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00891"},{"key":"e_1_3_2_1_39_1","volume-title":"Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation. arxiv: 2406.08801 [cs.CV] https:\/\/arxiv.org\/abs\/2406.08801","author":"Xu Mingwang","year":"2024","unstructured":"Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. 2024. Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation. arxiv: 2406.08801 [cs.CV] https:\/\/arxiv.org\/abs\/2406.08801"},{"key":"e_1_3_2_1_40_1","unstructured":"Haijie Yang Zhenyu Zhang Hao Tang Jianjun Qian and Jian Yang. 2024. ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance. arxiv: 2411.15436 [cs.CV] https:\/\/arxiv.org\/abs\/2411.15436"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00820"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681361"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00053"},{"key":"e_1_3_2_1_44_1","volume-title":"3d talking face with personalized pose dynamics","author":"Zhang Chenxu","year":"2021","unstructured":"Chenxu Zhang, Saifeng Ni, Zhipeng Fan, Hongbo Li, Ming Zeng, Madhukar Budagavi, and Xiaohu Guo. 2021b. 3d talking face with personalized pose dynamics. IEEE Transactions on Visualization and Computer Graphics (2021). colorblue."},{"key":"e_1_3_2_1_45_1","volume-title":"The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arxiv","author":"Zhang Richard","year":"1801","unstructured":"Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arxiv: 1801.03924 [cs.CV] https:\/\/arxiv.org\/abs\/1801.03924"},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00366"},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.3390\/electronics12010218"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00938"}],"event":{"name":"ICMR '25: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Chicago IL USA","acronym":"ICMR '25"},"container-title":["Proceedings of the 2025 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3731715.3733276","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T04:07:12Z","timestamp":1755749232000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3731715.3733276"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":48,"alternative-id":["10.1145\/3731715.3733276","10.1145\/3731715"],"URL":"https:\/\/doi.org\/10.1145\/3731715.3733276","relation":{},"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2025-06-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}