{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,9]],"date-time":"2026-05-09T00:13:48Z","timestamp":1778285628893,"version":"3.51.4"},"reference-count":47,"publisher":"Institution of Engineering and Technology (IET)","issue":"1","license":[{"start":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T00:00:00Z","timestamp":1753228800000},"content-version":"vor","delay-in-days":203,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"},{"start":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T00:00:00Z","timestamp":1735689600000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/doi.wiley.com\/10.1002\/tdm_license_1.1"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2022YFF0902200"],"award-info":[{"award-number":["2022YFF0902200"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Image Processing"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>High\u2010fidelity, speech\u2010driven 3D facial animation is crucial for immersive applications and virtual avatars. Nevertheless, advancement is impeded by two principal challenges: (1) a lack of high\u2010quality 3D data, and (2) inadequate modelling of the multi\u2010scale characteristics of speech signals. In this paper, we present Speech2Face3D, a novel two\u2010stage transfer\u2010learning framework that pretrains on large\u2010scale pseudo\u20103D facial data derived from 2D videos and subsequently finetunes on smaller yet high\u2010fidelity 3D datasets. This design leverages the richness of easily accessible 2D resources while mitigating reconstruction noise through a simple temporal smoothing step. Our approach further introduces a Multi\u2010Scale Hierarchical Audio Encoder to capture subtle phoneme transitions, mid\u2010range prosody, and longer\u2010range emotional cues. Extensive experiments on public 3D benchmarks demonstrate that our method achieves state\u2010of\u2010the\u2010art performance on lip synchronization, expression fidelity, and temporal coherence metrics. Qualitative user evaluations validate these quantitative improvements. Speech2Face3D is a robust and scalable framework for utilizing extensive 2D data to generate precise and realistic 3D facial animations only based on\u00a0speech.<\/jats:p>","DOI":"10.1049\/ipr2.70155","type":"journal-article","created":{"date-parts":[[2025,7,23]],"date-time":"2025-07-23T13:27:16Z","timestamp":1753277236000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Speech2Face3D: A Two\u2010Stage Transfer\u2010Learning Framework for Speech\u2010Driven 3D Facial Animation"],"prefix":"10.1049","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-4855-9132","authenticated-orcid":false,"given":"Liming","family":"Pang","sequence":"first","affiliation":[{"name":"Beijing University of Posts and Telecommunications Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhi","family":"Zeng","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications Beijing China"},{"name":"Institute of Automation Chinese Academy of Sciences Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-3063-0117","authenticated-orcid":false,"given":"Yahui","family":"Li","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Guixuan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shuwu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"265","published-online":{"date-parts":[[2025,7,23]]},"reference":[{"key":"e_1_2_12_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2024.112483"},{"key":"e_1_2_12_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ins.2024.121191"},{"key":"e_1_2_12_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-31456-9_35"},{"key":"e_1_2_12_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2020.3030497"},{"key":"e_1_2_12_6_1","doi-asserted-by":"crossref","unstructured":"Y.Gong Y.Chung andJ. R.Glass \u201cAST: Audio Spectrogram Transformer \u201darXiv:2104.01778(2021).","DOI":"10.21437\/Interspeech.2021-698"},{"key":"e_1_2_12_7_1","doi-asserted-by":"crossref","unstructured":"K.Chen Q.Kong T.Iqbal et\u00a0al. \u201cHTS\u2013 AT: A Hierarchical Token\u2010Semantic Audio Transformer for Sound Classification and Detection \u201d inProceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP)(IEEE 2022) 886\u2013890.","DOI":"10.1109\/ICASSP43922.2022.9746312"},{"key":"e_1_2_12_8_1","doi-asserted-by":"crossref","unstructured":"W.Zhu V. I.Morariu Z.Zhang M.Lu andX.Yang \u201cMAST: Multiscale Audio Spectrogram Transformer for Efficient Audio Classification \u201d inProceedings of the International Conference on Acoustics Speech and Signal Processing (ICASSP)(IEEE 2023) 1\u20135.","DOI":"10.1109\/ICASSP49357.2023.10096513"},{"key":"e_1_2_12_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2022.3189730"},{"key":"e_1_2_12_10_1","unstructured":"S.Chen Y.Zhang C.Wang et\u00a0al. \u201cBEATs: Audio Pre\u2010Training with Acoustic Tokenizers \u201d inProceedings of the 30th International Conference on Machine Learning Research(ACM 2022) 3687\u20133713."},{"key":"e_1_2_12_11_1","unstructured":"J.Shi H.Inaguma X.Ma I.Kulikov andA.Sun \u201cMulti\u2010Resolution HuBERT: Speech Self\u2010Supervised Learning With Masked Unit Prediction at Multiple Resolutions \u201d inInternational Conference on Learning Representations(ICLR 2024) 1\u201334."},{"key":"e_1_2_12_12_1","unstructured":"W.Chen W.Zheng L.Zhang Y.Zhang andL.Xie \u201cEAT: Self\u2010Supervised Pre\u2010Training with Efficient Audio Transformer \u201d inProceedings of the Thirty\u2010Third International Joint Conference on Artificial Intelligence(ACM 2024) 3825\u20133833."},{"key":"e_1_2_12_13_1","doi-asserted-by":"crossref","unstructured":"Y.Fan Z.Lin J.Saito W.Wang andT.Komura \u201cFaceformer: Speech\u2010Driven 3D Facial Animation with Transformers \u201d inIEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE 2022) 18770\u201318780.","DOI":"10.1109\/CVPR52688.2022.01821"},{"key":"e_1_2_12_14_1","doi-asserted-by":"crossref","unstructured":"K. I.HaqueandZ.Yumak \u201cFaceXHuBERT: Text\u2010Less Speech\u2010Driven E(X)pressive 3D Facial Animation Synthesis Using Self\u2010Supervised Speech Representation Learning \u201d in27th International Conference on Multimodal Interaction (ICMI '23)(ACM 2023) 282\u2013291.","DOI":"10.1145\/3577190.3614157"},{"key":"e_1_2_12_15_1","doi-asserted-by":"crossref","unstructured":"R.Dan\u011b\u010dek K.Chhatre S.Tripathi Y.Wen M.Black andT.Bolkart \u201cEmotional Speech\u2010Driven Animation with Content\u2010Emotion Disentanglement \u201d inSIGGRAPH Asia 2023 Conference Papers(ACM 2023) 1\u201313.","DOI":"10.1145\/3610548.3618183"},{"key":"e_1_2_12_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130800.3130813"},{"key":"e_1_2_12_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_2_12_18_1","doi-asserted-by":"crossref","unstructured":"K.Zhang M.Sun J.Sun K.Zhang Z.Sun andT.Tan \u201cOpen\u2010Vocabulary Text\u2010Driven Human Image Generation \u201d132 no.10(2024):4379\u20134397 https:\/\/doi.org\/10.1007\/s11263\u2010024\u201002079\u20107.","DOI":"10.1007\/s11263-024-02079-7"},{"key":"e_1_2_12_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3345866"},{"key":"e_1_2_12_20_1","doi-asserted-by":"crossref","unstructured":"X.Qi C.Liu M.Sun L.Li C.Fan andX.Yu \u201cDiverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement \u201d inIEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE 2023):4616\u20134626.","DOI":"10.1109\/CVPR52729.2023.00448"},{"key":"e_1_2_12_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2023.3341246"},{"key":"e_1_2_12_22_1","doi-asserted-by":"crossref","unstructured":"Z.Zhao N.Gao Z.Zeng G.Zhang J.Liu andS.Zhang \u201cGesture Motion Graphs for Few\u2010Shot Speech\u2010Driven Gesture Reenactment \u201d inProceedings of the 25th International Conference on Multimodal Interaction(ACM 2023) 772\u2013778.","DOI":"10.1145\/3577190.3616118"},{"key":"e_1_2_12_23_1","doi-asserted-by":"crossref","unstructured":"Z.Zhao N.Gao Z.Zeng G.Zhang J.Liu andS.Zhang \u201cA Unified Editing Method for Co\u2010Speech Gesture Generation via Diffusion Inversion \u201d inProceedings of the 6th ACM International Conference on Multimedia in Asia(ACM 2024) 1\u20137.","DOI":"10.1145\/3696409.3700261"},{"key":"e_1_2_12_24_1","doi-asserted-by":"crossref","unstructured":"N.Gao Z.Zeng G.Zhang andS.Zhang \u201cHeterogeneous Avatar Synthesis Based on Disentanglement of Topology and Rendering \u201d inComputer Vision \u2013 ACCV 2022(Springer 2022) 137\u2013152.","DOI":"10.1007\/978-3-031-26316-3_9"},{"key":"e_1_2_12_25_1","doi-asserted-by":"publisher","DOI":"10.1002\/cav.1892"},{"key":"e_1_2_12_26_1","unstructured":"JALI Research. JALI Research.https:\/\/jaliresearch.com\/(2023)."},{"key":"e_1_2_12_27_1","doi-asserted-by":"crossref","unstructured":"M. V.Aylagas H. A.Leon M.Teye andK.Tollmar \u201cVoice2Face: Audio\u2010driven Facial and Tongue Rig Animations with cVAEs \u201d inComputer Graphics Forum41 no.8(2022):255\u2013265.","DOI":"10.1111\/cgf.14640"},{"key":"e_1_2_12_28_1","doi-asserted-by":"crossref","unstructured":"D.Cudeiro T.Bolkart C.Laidlaw A.Ranjan andM.Black \u201cCapture Learning and Synthesis of 3D Speaking Styles \u201d inIEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE 2019) 10101\u201310111.","DOI":"10.1109\/CVPR.2019.01034"},{"key":"e_1_2_12_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"key":"e_1_2_12_30_1","doi-asserted-by":"crossref","unstructured":"A.Richard M.Zoll\u00f6fer Y.Wen F.de laTorre andY.Sheikh \u201c3D Face Animation From Speech Using Cross\u2010Modality Disentanglement \u201d inProceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)(IEEE 2021) 1173\u20131182.","DOI":"10.1109\/ICCV48922.2021.00121"},{"key":"e_1_2_12_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073699"},{"key":"e_1_2_12_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201292"},{"key":"e_1_2_12_33_1","doi-asserted-by":"crossref","unstructured":"Z.Peng H.Wu Z.Song et\u00a0al. \u201cEmoTalk: Speech\u2010Driven Emotional Disentanglement for 3D Face Animation \u201d inIEEE\/CVF Conference on ComputerVision and Pattern Recognition (CVPR)(IEEE 2023) 20630\u201320640.","DOI":"10.1109\/ICCV51070.2023.01891"},{"key":"e_1_2_12_34_1","doi-asserted-by":"crossref","unstructured":"B.Thambiraja I.Habibie S.Aliakbarian D.Cosker C.Theobalt andJ.Thies \u201cImitator: Personalized Speech\u2010Driven 3D Facial Animation \u201d inProceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV)(IEEE 2023) 20621\u201320631.","DOI":"10.1109\/ICCV51070.2023.01885"},{"key":"e_1_2_12_35_1","doi-asserted-by":"crossref","unstructured":"S.Wu K. I.Haque andZ.Yumak \u201cProbTalk3D: Non\u2010Deterministic Emotion Controllable Speech\u2010Driven 3D Facial Animation Synthesis Using VQ\u2010VAE \u201d inProceedings of the 16th ACM SIGGRAPH Conference on Motion Interaction and Games(ACM 2024) 1\u201354.","DOI":"10.1145\/3677388.3696320"},{"key":"e_1_2_12_36_1","doi-asserted-by":"crossref","unstructured":"J.Xing M.Xia Y.Zhang Y.Zhang andT. T.Wong \u201cCodeTalker: Speech\u2010Driven 3D Facial Animation with Discrete Motion Prior \u201d inIEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE 2023) 20327\u201320336.","DOI":"10.1109\/CVPR52729.2023.01229"},{"key":"e_1_2_12_37_1","doi-asserted-by":"crossref","unstructured":"S.Stan K. I.Haque andZ.Yumak \u201cFaceDiffuser: Speech\u2010Driven 3D Facial Animation Synthesis Using Diffusion \u201d inProceedings of the 16th ACM SIGGRAPH Conference on Motion Interaction and Games(ACM 2023) 1\u201311.","DOI":"10.1145\/3623264.3624447"},{"key":"e_1_2_12_38_1","doi-asserted-by":"crossref","unstructured":"Z.Ma X.Zhu G.Qi C.Qian Z.Zhang andZ.Lei \u201cDiffSpeaker: Speech\u2010Driven 3D Facial Animation with Diffusion Transformer \u201darXiv:2402.05712(2024).","DOI":"10.1109\/IJCB65343.2025.11411575"},{"key":"e_1_2_12_39_1","doi-asserted-by":"crossref","unstructured":"K.Sung\u2010Bin L.Chae\u2010Yeon G.Son et\u00a0al. \u201cMultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset \u201d inInterspeech 2024(2024) 1380\u20131384.","DOI":"10.21437\/Interspeech.2024-1794"},{"key":"e_1_2_12_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2020.3004555"},{"key":"e_1_2_12_41_1","unstructured":"H.Lu Y.Huo G.Yang et\u00a0al. \u201cUniAdapter: Unified Parameter\u2010Efficient Transfer Learning for Cross\u2010modal Modeling \u201darXiv:2302.06605(2024)."},{"key":"e_1_2_12_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3450626.3459936"},{"key":"e_1_2_12_43_1","doi-asserted-by":"crossref","unstructured":"S.Qian T.Kirschstein L.Schoneveld D.Davoli S.Giebenhain andM.Nie\u00dfner \u201cGaussianavatars: Photorealistic Head Avatars With Rigged 3D Gaussians \u201d inIEEE\/CVF Conference on ComputerVision and Pattern Recognition (CVPR)(IEEE 2024) 20299\u201320309.","DOI":"10.1109\/CVPR52733.2024.01919"},{"key":"e_1_2_12_44_1","doi-asserted-by":"crossref","unstructured":"K.Wang Q.Wu L.Song et\u00a0al. \u201cMEAD: A Large\u2010scale Audio\u2010visual Dataset for Emotional Talking\u2010face Generation \u201d inComputer Vision \u2013 ECCV(Springer 2020) 700\u2013717.","DOI":"10.1007\/978-3-030-58589-1_42"},{"key":"e_1_2_12_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2010.2052239"},{"key":"e_1_2_12_46_1","doi-asserted-by":"crossref","unstructured":"C.Szegedy W.Liu Y.Jia et\u00a0al. \u201cGoing Deeper With Convolutions \u201d inIEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(IEEE 2015) 1\u20139.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_12_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2019.2938758"},{"key":"e_1_2_12_48_1","doi-asserted-by":"crossref","unstructured":"B.Zhu C.Wang F.Liu et\u00a0al. \u201cLearning Environmental Sounds With Multi\u2010Scale Convolutional Neural Network \u201d in2018 International Joint Conference on Neural Networks (IJCNN)(IEEE 2018) 1\u20138.","DOI":"10.1109\/IJCNN.2018.8489641"}],"container-title":["IET Image Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.70155","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/full-xml\/10.1049\/ipr2.70155","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.70155","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,8]],"date-time":"2026-05-08T23:40:33Z","timestamp":1778283633000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/ipr2.70155"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["10.1049\/ipr2.70155"],"URL":"https:\/\/doi.org\/10.1049\/ipr2.70155","archive":["Portico"],"relation":{},"ISSN":["1751-9659","1751-9667"],"issn-type":[{"value":"1751-9659","type":"print"},{"value":"1751-9667","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1]]},"assertion":[{"value":"2025-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-17","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-23","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70155"}}