{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:29:13Z","timestamp":1759332553859,"version":"3.37.3"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2024,5,23]],"date-time":"2024-05-23T00:00:00Z","timestamp":1716422400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,23]],"date-time":"2024-05-23T00:00:00Z","timestamp":1716422400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"The National Key Research and Development Program of China","award":["2020YFB1313602"],"award-info":[{"award-number":["2020YFB1313602"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Complex Intell. Syst."],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In recent years, the applications of digital humans have become increasingly widespread. One of the most challenging core technologies is the generation of highly realistic and automated 3D facial animation that combines facial movements and speech. The single-modal 3D facial animation driven by speech typically ignores the weak correlation between speech and upper facial movements as well as head posture. In contrast, the video-driven approach can perfectly solve the posture problem while obtaining natural expressions. However, mapping 2D facial information to 3D facial information may lead to information loss, which make lip synchronization generated by video-driven methods is not as good as the speech-driven methods trained on 4D facial data. Therefore, this paper proposes a dual-modal generation method that uses speech and video information to generate more natural and vivid 3D facial animation. Specifically, the lip movements related to speech are generated by speech-video information, while speech-uncorrelated postures and expressions are generated solely by video information. The speech-driven module is used to extract speech features, and its output lip animation is then used as the foundation for facial animation. The expression and pose module is used to extract temporal visual features for regressing expression and head posture parameters. We fuse speech and video features to obtain chin posture parameters related to lip movements, and use these parameters to fine-tune the lip animation generated form the speech-driven module. This paper introduces multiple consistency losses to enhance the network\u2019s capability to generate expressions and postures. Experiments conducted on the LRS3, TCD-TIMIT and MEAD datasets show that the proposed method achieves better performance on evaluation metrics such as CER, WER, VER and VWER than the current state-of-the-art methods. In addition, a perceptual user study show that over 77% and 70% of cases believe that this paper\u2019s method is better than the comparative algorithms EMOCA and SPECTRE in terms of realism. In terms of lip synchronization, it received over 79% and 66% of cases support, respectively. Both evaluation methods demonstrate the effectiveness of the proposed method.<\/jats:p>","DOI":"10.1007\/s40747-024-01481-5","type":"journal-article","created":{"date-parts":[[2024,5,23]],"date-time":"2024-05-23T18:02:50Z","timestamp":1716487370000},"page":"5951-5964","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["3D facial animation driven by speech-video dual-modal signals"],"prefix":"10.1007","volume":"10","author":[{"given":"Xuejie","family":"Ji","sequence":"first","affiliation":[]},{"given":"Zhouzhou","family":"Liao","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0267-9905","authenticated-orcid":false,"given":"Lanfang","family":"Dong","sequence":"additional","affiliation":[]},{"given":"Yingchao","family":"Tang","sequence":"additional","affiliation":[]},{"given":"Guoming","family":"Li","sequence":"additional","affiliation":[]},{"given":"Meng","family":"Mao","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,5,23]]},"reference":[{"issue":"5","key":"1481_CR1","first-page":"863","volume":"23","author":"AH Abdelaziz","year":"2015","unstructured":"Abdelaziz AH, Zeiler S, Kolossa D (2015) Learning dynamic stream weights for coupled-hmm-based audio-visual speech recognition. IEEE\/ACM Trans Audio, Speech, Lang Process 23(5):863\u2013876","journal-title":"IEEE\/ACM Trans Audio, Speech, Lang Process"},{"key":"1481_CR2","unstructured":"Afouras T, Chung JS, Zisserman A (2018) Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496"},{"key":"1481_CR3","doi-asserted-by":"crossref","unstructured":"Barros JMD, Golyanik V, Varanasi K, Stricker D (2019) Face it!: a pipeline for real-time performance-driven facial animation. In: 2019 IEEE International Conference on Image Processing (ICIP). IEEE. pp 2209\u20132213","DOI":"10.1109\/ICIP.2019.8803330"},{"key":"1481_CR4","doi-asserted-by":"crossref","unstructured":"Brand M (1999) Voice puppetry. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp 21\u201328","DOI":"10.1145\/311535.311537"},{"key":"1481_CR5","doi-asserted-by":"crossref","unstructured":"Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2D & 3D face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE international conference on computer vision, pp 1021\u20131030","DOI":"10.1109\/ICCV.2017.116"},{"issue":"4","key":"1481_CR6","first-page":"1","volume":"33","author":"C Cao","year":"2014","unstructured":"Cao C, Hou Q, Zhou K (2014) Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans Graph (TOG) 33(4):1\u201310","journal-title":"ACM Trans Graph (TOG)"},{"key":"1481_CR7","doi-asserted-by":"crossref","unstructured":"Cao C, Weng Y, Lin S, Zhou K (2013) 3d shape regression for real-time facial animation. ACM Trans Graph (TOG) 32(4):1\u201310","DOI":"10.1145\/2461912.2462012"},{"key":"1481_CR8","doi-asserted-by":"crossref","unstructured":"Chen X, Cao C, Xue Z, Chu W (2018) Joint audio-video driven facial animation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp 3046\u20133050","DOI":"10.1109\/ICASSP.2018.8461502"},{"key":"1481_CR9","doi-asserted-by":"publisher","first-page":"51","DOI":"10.1023\/A:1011171430700","volume":"29","author":"K Choi","year":"2001","unstructured":"Choi K, Luo Y, Hwang JN (2001) Hidden Markov model inversion for audio-to-visual conversion in an mpeg-4 facial animation system. J VLSI Signal Process Syst Signal, Image Video Technol 29:51\u201361","journal-title":"J VLSI Signal Process Syst Signal, Image Video Technol"},{"key":"1481_CR10","doi-asserted-by":"crossref","unstructured":"Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black MJ (2019) Capture, learning, and synthesis of 3D speaking styles. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 10101\u201310111","DOI":"10.1109\/CVPR.2019.01034"},{"key":"1481_CR11","doi-asserted-by":"crossref","unstructured":"Dan\u011b\u010dek R, Black MJ, Bolkart T (2022) Emoca: Emotion driven monocular face capture and animation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 20311\u201320322","DOI":"10.1109\/CVPR52688.2022.01967"},{"key":"1481_CR12","doi-asserted-by":"crossref","unstructured":"Fan Y, Lin Z, Saito J, Wang W, Komura T (2022) Faceformer: Speech-driven 3D facial animation with transformers. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 18770\u201318780","DOI":"10.1109\/CVPR52688.2022.01821"},{"issue":"4","key":"1481_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3450626.3459936","volume":"40","author":"Y Feng","year":"2021","unstructured":"Feng Y, Feng H, Black MJ, Bolkart T (2021) Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans Graph (ToG) 40(4):1\u201313","journal-title":"ACM Trans Graph (ToG)"},{"key":"1481_CR14","doi-asserted-by":"crossref","unstructured":"Filntisis PP, Retsinas G, Paraperas-Papantoniou F, Katsamanis A, Roussos A, Maragos P (2023) Spectre: Visual speech-informed perceptual 3d facial expression reconstruction from videos. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 5744\u20135754","DOI":"10.1109\/CVPRW59228.2023.00609"},{"issue":"2","key":"1481_CR15","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1109\/TMM.2005.843341","volume":"7","author":"S Fu","year":"2005","unstructured":"Fu S, Gutierrez-Osuna R, Esposito A, Kakumanu PK, Garcia ON (2005) Audio\/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimed 7(2):243\u2013252","journal-title":"IEEE Trans Multimed"},{"key":"1481_CR16","doi-asserted-by":"crossref","unstructured":"Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1993) Darpa timit acoustic-phonetic continuous speech corpus cd-rom $$\\{$$TIMIT$$\\}$$","DOI":"10.6028\/NIST.IR.4930"},{"key":"1481_CR17","unstructured":"Guo J, Zhu X, Lei Z (2018) 3ddfa. url $$\\{$$https:\/\/github. com\/cleardusk\/3DDFA$$\\}$$"},{"key":"1481_CR18","unstructured":"Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A et al (2014) Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567"},{"issue":"5","key":"1481_CR19","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1109\/TMM.2015.2407694","volume":"17","author":"N Harte","year":"2015","unstructured":"Harte N, Gillen E (2015) Tcd-timit: an audio-visual corpus of continuous speech. IEEE Trans Multimed 17(5):603\u2013615","journal-title":"IEEE Trans Multimed"},{"key":"1481_CR20","doi-asserted-by":"crossref","unstructured":"Hempel T, Abdelrahman AA, Al-Hamadi A (2022) 6D rotation representation for unconstrained head pose estimation. In: 2022 IEEE International Conference on Image Processing (ICIP). IEEE. pp 2496\u20132500","DOI":"10.1109\/ICIP46576.2022.9897219"},{"key":"1481_CR21","doi-asserted-by":"crossref","unstructured":"Hussen\u00a0Abdelaziz A, Theobald BJ, Dixon P, Knothe R, Apostoloff N, Kajareker S (2020) Modality dropout for improved performance-driven talking faces. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp 378\u2013386","DOI":"10.1145\/3382507.3418840"},{"issue":"4","key":"1481_CR22","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073658","volume":"36","author":"T Karras","year":"2017","unstructured":"Karras T, Aila T, Laine S, Herva A, Lehtinen J (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Graph (TOG) 36(4):1\u201312","journal-title":"ACM Trans Graph (TOG)"},{"key":"1481_CR23","doi-asserted-by":"crossref","unstructured":"Laine S, Karras T, Aila T, Herva A, Saito S, Yu R, Li H, Lehtinen J (2017) Production-level facial performance capture using deep convolutional neural networks. In: Proceedings of the ACM SIGGRAPH\/Eurographics symposium on computer animation, pp 1\u201310","DOI":"10.1145\/3099564.3099581"},{"issue":"6","key":"1481_CR24","first-page":"1","volume":"36","author":"T Li","year":"2017","unstructured":"Li T, Bolkart T, Black MJ, Li H, Romero J (2017) Learning a model of facial shape and expression from 4D scans. ACM Trans Graph 36(6):1\u2013194","journal-title":"ACM Trans Graph"},{"issue":"12","key":"1481_CR25","doi-asserted-by":"publisher","first-page":"4873","DOI":"10.1109\/TVCG.2021.3107669","volume":"28","author":"J Liu","year":"2021","unstructured":"Liu J, Hui B, Li K, Liu Y, Lai YK, Zhang Y, Liu Y, Yang J (2021) Geometry-guided dense perspective network for speech-driven facial animation. IEEE Trans Vis Comput Graph 28(12):4873\u20134886","journal-title":"IEEE Trans Vis Comput Graph"},{"key":"1481_CR26","doi-asserted-by":"publisher","DOI":"10.1109\/TASE.2024.3375024","author":"X Liu","year":"2024","unstructured":"Liu X, Li Z, Zong W, Su H, Liu P, Ge SS (2024) Graph representation learning and optimization for spherical emission source microscopy system. IEEE Trans Autom Sci Eng. https:\/\/doi.org\/10.1109\/TASE.2024.3375024","journal-title":"IEEE Trans Autom Sci Eng"},{"issue":"6","key":"1481_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2816795.2818130","volume":"34","author":"Y Liu","year":"2015","unstructured":"Liu Y, Xu F, Chai J, Tong X, Wang L, Huo Q (2015) Video-audio driven real-time facial animation. ACM Trans Graph (TOG) 34(6):1\u201310","journal-title":"ACM Trans Graph (TOG)"},{"issue":"11","key":"1481_CR28","doi-asserted-by":"publisher","first-page":"930","DOI":"10.1038\/s42256-022-00550-z","volume":"4","author":"P Ma","year":"2022","unstructured":"Ma P, Petridis S, Pantic M (2022) Visual speech recognition for multiple languages in the wild. Nat Mach Intell 4(11):930\u20139","journal-title":"Nat Mach Intell"},{"key":"1481_CR29","doi-asserted-by":"crossref","unstructured":"Martyniuk T, Kupyn O, Kurlyak Y, Krashenyi I, Matas J, Sharmanska V (2022) Dad-3dheads: a large-scale dense, accurate and diverse dataset for 3d head alignment from a single image. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 20942\u201320952","DOI":"10.1109\/CVPR52688.2022.02027"},{"issue":"3","key":"1481_CR30","doi-asserted-by":"publisher","first-page":"393","DOI":"10.1108\/AA-11-2020-0178","volume":"41","author":"W Qi","year":"2021","unstructured":"Qi W, Liu X, Zhang L, Wu L, Zang W, Su H (2021) Adaptive sensor fusion labeling framework for hand pose recognition in robot teleoperation. Assem Autom 41(3):393\u2013400","journal-title":"Assem Autom"},{"key":"1481_CR31","doi-asserted-by":"crossref","unstructured":"Richard A, Lea C, Ma S, Gall J, De\u00a0la Torre F, Sheikh Y (2021) Audio-and gaze-driven facial animation of codec avatars. In: Proceedings of the IEEE\/CVF winter conference on applications of computer vision, pp 41\u201350","DOI":"10.1109\/WACV48630.2021.00009"},{"key":"1481_CR32","doi-asserted-by":"crossref","unstructured":"Richard A, Zollh\u00f6fer M, Wen Y, De\u00a0la Torre F, Sheikh Y (2021) Meshtalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 1173\u20131182","DOI":"10.1109\/ICCV48922.2021.00121"},{"key":"1481_CR33","unstructured":"Shi B, Hsu WN, Lakhotia K, Mohamed A (2022) Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184"},{"key":"1481_CR34","doi-asserted-by":"crossref","unstructured":"Shi B, Hsu WN, Mohamed A (2022) Robust self-supervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763","DOI":"10.21437\/Interspeech.2022-99"},{"issue":"4","key":"1481_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073699","volume":"36","author":"S Taylor","year":"2017","unstructured":"Taylor S, Kim T, Yue Y, Mahler M, Krahe J, Rodriguez AG, Hodgins J, Matthews I (2017) A deep learning approach for generalized speech animation. ACM Trans Graph (TOG) 36(4):1\u201311","journal-title":"ACM Trans Graph (TOG)"},{"key":"1481_CR36","doi-asserted-by":"crossref","unstructured":"Tian G, Yuan Y, Liu Y (2019) Audio2face: generating speech\/face animation from single audio with attention-based bidirectional lstm networks. In: 2019 IEEE international conference on Multimedia & Expo Workshops (ICMEW). IEEE. pp 366\u2013371","DOI":"10.1109\/ICMEW.2019.00069"},{"key":"1481_CR37","doi-asserted-by":"crossref","unstructured":"Wang K, Wu Q, Song L, Yang Z, Wu W, Qian C, He R, Qiao Y, Loy CC (2020) Mead: a large-scale audio-visual dataset for emotional talking-face generation. In: Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XXI. Springer. pp 700\u2013717","DOI":"10.1007\/978-3-030-58589-1_42"},{"key":"1481_CR38","unstructured":"Wang Q, Fan Z, Xia S (2021) 3D-talkemo: Learning to synthesize 3d emotional talking head. arXiv preprint arXiv:2104.12051"},{"issue":"8","key":"1481_CR39","doi-asserted-by":"publisher","first-page":"2325","DOI":"10.1016\/j.patcog.2006.12.001","volume":"40","author":"L Xie","year":"2007","unstructured":"Xie L, Liu ZQ (2007) A coupled hmm approach to video-realistic speech animation. Pattern Recognit 40(8):2325\u20132340","journal-title":"Pattern Recognit"},{"issue":"3","key":"1481_CR40","doi-asserted-by":"publisher","first-page":"500","DOI":"10.1109\/TMM.2006.888009","volume":"9","author":"L Xie","year":"2007","unstructured":"Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(3):500\u2013510","journal-title":"IEEE Trans Multimed"},{"issue":"1","key":"1481_CR41","doi-asserted-by":"publisher","first-page":"951","DOI":"10.1007\/s40747-022-00841-3","volume":"9","author":"Y Xu","year":"2023","unstructured":"Xu Y, Su H, Ma G, Liu X (2023) A novel dual-modal emotion recognition algorithm with fusing hybrid features of audio signal and speech context. Complex Intell Syst 9(1):951\u2013963","journal-title":"Complex Intell Syst"},{"issue":"4","key":"1481_CR42","first-page":"1","volume":"42","author":"L Zhang","year":"2023","unstructured":"Zhang L, Qiu Q, Lin H, Zhang Q, Shi C, Yang W, Shi Y, Yang S, Xu L, Yu J (2023) Dreamface: progressive generation of animatable 3d faces under text guidance. ACM Trans Graph 42(4):1\u201316","journal-title":"ACM Trans Graph"},{"issue":"4","key":"1481_CR43","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3197517.3201292","volume":"37","author":"Y Zhou","year":"2018","unstructured":"Zhou Y, Xu Z, Landreth C, Kalogerakis E, Maji S, Singh K (2018) Visemenet: audio-driven animator-centric speech animation. ACM Trans Graph (TOG) 37(4):1\u201310","journal-title":"ACM Trans Graph (TOG)"}],"container-title":["Complex &amp; Intelligent Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01481-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s40747-024-01481-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s40747-024-01481-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,14]],"date-time":"2024-09-14T15:05:47Z","timestamp":1726326347000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s40747-024-01481-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,23]]},"references-count":43,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["1481"],"URL":"https:\/\/doi.org\/10.1007\/s40747-024-01481-5","relation":{},"ISSN":["2199-4536","2198-6053"],"issn-type":[{"type":"print","value":"2199-4536"},{"type":"electronic","value":"2198-6053"}],"subject":[],"published":{"date-parts":[[2024,5,23]]},"assertion":[{"value":"8 May 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 April 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 May 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}