{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T07:43:55Z","timestamp":1767858235706,"version":"3.49.0"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T00:00:00Z","timestamp":1651622400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001863","name":"New Energy and Industrial Technology Development Organization","doi-asserted-by":"publisher","award":["JPNP21004"],"award-info":[{"award-number":["JPNP21004"]}],"id":[{"id":"10.13039\/501100001863","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Comput. Graph. Interact. Tech."],"published-print":{"date-parts":[[2022,5,4]]},"abstract":"<jats:p>Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing state-of-the-art approaches.<\/jats:p>","DOI":"10.1145\/3522615","type":"journal-article","created":{"date-parts":[[2022,5,4]],"date-time":"2022-05-04T17:31:08Z","timestamp":1651685468000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation"],"prefix":"10.1145","volume":"5","author":[{"given":"Yingruo","family":"Fan","sequence":"first","affiliation":[{"name":"The University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhaojiang","family":"Lin","sequence":"additional","affiliation":[{"name":"The Hong Kong University of Science and Technology, Hong Kong, Hong Kong and HKUST"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jun","family":"Saito","sequence":"additional","affiliation":[{"name":"Adobe Research, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenping","family":"Wang","sequence":"additional","affiliation":[{"name":"Texas A&amp;M University, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Taku","family":"Komura","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,5,4]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV.2016.7477553"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1095878.1095881"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_3"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_32"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00802"},{"key":"e_1_2_2_6_1","volume-title":"You said that? arXiv preprint arXiv:1705.02966","author":"Chung Joon Son","year":"2017","unstructured":"Joon Son Chung , Amir Jamaludin , and Andrew Zisserman . 2017. You said that? arXiv preprint arXiv:1705.02966 ( 2017 ). Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017)."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01034"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_25"},{"key":"e_1_2_2_9_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_2_10_1","volume-title":"Facial action coding system: A technique for the measurement of facial movement","author":"Eckman P","year":"1978","unstructured":"P Eckman and W Friesen . 1978. Facial action coding system: A technique for the measurement of facial movement . Consulting Psychologists Press ( 1978 ). P Eckman and W Friesen. 1978. Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press (1978)."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925984"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2010.2052239"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1044\/jshr.1104.796"},{"key":"e_1_2_2_14_1","unstructured":"Wallace V Friesen Paul Ekman etal 1983. EMFACS-7: Emotional facial action coding system. Unpublished manuscript University of California at San Francisco 2 36 (1983) 1.  Wallace V Friesen Paul Ekman et al. 1983. EMFACS-7: Emotional facial action coding system. Unpublished manuscript University of California at San Francisco 2 36 (1983) 1."},{"key":"e_1_2_2_15_1","unstructured":"Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates etal 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).  Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)."},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01386"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3418815"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511843891.014"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093527"},{"key":"e_1_2_2_22_1","volume-title":"Gentle forced aligner. github.com\/lowerquality\/gentle","author":"Ochshorn RM","year":"2017","unstructured":"RM Ochshorn and Max Hawkins . 2017. Gentle forced aligner. github.com\/lowerquality\/gentle ( 2017 ). RM Ochshorn and Max Hawkins. 2017. Gentle forced aligner. github.com\/lowerquality\/gentle (2017)."},{"key":"e_1_2_2_23_1","volume-title":"Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499","author":"van den Oord Aaron","year":"2016","unstructured":"Aaron van den Oord , Sander Dieleman , Heiga Zen , Karen Simonyan , Oriol Vinyals , Alex Graves , Nal Kalchbrenner , Andrew Senior , and Koray Kavukcuoglu . 2016 . Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016). Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)."},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.287"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2016.0055"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413532"},{"key":"e_1_2_2_27_1","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever etal 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9.  Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei Ilya Sutskever et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 8 (2019) 9."},{"key":"e_1_2_2_28_1","volume-title":"Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092","author":"Ramesh Aditya","year":"2021","unstructured":"Aditya Ramesh , Mikhail Pavlov , Gabriel Goh , Scott Gray , Chelsea Voss , Alec Radford , Mark Chen , and Ilya Sutskever . 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 ( 2021 ). Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)."},{"key":"e_1_2_2_29_1","volume-title":"Fernando De la Torre, and Yaser Sheikh","author":"Richard Alexander","year":"2021","unstructured":"Alexander Richard , Michael Zollhoefer , Yandong Wen , Fernando De la Torre, and Yaser Sheikh . 2021 . MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement . arXiv preprint arXiv:2104.08223 (2021). Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. arXiv preprint arXiv:2104.08223 (2021)."},{"key":"e_1_2_2_30_1","volume-title":"Speaker-normalized sound representations in the human auditory cortex. Nature communications 10, 1","author":"Sjerps Matthias J","year":"2019","unstructured":"Matthias J Sjerps , Neal P Fox , Keith Johnson , and Edward F Chang . 2019. Speaker-normalized sound representations in the human auditory cortex. Nature communications 10, 1 ( 2019 ), 1--9. Matthias J Sjerps, Neal P Fox, Keith Johnson, and Edward F Chang. 2019. Speaker-normalized sound representations in the human auditory cortex. Nature communications 10, 1 (2019), 1--9."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_2_2_32_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073699","article-title":"A deep learning approach for generalized speech animation","volume":"36","author":"Taylor Sarah","year":"2017","unstructured":"Sarah Taylor , Taehwan Kim , Yisong Yue , Moshe Mahler , James Krahe , Anastasio Garcia Rodriguez , Jessica Hodgins , and Iain Matthews . 2017 . A deep learning approach for generalized speech animation . ACM Transactions on Graphics (TOG) 36 , 4 (2017), 1 -- 11 . Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--11.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"e_1_2_2_33_1","volume-title":"Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. 275--284","author":"Taylor Sarah L","year":"2012","unstructured":"Sarah L Taylor , Moshe Mahler , Barry-John Theobald , and Iain Matthews . 2012 . Dynamic units of visual speech . In Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. 275--284 . Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. 275--284."},{"key":"e_1_2_2_34_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008."},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01251-8"},{"key":"e_1_2_2_36_1","volume-title":"3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head. arXiv preprint arXiv:2104.12051","author":"Wang Qianyun","year":"2021","unstructured":"Qianyun Wang , Zhenfeng Fan , and Shihong Xia . 2021. 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head. arXiv preprint arXiv:2104.12051 ( 2021 ). Qianyun Wang, Zhenfeng Fan, and Shihong Xia. 2021. 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head. arXiv preprint arXiv:2104.12051 (2021)."},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01261-8_41"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2522628.2522904"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_2_2_40_1","volume-title":"Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122","author":"Yu Fisher","year":"2015","unstructured":"Fisher Yu and Vladlen Koltun . 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 ( 2015 ). Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)."},{"key":"e_1_2_2_41_1","volume-title":"Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250","author":"Zadeh Amir","year":"2017","unstructured":"Amir Zadeh , Minghai Chen , Soujanya Poria , Erik Cambria , and Louis-Philippe Morency . 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 ( 2017 ). Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)."},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413844"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019299"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00416"},{"key":"e_1_2_2_45_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3197517.3201292","article-title":"Visemenet: Audio-driven animator-centric speech animation","volume":"37","author":"Zhou Yang","year":"2018","unstructured":"Yang Zhou , Zhan Xu , Chris Landreth , Evangelos Kalogerakis , Subhransu Maji , and Karan Singh . 2018 . Visemenet: Audio-driven animator-centric speech animation . ACM Transactions on Graphics (TOG) 37 , 4 (2018), 1 -- 10 . Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--10.","journal-title":"ACM Transactions on Graphics (TOG)"}],"container-title":["Proceedings of the ACM on Computer Graphics and Interactive Techniques"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3522615","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3522615","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:34Z","timestamp":1750183774000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3522615"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,4]]},"references-count":45,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,5,4]]}},"alternative-id":["10.1145\/3522615"],"URL":"https:\/\/doi.org\/10.1145\/3522615","relation":{},"ISSN":["2577-6193"],"issn-type":[{"value":"2577-6193","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,5,4]]},"assertion":[{"value":"2022-05-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}