{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,10]],"date-time":"2026-06-10T10:10:49Z","timestamp":1781086249431,"version":"3.54.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2017,7,20]],"date-time":"2017-07-20T00:00:00Z","timestamp":1500508800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2017,8,31]]},"abstract":"<jats:p>We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.<\/jats:p>\n          <jats:p>We train our network with 3--5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.<\/jats:p>","DOI":"10.1145\/3072959.3073658","type":"journal-article","created":{"date-parts":[[2017,7,21]],"date-time":"2017-07-21T12:24:07Z","timestamp":1500639847000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":361,"title":["Audio-driven facial animation by joint end-to-end learning of pose and emotion"],"prefix":"10.1145","volume":"36","author":[{"given":"Tero","family":"Karras","sequence":"first","affiliation":[{"name":"NVIDIA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Timo","family":"Aila","sequence":"additional","affiliation":[{"name":"NVIDIA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Samuli","family":"Laine","sequence":"additional","affiliation":[{"name":"NVIDIA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Antti","family":"Herva","sequence":"additional","affiliation":[{"name":"Remedy Entertainment"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jaakko","family":"Lehtinen","sequence":"additional","affiliation":[{"name":"NVIDIA and Aalto University"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2017,7,20]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.434"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2007.02.006"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311537"},{"key":"e_1_2_2_4_1","volume-title":"Proc. SCA. 225--231","author":"Cao Yong","year":"2003"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1095878.1095881"},{"key":"e_1_2_2_6_1","volume-title":"cuDNN: Efficient Primitives for Deep Learning. arXiv:1410.0759","author":"Chetlur Sharan","year":"2014"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/PCCGA.2002.1167840"},{"key":"e_1_2_2_8_1","volume-title":"Massaro","author":"Cohen Michael M.","year":"1993"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10331-5_9"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2279659"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1026776.1026784"},{"key":"e_1_2_2_12_1","first-page":"1523","article-title":"Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces","volume":"12","author":"Deng Zhigang","year":"2006","journal-title":"IEEE TVCG"},{"key":"e_1_2_2_13_1","volume-title":"S\u00f8ren Kaae S\u00f8nderby, and others","author":"Dieleman Sander","year":"2015"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925984"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2004.1315070"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/566654.566594"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-015-2944-3"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1044\/jshr.1104.796"},{"key":"e_1_2_2_19_1","volume-title":"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852","author":"He Kaiming","year":"2015"},{"key":"e_1_2_2_20_1","volume-title":"Proc. Interspeech. 454--457","author":"Hofer Gregor","year":"2010"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2002.1021892"},{"key":"e_1_2_2_22_1","volume-title":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167","author":"Ioffe Sergey","year":"2015"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-013-1604-8"},{"key":"e_1_2_2_24_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2000.871547"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1002\/vis.4340020404"},{"key":"e_1_2_2_27_1","unstructured":"J. P. Lewis Ken Anjyo Taehyun Rhee Mengjie Zhang Fred Pighin and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics (State of the Art Reports).  J. P. Lewis Ken Anjyo Taehyun Rhee Mengjie Zhang Fred Pighin and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics (State of the Art Reports)."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/29933.30874"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2011.6011835"},{"key":"e_1_2_2_30_1","first-page":"61","article-title":"Text-driven avatars based on artificial neural networks and fuzzy logic","volume":"4","author":"Malcangi M.","year":"2010","journal-title":"Int. J. Comput."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485895.2485900"},{"key":"e_1_2_2_32_1","volume-title":"Proc. AVSP. #23","author":"Massaro D. W."},{"key":"e_1_2_2_33_1","doi-asserted-by":"crossref","unstructured":"D. W. Massaro M. M. Cohen R. Clark and M. Tabain. 2012. Animated speech: Research progress and applications. In Audiovisual Speech Processing. 309--345.  D. W. Massaro M. M. Cohen R. Clark and M. Tabain. 2012. Animated speech: Research progress and applications. In Audiovisual Speech Processing. 309--345.","DOI":"10.1017\/CBO9780511843891.014"},{"key":"e_1_2_2_34_1","volume-title":"Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication 66 (2","author":"Mattheyses Wesley","year":"2015"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2008.2010213"},{"key":"e_1_2_2_36_1","first-page":"33","article-title":"Bukimi no tani (The uncanny valley)","volume":"7","author":"Mori M.","year":"1970","journal-title":"Energy"},{"key":"e_1_2_2_37_1","first-page":"45","article-title":"Using HMMs and ANNs for mapping acoustic to visual speech","volume":"40","author":"\u00d6hman T.","year":"1999","journal-title":"IEEE Journal of Selected Topics in Signal Processing"},{"key":"e_1_2_2_38_1","volume-title":"Proc. AAAI Fall Symp. 141--145","author":"Petrushin Valery A.","year":"1998"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2013.2281036"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1003743"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1015706.1015736"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-483"},{"key":"e_1_2_2_44_1","volume-title":"Proc. SCA. 275--284","author":"Taylor Sarah L.","year":"2012"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1162\/089976600300015349"},{"key":"e_1_2_2_46_1","volume-title":"Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688","author":"Team Theano Development","year":"2016"},{"key":"e_1_2_2_47_1","volume-title":"WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499","author":"van den Oord A\u00e4ron","year":"2016"},{"key":"e_1_2_2_48_1","first-page":"93","article-title":"Multilinear Subspace Analysis of Image Ensembles","volume":"2","author":"Vasilescu M. Alex O.","year":"2003","journal-title":"Proc. CVPR"},{"key":"e_1_2_2_49_1","volume-title":"Proc. SCA. 53--62","author":"Wampler Kevin","year":"2007"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-014-2118-8"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3072959.3073658","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3072959.3073658","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:30:23Z","timestamp":1750217423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3072959.3073658"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,7,20]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,8,31]]}},"alternative-id":["10.1145\/3072959.3073658"],"URL":"https:\/\/doi.org\/10.1145\/3072959.3073658","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,7,20]]},"assertion":[{"value":"2017-07-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}