{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T18:46:46Z","timestamp":1775069206181,"version":"3.50.1"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2017,7,20]],"date-time":"2017-07-20T00:00:00Z","timestamp":1500508800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Samsung"},{"DOI":"10.13039\/100006785","name":"Google","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100006785","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Intel"},{"name":"University of Washington Animation Research Labs"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2017,8,31]]},"abstract":"<jats:p>Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.<\/jats:p>","DOI":"10.1145\/3072959.3073640","type":"journal-article","created":{"date-parts":[[2017,7,21]],"date-time":"2017-07-21T12:24:07Z","timestamp":1500639847000},"page":"1-13","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":844,"title":["Synthesizing Obama"],"prefix":"10.1145","volume":"36","author":[{"given":"Supasorn","family":"Suwajanakorn","sequence":"first","affiliation":[{"name":"University of Washington"}]},{"given":"Steven M.","family":"Seitz","sequence":"additional","affiliation":[{"name":"University of Washington"}]},{"given":"Ira","family":"Kemelmacher-Shlizerman","sequence":"additional","affiliation":[{"name":"University of Washington"}]}],"member":"320","published-online":{"date-parts":[[2017,7,20]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467","author":"Abadi Martin","year":"2016","unstructured":"Martin Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S Corrado , Andy Davis , Jeffrey Dean , Matthieu Devin , and others. 2016 . Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016). Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503385.2503473"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.434"},{"key":"e_1_2_2_4_1","volume-title":"Availabel from: http:\/\/ffm.peg.org","author":"Bellard Fabrice","year":"2012","unstructured":"Fabrice Bellard , M Niedermayer , and others. 2012. FFmpeg. Availabel from: http:\/\/ffm.peg.org ( 2012 ). Fabrice Bellard, M Niedermayer, and others. 2012. FFmpeg. Availabel from: http:\/\/ffm.peg.org (2012)."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2185520.2185563"},{"key":"e_1_2_2_6_1","volume-title":"Dobb's Journal of Software Tools","author":"Bradski G.","year":"2000","unstructured":"G. Bradski . 2000. Dr. Dobb's Journal of Software Tools ( 2000 ). G. Bradski. 2000. Dr. Dobb's Journal of Software Tools (2000)."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311537"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/258734.258880"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/245.247"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925873"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1095878.1095881"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.927467"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2070781.2024164"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/566570.566594"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178899"},{"key":"e_1_2_2_16_1","volume-title":"A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications","author":"Fan Bo","year":"2015","unstructured":"Bo Fan , Lei Xie , Shan Yang , Lijuan Wang , and Frank K Soong . 2015b. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications ( 2015 ), 1--23. Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, and Frank K Soong. 2015b. A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications (2015), 1--23."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2005.843341"},{"key":"e_1_2_2_18_1","volume-title":"A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287","author":"Gal Yarin","year":"2015","unstructured":"Yarin Gal . 2015. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287 ( 2015 ). Yarin Gal. 2015. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287 (2015)."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.537"},{"key":"e_1_2_2_20_1","volume-title":"Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum","author":"Garrido Pablo","year":"2015","unstructured":"Pablo Garrido , Levi Valgaerts , Hamid Sarmadi , Ingmar Steiner , Kiran Varanasi , Patrick Perez , and Christian Theobalt . 2015 . Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum , Vol. 34 . Wiley Online Library , 193--204. Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 193--204."},{"key":"e_1_2_2_21_1","volume-title":"Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850","author":"Graves Alex","year":"2013","unstructured":"Alex Graves . 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 ( 2013 ). Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2013.6707742"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2005.06.042"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.2197\/ipsjjip.22.401"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2012.6247876"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783356"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/1577069.1755843"},{"key":"e_1_2_2_29_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik","year":"2014","unstructured":"Diederik Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_2_2_30_1","volume-title":"Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 57--64","author":"Li Kai","year":"2012","unstructured":"Kai Li , Feng Xu , Jue Wang , Qionghai Dai , and Yebin Liu . 2012 . A data-driven approach for facial expression synthesis in video . In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 57--64 . Kai Li, Feng Xu, Jue Wang, Qionghai Dai, and Yebin Liu. 2012. A data-driven approach for facial expression synthesis in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 57--64."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_23"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2008.4587845"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2013.02.005"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2014.11.001"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1141911.1141919"},{"key":"e_1_2_2_36_1","unstructured":"Wener Robitza. 2016. ffmpeg-normalize. https:\/\/github.com\/slhck\/ffmpeg-normalize. (2016).  Wener Robitza. 2016. ffmpeg-normalize. https:\/\/github.com\/slhck\/ffmpeg-normalize. (2016)."},{"key":"e_1_2_2_37_1","volume-title":"Real-Time Facial Segmentation and Performance Capture from RGB Input. arXiv preprint arXiv:1604.02647","author":"Saito Shunsuke","year":"2016","unstructured":"Shunsuke Saito , Tianye Li , and Hao Li. 2016. Real-Time Facial Segmentation and Performance Capture from RGB Input. arXiv preprint arXiv:1604.02647 ( 2016 ). Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-Time Facial Segmentation and Performance Capture from RGB Input. arXiv preprint arXiv:1604.02647 (2016)."},{"key":"e_1_2_2_38_1","doi-asserted-by":"crossref","unstructured":"Shinji Sako Keiichi Tokuda Takashi Masuko Takao Kobayashi and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis.. In INTERSPEECH. 25--28.  Shinji Sako Keiichi Tokuda Takashi Masuko Takao Kobayashi and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis.. In INTERSPEECH. 25--28.","DOI":"10.21437\/ICSLP.2000-469"},{"key":"e_1_2_2_39_1","doi-asserted-by":"crossref","unstructured":"YiChang Shih Sylvain Paris Connelly Barnes William T Freeman and Fr\u00e9do Durand. 2014. Style transfer for headshot portraits. (2014).  YiChang Shih Sylvain Paris Connelly Barnes William T Freeman and Fr\u00e9do Durand. 2014. Style transfer for headshot portraits. (2014).","DOI":"10.1145\/2601097.2601137"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/SII.2015.7404961"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10593-2_52"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.450"},{"key":"e_1_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Sarah Taylor Akihiro Kato Ben Milner and Iain Matthews. 2016. Audio-to-Visual Speech Conversion using Deep Neural Networks. (2016).  Sarah Taylor Akihiro Kato Ben Milner and Iain Matthews. 2016. Audio-to-Visual Speech Conversion using Deep Neural Networks. (2016).","DOI":"10.21437\/Interspeech.2016-483"},{"key":"e_1_2_2_44_1","volume-title":"Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. Eurographics Association, 275--284","author":"Taylor Sarah L","year":"2012","unstructured":"Sarah L Taylor , Moshe Mahler , Barry-John Theobald , and Iain Matthews . 2012 . Dynamic units of visual speech . In Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. Eurographics Association, 275--284 . Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH\/Eurographics conference on Computer Animation. Eurographics Association, 275--284."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1080\/10867651.2004.10487596"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2816795.2818056"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.262"},{"key":"e_1_2_2_48_1","volume-title":"Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499","author":"van den Oord A\u00e4ron","year":"2016","unstructured":"A\u00e4ron van den Oord , Sander Dieleman , Heiga Zen , Karen Simonyan , Oriol Vinyals , Alex Graves , Nal Kalchbrenner , Andrew Senior , and Koray Kavukcuoglu . 2016 . Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016). A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/1186822.1073209"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2012.6288925"},{"key":"e_1_2_2_51_1","doi-asserted-by":"crossref","first-page":"446","DOI":"10.21437\/Interspeech.2010-194","article-title":"Synthesizing photo-real talking head via trajectory-guided sample selection","volume":"10","author":"Wang Lijuan","year":"2010","unstructured":"Lijuan Wang , Xiaojun Qian , Wei Han , and Frank K Soong . 2010 . Synthesizing photo-real talking head via trajectory-guided sample selection .. In INTERSPEECH , Vol. 10. 446 -- 449 . Lijuan Wang, Xiaojun Qian, Wei Han, and Frank K Soong. 2010. Synthesizing photo-real talking head via trajectory-guided sample selection.. In INTERSPEECH, Vol. 10. 446--449.","journal-title":"INTERSPEECH"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2006.12.001"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.888009"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.75"},{"key":"e_1_2_2_55_1","volume-title":"Recurrent neural network regularization. arXiv preprint arXiv:1409.2329","author":"Zaremba Wojciech","year":"2014","unstructured":"Wojciech Zaremba , Ilya Sutskever , and Oriol Vinyals . 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 ( 2014 ). Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)."},{"key":"e_1_2_2_56_1","doi-asserted-by":"crossref","unstructured":"Xinjian Zhang Lijuan Wang Gang Li Frank Seide and Frank K Soong. 2013. A new language independent photo-realistic talking head driven by voice only.. In INTERSPEECH. 2743--2747.  Xinjian Zhang Lijuan Wang Gang Li Frank Seide and Frank K Soong. 2013. A new language independent photo-realistic talking head driven by voice only.. In INTERSPEECH. 2743--2747.","DOI":"10.21437\/Interspeech.2013-629"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3072959.3073640","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3072959.3073640","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:30:23Z","timestamp":1750217423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3072959.3073640"}},"subtitle":["learning lip sync from audio"],"short-title":[],"issued":{"date-parts":[[2017,7,20]]},"references-count":56,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,8,31]]}},"alternative-id":["10.1145\/3072959.3073640"],"URL":"https:\/\/doi.org\/10.1145\/3072959.3073640","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,7,20]]},"assertion":[{"value":"2017-07-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}