{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T20:37:55Z","timestamp":1775939875774,"version":"3.50.1"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"5","license":[{"start":{"date-parts":[[2019,10,13]],"date-time":"2019-10-13T00:00:00Z","timestamp":1570924800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,10,13]],"date-time":"2019-10-13T00:00:00Z","timestamp":1570924800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000761","name":"Imperial College London","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100000761","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2020,5]]},"abstract":"<jats:title>Abstract<\/jats:title>\n<jats:p>Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features. Our method generates videos which have (a) lip movements that are in sync with the audio and (b) natural facial expressions such as blinks and eyebrow movements. Our temporal GAN uses 3 discriminators focused on achieving detailed frames, audio-visual synchronization, and realistic expressions. We quantify the contribution of each component in our model using an ablation study and we provide insights into the latent representation of the model. The generated videos are evaluated based on sharpness, reconstruction quality, lip-reading accuracy, synchronization as well as their ability to generate natural blinks.<\/jats:p>","DOI":"10.1007\/s11263-019-01251-8","type":"journal-article","created":{"date-parts":[[2019,10,13]],"date-time":"2019-10-13T16:05:54Z","timestamp":1570982754000},"page":"1398-1413","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":228,"title":["Realistic Speech-Driven Facial Animation with GANs"],"prefix":"10.1007","volume":"128","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8552-5559","authenticated-orcid":false,"given":"Konstantinos","family":"Vougioukas","sequence":"first","affiliation":[]},{"given":"Stavros","family":"Petridis","sequence":"additional","affiliation":[]},{"given":"Maja","family":"Pantic","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2019,10,13]]},"reference":[{"key":"1251_CR1","unstructured":"Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). OpenFace: A general-purpose face recognition library with mobile applications. Technical Report, 118."},{"key":"1251_CR2","unstructured":"Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In ICLR."},{"key":"1251_CR3","unstructured":"Assael, Y. M., Shillingford, B., Whiteson, S., & de\u00a0Freitas, N. (2016). LipNet: End-to-end sentence-level Lipreading. arXiv preprint \narXiv:1611.01599\n\n."},{"issue":"6","key":"1251_CR4","doi-asserted-by":"publisher","first-page":"1028","DOI":"10.1002\/mds.870120629","volume":"12","author":"AR Bentivoglio","year":"1997","unstructured":"Bentivoglio, A. R., Bressman, S. B., Cassetta, E., Carretta, D., Tonali, P., & Albanese, A. (1997). Analysis of blink rate patterns in normal subjects. Movement Disorders, 12(6), 1028\u20131034.","journal-title":"Movement Disorders"},{"key":"1251_CR5","doi-asserted-by":"crossref","unstructured":"Bregler, C., Covell, M., & Slaney, M. (1997). Video rewrite. In Proceedings of the 24th annual conference on computer graphics and interactive techniques (pp. 353\u2013360).","DOI":"10.1145\/258734.258880"},{"issue":"4","key":"1251_CR6","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1109\/TAFFC.2014.2336244","volume":"5","author":"H Cao","year":"2014","unstructured":"Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4), 377\u2013390.","journal-title":"IEEE Transactions on Affective Computing"},{"issue":"4","key":"1251_CR7","doi-asserted-by":"publisher","first-page":"1283","DOI":"10.1145\/1095878.1095881","volume":"24","author":"Y Cao","year":"2005","unstructured":"Cao, Y., Tien, W. C., Faloutsos, P., & Pighin, F. (2005). Expressive speech-driven facial animation. ACM TOG, 24(4), 1283\u20131302.","journal-title":"ACM TOG"},{"key":"1251_CR8","doi-asserted-by":"crossref","unstructured":"Chen, L., Li, Z., Maddox, R. K., Duan, Z., & Xu, C. (2018). Lip movements generation at a glance. In ECCV (pp. 1\u201315).","DOI":"10.1007\/978-3-030-01234-2_32"},{"key":"1251_CR9","doi-asserted-by":"crossref","unstructured":"Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (2019). Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR.","DOI":"10.1109\/CVPR.2019.00802"},{"key":"1251_CR10","doi-asserted-by":"crossref","unstructured":"Chen, L., Srivastava, S., Duan, Z., & Xu, C. (2017). Deep cross-modal audio-visual generation. In Thematic workshops of ACM multimedia (pp. 349\u2013357).","DOI":"10.1145\/3126686.3126723"},{"key":"1251_CR11","unstructured":"Chung, J. S., Jamaludin, A., & Zisserman, A. (2017) You said that? In BMVC."},{"key":"1251_CR12","unstructured":"Chung, J. S., & Zisserman, A. (2016a). Lip reading in the wild. In ACCV."},{"key":"1251_CR13","unstructured":"Chung, J. S., & Zisserman, A. (2016b). Out of time: Automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV."},{"issue":"5","key":"1251_CR14","doi-asserted-by":"publisher","first-page":"2421","DOI":"10.1121\/1.2229005","volume":"120","author":"M Cooke","year":"2006","unstructured":"Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421\u20132424.","journal-title":"The Journal of the Acoustical Society of America"},{"key":"1251_CR15","doi-asserted-by":"crossref","unstructured":"Dai, W., Dai, C., Qu, S., Li, J., & Das, S. (2017) Very deep convolutional neural networks for raw waveforms. In ICASSP (pp. 421\u2013425).","DOI":"10.1109\/ICASSP.2017.7952190"},{"key":"1251_CR16","doi-asserted-by":"crossref","unstructured":"Fan, B., Wang, L., Soong, F., & Xie, L. (2015). Photo-real talking head with deep bidirectional LSTM. In ICASSP (pp. 4884\u20134888).","DOI":"10.1109\/ICASSP.2015.7178899"},{"key":"1251_CR17","unstructured":"Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial networks. In NIPS (pp. 2672\u20132680)."},{"issue":"5","key":"1251_CR18","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1109\/TMM.2015.2407694","volume":"17","author":"N Harte","year":"2015","unstructured":"Harte, N., & Gillen, E. (2015). TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, 17(5), 603\u2013615.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1251_CR19","unstructured":"Jianzhu\u00a0Guo, X. Z., & Lei, Z. (2018). 3DDFA. \nhttps:\/\/github.com\/cleardusk\/3DDFA\n\n. Accessed 17 Feb 2019."},{"issue":"4","key":"1251_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073658","volume":"36","author":"T Karras","year":"2017","unstructured":"Karras, T., Aila, T., Laine, S., Herva, A., & Lehtinen, J. (2017). Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM TOG, 36(4), 1\u201312.","journal-title":"ACM TOG"},{"key":"1251_CR21","unstructured":"Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint \narXiv:1412.6980\n\n."},{"key":"1251_CR22","doi-asserted-by":"crossref","unstructured":"Li, Y., Chang, M., & Lyu, S. (2018). In Ictu Oculi: Exposing AI created fake videos by detecting eye blinking. In WIFS.","DOI":"10.1109\/WIFS.2018.8630787"},{"key":"1251_CR23","unstructured":"Li, Y., Min, M. R., Shen, D., Carlson, D., & Carin, L. (2017). Video generation from text. arXiv preprint \narXiv:1710.00421"},{"key":"1251_CR24","unstructured":"Mathieu, M., Couprie, C., & LeCun, Y. (2015). Deep multi-scale video prediction beyond mean square error. arXiv preprint \narXiv:1511.05440\n\n."},{"issue":"9","key":"1251_CR25","first-page":"87","volume":"20","author":"ND Narvekar","year":"2009","unstructured":"Narvekar, N. D., & Karam, L. J. (2009). A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. International Workshop on Quality of Multimedia Experience (QoMEx), 20(9), 87\u201391.","journal-title":"International Workshop on Quality of Multimedia Experience (QoMEx)"},{"key":"1251_CR26","doi-asserted-by":"crossref","unstructured":"Pham, H. X., Cheung, S., & Pavlovic, V. (2017). Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In CVPR-Workshop (pp. 2328\u20132336).","DOI":"10.1109\/CVPRW.2017.287"},{"key":"1251_CR27","unstructured":"Pham, H. X., Wang, Y., & Pavlovic, V. (2018). Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network (pp. 1\u201318)."},{"key":"1251_CR28","doi-asserted-by":"crossref","unstructured":"Pumarola, A., Agudo, A., Martinez, A., Sanfeliu, A., & Moreno-Noguer, F. (2018). GANimation: Anatomically-aware facial animation from a single image. In ECCV.","DOI":"10.1007\/978-3-030-01249-6_50"},{"key":"1251_CR29","unstructured":"Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint \narXiv:1511.06434\n\n."},{"key":"1251_CR30","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234\u2013241).","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"1251_CR31","doi-asserted-by":"crossref","unstructured":"Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV (pp. 2830\u20132839).","DOI":"10.1109\/ICCV.2017.308"},{"issue":"January","key":"1251_CR32","first-page":"475","volume":"12","author":"AD Simons","year":"1990","unstructured":"Simons, A. D., & Cox, S. J. (1990). Generation of mouthshapes for a synthetic talking head. Proceedings of the Institute of Acoustics, Autumn Meeting, 12(January), 475\u2013482.","journal-title":"Proceedings of the Institute of Acoustics, Autumn Meeting"},{"key":"1251_CR33","unstructured":"Soukupova, T., & Cech, J. (2016). Real-time eye blink detection using facial landmarks. In Computer vision winter workshop."},{"issue":"4","key":"1251_CR34","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073640","volume":"36","author":"S Suwajanakorn","year":"2017","unstructured":"Suwajanakorn, S., Seitz, S., & Kemelmacher-Shlizerman, I. (2017). Synthesizing Obama: Learning lip sync from audio output Obama video. ACM TOG, 36(4), 1\u201313.","journal-title":"ACM TOG"},{"issue":"4","key":"1251_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3072959.3073699","volume":"36","author":"S Taylor","year":"2017","unstructured":"Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A. G., et al. (2017). A deep learning approach for generalized speech animation. ACM TOG, 36(4), 1\u201313.","journal-title":"ACM TOG"},{"key":"1251_CR36","doi-asserted-by":"crossref","unstructured":"Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR (pp. 1526\u20131535).","DOI":"10.1109\/CVPR.2018.00165"},{"key":"1251_CR37","first-page":"2579","volume":"9","author":"LJP Van Der Maaten","year":"2008","unstructured":"Van Der Maaten, L. J. P., & Hinton, G. E. (2008). Visualizing high-dimensional data using t-SNE. JMLR, 9, 2579\u20132605.","journal-title":"JMLR"},{"key":"1251_CR38","unstructured":"Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS (pp. 613\u2013621)."},{"key":"1251_CR39","unstructured":"Vougioukas, K., Petridis, S., & Pantic, M. (2018). End-to-end speech-driven facial animation with temporal GANs. In BMVC."},{"issue":"8","key":"1251_CR40","doi-asserted-by":"publisher","first-page":"2325","DOI":"10.1016\/j.patcog.2006.12.001","volume":"40","author":"L Xie","year":"2007","unstructured":"Xie, L., & Liu, Z. Q. (2007). A coupled HMM approach to video-realistic speech animation. Pattern Recognition, 40(8), 2325\u20132340.","journal-title":"Pattern Recognition"},{"issue":"1\u20132","key":"1251_CR41","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/S0167-6393(98)00054-5","volume":"26","author":"E Yamamoto","year":"1998","unstructured":"Yamamoto, E., Nakamura, S., & Shikano, K. (1998). Lip movement synthesis from speech based on hidden Markov Models. Speech Communication, 26(1\u20132), 105\u2013115.","journal-title":"Speech Communication"},{"issue":"1\u20132","key":"1251_CR42","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1016\/S0167-6393(98)00048-X","volume":"26","author":"H Yehia","year":"1998","unstructured":"Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1\u20132), 23\u201343.","journal-title":"Speech Communication"},{"issue":"3","key":"1251_CR43","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1006\/jpho.2002.0165","volume":"30","author":"HC Yehia","year":"2002","unstructured":"Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3), 555\u2013568.","journal-title":"Journal of Phonetics"},{"key":"1251_CR44","doi-asserted-by":"crossref","unstructured":"Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI.","DOI":"10.1609\/aaai.v33i01.33019299"},{"issue":"4","key":"1251_CR45","first-page":"161:1","volume":"37","author":"Y Zhou","year":"2018","unstructured":"Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S., & Singh, K. (2018). VisemeNet: Audio-driven animator-centric speech animation. ACM TOG, 37(4), 161:1\u2013161:10.","journal-title":"ACM TOG"},{"key":"1251_CR46","unstructured":"Zhu, X., Lei, Z., Li, S. Z., et\u00a0al. (2017). Face alignment in full pose range: A 3D total solution. In IEEE TPAMI."}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-019-01251-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s11263-019-01251-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-019-01251-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,10,11]],"date-time":"2020-10-11T23:15:58Z","timestamp":1602458158000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s11263-019-01251-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,13]]},"references-count":46,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2020,5]]}},"alternative-id":["1251"],"URL":"https:\/\/doi.org\/10.1007\/s11263-019-01251-8","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,10,13]]},"assertion":[{"value":"31 October 2018","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 October 2019","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 October 2019","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}