{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:27:09Z","timestamp":1750220829052,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,10,14]],"date-time":"2019-10-14T00:00:00Z","timestamp":1571011200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,10,14]]},"DOI":"10.1145\/3340555.3353745","type":"proceedings-article","created":{"date-parts":[[2019,10,17]],"date-time":"2019-10-17T12:49:48Z","timestamp":1571316588000},"page":"220-225","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models"],"prefix":"10.1145","author":[{"given":"Ahmed","family":"Hussen Abdelaziz","sequence":"first","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]},{"given":"Barry-John","family":"Theobald","sequence":"additional","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]},{"given":"Justin","family":"Binder","sequence":"additional","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]},{"given":"Gabriele","family":"Fanelli","sequence":"additional","affiliation":[{"name":"Apple Inc., Zurich, Switzerland"}]},{"given":"Paul","family":"Dixon","sequence":"additional","affiliation":[{"name":"Apple Inc., Zurich, Switzerland"}]},{"given":"Nick","family":"Apostoloff","sequence":"additional","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]},{"given":"Thibaut","family":"Weise","sequence":"additional","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]},{"given":"Sachin","family":"Kajareker","sequence":"additional","affiliation":[{"name":"Apple Inc., Cupertino, CA"}]}],"member":"320","published-online":{"date-parts":[[2019,10,14]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2015.2409785"},{"volume-title":"Proceedings of the IEEE conference on computer vision and pattern recognition. 3382\u20133389","author":"Anderson R.","key":"e_1_3_2_1_2_1","unstructured":"R. Anderson , B. Stenger , V. Wan , and R. Cipolla . 2013. Expressive visual text-to-speech using active appearance models . In Proceedings of the IEEE conference on computer vision and pattern recognition. 3382\u20133389 . R. Anderson, B. Stenger, V. Wan, and R. Cipolla. 2013. Expressive visual text-to-speech using active appearance models. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3382\u20133389."},{"volume-title":"International Conference on Auditory-Visual Speech Processing. 175\u2013180","author":"Arslan L.","key":"e_1_3_2_1_3_1","unstructured":"L. Arslan and D. Talkin . 1998. 3D face point trajectory synthesis using an automatically derived visual phoneme similarity matrix . In International Conference on Auditory-Visual Speech Processing. 175\u2013180 . L. Arslan and D. Talkin. 1998. 3D face point trajectory synthesis using an automatically derived visual phoneme similarity matrix. In International Conference on Auditory-Visual Speech Processing. 175\u2013180."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311537"},{"volume-title":"Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201997)","author":"Bregler C.","key":"e_1_3_2_1_5_1","unstructured":"C. Bregler , M. Covell , and M. Slaney . 1997. Video Rewrite: driving visual speech with audio .. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201997) , Vol.\u00a097. 353\u2013360. C. Bregler, M. Covell, and M. Slaney. 1997. Video Rewrite: driving visual speech with audio.. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201997), Vol.\u00a097. 353\u2013360."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1011171430700"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/6046.865480"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"crossref","unstructured":"P. Ekman and W. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press Palo Alto.  P. Ekman and W. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press Palo Alto.","DOI":"10.1037\/t27734-000"},{"volume-title":"Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201902)","author":"Ezzat T.","key":"e_1_3_2_1_9_1","unstructured":"T. Ezzat , G. Geiger , and T. Poggio . 2002. Trainable Videorealistic Speech Animation .. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201902) , Vol.\u00a097. 388\u2013398. T. Ezzat, G. Geiger, and T. Poggio. 2002. Trainable Videorealistic Speech Animation.. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques(SIGGRAPH \u201902), Vol.\u00a097. 388\u2013398."},{"volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888","author":"Fan B.","key":"e_1_3_2_1_10_1","unstructured":"B. Fan , L. Wang , F. Soong , and L. Xie . 2015. Photo-real talking head with deep bidirectional LSTM . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888 . B. Fan, L. Wang, F. Soong, and L. Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2005.843341"},{"key":"e_1_3_2_1_12_1","volume-title":"Maximum likelihood linear transformations for HMM-based speech recognition. Computer speech & language 12, 2","author":"Gales JF","year":"1998","unstructured":"Mark\u00a0 JF Gales . 1998. Maximum likelihood linear transformations for HMM-based speech recognition. Computer speech & language 12, 2 ( 1998 ), 75\u201398. Mark\u00a0JF Gales. 1998. Maximum likelihood linear transformations for HMM-based speech recognition. Computer speech & language 12, 2 (1998), 75\u201398."},{"key":"e_1_3_2_1_13_1","volume-title":"TDA: A new trainable trajectory formation system for facial animation. In Interspeech. 2474\u20132477.","author":"Govokhina O.","year":"2006","unstructured":"O. Govokhina , G. Bailly , G. Breton , and P. Bagshaw . 2006 . TDA: A new trainable trajectory formation system for facial animation. In Interspeech. 2474\u20132477. O. Govokhina, G. Bailly, G. Breton, and P. Bagshaw. 2006. TDA: A new trainable trajectory formation system for facial animation. In Interspeech. 2474\u20132477."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2004.840611"},{"volume-title":"IEEE International Conference on Automatic Face & Gesture Recognition. 200\u2013207","key":"e_1_3_2_1_15_1","unstructured":"Zhenliang H., Meina K., Jie J., Xilin C., and Shiguang S . 2017. A Fully End-to-End Cascaded CNN for Facial Landmark Detection . In IEEE International Conference on Automatic Face & Gesture Recognition. 200\u2013207 . Zhenliang H., Meina K., Jie J., Xilin C., and Shiguang S.2017. A Fully End-to-End Cascaded CNN for Facial Landmark Detection. In IEEE International Conference on Automatic Face & Gesture Recognition. 200\u2013207."},{"key":"e_1_3_2_1_16_1","unstructured":"S. Jalalifar H. Hasani and H. Aghajan. 2018. Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks. arXiv preprint arXiv:1803.07461(2018).  S. Jalalifar H. Hasani and H. Aghajan. 2018. Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks. arXiv preprint arXiv:1803.07461(2018)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"volume-title":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586","author":"Kim T.","key":"e_1_3_2_1_18_1","unstructured":"T. Kim , Y. Yue , S. Taylor , and I. Matthews . 2015. A decision tree framework for spatiotemporal sequence prediction . In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586 . T. Kim, Y. Yue, S. Taylor, and I. Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586."},{"key":"e_1_3_2_1_19_1","unstructured":"G\u00fcnter Klambauer Thomas Unterthiner Andreas Mayr and Sepp Hochreiter. 2017. Self-Normalizing Neural Networks. CoRR abs\/1706.02515(2017). arxiv:1706.02515http:\/\/arxiv.org\/abs\/1706.02515  G\u00fcnter Klambauer Thomas Unterthiner Andreas Mayr and Sepp Hochreiter. 2017. Self-Normalizing Neural Networks. CoRR abs\/1706.02515(2017). arxiv:1706.02515http:\/\/arxiv.org\/abs\/1706.02515"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2013.2278556"},{"key":"e_1_3_2_1_21_1","first-page":"1","article-title":"Example-based Facial Rigging. In ACM SIGGRAPH. ACM","volume":"32","author":"Li H.","year":"2010","unstructured":"H. Li , T. Weise , and M. Pauly . 2010 . Example-based Facial Rigging. In ACM SIGGRAPH. ACM , Article 32 , 32: 1 \u2013 32 :6\u00a0pages. H. Li, T. Weise, and M. Pauly. 2010. Example-based Facial Rigging. In ACM SIGGRAPH. ACM, Article 32, 32:1\u201332:6\u00a0pages.","journal-title":"Article"},{"key":"e_1_3_2_1_22_1","volume-title":"On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18, 1 (03","author":"Mann B.","year":"1947","unstructured":"H.\u00a0 B. Mann and D.\u00a0 R. Whitney . 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18, 1 (03 1947 ), 50\u201360. https:\/\/doi.org\/10.1214\/aoms\/1177730491 10.1214\/aoms H.\u00a0B. Mann and D.\u00a0R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18, 1 (03 1947), 50\u201360. https:\/\/doi.org\/10.1214\/aoms\/1177730491"},{"key":"e_1_3_2_1_23_1","first-page":"7","article-title":"Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis","volume":"55","author":"Mattheyses W.","year":"2013","unstructured":"W. Mattheyses , L. Latacz , and W. Verhelst . 2013 . Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis . Speech Communication 55 , 7 - 8 (2013), 857\u2013876. W. Mattheyses, L. Latacz, and W. Verhelst. 2013. Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Communication 55, 7-8 (2013), 857\u2013876.","journal-title":"Speech Communication"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"H. Pham Y. Wang and V. Pavlovic. 2017. End-to-end learning for 3d facial animation from raw waveforms of speech. arXiv preprint arXiv:1710.00920(2017).  H. Pham Y. Wang and V. Pavlovic. 2017. End-to-end learning for 3d facial animation from raw waveforms of speech. arXiv preprint arXiv:1710.00920(2017).","DOI":"10.1145\/3242969.3243017"},{"key":"e_1_3_2_1_25_1","unstructured":"D. Povey A. Ghoshal G. Boulianne L. Burget O. Glembek N. Goel M. Hannemann P. Motlicek Y. Qian P. Schwarz J. Silovsky G. Stemmer and K. Vesely. 2011. The Kaldi speech recognition toolkit. Technical Report. IEEE Signal Processing Society.  D. Povey A. Ghoshal G. Boulianne L. Burget O. Glembek N. Goel M. Hannemann P. Motlicek Y. Qian P. Schwarz J. Silovsky G. Stemmer and K. Vesely. 2011. The Kaldi speech recognition toolkit. Technical Report. IEEE Signal Processing Society."},{"volume-title":"2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105","author":"Shimba T.","key":"e_1_3_2_1_26_1","unstructured":"T. Shimba , R. Sakurai , H. Yamazoe , and J. Lee . 2015. Talking heads synthesis from audio with deep neural networks . In 2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105 . T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Y. Song J. Zhu X. Wang and H. Qi. 2018. Talking Face Generation by Conditional Recurrent Adversarial Network. arXiv preprint arXiv:1804.04786(2018).  Y. Song J. Zhu X. Wang and H. Qi. 2018. Talking Face Generation by Conditional Recurrent Adversarial Network. arXiv preprint arXiv:1804.04786(2018).","DOI":"10.24963\/ijcai.2019\/129"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"S. Taylor A. Kato B. Milner and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association.  S. Taylor A. Kato B. Milner and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association.","DOI":"10.21437\/Interspeech.2016-483"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073699"},{"volume-title":"Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284","author":"Taylor S.","key":"e_1_3_2_1_31_1","unstructured":"S. Taylor , M. Mahler , B. Theobald , and I. Matthews . 2012. Dynamic units of visual speech . In Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284 . S. Taylor, M. Mahler, B. Theobald, and I. Matthews. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"crossref","unstructured":"K. Vesel\u1ef3 A. Ghoshal L. Burget and D. Povey. 2013. Sequence-discriminative training of deep neural networks.. In Interspeech. 2345\u20132349.  K. Vesel\u1ef3 A. Ghoshal L. Burget and D. Povey. 2013. Sequence-discriminative training of deep neural networks.. In Interspeech. 2345\u20132349.","DOI":"10.21437\/Interspeech.2013-548"},{"key":"e_1_3_2_1_33_1","volume-title":"S. Petridis, and M. Pantic.","author":"K.","year":"2018","unstructured":"K. vougioukas , S. Petridis, and M. Pantic. 2018 . End-to-end speech-driven facial animation with temporal gans. arXiv preprint arXiv:1805.09313(2018). K. vougioukas, S. Petridis, and M. Pantic. 2018. End-to-end speech-driven facial animation with temporal gans. arXiv preprint arXiv:1805.09313(2018)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-014-2118-8"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"crossref","unstructured":"T. Weise S. Bouaziz H. Li and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG) Vol.\u00a030. ACM 77.  T. Weise S. Bouaziz H. Li and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG) Vol.\u00a030. ACM 77.","DOI":"10.1145\/1964921.1964972"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2006.12.001"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.888009"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-6393(98)00048-X"},{"volume-title":"Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics.","author":"Young S.","key":"e_1_3_2_1_39_1","unstructured":"S. Young , J. Odell , and P. Woodland . 1994. Tree-based state tying for high accuracy acoustic modelling . In Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics. S. Young, J. Odell, and P. Woodland. 1994. Tree-based state tying for high accuracy acoustic modelling. In Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics."}],"event":{"name":"ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","acronym":"ICMI '19","location":"Suzhou China"},"container-title":["2019 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3340555.3353745","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3340555.3353745","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:13:28Z","timestamp":1750202008000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3340555.3353745"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,14]]},"references-count":39,"alternative-id":["10.1145\/3340555.3353745","10.1145\/3340555"],"URL":"https:\/\/doi.org\/10.1145\/3340555.3353745","relation":{},"subject":[],"published":{"date-parts":[[2019,10,14]]},"assertion":[{"value":"2019-10-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}