{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,18]],"date-time":"2026-02-18T03:04:36Z","timestamp":1771383876206,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,18]]},"DOI":"10.1145\/3462244.3479883","type":"proceedings-article","created":{"date-parts":[[2021,10,15]],"date-time":"2021-10-15T15:01:58Z","timestamp":1634310118000},"page":"503-511","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Audiovisual Speech Synthesis using Tacotron2"],"prefix":"10.1145","author":[{"given":"Ahmed","family":"Hussen Abdelaziz","sequence":"first","affiliation":[{"name":"Apple, USA"}]},{"given":"Anushree Prasanna","family":"Kumar","sequence":"additional","affiliation":[{"name":"Apple, USA"}]},{"given":"Chloe","family":"Seivwright","sequence":"additional","affiliation":[{"name":"Apple, United Kingdom"}]},{"given":"Gabriele","family":"Fanelli","sequence":"additional","affiliation":[{"name":"Apple, Switzerland"}]},{"given":"Justin","family":"Binder","sequence":"additional","affiliation":[{"name":"Apple, USA"}]},{"given":"Yannis","family":"Stylianou","sequence":"additional","affiliation":[{"name":"Apple, United Kingdom"}]},{"given":"Sachin","family":"Kajareker","sequence":"additional","affiliation":[{"name":"Apple, USA"}]}],"member":"320","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"crossref","unstructured":"Ahmed\u00a0Hussen Abdelaziz Barry-John Theobald Paul Dixon Reinhard Knothe Nicholas Apostoloff and Sachin Kajareker. 2020. Modality Dropout for Improved Performance-driven Talking Faces. arxiv:2005.13616\u00a0[eess.AS] Ahmed\u00a0Hussen Abdelaziz Barry-John Theobald Paul Dixon Reinhard Knothe Nicholas Apostoloff and Sachin Kajareker. 2020. Modality Dropout for Improved Performance-driven Talking Faces. arxiv:2005.13616\u00a0[eess.AS]","DOI":"10.1145\/3382507.3418840"},{"key":"e_1_3_2_2_2_1","unstructured":"Zakaria Aldeneh Anushree\u00a0Prasanna Kumar Barry-John Theobald Erik Marchi Sachin Kajarekar Devang Naik and Ahmed\u00a0Hussen Abdelaziz. 2020. Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement. arXiv preprint arXiv:2004.12031(2020). Zakaria Aldeneh Anushree\u00a0Prasanna Kumar Barry-John Theobald Erik Marchi Sachin Kajarekar Devang Naik and Ahmed\u00a0Hussen Abdelaziz. 2020. Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement. arXiv preprint arXiv:2004.12031(2020)."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503385.2503473"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1964921.1964970"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311537"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2766943"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925873"},{"key":"e_1_3_2_2_8_1","unstructured":"Jan\u00a0K Chorowski Dzmitry Bahdanau Dmitriy Serdyuk Kyunghyun Cho and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577\u2013585. Jan\u00a0K Chorowski Dzmitry Bahdanau Dmitriy Serdyuk Kyunghyun Cho and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577\u2013585."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.927467"},{"key":"e_1_3_2_2_10_1","volume-title":"Normal visual hearing. Science","author":"Cotton Jack\u00a0Chilton","year":"1935","unstructured":"Jack\u00a0Chilton Cotton . 1935. Normal visual hearing. Science ( 1935 ). Jack\u00a0Chilton Cotton. 1935. Normal visual hearing. Science (1935)."},{"key":"e_1_3_2_2_11_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888","author":"Fan B.","unstructured":"B. Fan , L. Wang , F. Soong , and L. Xie . 2015. Photo-real talking head with deep bidirectional LSTM . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888 . B. Fan, L. Wang, F. Soong, and L. Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884\u20134888."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2005.843341"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2504459.2504501"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45016-5_24"},{"key":"e_1_3_2_2_15_1","volume-title":"Sander Dieleman, and Koray Kavukcuoglu.","author":"Kalchbrenner Nal","year":"2018","unstructured":"Nal Kalchbrenner , Erich Elsen , Karen Simonyan , Seb Noury , Norman Casagrande , Edward Lockhart , Florian Stimberg , Aaron van\u00a0den Oord , Sander Dieleman, and Koray Kavukcuoglu. 2018 . Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435(2018). Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van\u00a0den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435(2018)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"key":"e_1_3_2_2_17_1","volume-title":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586","author":"Kim T.","unstructured":"T. Kim , Y. Yue , S. Taylor , and I. Matthews . 2015. A decision tree framework for spatiotemporal sequence prediction . In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586 . T. Kim, Y. Yue, S. Taylor, and I. Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577\u2013586."},{"key":"e_1_3_2_2_18_1","volume-title":"Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442(2017).","author":"Kumar Rithesh","year":"2017","unstructured":"Rithesh Kumar , Jose Sotelo , Kundan Kumar , Alexandre de Br\u00e9bisson , and Yoshua Bengio . 2017 . Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442(2017). Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Br\u00e9bisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442(2017)."},{"key":"e_1_3_2_2_19_1","volume-title":"Hearing lips and seeing voices. Nature 264, 5588","author":"McGurk Harry","year":"1976","unstructured":"Harry McGurk and John MacDonald . 1976. Hearing lips and seeing voices. Nature 264, 5588 ( 1976 ), 746\u2013748. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746\u2013748."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7953092"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.21437\/ICSLP.2000-469"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_2_23_1","volume-title":"2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105","author":"Shimba T.","unstructured":"T. Shimba , R. Sakurai , H. Yamazoe , and J. Lee . 2015. Talking heads synthesis from audio with deep neural networks . In 2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105 . T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE\/SICE International Symposium on System Integration (SII). IEEE, 100\u2013105."},{"key":"e_1_3_2_2_24_1","unstructured":"RJ SkerryRyan Eric Battenberg Ying Xiao Yuxuan Wang Daisy Stanton Joel Shor Ron\u00a0J Weiss Rob Clark and Rif\u00a0A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047(2018). RJ SkerryRyan Eric Battenberg Ying Xiao Yuxuan Wang Daisy Stanton Joel Shor Ron\u00a0J Weiss Rob Clark and Rif\u00a0A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047(2018)."},{"key":"e_1_3_2_2_25_1","volume-title":"Everybody\u2019s Talkin\u2019: Let Me Talk as You Want. arXiv preprint arXiv:(2020).","author":"Song Linsen","year":"2020","unstructured":"Linsen Song , Wayne Wu , Chen Qian , Chen Qian , and Chen\u00a0Change Loy . 2020 . Everybody\u2019s Talkin\u2019: Let Me Talk as You Want. arXiv preprint arXiv:(2020). Linsen Song, Wayne Wu, Chen Qian, Chen Qian, and Chen\u00a0Change Loy. 2020. Everybody\u2019s Talkin\u2019: Let Me Talk as You Want. arXiv preprint arXiv:(2020)."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"crossref","first-page":"2578","DOI":"10.21437\/Interspeech.2019-2822","article-title":"Self-attention for Speech Emotion Recognition","volume":"2019","author":"Tarantino Lorenzo","year":"2019","unstructured":"Lorenzo Tarantino , Philip\u00a0 N Garner , and Alexandros Lazaridis . 2019 . Self-attention for Speech Emotion Recognition . Proc. Interspeech 2019 (2019), 2578 \u2013 2582 . Lorenzo Tarantino, Philip\u00a0N Garner, and Alexandros Lazaridis. 2019. Self-attention for Speech Emotion Recognition. Proc. Interspeech 2019(2019), 2578\u20132582.","journal-title":"Proc. Interspeech"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"crossref","unstructured":"S. Taylor A. Kato B. Milner and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association. S. Taylor A. Kato B. Milner and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association.","DOI":"10.21437\/Interspeech.2016-483"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073699"},{"key":"e_1_3_2_2_29_1","volume-title":"Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284","author":"Taylor S.","unstructured":"S. Taylor , M. Mahler , B. Theobald , and I. Matthews . 2012. Dynamic units of visual speech . In Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284 . S. Taylor, M. Mahler, B. Theobald, and I. Matthews. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation. Eurographics Association, 275\u2013284."},{"key":"e_1_3_2_2_30_1","volume-title":"Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019","author":"Thies Justus","year":"2019","unstructured":"Justus Thies , Mohamed Elgharib , Ayush Tewari , Christian Theobalt , and Matthias Nie\u00dfner . 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 ( 2019 ). Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nie\u00dfner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 (2019)."},{"key":"e_1_3_2_2_31_1","volume-title":"Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008)","author":"van\u00a0der Maaten Laurens","year":"2008","unstructured":"Laurens van\u00a0der Maaten and Geoffrey Hinton . 2008. Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008) ( 2008 ). Laurens van\u00a0der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008) (2008)."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2008-596"},{"key":"e_1_3_2_2_33_1","volume-title":"Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135(2017).","author":"Wang Yuxuan","year":"2017","unstructured":"Yuxuan Wang , RJ Skerry-Ryan , Daisy Stanton , Yonghui Wu , Ron\u00a0 J Weiss , Navdeep Jaitly , Zongheng Yang , Ying Xiao , Zhifeng Chen , Samy Bengio , 2017 . Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135(2017). Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron\u00a0J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135(2017)."},{"key":"e_1_3_2_2_34_1","unstructured":"Yuxuan Wang Daisy Stanton Yu Zhang RJ Skerry-Ryan Eric Battenberg Joel Shor Ying Xiao Fei Ren Ye Jia and Rif\u00a0A Saurous. 2018. Style tokens: Unsupervised style modeling control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017(2018). Yuxuan Wang Daisy Stanton Yu Zhang RJ Skerry-Ryan Eric Battenberg Joel Shor Ying Xiao Fei Ren Ye Jia and Rif\u00a0A Saurous. 2018. Style tokens: Unsupervised style modeling control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017(2018)."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"crossref","unstructured":"T. Weise S. Bouaziz H. Li and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG) Vol.\u00a030. ACM 77. T. Weise S. Bouaziz H. Li and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG) Vol.\u00a030. ACM 77.","DOI":"10.1145\/1964921.1964972"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.gmod.2013.10.002"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2006.888009"},{"key":"e_1_3_2_2_38_1","volume-title":"Data-Driven 3D Facial Animation","author":"Zhang Li","unstructured":"Li Zhang , Noah Snavely , Brian Curless , and Steven\u00a0 M Seitz . 2008. Spacetime faces: High-resolution capture formodeling and animation . In Data-Driven 3D Facial Animation . Springer , 248\u2013276. Li Zhang, Noah Snavely, Brian Curless, and Steven\u00a0M Seitz. 2008. Spacetime faces: High-resolution capture formodeling and animation. In Data-Driven 3D Facial Animation. Springer, 248\u2013276."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3197517.3201292","article-title":"Visemenet: Audio-driven animator-centric speech animation","volume":"37","author":"Zhou Yang","year":"2018","unstructured":"Yang Zhou , Zhan Xu , Chris Landreth , Evangelos Kalogerakis , Subhransu Maji , and Karan Singh . 2018 . Visemenet: Audio-driven animator-centric speech animation . ACM Transactions on Graphics (TOG) 37 , 4 (2018), 1 \u2013 10 . Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1\u201310.","journal-title":"ACM Transactions on Graphics (TOG)"}],"event":{"name":"ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","location":"Montr\u00e9al QC Canada","acronym":"ICMI '21","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"]},"container-title":["Proceedings of the 2021 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479883","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3462244.3479883","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:54Z","timestamp":1750193334000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479883"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":39,"alternative-id":["10.1145\/3462244.3479883","10.1145\/3462244"],"URL":"https:\/\/doi.org\/10.1145\/3462244.3479883","relation":{},"subject":[],"published":{"date-parts":[[2021,10,18]]},"assertion":[{"value":"2021-10-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}