{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T04:27:12Z","timestamp":1765254432822,"version":"3.41.0"},"reference-count":49,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,10,1]],"date-time":"2024-10-01T00:00:00Z","timestamp":1727740800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Culture, Sports and Tourism R&D Program"},{"DOI":"10.13039\/501100006465","name":"Korea Creative Content Agency","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006465","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Ministry of Culture, Sports and Tourism in 2023"},{"name":"Development of Universal Fashion Creation Platform Technology for Avatar Personality Expression","award":["RS-2023-00228331"],"award-info":[{"award-number":["RS-2023-00228331"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2025,2,28]]},"abstract":"<jats:p>We present a novel method that can generate realistic speech animations of a 3D face from audio using multiple adaptive windows. In contrast to previous studies that use a fixed size audio window, our method accepts an adaptive audio window as input, reflecting the audio speaking rate to use consistent phonemic information. Our system consists of three parts. First, the speaking rate is estimated from the input audio using a neural network trained in a self-supervised manner. Second, the appropriate window size that encloses the audio features is predicted adaptively based on the estimated speaking rate. Another key element lies in the use of multiple audio windows of different sizes as input to the animation generator: a small window to concentrate on detailed information and a large window to consider broad phonemic information near the center frame. Finally, the speech animation is generated from the multiple adaptive audio windows. Our method can generate realistic speech animations from in-the-wild audios at any speaking rate, i.e., fast raps, slow songs, as well as normal speech. We demonstrate via extensive quantitative and qualitative evaluations including a user study that our method outperforms state-of-the-art approaches.<\/jats:p>","DOI":"10.1145\/3691341","type":"journal-article","created":{"date-parts":[[2024,8,31]],"date-time":"2024-08-31T09:13:16Z","timestamp":1725095596000},"page":"1-14","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Speed-Aware Audio-Driven Speech Animation using Adaptive Windows"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6427-6258","authenticated-orcid":false,"given":"Sunjin","family":"Jung","sequence":"first","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7327-2950","authenticated-orcid":false,"given":"Yeongho","family":"Seol","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0570-4915","authenticated-orcid":false,"given":"Kwanggyoon","family":"Seo","sequence":"additional","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6818-1595","authenticated-orcid":false,"given":"Hyeonho","family":"Na","sequence":"additional","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8027-8261","authenticated-orcid":false,"given":"Seonghyeon","family":"Kim","sequence":"additional","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea and Anigma Technologies, Daejeon, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-8174-6909","authenticated-orcid":false,"given":"Vanessa","family":"Tan","sequence":"additional","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1925-3326","authenticated-orcid":false,"given":"Junyong","family":"Noh","sequence":"additional","affiliation":[{"name":"Visual Media Lab, KAIST, Daejeon, Republic of Korea"}]}],"member":"320","published-online":{"date-parts":[[2024,10]]},"reference":[{"key":"e_1_3_4_2_1","unstructured":"Autodesk. 2023. Maya. Retrieved from https:\/\/www.autodesk.com\/maya"},{"key":"e_1_3_4_3_1","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33 (2020), 12449\u201312460.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_4_4_1","first-page":"187","volume-title":"Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques","author":"Blanz Volker","year":"1999","unstructured":"Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 187\u2013194."},{"key":"e_1_3_4_5_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Chen Lele","year":"2019","unstructured":"Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)."},{"key":"e_1_3_4_6_1","volume-title":"Proceedings of the International Society for Music Information Retrieval Conference (ISMIR\u201920)","author":"Choi Soonbeom","year":"2020","unstructured":"Soonbeom Choi, Wonil Kim, Saebyul Park, Sangeon Yong, and Juhan Nam. 2020. Children\u2019s song dataset for singing voice research. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR\u201920)."},{"key":"e_1_3_4_7_1","article-title":"You said that?","author":"Chung Joon Son","year":"2017","unstructured":"Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017).","journal-title":"arXiv preprint arXiv:1705.02966"},{"key":"e_1_3_4_8_1","first-page":"10101","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919)","author":"Cudeiro Daniel","year":"2019","unstructured":"Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201919). 10101\u201310111."},{"issue":"4","key":"e_1_3_4_9_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2897824.2925984","article-title":"JALI: An animator-centric viseme model for expressive lip synchronization","volume":"35","author":"Edwards Pif","year":"2016","unstructured":"Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35, 4 (2016), 1\u201311.","journal-title":"ACM Trans. Graph."},{"key":"e_1_3_4_10_1","unstructured":"Faceware. 2022. Faceware Studio. Retrieved from https:\/\/facewaretech.com\/software\/studio"},{"key":"e_1_3_4_11_1","first-page":"1355","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Faltlhauser Robert","year":"2000","unstructured":"Robert Faltlhauser, Thilo Pfau, and G\u00fcnther Ruske. 2000. On-line speaking rate estimation using Gaussian mixture models. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1355\u20131358."},{"key":"e_1_3_4_12_1","first-page":"18770","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922)","author":"Fan Yingruo","year":"2022","unstructured":"Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. FaceFormer: Speech-driven 3D facial animation with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922). 18770\u201318780."},{"key":"e_1_3_4_13_1","unstructured":"Awni Hannun Carl Case Jared Casper Bryan Catanzaro Greg Diamos Erich Elsen Ryan Prenger Sanjeev Satheesh Shubho Sengupta Adam Coates and Andrew Y. Ng. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)."},{"key":"e_1_3_4_14_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)","author":"Isola Phillip","year":"2017","unstructured":"Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201917)."},{"key":"e_1_3_4_15_1","first-page":"1","volume-title":"Proceedings of the SIGGRAPH Asia Technical Communications Conference","author":"Iwase Shohei","year":"2020","unstructured":"Shohei Iwase, Takuya Kato, Shugo Yamaguchi, Tsuchiya Yukitaka, and Shigeo Morishima. 2020. Song2Face: Synthesizing singing facial animation from audio. In Proceedings of the SIGGRAPH Asia Technical Communications Conference. 1\u20134."},{"key":"e_1_3_4_16_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201921)","author":"Ji Xinya","year":"2021","unstructured":"Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201921)."},{"key":"e_1_3_4_17_1","first-page":"5245","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916)","author":"Jiao Yishan","year":"2016","unstructured":"Yishan Jiao, Ming Tu, Visar Berisha, and Julie Liss. 2016. Online speaking rate estimation using recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201916). IEEE, 5245\u20135249."},{"issue":"4","key":"e_1_3_4_18_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073658","article-title":"Audio-driven facial animation by joint end-to-end learning of pose and emotion","volume":"36","author":"Karras Tero","year":"2017","unstructured":"Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36, 4 (2017), 1\u201312.","journal-title":"ACM Trans. Graph."},{"key":"e_1_3_4_19_1","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).","journal-title":"arXiv preprint arXiv:1412.6980"},{"key":"e_1_3_4_20_1","article-title":"Investigation into target speaking rate adaptation for voice conversion","author":"Kuhlmann Michael","year":"2022","unstructured":"Michael Kuhlmann, Fritz Seebauer, Janek Ebbers, Petra Wagner, and Reinhold Haeb-Umbach. 2022. Investigation into target speaking rate adaptation for voice conversion. arXiv preprint arXiv:2209.01978 (2022).","journal-title":"arXiv preprint arXiv:2209.01978"},{"issue":"6","key":"e_1_3_4_21_1","first-page":"1","article-title":"Live speech portraits: Real-time photorealistic talking-head animation","volume":"40","author":"Lu Yuanxun","year":"2021","unstructured":"Yuanxun Lu, Jinxiang Chai, and Xun Cao. 2021. Live speech portraits: Real-time photorealistic talking-head animation. ACM Trans. Graph. 40, 6 (2021), 1\u201317.","journal-title":"ACM Trans. Graph."},{"key":"e_1_3_4_22_1","first-page":"1","volume-title":"Proceedings of the International Conference on Signal Processing and Communications (SPCOM\u201920)","author":"Mannem Renuka","year":"2020","unstructured":"Renuka Mannem, Hima Jyothi, Aravind Illa, and Prasanta Kumar Ghosh. 2020. Speech rate estimation using representations learned from speech with convolutional neural network. In Proceedings of the International Conference on Signal Processing and Communications (SPCOM\u201920). IEEE, 1\u20135."},{"key":"e_1_3_4_23_1","first-page":"498","volume-title":"Proceedings of the Annual Conference of the International Speech Communication Association","author":"McAuliffe Michael","year":"2017","unstructured":"Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the Annual Conference of the International Speech Communication Association(INTERSPEECH\u201917). 498\u2013502."},{"key":"e_1_3_4_24_1","first-page":"20406","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922)","author":"Medina Salvador","year":"2022","unstructured":"Salvador Medina, Denis Tome, Carsten Stoll, Mark Tiede, Kevin Munhall, Alexander G. Hauptmann, and Iain Matthews. 2022. Speech driven tongue animation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201922). 20406\u201320416."},{"key":"e_1_3_4_25_1","first-page":"729","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201998)","author":"Morgan Nelson","year":"1998","unstructured":"Nelson Morgan and Eric Fosler-Lussier. 1998. Combining multiple estimators of speaking rate. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201998). IEEE, 729\u2013732."},{"key":"e_1_3_4_26_1","doi-asserted-by":"crossref","first-page":"2079","DOI":"10.21437\/Eurospeech.1997-550","volume-title":"Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH\u201997)","author":"Morgan Nelson","year":"1997","unstructured":"Nelson Morgan, Eric Fosler-Lussier, and Nikki Mirghafori. 1997. Speech recognition using on-line estimation of speaking rate. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH\u201997). Citeseer, 2079\u20132082."},{"key":"e_1_3_4_27_1","first-page":"1","volume-title":"Proceedings of the SIGGRAPH Asia Conference","author":"Pan Yifang","year":"2022","unstructured":"Yifang Pan, Chris Landreth, Eugene Fiume, and Karan Singh. 2022. VOCAL: Vowel and consonant layering for expressive animator-centric singing animation. In Proceedings of the SIGGRAPH Asia Conference. 1\u20139."},{"key":"e_1_3_4_28_1","first-page":"5206","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201915)","author":"Panayotov Vassil","year":"2015","unstructured":"Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201915). IEEE, 5206\u20135210."},{"key":"e_1_3_4_29_1","first-page":"945","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201998)","author":"Pfau Thilo","year":"1998","unstructured":"Thilo Pfau and G\u00fcnther Ruske. 1998. Estimating the speaking rate by vowel detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201998). IEEE, 945\u2013948."},{"key":"e_1_3_4_30_1","first-page":"296","volume-title":"Proceedings of the IEEE International Conference on Multimedia and Expo Workshops","author":"Philippou-H\u00fcbner David","year":"2012","unstructured":"David Philippou-H\u00fcbner, Bogdan Vlasenko, Ronald B\u00f6ck, and Andreas Wendemuth. 2012. The performance of the speaking rate parameter in emotion recognition from speech. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops. IEEE, 296\u2013301."},{"key":"e_1_3_4_31_1","article-title":"Accelerating 3D deep learning with PyTorch3D","author":"Ravi Nikhila","year":"2020","unstructured":"Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. 2020. Accelerating 3D deep learning with PyTorch3D. arXiv:2007.08501 (2020).","journal-title":"arXiv:2007.08501"},{"key":"e_1_3_4_32_1","first-page":"1173","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921)","author":"Richard Alexander","year":"2021","unstructured":"Alexander Richard, Michael Zollh\u00f6fer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV\u201921). 1173\u20131182."},{"key":"e_1_3_4_33_1","doi-asserted-by":"crossref","first-page":"585","DOI":"10.1109\/TIFS.2022.3146783","article-title":"Everybody\u2019s talkin\u2019: Let me talk as you want","volume":"17","author":"Song Linsen","year":"2022","unstructured":"Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2022. Everybody\u2019s talkin\u2019: Let me talk as you want. IEEE Trans. Inf. Forens. Secur. 17 (2022), 585\u2013598.","journal-title":"IEEE Trans. Inf. Forens. Secur."},{"key":"e_1_3_4_34_1","doi-asserted-by":"crossref","first-page":"6098","DOI":"10.1007\/s00034-021-01754-1","article-title":"A robust speaking rate estimator using a CNN-BLSTM network","volume":"40","author":"Srinivasan Aparna","year":"2021","unstructured":"Aparna Srinivasan, Diviya Singh, Chiranjeevi Yarra, Aravind Illa, and Prasanta Kumar Ghosh. 2021. A robust speaking rate estimator using a CNN-BLSTM network. Circ., Syst., Signal Process. 40 (2021), 6098\u20136120.","journal-title":"Circ., Syst., Signal Process."},{"issue":"4","key":"e_1_3_4_35_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073640","article-title":"Synthesizing Obama: Learning lip sync from audio","volume":"36","author":"Suwajanakorn Supasorn","year":"2017","unstructured":"Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 36, 4 (2017), 1\u201313.","journal-title":"ACM Trans. Graph."},{"issue":"4","key":"e_1_3_4_36_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3072959.3073699","article-title":"A deep learning approach for generalized speech animation","volume":"36","author":"Taylor Sarah","year":"2017","unstructured":"Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Trans. Graph. 36, 4 (2017), 1\u201311.","journal-title":"ACM Trans. Graph."},{"key":"e_1_3_4_37_1","first-page":"3037","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914)","author":"Taylor Sarah","year":"2014","unstructured":"Sarah Taylor, Barry-John Theobald, and Iain Matthews. 2014. The effect of speaking rate on audio and visual speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914). IEEE, 3037\u20133041."},{"key":"e_1_3_4_38_1","first-page":"716","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201920)","author":"Thies Justus","year":"2020","unstructured":"Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nie\u00dfner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In Proceedings of the European Conference on Computer Vision (ECCV\u201920). Springer, 716\u2013731."},{"key":"e_1_3_4_39_1","first-page":"418","volume-title":"Proceedings of the 16th International Conference on Speech and Computer (SPECOM\u201914)","author":"Tomashenko Natalia","year":"2014","unstructured":"Natalia Tomashenko and Yuri Khokhlov. 2014. Speaking rate estimation based on deep neural networks. In Proceedings of the 16th International Conference on Speech and Computer (SPECOM\u201914). Springer, 418\u2013424."},{"key":"e_1_3_4_40_1","first-page":"2258","volume-title":"Proceedings of 4th International Conference on Spoken Language Processing (ICSLP\u201996)","author":"Verhasselt Jan P.","year":"1996","unstructured":"Jan P. Verhasselt and J.-P. Martens. 1996. A fast and reliable rate of speech detector. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP\u201996). IEEE, 2258\u20132261."},{"key":"e_1_3_4_41_1","first-page":"554","volume-title":"Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing","author":"Verhelst Werner","year":"1993","unstructured":"Werner Verhelst and Marc Roelands. 1993. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 554\u2013557."},{"key":"e_1_3_4_42_1","volume-title":"Proceedings of the British Machine Vision Conference (BMVC\u201918)","author":"Vougioukas K.","year":"2018","unstructured":"K. Vougioukas, S. Petridis, and M. Pantic. 2018. End-to-end speech-driven facial animation with temporal GANs. In Proceedings of the British Machine Vision Conference (BMVC\u201918)."},{"issue":"8","key":"e_1_3_4_43_1","doi-asserted-by":"crossref","first-page":"2190","DOI":"10.1109\/TASL.2007.905178","article-title":"Robust speech rate estimation for spontaneous speech","volume":"15","author":"Wang Dagen","year":"2007","unstructured":"Dagen Wang and Shrikanth S. Narayanan. 2007. Robust speech rate estimation for spontaneous speech. IEEE Trans. Audio, Speech Lang. Process. 15, 8 (2007), 2190\u20132201.","journal-title":"IEEE Trans. Audio, Speech Lang. Process."},{"key":"e_1_3_4_44_1","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV\u201920)","author":"Wang Kaisiyuan","year":"2020","unstructured":"Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A large-scale audio-visual dataset for emotional talking-face generation. In Proceedings of the European Conference on Computer Vision (ECCV\u201920)."},{"key":"e_1_3_4_45_1","volume-title":"Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201918)","author":"Wang Ting-Chun","year":"2018","unstructured":"Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS\u201918)."},{"key":"e_1_3_4_46_1","first-page":"12780","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201923)","author":"Xing Jinbo","year":"2023","unstructured":"Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. CodeTalker: Speech-driven 3D facial animation with discrete motion prior. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR\u201923). 12780\u201312790."},{"key":"e_1_3_4_47_1","article-title":"Multi-scale context aggregation by dilated convolutions","author":"Yu Fisher","year":"2015","unstructured":"Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).","journal-title":"arXiv preprint arXiv:1511.07122"},{"issue":"5","key":"e_1_3_4_48_1","doi-asserted-by":"crossref","first-page":"3878","DOI":"10.1121\/1.2935783","article-title":"Speaker identification on the SCOTUS corpus","volume":"123","author":"Yuan Jiahong","year":"2008","unstructured":"Jiahong Yuan and Mark Liberman. 2008. Speaker identification on the SCOTUS corpus. J. Acoust. Societ. Amer. 123, 5 (2008), 3878.","journal-title":"J. Acoust. Societ. Amer."},{"issue":"6","key":"e_1_3_4_49_1","article-title":"MakeItTalk: Speaker-aware talking-head animation","volume":"39","author":"Zhou Yang","year":"2020","unstructured":"Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeItTalk: Speaker-aware talking-head animation. ACM Trans. Graph. 39, 6 (2020).","journal-title":"ACM Trans. Graph."},{"issue":"4","key":"e_1_3_4_50_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3197517.3201292","article-title":"VisemeNet: Audio-driven animator-centric speech animation","volume":"37","author":"Zhou Yang","year":"2018","unstructured":"Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. VisemeNet: Audio-driven animator-centric speech animation. ACM Trans. Graph. 37, 4 (2018), 1\u201310.","journal-title":"ACM Trans. Graph."}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3691341","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3691341","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:49:56Z","timestamp":1750268996000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3691341"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,28]]}},"alternative-id":["10.1145\/3691341"],"URL":"https:\/\/doi.org\/10.1145\/3691341","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"type":"print","value":"0730-0301"},{"type":"electronic","value":"1557-7368"}],"subject":[],"published":{"date-parts":[[2024,10]]},"assertion":[{"value":"2023-10-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-20","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-10-01","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}