{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T18:51:38Z","timestamp":1774637498313,"version":"3.50.1"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"4","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2025,8,1]]},"abstract":"<jats:p>The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.<\/jats:p>","DOI":"10.1145\/3730929","type":"journal-article","created":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T04:02:22Z","timestamp":1753588942000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Tiny is not small enough: High quality, low-resource facial animation models through hybrid knowledge distillation"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-0109-5615","authenticated-orcid":false,"given":"Zhen","family":"Han","sequence":"first","affiliation":[{"name":"Electronic Arts, Stockholm, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0551-2817","authenticated-orcid":false,"given":"Mattias","family":"Teye","sequence":"additional","affiliation":[{"name":"Electronic Arts, Stockholm, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-4036-6642","authenticated-orcid":false,"given":"Derek","family":"Yadgaroff","sequence":"additional","affiliation":[{"name":"Electronic Arts, Stockholm, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5344-8042","authenticated-orcid":false,"given":"Judith","family":"B\u00fctepage","sequence":"additional","affiliation":[{"name":"Electronic Arts, Stockholm, Sweden"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,7,27]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02009"},{"key":"e_1_2_2_2_1","volume-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449\u201312460."},{"key":"e_1_2_2_3_1","volume-title":"Engin Erzin, Tanju Erdem, and Mehmet Ozkan.","author":"Bozkurt Elif","year":"2007","unstructured":"Elif Bozkurt, Cigdem Eroglu Erdem, Engin Erzin, Tanju Erdem, and Mehmet Ozkan. 2007. Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In 2007 3DTV Conference. IEEE, 1\u20134."},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/311535.311537"},{"key":"e_1_2_2_5_1","unstructured":"ITU-R Recommendation BT et al. 1998. Relative timing of sound and vision for broadcasting. Relative timing of sound and vision for broadcasting \" Nov (1998)."},{"key":"e_1_2_2_6_1","volume-title":"International Conference on Pattern Recognition Applications and Methods","volume":"2","author":"Cappelletta Luca","year":"2012","unstructured":"Luca Cappelletta and Naomi Harte. 2012. Phoneme-to-viseme mapping for visual speech recognition. In International Conference on Pattern Recognition Applications and Methods, Vol. 2. SciTePress, 322\u2013329."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01034"},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3610548.3618183"},{"key":"e_1_2_2_9_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 18770\u201318780","author":"Fan Yingruo","year":"2022","unstructured":"Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Face-former: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 18770\u201318780."},{"key":"e_1_2_2_10_1","volume-title":"Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430","author":"Gandhi Sanchit","year":"2023","unstructured":"Sanchit Gandhi, Patrick von Platen, and Alexander M Rush. 2023. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. arXiv preprint arXiv:2311.00430 (2023)."},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-021-01453-z"},{"key":"e_1_2_2_12_1","volume-title":"Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567","author":"Hannun A","year":"2014","unstructured":"A Hannun. 2014. Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)."},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3577190.3614157"},{"key":"e_1_2_2_14_1","volume-title":"Distilling the Knowledge in a Neural Network. ArXiv abs\/1503.02531","author":"Hinton Geoffrey E.","year":"2015","unstructured":"Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. ArXiv abs\/1503.02531 (2015)."},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073663"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2002.1021892"},{"key":"e_1_2_2_17_1","volume-title":"Kushal Lakhotia","author":"Hsu Wei-Ning","year":"2021","unstructured":"Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE\/ACM transactions on audio, speech, and language processing 29 (2021), 3451\u20133460."},{"key":"e_1_2_2_18_1","unstructured":"Janet Jeffers and Margaret Barley. 1971. Speechreading (lipreading). (No Title) (1971)."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"key":"e_1_2_2_20_1","volume-title":"MLSys Workshop on On-Device Intelligence (ODIW)","author":"Kim Bo-Kyeong","year":"2023","unstructured":"Bo-Kyeong Kim, Jaemin Kang, Daeun Seo, Hancheol Park, Shinkook Choi, Hyoung-Kyu Song, Hyungshin Kim, and Sungsu Lim. 2023. A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation. MLSys Workshop on On-Device Intelligence (ODIW) (2023)."},{"key":"e_1_2_2_21_1","volume-title":"International Conference on Learning Representations (ICLR).","author":"Kingma Diederick P","year":"2015","unstructured":"Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_22_1","volume-title":"Workshop on challenges in representation learning, ICML","volume":"3","author":"Dong-Hyun","unstructured":"Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Atlanta, 896."},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3450623.3464665"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00696"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478513.3480484"},{"key":"e_1_2_2_26_1","volume-title":"Vikrant Singh Tomar, and Yoshua Bengio","author":"Lugosch Loren","year":"2019","unstructured":"Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. 2019. Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670 (2019)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2014.6854467"},{"key":"e_1_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Mathieu Marquis Bolduc and Hau Nghiep Phan. 2022. Rig Inversion by Training a Differentiable Rig Function. In SIGGRAPH Asia 2022 Technical Communications. 1\u20134.","DOI":"10.1145\/3550340.3564218"},{"key":"e_1_2_2_29_1","volume-title":"Talking us into the Metaverse: Towards Realistic Streaming Speech-to-Face Animation. Ph. D. Dissertation","author":"Medina Salvador","unstructured":"Salvador Medina. 2024. Talking us into the Metaverse: Towards Realistic Streaming Speech-to-Face Animation. Ph. D. Dissertation. Carnegie Mellon University."},{"key":"e_1_2_2_30_1","volume-title":"PhISANet: Phonetically Informed Speech Animation Network. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8225\u20138229","author":"Medina Salvador","year":"2024","unstructured":"Salvador Medina, Sarah L Taylor, Carsten Stoll, Gareth Edwards, Alex Hauptmann, Shinji Watanabe, and Iain Matthews. 2024a. PhISANet: Phonetically Informed Speech Animation Network. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8225\u20138229."},{"key":"e_1_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Salvador Medina Sarah L Taylor Carsten Stoll Gareth Edwards Alex Hauptmann Shinji Watanabe and Iain Matthews. 2024b. Phonetically Informed Speech Animation Network Project Page. https:\/\/github.com\/salmedina\/PhISANet?tab=readme-ov-file","DOI":"10.1109\/ICASSP48485.2024.10448411"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.5963"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/IUCS.2010.5666250"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3623264.3624451"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01252-6_17"},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00241"},{"key":"e_1_2_2_38_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024\u20138035. http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2020.3022017"},{"key":"e_1_2_2_40_1","volume-title":"Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)","author":"Ravanelli Mirco","unstructured":"Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT). IEEE, 1021\u20131028."},{"key":"e_1_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Yuxi Ren Jie Wu Xuefeng Xiao and Jianchao Yang. 2021. Online Multi-Granularity Distillation for GAN Compression. (2021) 6773\u20136783.","DOI":"10.1109\/ICCV48922.2021.00672"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV48630.2021.00009"},{"key":"e_1_2_2_43_1","volume-title":"FitNets: Hints for Thin Deep Nets. In International Conference on Learning Representations (ICLR).","author":"Romero Adriana","year":"2015","unstructured":"Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. FitNets: Hints for Thin Deep Nets. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3623264.3624447"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3658221"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV57701.2024.00628"},{"key":"e_1_2_2_47_1","volume-title":"Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368","author":"Tang Jiaxiang","year":"2022","unstructured":"Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, and Jingdong Wang. 2022. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368 (2022)."},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073699"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01885"},{"key":"e_1_2_2_50_1","volume-title":"Proceedings, Part XVI 16","author":"Thies Justus","year":"2020","unstructured":"Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nie\u00dfner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XVI 16. Springer, 716\u2013731."},{"key":"e_1_2_2_51_1","volume-title":"Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP","author":"V\u00e5squez-Correa Juan Camilo","year":"2024","unstructured":"Juan Camilo V\u00e5squez-Correa, Santiago Moreno-Acevedo, Ander Gonzalez-Docasal, Aritz Lasarguren, Jone L\u00f2pez, Egoitz Rodriguez, and Aitor \u00c1lvarez. 2024. Real-Time Speech-Driven Avatar Animation by Predicting Facial landmarks and Deformation Blendshapes. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024). 109\u2013118."},{"key":"e_1_2_2_52_1","volume-title":"Mattias Teye, and Konrad Tollmar.","author":"Aylagas Monica Villanueva","year":"2022","unstructured":"Monica Villanueva Aylagas, Hector Anadon Leon, Mattias Teye, and Konrad Tollmar. 2022. Voice2Face: Audio-driven Facial and Tongue Rig Animations with cVAEs. In Computer Graphics Forum, Vol. 41. Wiley Online Library, 255\u2013265."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-2066"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1007-0214(05)70048-1"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01229"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBC.2008.2002102"},{"key":"e_1_2_2_57_1","unstructured":"Sergey Zagoruyko and Nikos Komodakis. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In ICLR."},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3641519.3657413"},{"key":"e_1_2_2_59_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3197517.3201292","article-title":"Visemenet: Audio-driven animator-centric speech animation","volume":"37","author":"Zhou Yang","year":"2018","unstructured":"Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1\u201310.","journal-title":"ACM Transactions on Graphics (TOG)"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3730929","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:51:01Z","timestamp":1774633861000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3730929"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,27]]},"references-count":59,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2025,8,1]]}},"alternative-id":["10.1145\/3730929"],"URL":"https:\/\/doi.org\/10.1145\/3730929","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,27]]},"assertion":[{"value":"2025-01-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}