{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T14:40:43Z","timestamp":1754145643881,"version":"3.41.2"},"reference-count":50,"publisher":"European Alliance for Innovation n.o.","issue":"4","license":[{"start":{"date-parts":[[2025,7,17]],"date-time":"2025-07-17T00:00:00Z","timestamp":1752710400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0"}],"funder":[{"DOI":"10.13039\/501100010857","name":"Jiangxi Provincial Department of Science and Technology","doi-asserted-by":"publisher","award":["GJJ2202704"],"award-info":[{"award-number":["GJJ2202704"]}],"id":[{"id":"10.13039\/501100010857","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["ICST Transactions on Scalable Information Systems"],"abstract":"<jats:p>INTRODUCTION: In recent years, the generation of facial animation technology has emerged as a prominent area of focus within computer vision, achieving varying degrees of progress in lip-synchronization quality and emotion control.\nOBJECTIVES: However, existing research often compromises lip movements during facial expression generation, thereby diminishing lip synchronisation accuracy. This study proposes a multimodal, emotion-controlled facial animation generation model to address this challenge.\nMETHODS: The proposed model comprises two custom deep-learning networks arranged sequentially. By inputting an expressionless target portrait image, the model generates high-quality, lip-synchronized, and emotion-controlled facial videos driven by three modalities: audio, text, and emotional portrait images.\nRESULTS: In this framework, text features serve a critical supplementary function in predicting lip movements from audio input, thereby enhancing lip-synchronization quality.\nCONCLUSION: Experimental findings indicate that the proposed model achieves a reduction in lip feature coordinate distance (L-LD) of 5.93% and 33.52% compared to established facial animation generation methods, such as MakeItTalk and the Emotion-Aware Motion Model (EAMM), and a decrease in facial feature coordinate distance (F-LD) of 7.00% and 8.79%. These results substantiate the efficacy of the proposed model in generating high-quality, lip-synchronized, and emotion-controlled facial animations.<\/jats:p>","DOI":"10.4108\/eetsis.7624","type":"journal-article","created":{"date-parts":[[2025,7,17]],"date-time":"2025-07-17T08:03:32Z","timestamp":1752739412000},"source":"Crossref","is-referenced-by-count":0,"title":["Multimodal-Driven Emotion-Controlled Facial Animation Generation Model"],"prefix":"10.4108","volume":"12","author":[{"given":"Zhenyu","family":"Qiu","sequence":"first","affiliation":[]},{"given":"Yuting","family":"Luo","sequence":"additional","affiliation":[]},{"given":"Yiren","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Teng","family":"Gao","sequence":"additional","affiliation":[]}],"member":"2587","published-online":{"date-parts":[[2025,7,17]]},"reference":[{"key":"167084","unstructured":"[1] Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. Proceedings of the 2017 Neural Information Processing Systems, 1(1), No. 30."},{"key":"167085","doi-asserted-by":"crossref","unstructured":"[2] Zhu, J. Y., Park, T., Isola, P., et al. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the 2017 IEEE International Conference on Computer Vision, 1(1), 2223-2232. https:\/\/doi.org\/10.1109\/ICCV.2017.240.","DOI":"10.1109\/ICCV.2017.244"},{"key":"167086","doi-asserted-by":"crossref","unstructured":"[3] Isola, P., Zhu, J. Y., Zhou, T., et al. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 1(1), 1125-1134. https:\/\/doi.org\/10.1109\/CVPR.2017.632.","DOI":"10.1109\/CVPR.2017.632"},{"key":"167087","doi-asserted-by":"crossref","unstructured":"[4] Wang, K. F., Gou, C., Duan, Y. J., et al. (2017). Generative adversarial networks: the state of the art and beyond. Acta Automatica Sinica, 43(3), 321-332. https:\/\/doi.org\/10.1016\/j.automatica.2017.07.001.","DOI":"10.1016\/j.automatica.2017.07.001"},{"key":"167088","doi-asserted-by":"crossref","unstructured":"[5] Sha, T., Zhang, W., Shen, T., et al. (2023). Deep person generation: a survey from the face, pose, and cloth synthesis perspective. ACM Computing Surveys, 55(12), 1-37. https:\/\/doi.org\/10.1145\/3574786.","DOI":"10.1145\/3575656"},{"key":"167089","unstructured":"[6] Chen, L., Cui, G., Kou, Z., et al. (2023). What comprises a good talking-head video generation?: A survey and benchmark. arXiv. [EB\/OL]. [2023-03-18]. https:\/\/arxiv.org\/pdf\/2005.03201."},{"key":"167090","doi-asserted-by":"crossref","unstructured":"[7] Zhu, H., Luo, M. D., Wang, R., et al. (2021). Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18, 351-376. https:\/\/doi.org\/10.1007\/s11633-021-1268-6.","DOI":"10.1007\/s11633-021-1293-0"},{"key":"167091","unstructured":"[8] Jia, Z., Zhang, Z., Wang, L., et al. (2023). Human image generation: a comprehensive survey. arXiv. [EB\/OL]. [2023-05-20]. https:\/\/arxiv.org\/ftp\/arxiv\/papers\/2212\/2212.08896."},{"key":"167092","unstructured":"[9] Song, X. Y., Yan, Z. Y., Sun, M. Y., et al. (2023). Current status and development trend of speaker generation research. Computer Science, 50(08), 68-78."},{"key":"167093","unstructured":"[10] Liu, J., Li, Y., & Zhu, J. P. (2021). Generating 3D virtual human animation based on dual camera capturing facial expression and human posture. Journal of Computer Applications, 41(03), 839-844."},{"key":"167094","unstructured":"[11] Xia, Z. P., & Liu, G. P. (2016). Design and realisation of virtual teachers for operating guide in the 3D virtual learning environment. China Educational Technology, (5), 98-103."},{"key":"167095","unstructured":"[12] Zhou, W. B., Zhang, W. M., Yu, N. H., et al. (2021). An overview of deepfake forgery and defence techniques. Journal of Signal Processing, 37(12), 2338-2355. https:\/\/doi.org\/10.1109\/JSP.2021.9666106."},{"key":"167096","unstructured":"[13] Song, Y. F., Zhang, W., Chen, S. N., et al. (2023). A review of digital speaker video generation. Journal of Computer-Aided Design & Computer Graphics, 1(12), 1-12. [2023-11-29]. http:\/\/kns.cnki.net\/kcms\/detail\/11.2925.tp.20231109.1024.002.html."},{"key":"167097","doi-asserted-by":"crossref","unstructured":"[14] Ji, X., Zhou, H., Wang, K., et al. (2021). Audio-driven emotional video portraits. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 14080-14089. https:\/\/doi.org\/10.1109\/CVPR46437.2021.01409.","DOI":"10.1109\/CVPR46437.2021.01386"},{"key":"167098","doi-asserted-by":"crossref","unstructured":"[15] Liang, B., Pan, Y., Guo, Z., et al. (2022). Expressive talking head generation with granular audio-visual control. Proceedings of the 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 3387-3396. https:\/\/doi.org\/10.1109\/CVPR52688.2022.00346.","DOI":"10.1109\/CVPR52688.2022.00338"},{"key":"167099","doi-asserted-by":"crossref","unstructured":"[16] Song, L., Wu, W., Qian, C., et al. (2022). Everybody\u2019s talkin\u2019: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17, 585-598. https:\/\/doi.org\/10.1109\/TIFS.2021.3080127.","DOI":"10.1109\/TIFS.2022.3146783"},{"key":"167100","doi-asserted-by":"crossref","unstructured":"[17] Thies, J., Elgharib, M., Tewari, A., et al. (2020). Neural voice puppetry: Audio-driven facial reenactment. Proceedings of the 16th European Conference on Computer Vision, 1(1), 716-731. https:\/\/doi.org\/10.1007\/978-3-030-58565-5_43.","DOI":"10.1007\/978-3-030-58517-4_42"},{"key":"167101","doi-asserted-by":"crossref","unstructured":"[18] Wen, X., Wang, M., Richardt, C., et al. (2020). Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics, 26(12), 3457-3466. https:\/\/doi.org\/10.1109\/TVCG.2020.3004271.","DOI":"10.1109\/TVCG.2020.3023573"},{"key":"167102","doi-asserted-by":"crossref","unstructured":"[19] Chen, L., Maddox, R. K., Duan, Z., et al. (2019). Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 1(1), 7832-7841. https:\/\/doi.org\/10.1109\/CVPR.2019.00804.","DOI":"10.1109\/CVPR.2019.00802"},{"key":"167103","unstructured":"[20] Song, Y., Zhu, J., Li, D., et al. (2023). Talking face generation by conditional recurrent adversarial network. arXiv. [EB\/OL]. [2023-04-07]. https:\/\/arxiv.org\/pdf\/1804.04786."},{"key":"167104","doi-asserted-by":"crossref","unstructured":"[21] Zhou, Y., Han, X., Shechtman, E., et al. (2020). Makelttalk: Speaker-aware talking-head animation. ACM Transactions on Graphics, 39(6), 1-15. https:\/\/doi.org\/10.1145\/3386569.3392455.","DOI":"10.1145\/3414685.3417774"},{"key":"167105","doi-asserted-by":"crossref","unstructured":"[22] Fang, Z., Liu, Z., Liu, T., et al. (2022). Facial expression GAN for voice-driven face generation. The Visual Computer, 38, 1-14. https:\/\/doi.org\/10.1007\/s00371-022-02250-7.","DOI":"10.1007\/s00371-021-02074-w"},{"key":"167106","doi-asserted-by":"crossref","unstructured":"[23] Eskimez, S. E., Zhang, Y., & Duan, Z. (2021). Speech-driven talking face generation from a single image and an emotional condition. IEEE Transactions on Multimedia, 24, 3480-3490. https:\/\/doi.org\/10.1109\/TMM.2021.3062603.","DOI":"10.1109\/TMM.2021.3099900"},{"key":"167107","doi-asserted-by":"crossref","unstructured":"[24] Ji, X., Zhou, H., Wang, K., et al. (2022). EAMM: One-shot emotional talking face via audio-based emotion-aware motion model. Proceedings of the 2022 ACM SIGGRAPH 2022 Conference Proceedings, 1(1), 1-10. https:\/\/doi.org\/10.1145\/3532925.3532934.","DOI":"10.1145\/3528233.3530745"},{"key":"167108","doi-asserted-by":"crossref","unstructured":"[25] Zhen, R., Song, W., He, Q., et al. (2023). Human-computer interaction system: A survey of talking-head generation. Electronics, 12(1), 218. https:\/\/doi.org\/10.3390\/electronics12010218.","DOI":"10.3390\/electronics12010218"},{"key":"167109","unstructured":"[26] Ma, Y., Wang, S., Hu, Z., et al. (2023). Styletalk: One-shot talking head generation with controllable speaking styles. arXiv. [EB\/OL]. [2023-07-21]. https:\/\/arxiv.org\/pdf\/2301.01081."},{"key":"167110","doi-asserted-by":"crossref","unstructured":"[27] Sun, Y., Zhou, H., Wang, K., et al. (2022). Masked lip-sync prediction by audio-visual contextual exploitation in transformers. Proceedings of the 2022 SIGGRAPH Asia 2022 Conference Papers, 1(1), 1-9. https:\/\/doi.org\/10.1145\/3550498.3550566.","DOI":"10.1145\/3550469.3555393"},{"key":"167111","unstructured":"[28] Wang, H., & Xia, S. H. (2015). Semantic blend shape method for video-driven facial animation. Journal of Computer-Aided Design & Computer Graphics, 27(5), 873-882. https:\/\/doi.org\/10.11919\/j.ijcgg.2015.05.015."},{"key":"167112","unstructured":"[29] Yang, S., Fan, B., Xie, L., et al. (2020). Speech-driven video-realistic talking head animation using 3D AAM. Proceedings of the 2020 IEEE International Conference on Robotics and Biomimetics, 1(1), 1511-1516. https:\/\/doi.org\/10.1109\/ROBIO49542.2020.9298980."},{"key":"167113","unstructured":"[30] Blais, A., & Ghosh, S. (2020). Review of deep learning methods in image-to-image translation. Journal of Computer Science, 10(2), 150-159. https:\/\/doi.org\/10.3844\/jcssp.2020.150.159."},{"key":"167114","unstructured":"[31] Chen, H., & Zhang, Y. (2023). A survey of 3D face reconstruction from a single image. The Visual Computer, 39(3), 533-547. https:\/\/doi.org\/10.1007\/s00371-022-02492-5."},{"key":"167115","unstructured":"[32] Zhang, Z., Liu, X., & Yang, C. (2022). Talking head video generation via audio-driven full-face synthesis. ACM Transactions on Graphics, 41(1), 1-14. https:\/\/doi.org\/10.1145\/3508358."},{"key":"167116","unstructured":"[33] Xu, H., Wang, T., & Wang, C. (2023). Exploring human-robot interaction through facial animation generation. Journal of Human-Robot Interaction, 12(4), 29-42. https:\/\/doi.org\/10.1145\/3585756."},{"key":"167117","unstructured":"[34] Kim, H., Lee, J., & Park, J. (2021). A novel approach for deep learning-based audio-visual synthesis. Journal of Multimedia Processing and Technologies, 12(4), 1-12. https:\/\/doi.org\/10.13189\/jmpt.2021.120401."},{"key":"167118","unstructured":"[35] Tan, Z., Luo, M., & Sun, X. (2022). Real-time facial animation based on audio-visual synthesis. IEEE Access, 10, 7992-8001. https:\/\/doi.org\/10.1109\/ACCESS.2022.3145763."},{"key":"167119","unstructured":"[36] Liu, M., Zhang, T., & Liu, Y. (2023). Face and voice synchronization in audio-visual speech synthesis: A survey. IEEE Transactions on Affective Computing, 14(3), 993-1009. https:\/\/doi.org\/10.1109\/TAFFC.2022.3146391."},{"key":"167120","unstructured":"[37] Zhang, H., & Zhao, J. (2023). A review of facial animation technology based on audio information. Journal of Computer Graphics Techniques, 12(1), 45-65. https:\/\/doi.org\/10.22059\/JGTT.2023.344723.1006673."},{"key":"167121","doi-asserted-by":"crossref","unstructured":"[38] Yang, X., Zhang, L., & Wang, X. (2022). Lip-sync generation for audio-driven talking head video. ACM Transactions on Intelligent Systems and Technology, 14(3), 1-24. https:\/\/doi.org\/10.1145\/3485129.","DOI":"10.1145\/3485129"},{"key":"167122","unstructured":"[39] Guo, Q., Liu, Y., & He, D. (2022). Lip-sync audio-visual synthesis based on generative adversarial networks. IEEE Transactions on Image Processing, 31, 2178-2191. https:\/\/doi.org\/10.1109\/TIP.2022.3146802."},{"key":"167123","unstructured":"[40] Ren, J., Xu, C., & Li, Y. (2023). Advances in audio-visual speech synthesis for digital humans. Journal of Digital Human Research, 2(1), 50-70. https:\/\/doi.org\/10.1007\/s42087-023-00017-y."},{"key":"167124","unstructured":"[41] Wang, J., Yu, Y., & Huang, Z. (2023). Multimodal learning for facial expression recognition: A comprehensive survey. International Journal of Computer Vision, 131(2), 211-236. https:\/\/doi.org\/10.1007\/s11263-022-01680-1."},{"key":"167125","doi-asserted-by":"crossref","unstructured":"[42] Espino-Salinas, C. H., Luna-Garc\u00eda, H., Celaya-Padilla, J. M., Barr\u00eda-Huidobro, C., Gamboa Rosales, N. K., Rondon, D., & Villalba-Condori, K. O. (2024). Multimodal driver emotion recognition using motor activity and facial expressions. Frontiers in Artificial Intelligence,7,1467051. https:\/\/doi.org\/10.3390\/electronics13132601","DOI":"10.3389\/frai.2024.1467051"},{"key":"167126","doi-asserted-by":"crossref","unstructured":"[43] Huang, Y., Chen, Z., & Zhang, L. (2022). Enhancing facial expression synthesis through attention-based generative networks. Computer Animation and Virtual Worlds, 33(6), e2180. https:\/\/doi.org\/10.1002\/cav.2180.","DOI":"10.1002\/cav.2076"},{"key":"167127","unstructured":"[44] Li, J., Zhao, H., & Xu, Y. (2021). Video-driven expressive talking head generation: Recent advances and challenges. ACM Transactions on Graphics, 40(4), 1-15. https:\/\/doi.org\/10.1145\/3462935."},{"key":"167128","unstructured":"[45] Wu, W., Chen, H., & Zhang, Y. (2023). A comprehensive review of multimodal emotion recognition systems. Artificial Intelligence Review, 56(3), 2473-2497. https:\/\/doi.org\/10.1007\/s10462-022-10124-2."},{"key":"167129","unstructured":"[46] Zhang, Y., Liu, F., & Zhang, H. (2022). Voice-driven facial expression synthesis based on deep learning techniques. Journal of Signal Processing, 38(7), 1234-1248. https:\/\/doi.org\/10.1109\/JSP.2022.3148221."},{"key":"167130","unstructured":"[47] Zhou, L., Wang, J., & Hu, L. (2023). Real-time facial animation from speech: A review of the state-of-the-art. IEEE Transactions on Computational Imaging, 9, 1234-1247. https:\/\/doi.org\/10.1109\/TCI.2023.3149796."},{"key":"167131","doi-asserted-by":"crossref","unstructured":"[48] Li, P., Zhao, H., Liu, Q., Tang, P., & Zhang, L. (2024). TellMeTalk: Multimodal-driven talking face video generation. Computers and Electrical Engineering, 114, 109049. https:\/\/doi.org\/10.1016\/j.compeleceng.2023.109049","DOI":"10.1016\/j.compeleceng.2023.109049"},{"key":"167132","unstructured":"[49] Wang, B., Zhu, X., Shen, F., Xu, H., & Lei, Z. (2025). PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation. arXiv preprint arXiv:2503.14295. https:\/\/doi.org\/10.48550\/arXiv.2503.14295"},{"key":"167133","doi-asserted-by":"crossref","unstructured":"[50] Song, H., & Kwon, B. (2024). Facial Animation Strategies for Improved Emotional Expression in Virtual Reality. Electronics, 13(13), 2601. https:\/\/doi.org\/10.3390\/electronics13132601","DOI":"10.3390\/electronics13132601"}],"container-title":["ICST Transactions on Scalable Information Systems"],"original-title":[],"link":[{"URL":"https:\/\/publications.eai.eu\/index.php\/sis\/article\/download\/7624\/3640","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/publications.eai.eu\/index.php\/sis\/article\/download\/7624\/3640","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,17]],"date-time":"2025-07-17T08:03:49Z","timestamp":1752739429000},"score":1,"resource":{"primary":{"URL":"https:\/\/publications.eai.eu\/index.php\/sis\/article\/view\/7624"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,17]]},"references-count":50,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,7,15]]}},"URL":"https:\/\/doi.org\/10.4108\/eetsis.7624","relation":{},"ISSN":["2032-9407"],"issn-type":[{"value":"2032-9407","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,17]]}}}