{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T09:12:53Z","timestamp":1774602773404,"version":"3.50.1"},"reference-count":104,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T00:00:00Z","timestamp":1769212800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T00:00:00Z","timestamp":1769212800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2026,3]]},"DOI":"10.1007\/s11263-026-02752-z","type":"journal-article","created":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T17:03:53Z","timestamp":1769274233000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers"],"prefix":"10.1007","volume":"134","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0589-4424","authenticated-orcid":false,"given":"Yasheng","family":"Sun","sequence":"first","affiliation":[]},{"given":"Zhiliang","family":"Xu","sequence":"additional","affiliation":[]},{"given":"Hang","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Jiazhi","family":"Guan","sequence":"additional","affiliation":[]},{"given":"Quanwei","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Kaisiyuan","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Borong","family":"Liang","sequence":"additional","affiliation":[]},{"given":"Yingying","family":"Li","sequence":"additional","affiliation":[]},{"given":"Haocheng","family":"Feng","sequence":"additional","affiliation":[]},{"given":"Jingdong","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Ziwei","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Koike","family":"Hideki","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,1,24]]},"reference":[{"issue":"2","key":"2752_CR1","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1007\/s12525-020-00414-7","volume":"31","author":"M Adam","year":"2021","unstructured":"Adam, M., Wessel, M., & Benlian, A. (2021). Ai-based chatbots in customer service and their effects on user compliance. Electronic Markets,31(2), 427\u2013445.","journal-title":"Electronic Markets"},{"key":"2752_CR2","doi-asserted-by":"crossref","unstructured":"Ahuja, C., Lee D.W., Ishii, R., & Morency, L.-P. (2020a). No gestures left behind: Learning relationships between spoken language and freeform gestures. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp 1884\u20131895","DOI":"10.18653\/v1\/2020.findings-emnlp.170"},{"key":"2752_CR3","doi-asserted-by":"crossref","unstructured":"Ahuja, C., Lee, D.W., Nakano, Y.I., & Morency, L.-P. (2020b). Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In: Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XVIII 16, Springer, pp 248\u2013265","DOI":"10.1007\/978-3-030-58523-5_15"},{"issue":"4","key":"2752_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3592097","volume":"42","author":"T Ao","year":"2023","unstructured":"Ao, T., Zhang, Z., & Liu, L. (2023). Gesturediffuclip: Gesture diffusion model with clip latents. ACM Transactions on Graphics (TOG),42(4), 1\u201318.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"2752_CR5","first-page":"12449","volume":"33","author":"A Baevski","year":"2020","unstructured":"Baevski, A., Zhou, Y., Mohamed, A., et al. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems,33, 12449\u201312460.","journal-title":"Advances in neural information processing systems"},{"key":"2752_CR6","doi-asserted-by":"crossref","unstructured":"Balaji, Y., Min, M.R., Bai, B., Chellappa, R., & Graf, H.P. (2019). Conditional gan with discriminative filter generation for text-to-video synthesis. In: IJCAI, 2","DOI":"10.24963\/ijcai.2019\/276"},{"key":"2752_CR7","doi-asserted-by":"crossref","unstructured":"Bar-Tal, O., Chefer, H., Tov, O. et\u00a0al. (2024). Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945","DOI":"10.1145\/3680528.3687614"},{"key":"2752_CR8","doi-asserted-by":"crossref","unstructured":"Bhattacharya U, Rewkowski N, Banerjee A, Guhan, P., Bera, A., & Manocha, D. (2021). Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE virtual reality and 3D user interfaces (VR), IEEE, pp 1\u201310","DOI":"10.1109\/VR50410.2021.00037"},{"key":"2752_CR9","unstructured":"Blattmann, A., Dockhorn, T., Kulal, S. et\u00a0al. (2023). Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127"},{"key":"2752_CR10","doi-asserted-by":"crossref","unstructured":"Cai, Z., Yin, W., Zeng, A., Wei, C., Sun, Q., Yanjun, W., Pang, H.E., Mei, H., Zhang, M., Zhang, L., Loy, C.C., Yang, L., & Liu, Z. (2023). SMPLer-X: Scaling up expressive human pose and shape estimation. In: Advances in Neural Information Processing Systems","DOI":"10.52202\/075280-0506"},{"key":"2752_CR11","doi-asserted-by":"crossref","unstructured":"Cao, Z., Simon, T., Wei, S.E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291\u20137299","DOI":"10.1109\/CVPR.2017.143"},{"key":"2752_CR12","doi-asserted-by":"crossref","unstructured":"Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., & Stone, M. (1994). Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp 413\u2013420","DOI":"10.1145\/192161.192272"},{"key":"2752_CR13","doi-asserted-by":"crossref","unstructured":"Cassell, J., Vilhj\u00e1lmsson, H.H., & Bickmore, T. (2001). Beat: the behavior expression animation toolkit. In: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp 477\u2013486","DOI":"10.1145\/383259.383315"},{"key":"2752_CR14","doi-asserted-by":"crossref","unstructured":"Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 18000\u201318010","DOI":"10.1109\/CVPR52729.2023.01726"},{"key":"2752_CR15","doi-asserted-by":"crossref","unstructured":"Cheng, K., Cun, X., Zhang, Y., Menghan, X., Fei, Y., Mingrui, Z., Xuan, W., Jue, W., & Nannan, W. (2022). Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. arXiv:2211.14758","DOI":"10.1145\/3550469.3555399"},{"key":"2752_CR16","doi-asserted-by":"crossref","unstructured":"Corona, E., Zanfir, A., Bazavan, E.G., Kolotouros, N., Alldieck, T., & Sminchisescu, C. (2024). Vlogger: Multimodal diffusion for embodied avatar synthesis. arXiv preprint arXiv:2403.08764","DOI":"10.1109\/CVPR52734.2025.01482"},{"key":"2752_CR17","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 4690\u20134699","DOI":"10.1109\/CVPR.2019.00482"},{"key":"2752_CR18","first-page":"8780","volume":"34","author":"P Dhariwal","year":"2021","unstructured":"Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in neural information processing systems,34, 8780\u20138794.","journal-title":"Advances in neural information processing systems"},{"key":"2752_CR19","unstructured":"Esser, P., Kulal, S., Blattmann, A., Entezari, R., M\u00fcller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F. et\u00a0al (2024). Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first International Conference on Machine Learning"},{"key":"2752_CR20","doi-asserted-by":"crossref","unstructured":"Esser, P., Rombach, R., & Ommer, B. (2021) Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 12873\u201312883","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"2752_CR21","doi-asserted-by":"crossref","unstructured":"Fan, Y., Lin, Z., Saito, J., Wang, W., & Komura, T. (2022). Faceformer: Speech-driven 3d facial animation with transformers. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 18770\u201318780","DOI":"10.1109\/CVPR52688.2022.01821"},{"key":"2752_CR22","doi-asserted-by":"crossref","unstructured":"Farouk, M. (2022). Studying human robot interaction and its characteristics. International Journal of Computations, Information and Manufacturing (IJCIM) 2(1)","DOI":"10.54489\/ijcim.v2i1.73"},{"key":"2752_CR23","doi-asserted-by":"crossref","unstructured":"Ferstl, Y., & McDonnell, R. (2018). Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp 93\u201398","DOI":"10.1145\/3267851.3267898"},{"issue":"2","key":"2752_CR24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3472303","volume":"3","author":"Y Fu","year":"2022","unstructured":"Fu, Y., Hu, Y., & Sundstedt, V. (2022). A systematic literature review of virtual, augmented, and mixed reality game applications in healthcare. ACM Transactions on Computing for Healthcare (HEALTH),3(2), 1\u201327.","journal-title":"ACM Transactions on Computing for Healthcare (HEALTH)"},{"key":"2752_CR25","unstructured":"Gao, P., Zhuo, L., Liu, C., et\u00a0al (2024). Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945"},{"key":"2752_CR26","doi-asserted-by":"crossref","unstructured":"Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., & Malik, J. (2019). Learning individual styles of conversational gesture. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 3497\u20133506","DOI":"10.1109\/CVPR.2019.00361"},{"issue":"11","key":"2752_CR27","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1145\/3422622","volume":"63","author":"I Goodfellow","year":"2020","unstructured":"Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2020). Generative adversarial networks. Communications of the ACM,63(11), 139\u2013144.","journal-title":"Communications of the ACM"},{"key":"2752_CR28","doi-asserted-by":"crossref","unstructured":"Guan, J., Zhang, Z., Zhou, H., et\u00a0al (2023). Stylesync: High-fidelity generalized and personalized lip sync in style-based generator. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1505\u20131515","DOI":"10.1109\/CVPR52729.2023.00151"},{"key":"2752_CR29","doi-asserted-by":"crossref","unstructured":"Gu S, Chen D, Bao J, Wen, F., Zhang, B., Chen, D., Yuan, L., & Guo, B. (2022). Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10696\u201310706","DOI":"10.1109\/CVPR52688.2022.01043"},{"key":"2752_CR30","doi-asserted-by":"crossref","unstructured":"G\u00fcler, R.A., Neverova, N., & Kokkinos, I. (2018). Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7297\u20137306","DOI":"10.1109\/CVPR.2018.00762"},{"key":"2752_CR31","unstructured":"Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., & Dai, B. (2024). Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations"},{"key":"2752_CR32","doi-asserted-by":"crossref","unstructured":"Habibie, I., Xu, W., Mehta, D., Liu, L., Seidel, H.-P., Pons-Moll, G., Elgharib, M., & Theobalt, C. (2021). Learning speech-driven 3d conversational gestures from video. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pp 101\u2013108","DOI":"10.1145\/3472306.3478335"},{"key":"2752_CR33","doi-asserted-by":"crossref","unstructured":"He, X., Huang, Q., Zhang, Z., Lin, Z., Wu, Z., Yang, S., Li, M., Chen, Z., Xu, S., & Wu, X. (2024). Co-speech gesture video generation via motion-decoupled diffusion model. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 2263\u20132273","DOI":"10.1109\/CVPR52733.2024.00220"},{"key":"2752_CR34","unstructured":"He, Y., Yang, T., Zhang, Y., Shan, Y., & Chen, Q. (2022). Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 2(3):4"},{"key":"2752_CR35","unstructured":"Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30"},{"key":"2752_CR36","doi-asserted-by":"crossref","unstructured":"Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J. et\u00a0al (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303","DOI":"10.52202\/068431-0628"},{"key":"2752_CR37","doi-asserted-by":"crossref","unstructured":"Hu, L. (2024). Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 8153\u20138163","DOI":"10.1109\/CVPR52733.2024.00779"},{"key":"2752_CR38","doi-asserted-by":"crossref","unstructured":"Huang, C.M., Mutlu, B. (2012). Robot behavior toolkit: generating effective social behaviors for robots. In: Proceedings of the seventh annual ACM\/IEEE international conference on Human-Robot Interaction, pp 25\u201332","DOI":"10.1145\/2157689.2157694"},{"key":"2752_CR39","doi-asserted-by":"crossref","unstructured":"Huang, Z., Tang, F., Zhang, Y., Cun, X., Cao, J., Li, J., & Lee, T.-Y. (2024). Make-your-anchor: A diffusion-based 2d avatar generation framework. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 6997\u20137006","DOI":"10.1109\/CVPR52733.2024.00668"},{"key":"2752_CR40","first-page":"20067","volume":"36","author":"B Jiang","year":"2023","unstructured":"Jiang, B., Chen, X., Liu, W., et al. (2023). Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems,36, 20067\u201320079.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2752_CR41","doi-asserted-by":"crossref","unstructured":"Karras, J., Holynski, A., Wang, T.C., & Kemelmacher-Shlizerman, I. (2023). Dreampose: Fashion image-to-video synthesis via stable diffusion. In: 2023 IEEE\/CVF International Conference on Computer Vision (ICCV), IEEE, pp 22623\u201322633","DOI":"10.1109\/ICCV51070.2023.02073"},{"key":"2752_CR42","doi-asserted-by":"crossref","unstructured":"Khachatryan L, Movsisyan A, Tadevosyan V, Henschel, R., Wang, Z., Navasardyan, S., & Shi, H.(2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 15954\u201315964","DOI":"10.1109\/ICCV51070.2023.01462"},{"key":"2752_CR43","unstructured":"Kingma, D.P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980"},{"issue":"1","key":"2752_CR44","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1002\/cav.6","volume":"15","author":"S Kopp","year":"2004","unstructured":"Kopp, S., & Wachsmuth, I. (2004). Synthesizing multimodal utterances for conversational agents. Computer animation and virtual worlds,15(1), 39\u201352.","journal-title":"Computer animation and virtual worlds"},{"key":"2752_CR45","doi-asserted-by":"crossref","unstructured":"Kucherenko, T., Hasegawa, D., Henter, G.E., Kaneko, N., & Kjellstr\u00f6m, H. (2019). Analyzing input and output representations for speech-driven gesture generation. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp 97\u2013104","DOI":"10.1145\/3308532.3329472"},{"key":"2752_CR46","unstructured":"Lee, H.Y., Yang, X., Liu, M.Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., & Kautz, J. (2019). Dancing to music. In: Wallach H, Larochelle H, Beygelzimer A, et\u00a0al (eds) Advances in Neural Information Processing Systems, vol\u00a032. Curran Associates, Inc., https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2019\/file\/7ca57a9f85a19a6e4b9a248c1daca185-Paper.pdf"},{"key":"2752_CR47","doi-asserted-by":"crossref","unstructured":"Levine, S., Kr\u00e4henb\u00fchl, P., Thrun, S., & Koltun, V. (2010). Gesture controllers. In: Acm siggraph 2010 papers, pp 1\u201311","DOI":"10.1145\/1833349.1778861"},{"key":"2752_CR48","doi-asserted-by":"crossref","unstructured":"Li, J., Kang, D., Pei, W., et\u00a0al (2021a). Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 11293\u201311302","DOI":"10.1109\/ICCV48922.2021.01110"},{"key":"2752_CR49","doi-asserted-by":"crossref","unstructured":"Li, L., Wang, S., Zhang, Z., Ding, Y., Zheng, Y., Yu, X., & Fan, C. (2021b). Write-a-speaker: Text-based emotional and rhythmic talking-head generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1911\u20131920","DOI":"10.1609\/aaai.v35i3.16286"},{"key":"2752_CR50","unstructured":"Li, R., Yang, S., Ross, D.A., & Kanazawa, A. (2021c). Learn to dance with aist++: Music conditioned 3d dance generation. arXiv preprint arXiv:2101.08779 2(3)"},{"key":"2752_CR51","unstructured":"Lin, G., Jiang, J., Liang, C., et\u00a0al (2024). Cyberhost: Taming audio-driven avatar diffusion model with region codebook attention. arXiv preprint arXiv:2409.01876"},{"key":"2752_CR52","doi-asserted-by":"crossref","unstructured":"Liu, Y., Cao, Q., Wen, Y., Jiang, H., & Ding, C. (2024b). Towards variable and coordinated holistic co-speech motion generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1566\u20131576","DOI":"10.1109\/CVPR52733.2024.00155"},{"key":"2752_CR53","doi-asserted-by":"crossref","unstructured":"Liu, X., Wu, Q., Zhou, H., et\u00a0al (2022c). Learning hierarchical cross-modal association for co-speech gesture generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 10462\u201310472","DOI":"10.1109\/CVPR52688.2022.01021"},{"key":"2752_CR54","doi-asserted-by":"crossref","unstructured":"Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., & Black, M.J. (2024a). Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1144\u20131154","DOI":"10.1109\/CVPR52733.2024.00115"},{"key":"2752_CR55","doi-asserted-by":"crossref","unstructured":"Liu H, Iwamoto N, Zhu Z, Li, Z., Zhou, Y., Bozkurt, E., & Zheng, B. (2022a). Disco: Disentangled implicit content and rhythm learning for diverse co-speech gestures synthesis. In: Proceedings of the 30th ACM international conference on multimedia, pp 3764\u20133773","DOI":"10.1145\/3503161.3548400"},{"key":"2752_CR56","first-page":"21386","volume":"35","author":"X Liu","year":"2022","unstructured":"Liu, X., Wu, Q., Zhou, H., et al. (2022). Audio-driven co-speech gesture video generation. Advances in Neural Information Processing Systems,35, 21386\u201321399.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2752_CR57","doi-asserted-by":"crossref","unstructured":"Loper, M., Mahmood, N., Romero, J., et\u00a0al (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc SIGGRAPH Asia) 34(6), 248:1\u2013248:16","DOI":"10.1145\/2816795.2818013"},{"key":"2752_CR58","doi-asserted-by":"crossref","unstructured":"Marsella, S., Xu, Y., Lhommet, M., et\u00a0al (2013). Virtual character performance from speech. In: Proceedings of the 12th ACM SIGGRAPH\/Eurographics symposium on computer animation, pp 25\u201335","DOI":"10.1145\/2485895.2485900"},{"key":"2752_CR59","doi-asserted-by":"crossref","unstructured":"Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G.E., & Neff, M. (2023). A comprehensive review of data-driven co-speech gesture generation. In: Computer Graphics Forum, Wiley Online Library, pp 569\u2013596","DOI":"10.1111\/cgf.14776"},{"issue":"4","key":"2752_CR60","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3592456","volume":"42","author":"K Pang","year":"2023","unstructured":"Pang, K., Qin, D., Fan, Y., et al. (2023). Bodyformer: Semantics-guided 3d body gesture synthesis with transformer. ACM Transactions on Graphics (TOG),42(4), 1\u201312.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"2752_CR61","doi-asserted-by":"crossref","unstructured":"Park, S.J., Kim, M., Hong, J., et\u00a0al (2022). Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In: AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial Intelligence","DOI":"10.1609\/aaai.v36i2.20102"},{"key":"2752_CR62","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., & Black, M.J. (2019a). Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10975\u201310985","DOI":"10.1109\/CVPR.2019.01123"},{"key":"2752_CR63","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., & Black, M.J. (2019b). Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)","DOI":"10.1109\/CVPR.2019.01123"},{"key":"2752_CR64","doi-asserted-by":"crossref","unstructured":"Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., & Malik, J. (2024) Reconstructing hands in 3D with transformers. In: CVPR","DOI":"10.1109\/CVPR52733.2024.00938"},{"key":"2752_CR65","doi-asserted-by":"crossref","unstructured":"Peebles, W., & Xie, S. (2023) Scalable diffusion models with transformers. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 4195\u20134205","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"2752_CR66","unstructured":"Peng, B., Wang, J., Zhang, Y., Li, W., Yang, M.-C., & Jia, J. (2024). Controlnext: Powerful and efficient control for image and video generation. arXiv preprint arXiv:2408.06070"},{"issue":"7","key":"2752_CR67","doi-asserted-by":"publisher","first-page":"1372","DOI":"10.1002\/mar.21813","volume":"40","author":"G Pizzi","year":"2023","unstructured":"Pizzi, G., Vannucci, V., Mazzoli, V., et al. (2023). I, chatbot! the impact of anthropomorphism and gaze direction on willingness to disclose personal information and behavioral intentions. Psychology & Marketing,40(7), 1372\u20131387.","journal-title":"Psychology & Marketing"},{"key":"2752_CR68","doi-asserted-by":"crossref","unstructured":"Poggi, I., Pelachaud, C., de\u00a0Rosis, F., Carofiglio, V., De Carolis, B., & GRETA, A. (2005). A believable embodied conversational agent. Multimodal intelligent information presentation Text, speech and language technology 27","DOI":"10.1007\/1-4020-3051-7_1"},{"key":"2752_CR69","doi-asserted-by":"crossref","unstructured":"Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., & Jawahar, C.V. (2020). A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia, pp 484\u2013492","DOI":"10.1145\/3394171.3413532"},{"key":"2752_CR70","doi-asserted-by":"crossref","unstructured":"Qi, X., Liu, C., Li, L., et\u00a0al (2024). Emotiongesture: Audio-driven diverse emotional co-speech 3d gesture generation. IEEE Transactions on Multimedia","DOI":"10.1109\/TMM.2024.3407692"},{"key":"2752_CR71","doi-asserted-by":"crossref","unstructured":"Qian, S., Tu, Z., Zhi, Y., Liu, W., & Gao, S.(2021). Speech drives templates: Co-speech gesture synthesis with learned templates. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 11077\u201311086","DOI":"10.1109\/ICCV48922.2021.01089"},{"key":"2752_CR72","unstructured":"Ravi, N., Reizenstein, J., Novotny, D., et\u00a0al (2020). Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501"},{"key":"2752_CR73","doi-asserted-by":"crossref","unstructured":"Roesler, E., Manzey, D., & Onnasch, L. (2021). A meta-analysis on the effectiveness of anthropomorphism in human-robot interaction. Science Robotics 6(58):eabj5425","DOI":"10.1126\/scirobotics.abj5425"},{"key":"2752_CR74","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10684\u201310695","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"2752_CR75","doi-asserted-by":"crossref","unstructured":"Romero, J., Tzionas, D., & Black, M.J. (2017). Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc SIGGRAPH Asia) 36(6)","DOI":"10.1145\/3130800.3130883"},{"issue":"4","key":"2752_CR76","doi-asserted-by":"publisher","first-page":"8","DOI":"10.17705\/1jais.00685","volume":"22","author":"AM Seeger","year":"2021","unstructured":"Seeger, A. M., Pfeiffer, J., & Heinzl, A. (2021). Texting with humanlike conversational agents: Designing for anthropomorphism. Journal of the Association for Information systems,22(4), 8.","journal-title":"Journal of the Association for Information systems"},{"key":"2752_CR77","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijhcs.2022.102848","volume":"165","author":"L Seitz","year":"2022","unstructured":"Seitz, L., Bekmeier-Feuerhahn, S., & Gohil, K. (2022). Can we trust a chatbot like a physician? a qualitative study on understanding the emergence of trust toward diagnostic chatbots. International Journal of Human-Computer Studies,165, Article 102848.","journal-title":"International Journal of Human-Computer Studies"},{"key":"2752_CR78","doi-asserted-by":"crossref","unstructured":"Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nie\u00dfner, M. (2020). Neural voice puppetry: Audio-driven facial reenactment. In: European Conference on Computer Vision, Springer, pp 716\u2013731","DOI":"10.1007\/978-3-030-58517-4_42"},{"key":"2752_CR79","unstructured":"Unterthiner T, Van\u00a0Steenkiste S, Kurach K, Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717"},{"key":"2752_CR80","unstructured":"Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems"},{"key":"2752_CR81","unstructured":"Villegas, R., Babaeizadeh, M., Kindermans, P.J. et\u00a0al (2022). Phenaki: Variable length video generation from open domain textual descriptions. In: International Conference on Learning Representations"},{"key":"2752_CR82","doi-asserted-by":"crossref","unstructured":"Wang, T., Li, L., Lin, K., Zhai, Y., Lin, C.-C., Yang, Z., Zhang, H., Liu, Z., & Wang, L. (2024). Disco: Disentangled control for realistic human dance generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 9326\u20139336","DOI":"10.1109\/CVPR52733.2024.00891"},{"issue":"4","key":"2752_CR83","doi-asserted-by":"publisher","first-page":"600","DOI":"10.1109\/TIP.2003.819861","volume":"13","author":"Z Wang","year":"2004","unstructured":"Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing,13(4), 600\u2013612.","journal-title":"IEEE transactions on image processing"},{"key":"2752_CR84","doi-asserted-by":"crossref","unstructured":"Xing, J., Xia, M., Liu, Y., et\u00a0al (2024). Make-your-video: Customized video generation using textual and structural guidance. IEEE Transactions on Visualization and Computer Graphics","DOI":"10.1109\/TVCG.2024.3365804"},{"key":"2752_CR85","doi-asserted-by":"crossref","unstructured":"Xu, Z., Zhang, J., Liew, J.H., Yan, H., Liu, J.-W., Zhang, C., Feng, J., Shou, M.Z. (2024). Magicanimate: Temporally consistent human image animation using diffusion model. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 1481\u20131490","DOI":"10.1109\/CVPR52733.2024.00147"},{"key":"2752_CR86","unstructured":"Yang, Z., Teng, J., Zheng, W., et\u00a0al (2024). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072"},{"key":"2752_CR87","doi-asserted-by":"crossref","unstructured":"Yang, S., Wu, Z., Li, M., Zhang, Z., Hao, L., Bao, W., Cheng, M., & Xiao, L. (2023a). Diffusestylegesture: Stylized audio-driven co-speech gesture generation with diffusion models. arXiv preprint arXiv:2305.04919","DOI":"10.24963\/ijcai.2023\/650"},{"key":"2752_CR88","doi-asserted-by":"crossref","unstructured":"Yang, Z., Zeng, A., Yuan, C, et\u00a0al (2023b). Effective whole-body pose estimation with two-stages distillation. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 4210\u20134220","DOI":"10.1109\/ICCVW60793.2023.00455"},{"key":"2752_CR89","doi-asserted-by":"crossref","unstructured":"Yazdian, P.J., Chen, M., & Lim, A. (2022). Gesture2vec: Clustering gestures using representation learning methods for co-speech gesture generation. In: 2022 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 3100\u20133107","DOI":"10.1109\/IROS47612.2022.9981117"},{"key":"2752_CR90","unstructured":"Ye, Z., Jiang, Z., Ren, Y., Liu, J., He, J., & Zhao, Z. (2023). Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis. ICLR"},{"key":"2752_CR91","doi-asserted-by":"crossref","unstructured":"Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., & Black, M.J. (2023). Generating holistic 3d human motion from speech. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 469\u2013480","DOI":"10.1109\/CVPR52729.2023.00053"},{"key":"2752_CR92","doi-asserted-by":"crossref","unstructured":"Yin, L., Wang, Y., He, T., Liu, J., Zhao, W., Li, B., Jin, X., & Lin, J. (2023). Emog: Synthesizing emotive co-speech 3d gesture with diffusion model. arXiv preprint arXiv:2306.11496","DOI":"10.2139\/ssrn.4818829"},{"key":"2752_CR93","doi-asserted-by":"crossref","unstructured":"Yoon, Y., Ko, W.R., Jang, M., et\u00a0al (2019). Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, pp 4303\u20134309","DOI":"10.1109\/ICRA.2019.8793720"},{"issue":"6","key":"2752_CR94","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3414685.3417838","volume":"39","author":"Y Yoon","year":"2020","unstructured":"Yoon, Y., Cha, B., Lee, J. H., et al. (2020). Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG),39(6), 1\u201316.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"2752_CR95","unstructured":"Yu, L., Lezama, J., Gundavarapu, N.B., et\u00a0al (2023). Language model beats diffusion\u2013tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737"},{"key":"2752_CR96","unstructured":"Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems 31"},{"key":"2752_CR97","unstructured":"Zhang, Y., Gu, J., Wang, L.W., et\u00a0al (2024). Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680"},{"key":"2752_CR98","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., Efros, A.A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586\u2013595","DOI":"10.1109\/CVPR.2018.00068"},{"key":"2752_CR99","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Barnes, C., Lu, J., et\u00a0al (2019). On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 5745\u20135753","DOI":"10.1109\/CVPR.2019.00589"},{"key":"2752_CR100","unstructured":"Zhou, S., Chan, K.C., Li, C., & Loy, C.C. (2022b). Towards robust blind face restoration with codebook lookup transformer. In: NeurIPS"},{"key":"2752_CR101","doi-asserted-by":"crossref","unstructured":"Zhou, H., Sun, Y., Wu, W., et\u00a0al (2021). Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 4176\u20134186","DOI":"10.1109\/CVPR46437.2021.00416"},{"key":"2752_CR102","unstructured":"Zhou, D., Wang, W., Yan, H., Lv, W., Zhu, Y., & Feng, J. (2022a). Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018"},{"key":"2752_CR103","doi-asserted-by":"crossref","unstructured":"Zhu, L., Liu, X., Liu, X., Liu, X., Qian, R., Liu, Z., & Yu, L. (2023). Taming diffusion models for audio-driven co-speech gesture generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp 10544\u201310553","DOI":"10.1109\/CVPR52729.2023.01016"},{"key":"2752_CR104","doi-asserted-by":"crossref","unstructured":"Zhuang, W., Wang, C., Chai, J., Wang, Y., Shao, M., & Xia, S. (2022). Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18(2):1\u201321","DOI":"10.1145\/3485664"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02752-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-026-02752-z","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-026-02752-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T08:38:50Z","timestamp":1774600730000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-026-02752-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,24]]},"references-count":104,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["2752"],"URL":"https:\/\/doi.org\/10.1007\/s11263-026-02752-z","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,24]]},"assertion":[{"value":"11 March 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 January 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The performance of our model is hindered by unstable backgrounds. Moreover, our model cannot handle results with cross fingers, which might cause difficulties in the 3D representations. Extending the capabilities of co-speech gesture video generation to diverse, real-world scenes remains a challenging open problem. Larger-scale pre-trained DiT models might be able to tackle these difficulties.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Limitation"}},{"value":"Our method has the potential to generate fabricated talks, raising concerns about potential misuse. To mitigate this risk, we are committed to strictly controlling the distribution of our models and the generated content, limiting access to research purposes only.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical Consideration"}}],"article-number":"85"}}