{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T17:58:01Z","timestamp":1768240681614,"version":"3.49.0"},"reference-count":81,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>Sign Language Production (SLP) aims to translate spoken textual languages into sign language sequences, which can significantly bridge the communication gap for deaf and hard-of-hearing individuals. Most previous SLP methods typically rely on skeleton-based data, which hinders their realism and expressive capacity. In this work, we address expressive 3D SLP tasks to generate high-quality 3D holistic sign motions driven by spoken language. However, existing 3D SLP methods struggle to accurately capture spatial relationships within intricate 3D structures and overlook the alignment of semantics at the word level. To overcome these limitations, we propose SignMask, a novel generative masked modeling framework that enhances spatial structure awareness and semantic understanding. We first design a structural holistic sign motion tokenizer that hierarchically learns discrete tokens of body and hand movements. This tokenizer adaptively aggregates 3D SMPL-X pose features corresponding to the same semantic parts and dynamically adjusts the weights between pose features of different semantic parts, enhancing spatial awareness and ensuring semantic consistency. Building on these tokenized representations, we introduce a specialized Sign-M Transformer to learn masked token prediction guided by textual input. Our Sign-M Transformer employs a hierarchical masking strategy, alongside spatio-temporal and cross-modal attention mechanisms, to effectively capture complex spatio-temporal relationships among sign tokens and semantic dependencies between sign and text tokens. During inference, our SignMask model parallelly and iteratively fills up the missing motion tokens starting from full-masked token sequences, therefore achieving high-fidelity and efficient 3D sign avatar generation. Extensive experiments demonstrate the superior performance of our approach compared to existing SLP methods across various lingual sign language datasets in generating high-quality and semantically consistent sign language motions.<\/jats:p>","DOI":"10.1145\/3776750","type":"journal-article","created":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T15:39:48Z","timestamp":1763480388000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["SignMask: Structure-aware Masked Modeling for Holistic 3D Sign Language Production"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-1743-817X","authenticated-orcid":false,"given":"Yibo","family":"Xia","sequence":"first","affiliation":[{"name":"Beihang University, Beijing, China and Zhongguancun Academy, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-1644-7064","authenticated-orcid":false,"given":"Qihui","family":"Zhan","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7256-4329","authenticated-orcid":false,"given":"Xiaoyan","family":"Luo","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8823-6119","authenticated-orcid":false,"given":"Xiaofeng","family":"Shi","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8001-2703","authenticated-orcid":false,"given":"Yunhong","family":"Wang","sequence":"additional","affiliation":[{"name":"Beihang University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2026,1,12]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592458"},{"issue":"6","key":"e_1_3_3_3_2","first-page":"1","article-title":"Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings","volume":"41","author":"Ao Tenglong","year":"2022","unstructured":"Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics 41, 6 (2022), 1\u201319.","journal-title":"ACM Transactions on Graphics"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592097"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02016"},{"key":"e_1_3_3_6_2","unstructured":"Dzmitry Bahdanau. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https:\/\/arxiv.org\/abs\/1409.0473"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00194"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01004"},{"key":"e_1_3_3_9_2","unstructured":"Huiwen Chang Han Zhang Jarred Barber A. J. Maschinot Jose Lezama Lu Jiang Ming-Hsuan Yang Kevin Murphy William T. Freeman Michael Rubinstein et al. 2023. Muse: Text-to-image generation via masked generative transformers. arXiv:2301.00704. Retrieved from https:\/\/arxiv.org\/abs\/2301.00704"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01103"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3680847"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00702"},{"key":"e_1_3_3_13_2","first-page":"906","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Conference","author":"Cheng Hongye","year":"2025","unstructured":"Hongye Cheng, Tianyu Wang, Guangsi Shi, Zexing Zhao, and Yanwei Fu. 2025. HOP: Heterogeneous topology-based multimodal entanglement for co-speech gesture generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, 906\u2013916."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3680528.3687677"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00190"},{"key":"e_1_3_3_16_2","unstructured":"Kyunghyun Cho. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259. Retrieved from https:\/\/arxiv.org\/abs\/1409.1259"},{"key":"e_1_3_3_17_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171\u20134186."},{"key":"e_1_3_3_18_2","first-page":"1","volume-title":"Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)","author":"Dong Lu","year":"2024","unstructured":"Lu Dong, Lipisha Chaudhary, Fei Xu, Xiao Wang, Mason Lary, and Ifeoma Nwogu. 2024. SignAvatar: Sign language 3D motion reconstruction and generation. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1\u201310."},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00276"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00186"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3743138"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472306.3478335"},{"key":"e_1_3_3_23_2","article-title":"GANs trained by a two time-scale update rule converge to a local Nash equilibrium","volume":"30","author":"Heusel Martin","year":"2017","unstructured":"Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_24_2","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho Jonathan","year":"2020","unstructured":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 6840\u20136851.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_25_2","unstructured":"Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv:2207.12598. Retrieved from https:\/\/arxiv.org\/abs\/2207.12598"},{"issue":"8","key":"e_1_3_3_26_2","first-page":"4792","article-title":"Pose-aware attention network for flexible motion retargeting by body part","volume":"30","author":"Hu Lei","year":"2023","unstructured":"Lei Hu, Zihao Zhang, Chongyang Zhong, Boyuan Jiang, and Shihong Xia. 2023. Pose-aware attention network for flexible motion retargeting by body part. IEEE Transactions on Visualization and Computer Graphics 30, 8 (2023), 4792\u20134808.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475463"},{"key":"e_1_3_3_28_2","first-page":"3","volume-title":"Proceedings of the British Machine Vision Conference","volume":"1","author":"Hwang Eui Jun","year":"2021","unstructured":"Eui Jun Hwang, Jung-Ho Kim, and Jong C. Park. 2021. Non-autoregressive sign language production with Gaussian space. In Proceedings of the British Machine Vision Conference, Vol. 1, 3."},{"key":"e_1_3_3_29_2","unstructured":"Eui Jun Hwang Huije Lee and Jong C. Park. 2023. Autoregressive sign language production: A gloss-free approach with discrete representations. arXiv:2309.12179. Retrieved from https:\/\/arxiv.org\/abs\/2309.12179"},{"key":"e_1_3_3_30_2","first-page":"1","volume-title":"Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)","author":"Hwang Eui Jun","year":"2024","unstructured":"Eui Jun Hwang, Huije Lee, and Jong C. Park. 2024. A gloss-free sign language production with discrete representation. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1\u20136."},{"key":"e_1_3_3_31_2","volume-title":"Gesture Generation by Imitation: From Human Behavior to Computer Character Animation","author":"Kipp Michael","year":"2005","unstructured":"Michael Kipp. 2005. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal Publishers."},{"key":"e_1_3_3_32_2","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1007\/11821830_17","volume-title":"Proceedings of the International Workshop on Intelligent Virtual Agents","author":"Kopp Stefan","year":"2006","unstructured":"Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N. Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn R. Th\u00f3risson, and Hannes Vilhj\u00e1lmsson. 2006. Towards a common framework for multimodal generation: The behavior markup language. In Proceedings of the International Workshop on Intelligent Virtual Agents. Springer, 205\u2013217."},{"key":"e_1_3_3_33_2","first-page":"20740","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Lee Taeryung","year":"2023","unstructured":"Taeryung Lee, Yeonguk Oh, and Kyoung Mu Lee. 2023. Human part-wise 3D motion context learning for sign language recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 20740\u201320750."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.3115\/1218955.1219032"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02027"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00115"},{"key":"e_1_3_3_37_2","first-page":"13963","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Liu Lanmiao","year":"2025","unstructured":"Lanmiao Liu, Esam Ghaleb, Asli Ozyurek, and Zerrin Yumak. 2025. SemGes: Semantics-aware co-speech gesture generation using semantic coherence and relevance learning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 13963\u201313973."},{"key":"e_1_3_3_38_2","first-page":"1566","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Yifei","year":"2024","unstructured":"Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. 2024. Towards variable and coordinated holistic co-speech motion generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 1566\u20131576."},{"key":"e_1_3_3_39_2","unstructured":"I. Loshchilov. 2017. Decoupled weight decay regularization. arXiv:1711.05101. Retrieved from https:\/\/arxiv.org\/abs\/1711.05101"},{"key":"e_1_3_3_40_2","unstructured":"Shunlin Lu Ling-Hao Chen Ailing Zeng Jing Lin Ruimao Zhang Lei Zhang and Heung-Yeung Shum. 2023. HumanTOMATO: Text-aligned whole-body motion generation. arXiv:2310.12978. Retrieved from https:\/\/arxiv.org\/abs\/2310.12978"},{"key":"e_1_3_3_41_2","unstructured":"Jian Ma Wenguan Wang Yi Yang and Feng Zheng. 2024. MS2SL: Multimodal spoken data-driven continuous sign language production. arXiv:2407.12842. Retrieved from https:\/\/arxiv.org\/abs\/2407.12842"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485895.2485900"},{"key":"e_1_3_3_43_2","first-page":"16578","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Conference","author":"Hamza Mughal M.","year":"2025","unstructured":"M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, and Christian Theobalt. 2025. Retrieving semantics from the deep: An RAG solution for gesture synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, 16578\u201316588."},{"key":"e_1_3_3_44_2","first-page":"569","volume-title":"Proceedings of the Computer Graphics Forum","author":"Nyatsanga Simbarashe","year":"2023","unstructured":"Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, and Michael Neff. 2023. A comprehensive review of data-driven co-speech gesture generation. In Proceedings of the Computer Graphics Forum, Vol. 42. Wiley Online Library, 569\u2013596."},{"key":"e_1_3_3_45_2","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311\u2013318."},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01123"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.00153"},{"key":"e_1_3_3_49_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"issue":"140","key":"e_1_3_3_50_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_3_51_2","article-title":"Generating diverse high-fidelity images with VQ-VAE-2","volume":"32","author":"Razavi Ali","year":"2019","unstructured":"Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with VQ-VAE-2. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_52_2","unstructured":"V. Sanh. 2019. DistilBERT a distilled version of BERT: Smaller faster cheaper and lighter. arXiv:1910.01108. Retrieved from https:\/\/arxiv.org\/abs\/1910.01108"},{"key":"e_1_3_3_53_2","doi-asserted-by":"crossref","unstructured":"Ben Saunders Necati Cihan Camgoz and Richard Bowden. 2020. Adversarial training for multi-channel sign language production. arXiv:2008.12405. Retrieved from https:\/\/arxiv.org\/abs\/2008.12405","DOI":"10.5244\/C.34.63"},{"key":"e_1_3_3_54_2","first-page":"687","volume-title":"Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920)","author":"Saunders Ben","year":"2020","unstructured":"Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. 2020. Progressive transformers for end-to-end sign language production. In Proceedings of the 16th European Conference on Computer Vision (ECCV \u201920). Springer, 687\u2013705."},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681641"},{"key":"e_1_3_3_56_2","unstructured":"Yang Song Jascha Sohl-Dickstein Diederik P. Kingma Abhishek Kumar Stefano Ermon and Ben Poole. 2020. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456. Retrieved from https:\/\/arxiv.org\/abs\/2011.13456"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01281-2"},{"key":"e_1_3_3_58_2","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1109\/3DV57658.2022.00031","volume-title":"Proceedings of the 2022 International Conference on 3D Vision (3DV)","author":"Stoll Stephanie","year":"2022","unstructured":"Stephanie Stoll, Armin Mustafa, and Jean-Yves Guillemaut. 2022. There and back again: 3D sign language generation from text using back-translation. In Proceedings of the 2022 International Conference on 3D Vision (3DV). IEEE, 187\u2013196."},{"key":"e_1_3_3_59_2","unstructured":"I. Sutskever. 2014. Sequence to sequence learning with neural networks. arXiv:1409.3215. Retrieved from https:\/\/arxiv.org\/abs\/1409.3215"},{"key":"e_1_3_3_60_2","first-page":"3481","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Conference","author":"Tang Shengeng","year":"2025","unstructured":"Shengeng Tang, Jiayi He, Lechao Cheng, Jingjing Wu, Dan Guo, and Richang Hong. 2025. Discrete to continuous: Generating smooth transition poses from sign language observations. In Proceedings of the Computer Vision and Pattern Recognition Conference, 3481\u20133491."},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v39i7.32781"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547830"},{"key":"e_1_3_3_63_2","first-page":"1","volume-title":"ACM Transactions on Multimedia Computing, Communications and Applications","volume":"21","author":"Tang Shengeng","year":"2024","unstructured":"Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, and Richang Hong. 2024. Gloss-driven conditional diffusion models for sign language production. ACM Transactions on Multimedia Computing, Communications and Applications 21, 4 (2024), 1\u201317."},{"key":"e_1_3_3_64_2","first-page":"244","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Tian Linrui","year":"2024","unstructured":"Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. EMO: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In Proceedings of the European Conference on Computer Vision. Springer, 244\u2013260."},{"key":"e_1_3_3_65_2","article-title":"Neural discrete representation learning","volume":"30","author":"Van Den Oord Aaron","year":"2017","unstructured":"Aaron Van Den Oord and Oriol Vinyals. 2017. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_66_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems."},{"issue":"1","key":"e_1_3_3_67_2","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1007\/s10032-020-00360-2","article-title":"Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training","volume":"24","author":"Wang Zelun","year":"2021","unstructured":"Zelun Wang and Jyh-Charn Liu. 2021. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition 24, 1 (2021), 63\u201375.","journal-title":"International Journal on Document Analysis and Recognition"},{"key":"e_1_3_3_68_2","unstructured":"Huawei Wei Zejun Yang and Zhisheng Wang. 2024. AniPortrait: Audio-driven synthesis of photorealistic portrait animation. arXiv:2403.17694. Retrieved from https:\/\/arxiv.org\/abs\/2403.17694"},{"key":"e_1_3_3_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/3664647.3681392"},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i6.28441"},{"key":"e_1_3_3_71_2","first-page":"660","article-title":"VASA-1: Lifelike audio-driven talking faces generated in real time","volume":"37","author":"Xu Sicheng","year":"2024","unstructured":"Sicheng Xu, Guojun Chen, Yu-Xiao Guo Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. 2024. VASA-1: Lifelike audio-driven talking faces generated in real time. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 37, 660\u2013684.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00053"},{"key":"e_1_3_3_73_2","unstructured":"Aoxiong Yin Haoyuan Li Kai Shen Siliang Tang and Yueting Zhuang. 2024. T2S-GPT: Dynamic vector quantization for autoregressive sign language production from text. arXiv:2406.07119. Retrieved from https:\/\/arxiv.org\/abs\/2406.07119"},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_3_3_75_2","first-page":"1","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Yu Zhengdi","year":"2024","unstructured":"Zhengdi Yu, Shaoli Huang, Yongkang Cheng, and Tolga Birdal. 2024. SignAvatars: A large-scale 3D sign language holistic motion dataset and benchmark. In Proceedings of the European Conference on Computer Vision. Springer, 1\u201319."},{"key":"e_1_3_3_76_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3129994"},{"key":"e_1_3_3_77_2","first-page":"3395","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Zelinka Jan","year":"2020","unstructured":"Jan Zelinka and Jakub Kanis. 2020. Neural sign language synthesis: Words are our glosses. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, 3395\u20133403."},{"key":"e_1_3_3_78_2","first-page":"13761","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhang Xiangyue","year":"2025","unstructured":"Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, and Zhigang Tu. 2025. SemTalk: Holistic co-speech motion generation with frame-level semantic emphasis. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 13761\u201313771."},{"key":"e_1_3_3_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00137"},{"key":"e_1_3_3_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00589"},{"key":"e_1_3_3_81_2","first-page":", 11040","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"39","author":"Zhu Yitao","year":"2025","unstructured":"Yitao Zhu, Sheng Wang, Mengjie Xu, Zixu Zhuang, Zhixin Wang, Kaidong Wang, Han Zhang, and Qian Wang. 2025. MUC: Mixture of uncalibrated cameras for robust 3D human body reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, 11040\u201311048."},{"key":"e_1_3_3_82_2","first-page":"36","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Zuo Ronglai","year":"2024","unstructured":"Ronglai Zuo, Fangyun Wei, Zenggui Chen, Brian Mak, Jiaolong Yang, and Xin Tong. 2024. A simple baseline for spoken language to sign language translation with 3D avatars. In Proceedings of the European Conference on Computer Vision. Springer, 36\u201354."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3776750","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T14:32:53Z","timestamp":1768228373000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3776750"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,12]]},"references-count":81,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3776750"],"URL":"https:\/\/doi.org\/10.1145\/3776750","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,12]]},"assertion":[{"value":"2025-03-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-01-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}