{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T15:25:42Z","timestamp":1778081142801,"version":"3.51.4"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,7,26]],"date-time":"2023-07-26T00:00:00Z","timestamp":1690329600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100003452","name":"Innovation and Technology Commission","doi-asserted-by":"publisher","award":["ITS\/319\/21FP"],"award-info":[{"award-number":["ITS\/319\/21FP"]}],"id":[{"id":"10.13039\/501100003452","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Research Grant Council","award":["17210222"],"award-info":[{"award-number":["17210222"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.<\/jats:p>","DOI":"10.1145\/3592456","type":"journal-article","created":{"date-parts":[[2023,7,26]],"date-time":"2023-07-26T14:29:21Z","timestamp":1690381761000},"page":"1-12","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":19,"title":["Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0168-3697","authenticated-orcid":false,"given":"Kunkun","family":"Pang","sequence":"first","affiliation":[{"name":"Guangdong Key Laboratory of Modern Control Technology, Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4992-4760","authenticated-orcid":false,"given":"Dafei","family":"Qin","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Pokfulam, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8977-4958","authenticated-orcid":false,"given":"Yingruo","family":"Fan","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Pokfulam, Hong Kong"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0894-5954","authenticated-orcid":false,"given":"Julian","family":"Habekost","sequence":"additional","affiliation":[{"name":"University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1012-415X","authenticated-orcid":false,"given":"Takaaki","family":"Shiratori","sequence":"additional","affiliation":[{"name":"Meta Reality Labs, Pittsburgh, United States of America"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2752-3955","authenticated-orcid":false,"given":"Junichi","family":"Yamagishi","sequence":"additional","affiliation":[{"name":"National Institute of Informatics, Tokyo, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2729-5860","authenticated-orcid":false,"given":"Taku","family":"Komura","sequence":"additional","affiliation":[{"name":"University of Hong Kong, Pokfulam, Hong Kong"},{"name":"Research Institute of Electrical Communication, Tohoku University, Sendai, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,7,26]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.170"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_15"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.13946"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550454.3555435"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475223"},{"key":"e_1_2_2_6_1","volume-title":"Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR)","author":"Bhattacharya Uttaran","unstructured":"Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021b. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1--10."},{"key":"e_1_2_2_7_1","volume-title":"Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1","author":"Cassell Justine","year":"1999","unstructured":"Justine Cassell, David McNeill, and Karl-Erik McCullough. 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1 (1999), 1--34."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1756-8765.2012.01183.x"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1423"},{"key":"e_1_2_2_12_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=YicbFdNTTy"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267898"},{"key":"e_1_2_2_14_1","doi-asserted-by":"crossref","unstructured":"Ylva Ferstl Michael Neff and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion Interaction and Games. ACM 3.","DOI":"10.1145\/3359566.3360053"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N19-1021"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1049\/cp:19991218"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.14734"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00361"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3528233.3530750"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472306.3478335"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417836"},{"key":"e_1_2_2_22_1","volume-title":"Some relationships between body motion and speech. Studies in dyadic communication 7, 177","author":"Kendon Adam","year":"1972","unstructured":"Adam Kendon. 1972. Some relationships between body motion and speech. Studies in dyadic communication 7, 177 (1972), 90."},{"key":"e_1_2_2_23_1","volume-title":"Fahad Shahbaz Khan, and Mubarak Shah.","author":"Khan Salman","year":"2021","unstructured":"Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) (2021)."},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308532.3329472"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3418815"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397481.3450692"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00085"},{"key":"e_1_2_2_28_1","volume-title":"Proceedings of International Conference on Machine Learning. 3744--3753","author":"Lee Juho","year":"2019","unstructured":"Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019b. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of International Conference on Machine Learning. 3744--3753."},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01110"},{"key":"e_1_2_2_30_1","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision.","author":"Li Ruilong","year":"2021","unstructured":"Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021b. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation. In Proceedings of IEEE\/CVF International Conference on Computer Vision."},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548400"},{"key":"e_1_2_2_32_1","volume-title":"BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297","author":"Liu Haiyang","year":"2022","unstructured":"Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022c. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297 (2022)."},{"key":"e_1_2_2_33_1","volume-title":"Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. arXiv preprint arXiv:2203.13161","author":"Liu Xian","year":"2022","unstructured":"Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022b. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. arXiv preprint arXiv:2203.13161 (2022)."},{"key":"e_1_2_2_34_1","volume-title":"Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Bkg6RiCqY7"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_2_2_36_1","volume-title":"Hand and mind: What gestures reveal about thought","author":"McNeill David","unstructured":"David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press."},{"key":"e_1_2_2_37_1","volume-title":"Ekin Dogus Cubuk, and Quoc V. Le","author":"Park Daniel S.","year":"2019","unstructured":"Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In INTERSPEECH."},{"key":"e_1_2_2_38_1","volume-title":"PyTorch: An Imperative Style","author":"Paszke Adam","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01089"},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00790"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3478513.3480570"},{"key":"e_1_2_2_42_1","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Petra Wagner Zofia Malisz and Stefan Kopp. 2014. Gesture and speech in interaction: An overview.","DOI":"10.1016\/j.specom.2013.09.008"},{"key":"e_1_2_2_44_1","volume-title":"Proceedings of International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=onxoVA9FxMw","author":"Wang Benyou","year":"2021","unstructured":"Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob Grue Simonsen. 2021b. On Position Embeddings in BERT. In Proceedings of International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=onxoVA9FxMw"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3462244.3479914"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793720"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201366"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00589"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01016"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.11"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3592456","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3592456","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:59Z","timestamp":1750182539000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3592456"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,26]]},"references-count":51,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.1145\/3592456"],"URL":"https:\/\/doi.org\/10.1145\/3592456","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,26]]},"assertion":[{"value":"2023-07-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}