{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T18:47:09Z","timestamp":1761677229710,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":63,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,9]],"date-time":"2023-10-09T00:00:00Z","timestamp":1696809600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,9]]},"DOI":"10.1145\/3577190.3614135","type":"proceedings-article","created":{"date-parts":[[2023,10,7]],"date-time":"2023-10-07T22:30:48Z","timestamp":1696717848000},"page":"60-69","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-3646-7702","authenticated-orcid":false,"given":"Hendric","family":"Vo\u00df","sequence":"first","affiliation":[{"name":"Social Cognitive Systems - Cluster of Excellence Cognitive Interaction Technology (CITEC), Bielefeld University, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4047-9277","authenticated-orcid":false,"given":"Stefan","family":"Kopp","sequence":"additional","affiliation":[{"name":"Cluster of Excellence Cognitive Interaction Technology (CITEC), Bielefeld University, Germany"}]}],"member":"320","published-online":{"date-parts":[[2023,10,9]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"[n. d.]. TED \u2014 youtube.com. https:\/\/www.youtube.com\/c\/TED\/videos. [Accessed 16-Feb-2023].  [n. d.]. TED \u2014 youtube.com. https:\/\/www.youtube.com\/c\/TED\/videos. [Accessed 16-Feb-2023]."},{"key":"e_1_3_2_1_2_1","unstructured":"[n. d.]. TEDx Talks \u2014 youtube.com. https:\/\/www.youtube.com\/channel\/UCsT0YIqwnpJCM-mx7-gSA4Q. [Accessed 16-Feb-2023].  [n. d.]. TEDx Talks \u2014 youtube.com. https:\/\/www.youtube.com\/channel\/UCsT0YIqwnpJCM-mx7-gSA4Q. [Accessed 16-Feb-2023]."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.170"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Chaitanya Ahuja Dong\u00a0Won Lee Yukiko\u00a0I. Nakano and Louis-Philippe Morency. 2020. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. http:\/\/arxiv.org\/abs\/2007.12553 arXiv:2007.12553 [cs].  Chaitanya Ahuja Dong\u00a0Won Lee Yukiko\u00a0I. Nakano and Louis-Philippe Morency. 2020. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. http:\/\/arxiv.org\/abs\/2007.12553 arXiv:2007.12553 [cs].","DOI":"10.1007\/978-3-030-58523-5_15"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. http:\/\/arxiv.org\/abs\/1907.01108 arXiv:1907.01108 [cs].  Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. http:\/\/arxiv.org\/abs\/1907.01108 arXiv:1907.01108 [cs].","DOI":"10.1109\/3DV.2019.00084"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3550454.3555435"},{"key":"#cr-split#-e_1_3_2_1_7_1.1","unstructured":"Alexei Baevski Henry Zhou Abdelrahman Mohamed and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. https:\/\/doi.org\/10.48550\/arXiv.2006.11477 arXiv:2006.11477 [cs eess]. 10.48550\/arXiv.2006.11477"},{"key":"#cr-split#-e_1_3_2_1_7_1.2","unstructured":"Alexei Baevski Henry Zhou Abdelrahman Mohamed and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. https:\/\/doi.org\/10.48550\/arXiv.2006.11477 arXiv:2006.11477 [cs eess]."},{"volume-title":"Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR)","author":"Bhattacharya Uttaran","key":"e_1_3_2_1_8_1","unstructured":"Uttaran Bhattacharya , Nicholas Rewkowski , Abhishek Banerjee , Pooja Guhan , Aniket Bera , and Dinesh Manocha . 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR) . IEEE , 1\u201310. Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, 1\u201310."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/344779.344865"},{"key":"e_1_3_2_1_10_1","volume-title":"Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1","author":"Cassell Justine","year":"1999","unstructured":"Justine Cassell , David McNeill , and Karl-Erik McCullough . 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1 ( 1999 ), 1\u201334. Justine Cassell, David McNeill, and Karl-Erik McCullough. 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1 (1999), 1\u201334."},{"volume-title":"Life-Like Characters: Tools","author":"Cassell Justine","key":"e_1_3_2_1_11_1","unstructured":"Justine Cassell , Hannes\u00a0H\u00f6gni Vilhj\u00e1lmsson , and Timothy Bickmore . 2004. BEAT: the Behavior Expression Animation Toolkit . In Life-Like Characters: Tools , Affective Functions, and Applications, Helmut Prendinger and Mitsuru Ishizuka (Eds.). Springer , Berlin, Heidelberg , 163\u2013185. https:\/\/doi.org\/10.1007\/978-3-662-08373-4_8 10.1007\/978-3-662-08373-4_8 Justine Cassell, Hannes\u00a0H\u00f6gni Vilhj\u00e1lmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters: Tools, Affective Functions, and Applications, Helmut Prendinger and Mitsuru Ishizuka (Eds.). Springer, Berlin, Heidelberg, 163\u2013185. https:\/\/doi.org\/10.1007\/978-3-662-08373-4_8"},{"key":"e_1_3_2_1_12_1","volume-title":"2005 IEEE\/RSJ International Conference on Intelligent Robots and Systems. IEEE, Edmonton, Alta., Canada, 2662\u20132667","author":"Liu Changchun","year":"2005","unstructured":"Changchun Liu , P. Rani , and N. Sarkar . 2005. An empirical study of machine learning techniques for affect recognition in human-robot interaction . In 2005 IEEE\/RSJ International Conference on Intelligent Robots and Systems. IEEE, Edmonton, Alta., Canada, 2662\u20132667 . https:\/\/doi.org\/10.1109\/IROS. 2005 .1545344 10.1109\/IROS.2005.1545344 Changchun Liu, P. Rani, and N. Sarkar. 2005. An empirical study of machine learning techniques for affect recognition in human-robot interaction. In 2005 IEEE\/RSJ International Conference on Intelligent Robots and Systems. IEEE, Edmonton, Alta., Canada, 2662\u20132667. https:\/\/doi.org\/10.1109\/IROS.2005.1545344"},{"key":"e_1_3_2_1_13_1","volume-title":"Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach","author":"Chiu Chung-Cheng","year":"1996","unstructured":"Chung-Cheng Chiu , Louis-Philippe Morency , and Stacy Marsella . 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach . In Intelligent Virtual Agents, Willem-Paul Brinkman, Joost Broekens, and Dirk Heylen (Eds.). Vol.\u00a09238. Springer International Publishing , Cham, 152\u2013166. https:\/\/doi.org\/10.1007\/978-3-319-2 1996 -7_17 Series Title : Lecture Notes in Computer Science. 10.1007\/978-3-319-21996-7_17 Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In Intelligent Virtual Agents, Willem-Paul Brinkman, Joost Broekens, and Dirk Heylen (Eds.). Vol.\u00a09238. Springer International Publishing, Cham, 152\u2013166. https:\/\/doi.org\/10.1007\/978-3-319-21996-7_17 Series Title: Lecture Notes in Computer Science."},{"key":"e_1_3_2_1_14_1","volume-title":"Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho , Bart Van\u00a0Merri\u00ebnboer , Caglar Gulcehre , Dzmitry Bahdanau , Fethi Bougares , Holger Schwenk , and Yoshua Bengio . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 ( 2014 ). Kyunghyun Cho, Bart Van\u00a0Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)."},{"key":"e_1_3_2_1_15_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https:\/\/doi.org\/10.48550\/arXiv.1810.04805 arXiv:1810.04805 [cs].","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https:\/\/doi.org\/10.48550\/arXiv.1810.04805 arXiv:1810.04805 [cs]. 10.48550\/arXiv.1810.04805 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https:\/\/doi.org\/10.48550\/arXiv.1810.04805 arXiv:1810.04805 [cs]."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Patrick Esser Robin Rombach and Bj\u00f6rn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. http:\/\/arxiv.org\/abs\/2012.09841 arXiv:2012.09841 [cs].  Patrick Esser Robin Rombach and Bj\u00f6rn Ommer. 2021. Taming Transformers for High-Resolution Image Synthesis. http:\/\/arxiv.org\/abs\/2012.09841 arXiv:2012.09841 [cs].","DOI":"10.1109\/CVPR46437.2021.01268"},{"key":"e_1_3_2_1_17_1","volume-title":"Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402","author":"Fan Angela","year":"2020","unstructured":"Angela Fan , Thibaut Lavril , Edouard Grave , Armand Joulin , and Sainbayar Sukhbaatar . 2020. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402 ( 2020 ). Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. 2020. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402 (2020)."},{"key":"e_1_3_2_1_18_1","volume-title":"ISCA","author":"Fan Yuchen","year":"2014","unstructured":"Yuchen Fan , Yao Qian , Feng-Long Xie , and Frank\u00a0 K. Soong . 2014 . TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech 2014 . ISCA , 1964\u20131968. https:\/\/doi.org\/10.21437\/Interspeech.2014-443 10.21437\/Interspeech.2014-443 Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank\u00a0K. Soong. 2014. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech 2014. ISCA, 1964\u20131968. https:\/\/doi.org\/10.21437\/Interspeech.2014-443"},{"key":"e_1_3_2_1_19_1","volume-title":"AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time","author":"Fang Hao-Shu","year":"2022","unstructured":"Hao-Shu Fang , Jiefeng Li , Hongyang Tang , Chao Xu , Haoyi Zhu , Yuliang Xiu , Yong-Lu Li , and Cewu Lu. 2022. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2022 ). Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yuliang Xiu, Yong-Lu Li, and Cewu Lu. 2022. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)."},{"key":"#cr-split#-e_1_3_2_1_20_1.1","doi-asserted-by":"crossref","unstructured":"Mireille Fares Catherine Pelachaud and Nicolas Obin. 2022. Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation. https:\/\/doi.org\/10.48550\/arXiv.2110.04527 arXiv:2110.04527 [eess]. 10.48550\/arXiv.2110.04527","DOI":"10.23919\/EUSIPCO55093.2022.9909519"},{"key":"#cr-split#-e_1_3_2_1_20_1.2","doi-asserted-by":"crossref","unstructured":"Mireille Fares Catherine Pelachaud and Nicolas Obin. 2022. Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation. https:\/\/doi.org\/10.48550\/arXiv.2110.04527 arXiv:2110.04527 [eess].","DOI":"10.23919\/EUSIPCO55093.2022.9909519"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267898"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1006\/cviu.2000.0894"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3536221.3558068"},{"key":"#cr-split#-e_1_3_2_1_24_1.1","unstructured":"Saeed Ghorbani Ylva Ferstl Daniel Holden Nikolaus\u00a0F. Troje and Marc-Andr\u00e9 Carbonneau. 2022. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. https:\/\/doi.org\/10.48550\/arXiv.2209.07556 arXiv:2209.07556 [cs]. 10.48550\/arXiv.2209.07556"},{"key":"#cr-split#-e_1_3_2_1_24_1.2","doi-asserted-by":"crossref","unstructured":"Saeed Ghorbani Ylva Ferstl Daniel Holden Nikolaus\u00a0F. Troje and Marc-Andr\u00e9 Carbonneau. 2022. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. https:\/\/doi.org\/10.48550\/arXiv.2209.07556 arXiv:2209.07556 [cs].","DOI":"10.1111\/cgf.14734"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Shiry Ginosar Amir Bar Gefen Kohavi Caroline Chan Andrew Owens and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. http:\/\/arxiv.org\/abs\/1906.04160 arXiv:1906.04160 [cs eess].  Shiry Ginosar Amir Bar Gefen Kohavi Caroline Chan Andrew Owens and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. http:\/\/arxiv.org\/abs\/1906.04160 arXiv:1906.04160 [cs eess].","DOI":"10.1109\/CVPR.2019.00361"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3422622"},{"key":"e_1_3_2_1_27_1","volume-title":"Improved training of wasserstein gans. Advances in neural information processing systems 30","author":"Gulrajani Ishaan","year":"2017","unstructured":"Ishaan Gulrajani , Faruk Ahmed , Martin Arjovsky , Vincent Dumoulin , and Aaron\u00a0 C Courville . 2017. Improved training of wasserstein gans. Advances in neural information processing systems 30 ( 2017 ). Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron\u00a0C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417836"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073663"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177703732"},{"key":"e_1_3_2_1_31_1","volume-title":"Pororobot: A Deep Learning Robot that Plays Video Q&A Games.","author":"Kim Kyung-Min","year":"2015","unstructured":"Kyung-Min Kim , Chang-Jun Nan , Jung-Woo Ha , Yu-Jung Heo , and Byoung-Tak Zhang . 2015 . Pororobot: A Deep Learning Robot that Plays Video Q&A Games. (2015). Kyung-Min Kim, Chang-Jun Nan, Jung-Woo Ha, Yu-Jung Heo, and Byoung-Tak Zhang. 2015. Pororobot: A Deep Learning Robot that Plays Video Q&A Games. (2015)."},{"key":"e_1_3_2_1_32_1","volume-title":"Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114","author":"Kingma P","year":"2013","unstructured":"Diederik\u00a0 P Kingma and Max Welling . 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ( 2013 ). Diederik\u00a0P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)."},{"key":"e_1_3_2_1_33_1","volume-title":"Intelligent Virtual Agents(Lecture Notes in Computer Science)","author":"Kopp Stefan","year":"1821","unstructured":"Stefan Kopp , Brigitte Krenn , Stacy Marsella , Andrew\u00a0 N. Marshall , Catherine Pelachaud , Hannes Pirker , Kristinn\u00a0 R. Th\u00f3risson , and Hannes Vilhj\u00e1lmsson . 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language . In Intelligent Virtual Agents(Lecture Notes in Computer Science) , Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer , Berlin, Heidelberg , 205\u2013217. https:\/\/doi.org\/10.1007\/1 1821 830_17 10.1007\/11821830_17 Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew\u00a0N. Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn\u00a0R. Th\u00f3risson, and Hannes Vilhj\u00e1lmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In Intelligent Virtual Agents(Lecture Notes in Computer Science), Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer, Berlin, Heidelberg, 205\u2013217. https:\/\/doi.org\/10.1007\/11821830_17"},{"key":"e_1_3_2_1_34_1","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision. 763\u2013772","author":"Lee Gilwoo","year":"2019","unstructured":"Gilwoo Lee , Zhiwei Deng , Shugao Ma , Takaaki Shiratori , Siddhartha\u00a0 S. Srinivasa , and Yaser Sheikh . 2019 . Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis . In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 763\u2013772 . Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha\u00a0S. Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 763\u2013772."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01022"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392422"},{"key":"#cr-split#-e_1_3_2_1_37_1.1","doi-asserted-by":"crossref","unstructured":"Haiyang Liu Zihao Zhu Naoya Iwamoto Yichen Peng Zhengqing Li You Zhou Elif Bozkurt and Bo Zheng. 2022. BEAT: A Large-Scale Semantic and\u00a0Emotional Multi-modal Dataset for\u00a0Conversational Gestures Synthesis. In Computer Vision - ECCV 2022(Lecture Notes in Computer Science) Shai Avidan Gabriel Brostow Moustapha Ciss\u00e9 Giovanni\u00a0Maria Farinella and Tal Hassner (Eds.). Springer Nature Switzerland Cham 612-630. https:\/\/doi.org\/10.1007\/978-3-031-20071-7_36 10.1007\/978-3-031-20071-7_36","DOI":"10.1007\/978-3-031-20071-7_36"},{"key":"#cr-split#-e_1_3_2_1_37_1.2","doi-asserted-by":"crossref","unstructured":"Haiyang Liu Zihao Zhu Naoya Iwamoto Yichen Peng Zhengqing Li You Zhou Elif Bozkurt and Bo Zheng. 2022. BEAT: A Large-Scale Semantic and\u00a0Emotional Multi-modal Dataset for\u00a0Conversational Gestures Synthesis. In Computer Vision - ECCV 2022(Lecture Notes in Computer Science) Shai Avidan Gabriel Brostow Moustapha Ciss\u00e9 Giovanni\u00a0Maria Farinella and Tal Hassner (Eds.). Springer Nature Switzerland Cham 612-630. https:\/\/doi.org\/10.1007\/978-3-031-20071-7_36","DOI":"10.1007\/978-3-031-20071-7_36"},{"key":"e_1_3_2_1_38_1","volume-title":"An acceleration framework for high resolution image synthesis. arXiv preprint arXiv:1909.03611","author":"Liu Jinlin","year":"2019","unstructured":"Jinlin Liu , Yuan Yao , and Jianqiang Ren . 2019. An acceleration framework for high resolution image synthesis. arXiv preprint arXiv:1909.03611 ( 2019 ). Jinlin Liu, Yuan Yao, and Jianqiang Ren. 2019. An acceleration framework for high resolution image synthesis. arXiv preprint arXiv:1909.03611 (2019)."},{"key":"e_1_3_2_1_39_1","volume-title":"Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE","author":"Liu Xian","year":"2022","unstructured":"Xian Liu , Qianyi Wu , Hang Zhou , Yinghao Xu , Rui Qian , Xinyi Lin , Xiaowei Zhou , Wayne Wu , Bo Dai , and Bolei Zhou . 2022 . Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE , New Orleans, LA, USA, 10452\u201310462. https:\/\/doi.org\/10.1109\/CVPR52688. 2022.01021 10.1109\/CVPR52688.2022.01021 Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 10452\u201310462. https:\/\/doi.org\/10.1109\/CVPR52688.2022.01021"},{"key":"#cr-split#-e_1_3_2_1_40_1.1","doi-asserted-by":"crossref","unstructured":"Simbarashe Nyatsanga Taras Kucherenko Chaitanya Ahuja Gustav\u00a0Eje Henter and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. https:\/\/doi.org\/10.1111\/cgf.14776 arXiv:2301.05339 [cs]. 10.1111\/cgf.14776","DOI":"10.1111\/cgf.14776"},{"key":"#cr-split#-e_1_3_2_1_40_1.2","doi-asserted-by":"crossref","unstructured":"Simbarashe Nyatsanga Taras Kucherenko Chaitanya Ahuja Gustav\u00a0Eje Henter and Michael Neff. 2023. A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. https:\/\/doi.org\/10.1111\/cgf.14776 arXiv:2301.05339 [cs].","DOI":"10.1111\/cgf.14776"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/MRA.2018.2833157"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00794"},{"key":"e_1_3_2_1_43_1","volume-title":"Aaron Van\u00a0den Oord, and Oriol Vinyals","author":"Razavi Ali","year":"2019","unstructured":"Ali Razavi , Aaron Van\u00a0den Oord, and Oriol Vinyals . 2019 . Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019). Ali Razavi, Aaron Van\u00a0den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_3_2_1_44_1","volume-title":"Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi . 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 ( 2018 ). Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)."},{"key":"e_1_3_2_1_45_1","volume-title":"Perceiving the person and their interactions with the others for social robotics \u2013 A review. Pattern Recognition Letters 118 (Feb","author":"Tapus Adriana","year":"2019","unstructured":"Adriana Tapus , Antonio Bandera , Ricardo Vazquez-Martin , and Luis\u00a0 V. Calderita . 2019. Perceiving the person and their interactions with the others for social robotics \u2013 A review. Pattern Recognition Letters 118 (Feb . 2019 ), 3\u201313. https:\/\/doi.org\/10.1016\/j.patrec.2018.03.006 10.1016\/j.patrec.2018.03.006 Adriana Tapus, Antonio Bandera, Ricardo Vazquez-Martin, and Luis\u00a0V. Calderita. 2019. Perceiving the person and their interactions with the others for social robotics \u2013 A review. Pattern Recognition Letters 118 (Feb. 2019), 3\u201313. https:\/\/doi.org\/10.1016\/j.patrec.2018.03.006"},{"key":"e_1_3_2_1_46_1","volume-title":"Neural discrete representation learning. Advances in neural information processing systems 30","author":"Den\u00a0Oord Aaron Van","year":"2017","unstructured":"Aaron Van Den\u00a0Oord , Oriol Vinyals , 2017. Neural discrete representation learning. Advances in neural information processing systems 30 ( 2017 ). Aaron Van Den\u00a0Oord, Oriol Vinyals, 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_47_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan\u00a0 N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_48_1","volume-title":"Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis. arXiv preprint arXiv:2307.09597","author":"Vo\u00df Hendric","year":"2023","unstructured":"Hendric Vo\u00df and Stefan Kopp . 2023. Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis. arXiv preprint arXiv:2307.09597 ( 2023 ). Hendric Vo\u00df and Stefan Kopp. 2023. Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis. arXiv preprint arXiv:2307.09597 (2023)."},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"crossref","unstructured":"Petra Wagner Zofia Malisz and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. 209\u2013232\u00a0pages.  Petra Wagner Zofia Malisz and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. 209\u2013232\u00a0pages.","DOI":"10.1016\/j.specom.2013.09.008"},{"key":"e_1_3_2_1_50_1","volume-title":"OGRU: An optimized gated recurrent unit neural network. In Journal of Physics: Conference Series, Vol.\u00a01325","author":"Wang Xin","year":"2019","unstructured":"Xin Wang , Jiabing Xu , Wei Shi , and Jiarui Liu . 2019 . OGRU: An optimized gated recurrent unit neural network. In Journal of Physics: Conference Series, Vol.\u00a01325 . IOP Publishing , 012089. Xin Wang, Jiabing Xu, Wei Shi, and Jiarui Liu. 2019. OGRU: An optimized gated recurrent unit neural network. In Journal of Physics: Conference Series, Vol.\u00a01325. IOP Publishing, 012089."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01228-1_40"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_3_2_1_53_1","volume-title":"Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA)","author":"Yoon Youngwoo","year":"2019","unstructured":"Youngwoo Yoon , Woo-Ri Ko , Minsu Jang , Jaeyeon Lee , Jaehong Kim , and Geehyuk Lee . 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA) . IEEE, Montreal, QC , Canada , 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA. 2019 .8793720 10.1109\/ICRA.2019.8793720 Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, Montreal, QC, Canada, 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA.2019.8793720"},{"key":"e_1_3_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3536221.3558058"},{"volume-title":"Social Robotics(Lecture Notes in Computer Science), Miguel\u00a0A. Salichs, Shuzhi\u00a0Sam Ge, Emilia\u00a0Ivanova Barakova, John-John Cabibihan, Alan\u00a0R. Wagner, \u00c1lvaro Castro-Gonz\u00e1lez","author":"Yu Chuang","key":"e_1_3_2_1_55_1","unstructured":"Chuang Yu and Adriana Tapus . 2019. Interactive Robot Learning for Multimodal Emotion Recognition . In Social Robotics(Lecture Notes in Computer Science), Miguel\u00a0A. Salichs, Shuzhi\u00a0Sam Ge, Emilia\u00a0Ivanova Barakova, John-John Cabibihan, Alan\u00a0R. Wagner, \u00c1lvaro Castro-Gonz\u00e1lez , and Hongsheng He (Eds.). Springer International Publishing , Cham , 633\u2013642. https:\/\/doi.org\/10.1007\/978-3-030-35888-4_59 10.1007\/978-3-030-35888-4_59 Chuang Yu and Adriana Tapus. 2019. Interactive Robot Learning for Multimodal Emotion Recognition. In Social Robotics(Lecture Notes in Computer Science), Miguel\u00a0A. Salichs, Shuzhi\u00a0Sam Ge, Emilia\u00a0Ivanova Barakova, John-John Cabibihan, Alan\u00a0R. Wagner, \u00c1lvaro Castro-Gonz\u00e1lez, and Hongsheng He (Eds.). Springer International Publishing, Cham, 633\u2013642. https:\/\/doi.org\/10.1007\/978-3-030-35888-4_59"},{"key":"e_1_3_2_1_56_1","volume-title":"Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214","author":"Zhang Fan","year":"2020","unstructured":"Fan Zhang , Valentin Bazarevsky , Andrey Vakunov , Andrei Tkachenka , George Sung , Chuo-Ling Chang , and Matthias Grundmann . 2020. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 ( 2020 ). Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020)."},{"key":"e_1_3_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00363"},{"key":"e_1_3_2_1_58_1","volume-title":"arXiv preprint arXiv:2205.15573","author":"Zhuang Wenlin","year":"2022","unstructured":"Wenlin Zhuang , Jinwei Qi , Peng Zhang , Bang Zhang , and Ping Tan . 2022. Text\/ Speech-Driven Full-Body Animation . arXiv preprint arXiv:2205.15573 ( 2022 ). Wenlin Zhuang, Jinwei Qi, Peng Zhang, Bang Zhang, and Ping Tan. 2022. Text\/Speech-Driven Full-Body Animation. arXiv preprint arXiv:2205.15573 (2022)."}],"event":{"name":"ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"location":"Paris France","acronym":"ICMI '23"},"container-title":["INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3577190.3614135","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3577190.3614135","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:01Z","timestamp":1750178221000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3577190.3614135"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,9]]},"references-count":63,"alternative-id":["10.1145\/3577190.3614135","10.1145\/3577190"],"URL":"https:\/\/doi.org\/10.1145\/3577190.3614135","relation":{},"subject":[],"published":{"date-parts":[[2023,10,9]]},"assertion":[{"value":"2023-10-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}