{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T18:13:51Z","timestamp":1760552031107,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":73,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,9]],"date-time":"2023-10-09T00:00:00Z","timestamp":1696809600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,10,9]]},"DOI":"10.1145\/3577190.3616115","type":"proceedings-article","created":{"date-parts":[[2023,10,7]],"date-time":"2023-10-07T22:30:48Z","timestamp":1696717848000},"page":"763-771","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-4853-1500","authenticated-orcid":false,"given":"Leon","family":"Harz","sequence":"first","affiliation":[{"name":"Bielefeld University, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-3646-7702","authenticated-orcid":false,"given":"Hendric","family":"Vo\u00df","sequence":"additional","affiliation":[{"name":"Social Cognitive Systems - CITEC, Bielefeld University, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4047-9277","authenticated-orcid":false,"given":"Stefan","family":"Kopp","sequence":"additional","affiliation":[{"name":"Social Cognitive Systems - CITEC, Bielefeld University, Germany"}]}],"member":"320","published-online":{"date-parts":[[2023,10,9]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"Michael Ahn Anthony Brohan Noah Brown Yevgen Chebotar Omar Cortes Byron David Chelsea Finn Chuyuan Fu Keerthana Gopalakrishnan Karol Hausman Alex Herzog Daniel Ho Jasmine Hsu Julian Ibarz Brian Ichter Alex Irpan Eric Jang Rosario\u00a0Jauregui Ruano Kyle Jeffrey Sally Jesmonth Nikhil Joshi Ryan Julian Dmitry Kalashnikov Yuheng Kuang Kuang-Huei Lee Sergey Levine Yao Lu Linda Luu Carolina Parada Peter Pastor Jornell Quiambao Kanishka Rao Jarek Rettinghouse Diego Reyes Pierre Sermanet Nicolas Sievers Clayton Tan Alexander Toshev Vincent Vanhoucke Fei Xia Ted Xiao Peng Xu Sichun Xu Mengyuan Yan and Andy Zeng. 2022. Do As I Can and Not As I Say: Grounding Language in Robotic Affordances. In arXiv preprint arXiv:2204.01691. Michael Ahn Anthony Brohan Noah Brown Yevgen Chebotar Omar Cortes Byron David Chelsea Finn Chuyuan Fu Keerthana Gopalakrishnan Karol Hausman Alex Herzog Daniel Ho Jasmine Hsu Julian Ibarz Brian Ichter Alex Irpan Eric Jang Rosario\u00a0Jauregui Ruano Kyle Jeffrey Sally Jesmonth Nikhil Joshi Ryan Julian Dmitry Kalashnikov Yuheng Kuang Kuang-Huei Lee Sergey Levine Yao Lu Linda Luu Carolina Parada Peter Pastor Jornell Quiambao Kanishka Rao Jarek Rettinghouse Diego Reyes Pierre Sermanet Nicolas Sievers Clayton Tan Alexander Toshev Vincent Vanhoucke Fei Xia Ted Xiao Peng Xu Sichun Xu Mengyuan Yan and Andy Zeng. 2022. Do As I Can and Not As I Say: Grounding Language in Robotic Affordances. In arXiv preprint arXiv:2204.01691."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.170"},{"key":"#cr-split#-e_1_3_2_1_3_1.1","unstructured":"Chaitanya Ahuja Dong\u00a0Won Lee Yukiko\u00a0I. Nakano and Louis-Philippe Morency. 2020. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. https:\/\/doi.org\/10.48550\/arXiv.2007.12553 arXiv:2007.12553 [cs]. 10.48550\/arXiv.2007.12553"},{"key":"#cr-split#-e_1_3_2_1_3_1.2","doi-asserted-by":"crossref","unstructured":"Chaitanya Ahuja Dong\u00a0Won Lee Yukiko\u00a0I. Nakano and Louis-Philippe Morency. 2020. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach. https:\/\/doi.org\/10.48550\/arXiv.2007.12553 arXiv:2007.12553 [cs].","DOI":"10.1007\/978-3-030-58523-5_15"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3383652.3423874"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1608912.1608922"},{"volume-title":"Intelligent Virtual Agents, David Hutchison, Takeo Kanade, Josef Kittler, Jon\u00a0M. Kleinberg, Friedemann Mattern, John\u00a0C. Mitchell, Moni Naor, Oscar Nierstrasz, C.\u00a0Pandu\u00a0Rangan","author":"Bergmann Kirsten","key":"e_1_3_2_1_6_1","unstructured":"Kirsten Bergmann , Sebastian Kahl , and Stefan Kopp . 2013. Modeling the Semantic Coordination of Speech and Gesture under Cognitive and Linguistic Constraints . In Intelligent Virtual Agents, David Hutchison, Takeo Kanade, Josef Kittler, Jon\u00a0M. Kleinberg, Friedemann Mattern, John\u00a0C. Mitchell, Moni Naor, Oscar Nierstrasz, C.\u00a0Pandu\u00a0Rangan , Bernhard Steffen , Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe\u00a0Y. Vardi, Gerhard Weikum, Ruth Aylett, Brigitte Krenn, Catherine Pelachaud, and Hiroshi Shimodaira (Eds.). Vol.\u00a08108. Springer Berlin Heidelberg , Berlin, Heidelberg, 203\u2013216. https:\/\/doi.org\/10.1007\/978-3-642-40415-3_18 Series Title : Lecture Notes in Computer Science. 10.1007\/978-3-642-40415-3_18 Kirsten Bergmann, Sebastian Kahl, and Stefan Kopp. 2013. Modeling the Semantic Coordination of Speech and Gesture under Cognitive and Linguistic Constraints. In Intelligent Virtual Agents, David Hutchison, Takeo Kanade, Josef Kittler, Jon\u00a0M. Kleinberg, Friedemann Mattern, John\u00a0C. Mitchell, Moni Naor, Oscar Nierstrasz, C.\u00a0Pandu\u00a0Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe\u00a0Y. Vardi, Gerhard Weikum, Ruth Aylett, Brigitte Krenn, Catherine Pelachaud, and Hiroshi Shimodaira (Eds.). Vol.\u00a08108. Springer Berlin Heidelberg, Berlin, Heidelberg, 203\u2013216. https:\/\/doi.org\/10.1007\/978-3-642-40415-3_18 Series Title: Lecture Notes in Computer Science."},{"key":"#cr-split#-e_1_3_2_1_7_1.1","unstructured":"Piotr Bojanowski Edouard Grave Armand Joulin and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. https:\/\/doi.org\/10.48550\/arXiv.1607.04606 arXiv:1607.04606 [cs]. 10.48550\/arXiv.1607.04606"},{"key":"#cr-split#-e_1_3_2_1_7_1.2","doi-asserted-by":"crossref","unstructured":"Piotr Bojanowski Edouard Grave Armand Joulin and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. https:\/\/doi.org\/10.48550\/arXiv.1607.04606 arXiv:1607.04606 [cs].","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/344779.344865"},{"key":"e_1_3_2_1_9_1","volume-title":"Speech-Gesture Mismatches: Evidence for One Underlying Representation of Linguistic and Nonlinguistic Information. Cognition 7 (Jan","author":"Cassell Justine","year":"1994","unstructured":"Justine Cassell , David Mcneill , and Karl-Erik Mccullough . 1994. Speech-Gesture Mismatches: Evidence for One Underlying Representation of Linguistic and Nonlinguistic Information. Cognition 7 (Jan . 1994 ). https:\/\/doi.org\/10.1075\/pc.7.1.03cas 10.1075\/pc.7.1.03cas Justine Cassell, David Mcneill, and Karl-Erik Mccullough. 1994. Speech-Gesture Mismatches: Evidence for One Underlying Representation of Linguistic and Nonlinguistic Information. Cognition 7 (Jan. 1994). https:\/\/doi.org\/10.1075\/pc.7.1.03cas"},{"volume-title":"Life-Like Characters: Tools","author":"Cassell Justine","key":"e_1_3_2_1_10_1","unstructured":"Justine Cassell , Hannes\u00a0H\u00f6gni Vilhj\u00e1lmsson , and Timothy Bickmore . 2004. BEAT: the Behavior Expression Animation Toolkit . In Life-Like Characters: Tools , Affective Functions, and Applications, Helmut Prendinger and Mitsuru Ishizuka (Eds.). Springer , Berlin, Heidelberg , 163\u2013185. https:\/\/doi.org\/10.1007\/978-3-662-08373-4_8 10.1007\/978-3-662-08373-4_8 Justine Cassell, Hannes\u00a0H\u00f6gni Vilhj\u00e1lmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters: Tools, Affective Functions, and Applications, Helmut Prendinger and Mitsuru Ishizuka (Eds.). Springer, Berlin, Heidelberg, 163\u2013185. https:\/\/doi.org\/10.1007\/978-3-662-08373-4_8"},{"key":"#cr-split#-e_1_3_2_1_11_1.1","doi-asserted-by":"crossref","unstructured":"Che-Jui Chang Sen Zhang and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022 - A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 784-789. https:\/\/doi.org\/10.1145\/3536221.3558060 10.1145\/3536221.3558060","DOI":"10.1145\/3536221.3558060"},{"key":"#cr-split#-e_1_3_2_1_11_1.2","doi-asserted-by":"crossref","unstructured":"Che-Jui Chang Sen Zhang and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022 - A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 784-789. https:\/\/doi.org\/10.1145\/3536221.3558060","DOI":"10.1145\/3536221.3558060"},{"key":"e_1_3_2_1_12_1","volume-title":"Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach","author":"Chiu Chung-Cheng","year":"1996","unstructured":"Chung-Cheng Chiu , Louis-Philippe Morency , and Stacy Marsella . 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach . In Intelligent Virtual Agents, Willem-Paul Brinkman, Joost Broekens, and Dirk Heylen (Eds.). Vol.\u00a09238. Springer International Publishing , Cham, 152\u2013166. https:\/\/doi.org\/10.1007\/978-3-319-2 1996 -7_17 Series Title : Lecture Notes in Computer Science. 10.1007\/978-3-319-21996-7_17 Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In Intelligent Virtual Agents, Willem-Paul Brinkman, Joost Broekens, and Dirk Heylen (Eds.). Vol.\u00a09238. Springer International Publishing, Cham, 152\u2013166. https:\/\/doi.org\/10.1007\/978-3-319-21996-7_17 Series Title: Lecture Notes in Computer Science."},{"key":"#cr-split#-e_1_3_2_1_13_1.1","unstructured":"Kyunghyun Cho Bart van Merrienboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. https:\/\/doi.org\/10.48550\/arXiv.1406.1078 arXiv:1406.1078 [cs stat]. 10.48550\/arXiv.1406.1078"},{"key":"#cr-split#-e_1_3_2_1_13_1.2","unstructured":"Kyunghyun Cho Bart van Merrienboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. https:\/\/doi.org\/10.48550\/arXiv.1406.1078 arXiv:1406.1078 [cs stat]."},{"key":"e_1_3_2_1_14_1","volume-title":"The Role of Gesture in Communication and Cognition: Implications for Understanding and Treating Neurogenic Communication Disorders. Frontiers in Human Neuroscience 14","author":"Clough Sharice","year":"2020","unstructured":"Sharice Clough and Melissa\u00a0 C. Duff . 2020. The Role of Gesture in Communication and Cognition: Implications for Understanding and Treating Neurogenic Communication Disorders. Frontiers in Human Neuroscience 14 ( 2020 ). https:\/\/doi.org\/10.3389\/fnhum.2020.00323 10.3389\/fnhum.2020.00323 Sharice Clough and Melissa\u00a0C. Duff. 2020. The Role of Gesture in Communication and Cognition: Implications for Understanding and Treating Neurogenic Communication Disorders. Frontiers in Human Neuroscience 14 (2020). https:\/\/doi.org\/10.3389\/fnhum.2020.00323"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10919-011-0112-7"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1080\/20445911.2012.743987"},{"key":"#cr-split#-e_1_3_2_1_17_1.1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https:\/\/doi.org\/10.48550\/arXiv.2010.11929 arXiv:2010.11929 [cs]. 10.48550\/arXiv.2010.11929"},{"key":"#cr-split#-e_1_3_2_1_17_1.2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https:\/\/doi.org\/10.48550\/arXiv.2010.11929 arXiv:2010.11929 [cs]."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3359566.3360053"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1006\/cviu.2000.0894"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS51168.2021.9636069"},{"key":"#cr-split#-e_1_3_2_1_21_1.1","doi-asserted-by":"crossref","unstructured":"Shiry Ginosar Amir Bar Gefen Kohavi Caroline Chan Andrew Owens and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. https:\/\/doi.org\/10.48550\/arXiv.1906.04160 arXiv:1906.04160 [cs eess]. 10.48550\/arXiv.1906.04160","DOI":"10.1109\/CVPR.2019.00361"},{"key":"#cr-split#-e_1_3_2_1_21_1.2","doi-asserted-by":"crossref","unstructured":"Shiry Ginosar Amir Bar Gefen Kohavi Caroline Chan Andrew Owens and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. https:\/\/doi.org\/10.48550\/arXiv.1906.04160 arXiv:1906.04160 [cs eess].","DOI":"10.1109\/CVPR.2019.00361"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1080\/10867651.1998.10487493"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267878"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417836"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073663"},{"key":"e_1_3_2_1_27_1","volume-title":"Proceedings of the 5th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol.\u00a0164)","author":"Jang Eric","year":"2022","unstructured":"Eric Jang , Alex Irpan , Mohi Khansari , Daniel Kappler , Frederik Ebert , Corey Lynch , Sergey Levine , and Chelsea Finn . 2022 . BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning . In Proceedings of the 5th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol.\u00a0164) , Aleksandra Faust, David Hsu, and Gerhard Neumann (Eds.). PMLR, 991\u20131002. https:\/\/proceedings.mlr.press\/v164\/jang22a.html Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. 2022. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. In Proceedings of the 5th Conference on Robot Learning(Proceedings of Machine Learning Research, Vol.\u00a0164), Aleksandra Faust, David Hsu, and Gerhard Neumann (Eds.). PMLR, 991\u20131002. https:\/\/proceedings.mlr.press\/v164\/jang22a.html"},{"key":"#cr-split#-e_1_3_2_1_28_1.1","doi-asserted-by":"crossref","unstructured":"Naoshi Kaneko Yuna Mitsubayashi and Geng Mu. 2022. TransGesture: Autoregressive Gesture Generation with RNN-Transducer. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 753-757. https:\/\/doi.org\/10.1145\/3536221.3558061 10.1145\/3536221.3558061","DOI":"10.1145\/3536221.3558061"},{"key":"#cr-split#-e_1_3_2_1_28_1.2","doi-asserted-by":"crossref","unstructured":"Naoshi Kaneko Yuna Mitsubayashi and Geng Mu. 2022. TransGesture: Autoregressive Gesture Generation with RNN-Transducer. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 753-757. https:\/\/doi.org\/10.1145\/3536221.3558061","DOI":"10.1145\/3536221.3558061"},{"key":"e_1_3_2_1_29_1","volume-title":"Intelligent Virtual Agents(Lecture Notes in Computer Science)","author":"Kopp Stefan","year":"1821","unstructured":"Stefan Kopp , Brigitte Krenn , Stacy Marsella , Andrew\u00a0 N. Marshall , Catherine Pelachaud , Hannes Pirker , Kristinn\u00a0 R. Th\u00f3risson , and Hannes Vilhj\u00e1lmsson . 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language . In Intelligent Virtual Agents(Lecture Notes in Computer Science) , Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer , Berlin, Heidelberg , 205\u2013217. https:\/\/doi.org\/10.1007\/1 1821 830_17 10.1007\/11821830_17 Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew\u00a0N. Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn\u00a0R. Th\u00f3risson, and Hannes Vilhj\u00e1lmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In Intelligent Virtual Agents(Lecture Notes in Computer Science), Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer, Berlin, Heidelberg, 205\u2013217. https:\/\/doi.org\/10.1007\/11821830_17"},{"key":"#cr-split#-e_1_3_2_1_30_1.1","doi-asserted-by":"crossref","unstructured":"Vladislav Korzun Anna Beloborodova and Arkady Ilin. 2022. ReCell: replicating recurrent cell for auto-regressive pose generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 94-97. https:\/\/doi.org\/10.1145\/3536220.3558801 10.1145\/3536220.3558801","DOI":"10.1145\/3536220.3558801"},{"key":"#cr-split#-e_1_3_2_1_30_1.2","doi-asserted-by":"crossref","unstructured":"Vladislav Korzun Anna Beloborodova and Arkady Ilin. 2022. ReCell: replicating recurrent cell for auto-regressive pose generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 94-97. https:\/\/doi.org\/10.1145\/3536220.3558801","DOI":"10.1145\/3536220.3558801"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3577190.3616120"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3461615.3485408"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00085"},{"key":"e_1_3_2_1_34_1","volume-title":"Intelligent Virtual Agents(Lecture Notes in Computer Science)","author":"Lee Jina","year":"1821","unstructured":"Jina Lee and Stacy Marsella . 2006. Nonverbal Behavior Generator for Embodied Conversational Agents . In Intelligent Virtual Agents(Lecture Notes in Computer Science) , Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer, Berlin , Heidelberg , 243\u2013255. https:\/\/doi.org\/10.1007\/1 1821 830_20 10.1007\/11821830_20 Jina Lee and Stacy Marsella. 2006. Nonverbal Behavior Generator for Embodied Conversational Agents. In Intelligent Virtual Agents(Lecture Notes in Computer Science), Jonathan Gratch, Michael Young, Ruth Aylett, Daniel Ballin, and Patrick Olivier (Eds.). Springer, Berlin, Heidelberg, 243\u2013255. https:\/\/doi.org\/10.1007\/11821830_20"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.3724\/SP.J.2096-5796.2018.0006"},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Kevin Lin Christopher Agia Toki Migimatsu Marco Pavone and Jeannette Bohg. 2023. Text2Motion: From Natural Language Instructions to Feasible Plans. arxiv:2303.12153\u00a0[cs.RO] Kevin Lin Christopher Agia Toki Migimatsu Marco Pavone and Jeannette Bohg. 2023. Text2Motion: From Natural Language Instructions to Feasible Plans. arxiv:2303.12153\u00a0[cs.RO]","DOI":"10.1007\/s10514-023-10131-7"},{"key":"#cr-split#-e_1_3_2_1_37_1.1","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. https:\/\/doi.org\/10.48550\/arXiv.1711.05101 arXiv:1711.05101 [cs math]. 10.48550\/arXiv.1711.05101"},{"key":"#cr-split#-e_1_3_2_1_37_1.2","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. https:\/\/doi.org\/10.48550\/arXiv.1711.05101 arXiv:1711.05101 [cs math]."},{"key":"e_1_3_2_1_38_1","volume-title":"The DeepMotion entry to the GENEA Challenge","author":"Lu Shuhong","year":"2022","unstructured":"Shuhong Lu and Andrew Feng . 2022. The DeepMotion entry to the GENEA Challenge 2022 . In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM , Bengaluru India , 790\u2013796. https:\/\/doi.org\/10.1145\/3536221.3558059 10.1145\/3536221.3558059 Shuhong Lu and Andrew Feng. 2022. The DeepMotion entry to the GENEA Challenge 2022. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM, Bengaluru India, 790\u2013796. https:\/\/doi.org\/10.1145\/3536221.3558059"},{"key":"e_1_3_2_1_39_1","unstructured":"Andrew\u00a0L Maas Awni\u00a0Y Hannun and Andrew\u00a0Y Ng. [n. d.]. Recti\ufb01er Nonlinearities Improve Neural Network Acoustic Models. ([n. d.]). Andrew\u00a0L Maas Awni\u00a0Y Hannun and Andrew\u00a0Y Ng. [n. d.]. Recti\ufb01er Nonlinearities Improve Neural Network Acoustic Models. ([n. d.])."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485895.2485900"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.14776"},{"key":"e_1_3_2_1_42_1","volume-title":"Towards A Unified Agent with Foundation Models. In Workshop on Reincarnating Reinforcement Learning at ICLR","author":"Palo Norman\u00a0Di","year":"2023","unstructured":"Norman\u00a0Di Palo , Arunkumar Byravan , Leonard Hasenclever , Markus Wulfmeier , Nicolas Heess , and Martin Riedmiller . 2023 . Towards A Unified Agent with Foundation Models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023. https:\/\/openreview.net\/forum?id=JK_B1tB6p- Norman\u00a0Di Palo, Arunkumar Byravan, Leonard Hasenclever, Markus Wulfmeier, Nicolas Heess, and Martin Riedmiller. 2023. Towards A Unified Agent with Foundation Models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023. https:\/\/openreview.net\/forum?id=JK_B1tB6p-"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"#cr-split#-e_1_3_2_1_44_1.1","doi-asserted-by":"crossref","unstructured":"Khaled Saleh. 2022. Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 748-752. https:\/\/doi.org\/10.1145\/3536221.3558064 10.1145\/3536221.3558064","DOI":"10.1145\/3536221.3558064"},{"key":"#cr-split#-e_1_3_2_1_44_1.2","doi-asserted-by":"crossref","unstructured":"Khaled Saleh. 2022. Hybrid Seq2Seq Architecture for 3D Co-Speech Gesture Generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 748-752. https:\/\/doi.org\/10.1145\/3536221.3558064","DOI":"10.1145\/3536221.3558064"},{"key":"e_1_3_2_1_45_1","volume-title":"GLU Variants Improve Transformer. CoRR abs\/2002.05202","author":"Shazeer Noam","year":"2020","unstructured":"Noam Shazeer . 2020. GLU Variants Improve Transformer. CoRR abs\/2002.05202 ( 2020 ). arXiv:2002.05202https:\/\/arxiv.org\/abs\/2002.05202 Noam Shazeer. 2020. GLU Variants Improve Transformer. CoRR abs\/2002.05202 (2020). arXiv:2002.05202https:\/\/arxiv.org\/abs\/2002.05202"},{"key":"e_1_3_2_1_46_1","unstructured":"Mingyang Sun Mengchen Zhao Yaqing Hou Minglei Li Huang Xu Songcen Xu and Jianye Hao. [n. d.]. Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards. ([n. d.]). Mingyang Sun Mengchen Zhao Yaqing Hou Minglei Li Huang Xu Songcen Xu and Jianye Hao. [n. d.]. Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards. ([n. d.])."},{"volume-title":"Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Montreal Quebec Canada, 1025\u20131032","author":"W.","key":"e_1_3_2_1_47_1","unstructured":"Graham\u00a0 W. Taylor and Geoffrey\u00a0E. Hinton. 2009. Factored conditional restricted Boltzmann Machines for modeling motion style . In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Montreal Quebec Canada, 1025\u20131032 . https:\/\/doi.org\/10.1145\/1553374.1553505 10.1145\/1553374.1553505 Graham\u00a0W. Taylor and Geoffrey\u00a0E. Hinton. 2009. Factored conditional restricted Boltzmann Machines for modeling motion style. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, Montreal Quebec Canada, 1025\u20131032. https:\/\/doi.org\/10.1145\/1553374.1553505"},{"key":"#cr-split#-e_1_3_2_1_48_1.1","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. https:\/\/doi.org\/10.48550\/arXiv.2302.13971 arXiv:2302.13971 [cs]. 10.48550\/arXiv.2302.13971"},{"key":"#cr-split#-e_1_3_2_1_48_1.2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. https:\/\/doi.org\/10.48550\/arXiv.2302.13971 arXiv:2302.13971 [cs]."},{"key":"e_1_3_2_1_49_1","volume-title":"Advances in Neural Information Processing Systems, I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Vol.\u00a030. Curran Associates","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan\u00a0 N Gomez , \u0141\u00a0ukasz Kaiser , and Illia Polosukhin . 2017. Attention is All you Need . In Advances in Neural Information Processing Systems, I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Vol.\u00a030. Curran Associates , Inc .https:\/\/proceedings.neurips.cc\/paper_files\/paper\/ 2017 \/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan\u00a0N Gomez, \u0141\u00a0ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I.\u00a0Guyon, U.\u00a0Von Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Vol.\u00a030. Curran Associates, Inc.https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_3_2_1_50_1","volume-title":"Gesture and speech in interaction: An overview. Speech Communication 57 (Feb","author":"Wagner Petra","year":"2014","unstructured":"Petra Wagner , Zofia Malisz , and Stefan Kopp . 2014. Gesture and speech in interaction: An overview. Speech Communication 57 (Feb . 2014 ), 209\u2013232. https:\/\/doi.org\/10.1016\/j.specom.2013.09.008 10.1016\/j.specom.2013.09.008 Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Communication 57 (Feb. 2014), 209\u2013232. https:\/\/doi.org\/10.1016\/j.specom.2013.09.008"},{"key":"e_1_3_2_1_51_1","volume-title":"UEA Digital Humans entry to the GENEA Challenge","author":"Windle Jonathan","year":"2022","unstructured":"Jonathan Windle , David Greenwood , and Sarah Taylor . 2022. UEA Digital Humans entry to the GENEA Challenge 2022 . In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM , Bengaluru India , 771\u2013777. https:\/\/doi.org\/10.1145\/3536221.3558065 10.1145\/3536221.3558065 Jonathan Windle, David Greenwood, and Sarah Taylor. 2022. UEA Digital Humans entry to the GENEA Challenge 2022. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM, Bengaluru India, 771\u2013777. https:\/\/doi.org\/10.1145\/3536221.3558065"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3461615.3485407"},{"key":"#cr-split#-e_1_3_2_1_53_1.1","unstructured":"Jiqing Wu Zhiwu Huang Janine Thoma Dinesh Acharya and Luc Van\u00a0Gool. 2018. Wasserstein Divergence for GANs. https:\/\/doi.org\/10.48550\/arXiv.1712.01026 arXiv:1712.01026 [cs]. 10.48550\/arXiv.1712.01026"},{"key":"#cr-split#-e_1_3_2_1_53_1.2","unstructured":"Jiqing Wu Zhiwu Huang Janine Thoma Dinesh Acharya and Luc Van\u00a0Gool. 2018. Wasserstein Divergence for GANs. https:\/\/doi.org\/10.48550\/arXiv.1712.01026 arXiv:1712.01026 [cs]."},{"key":"e_1_3_2_1_54_1","volume-title":"The ReprGesture entry to the GENEA Challenge","author":"Yang Sicheng","year":"2022","unstructured":"Sicheng Yang , Zhiyong Wu , Minglei Li , Mengchen Zhao , Jiuxin Lin , Liyang Chen , and Weihong Bao . 2022. The ReprGesture entry to the GENEA Challenge 2022 . In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM , Bengaluru India , 758\u2013763. https:\/\/doi.org\/10.1145\/3536221.3558066 10.1145\/3536221.3558066 Sicheng Yang, Zhiyong Wu, Minglei Li, Mengchen Zhao, Jiuxin Lin, Liyang Chen, and Weihong Bao. 2022. The ReprGesture entry to the GENEA Challenge 2022. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM, Bengaluru India, 758\u2013763. https:\/\/doi.org\/10.1145\/3536221.3558066"},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_3_2_1_56_1","volume-title":"Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA)","author":"Yoon Youngwoo","year":"2019","unstructured":"Youngwoo Yoon , Woo-Ri Ko , Minsu Jang , Jaeyeon Lee , Jaehong Kim , and Geehyuk Lee . 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA) . IEEE, Montreal, QC , Canada , 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA. 2019 .8793720 10.1109\/ICRA.2019.8793720 Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, Montreal, QC, Canada, 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA.2019.8793720"},{"key":"e_1_3_2_1_57_1","volume-title":"Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA)","author":"Yoon Youngwoo","year":"2019","unstructured":"Youngwoo Yoon , Woo-Ri Ko , Minsu Jang , Jaeyeon Lee , Jaehong Kim , and Geehyuk Lee . 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA) . IEEE, Montreal, QC , Canada , 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA. 2019 .8793720 10.1109\/ICRA.2019.8793720 Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, Montreal, QC, Canada, 4303\u20134309. https:\/\/doi.org\/10.1109\/ICRA.2019.8793720"},{"key":"#cr-split#-e_1_3_2_1_58_1.1","doi-asserted-by":"crossref","unstructured":"Youngwoo Yoon Pieter Wolfert Taras Kucherenko Carla Viegas Teodor Nikolov Mihail Tsakov and Gustav\u00a0Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 736-747. https:\/\/doi.org\/10.1145\/3536221.3558058 10.1145\/3536221.3558058","DOI":"10.1145\/3536221.3558058"},{"key":"#cr-split#-e_1_3_2_1_58_1.2","doi-asserted-by":"crossref","unstructured":"Youngwoo Yoon Pieter Wolfert Taras Kucherenko Carla Viegas Teodor Nikolov Mihail Tsakov and Gustav\u00a0Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 736-747. https:\/\/doi.org\/10.1145\/3536221.3558058","DOI":"10.1145\/3536221.3558058"},{"key":"#cr-split#-e_1_3_2_1_59_1.1","doi-asserted-by":"crossref","unstructured":"Chi Zhou Tengyue Bian and Kang Chen. 2022. GestureMaster: Graph-based Speech-driven Gesture Generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 764-770. https:\/\/doi.org\/10.1145\/3536221.3558063 10.1145\/3536221.3558063","DOI":"10.1145\/3536221.3558063"},{"key":"#cr-split#-e_1_3_2_1_59_1.2","doi-asserted-by":"crossref","unstructured":"Chi Zhou Tengyue Bian and Kang Chen. 2022. GestureMaster: Graph-based Speech-driven Gesture Generation. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION. ACM Bengaluru India 764-770. https:\/\/doi.org\/10.1145\/3536221.3558063","DOI":"10.1145\/3536221.3558063"}],"event":{"name":"ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"location":"Paris France","acronym":"ICMI '23"},"container-title":["INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3577190.3616115","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3577190.3616115","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:02Z","timestamp":1750178222000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3577190.3616115"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,9]]},"references-count":73,"alternative-id":["10.1145\/3577190.3616115","10.1145\/3577190"],"URL":"https:\/\/doi.org\/10.1145\/3577190.3616115","relation":{},"subject":[],"published":{"date-parts":[[2023,10,9]]},"assertion":[{"value":"2023-10-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}