{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T07:25:02Z","timestamp":1753601102505,"version":"3.41.0"},"reference-count":237,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,1,30]],"date-time":"2024-01-30T00:00:00Z","timestamp":1706572800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Hum.-Robot Interact."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of human\u2013agent interaction and human\u2013robot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual human\u2013human interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.<\/jats:p>","DOI":"10.1145\/3609235","type":"journal-article","created":{"date-parts":[[2023,8,16]],"date-time":"2023-08-16T12:15:10Z","timestamp":1692188110000},"page":"1-39","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Data-driven Communicative Behaviour Generation: A Survey"],"prefix":"10.1145","volume":"13","author":[{"given":"Nurziya","family":"Oralbayeva","sequence":"first","affiliation":[{"name":"Department of Robotics and Mechatronics, School of Engineering and Digital Sciences, Nazarbayev University, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amir","family":"Aly","sequence":"additional","affiliation":[{"name":"School of Engineering, Computing and Mathematics, University of Plymouth, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anara","family":"Sandygulova","sequence":"additional","affiliation":[{"name":"Department of Robotics and Mechatronics, School of Engineering and Digital Sciences, Nazarbayev University, Kazakhstan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tony","family":"Belpaeme","sequence":"additional","affiliation":[{"name":"Ghent University, IDLab-imec, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1,30]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"Kyubyong Park. 2018. KSS Dataset: Korean Single Speaker Speech Dataset. https:\/\/www.kaggle.com\/bryanpark\/korean-single-speaker-speech-dataset"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.5898\/JHRI.6.1.Admoni"},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_15"},{"key":"e_1_3_3_5_2","first-page":"1","volume-title":"Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS\u201910)","author":"Aifanti Niki","year":"2010","unstructured":"Niki Aifanti, Christos Papachristou, and Anastasios Delopoulos. 2010. The MUG facial expression database. In Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS\u201910). IEEE, Desenzano del Garda, Italy, 1\u20134."},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1111\/cgf.13946"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1080\/00401706.1971.10488811"},{"key":"e_1_3_3_8_2","first-page":"113","volume-title":"International Conference on Cooperative Multimodal Communication","author":"Allwood Jens","year":"1998","unstructured":"Jens Allwood. 1998. Cooperation and flexibility in multimodal communication. In International Conference on Cooperative Multimodal Communication. Springer, Berlin, 113\u2013124."},{"key":"e_1_3_3_9_2","first-page":"195","volume-title":"Proceedings of the 34th International Conference on Machine Learning (ICML\u201917)","author":"Arik S.","year":"2017","unstructured":"S. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi. 2017. Deep Voice: Real-time neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning (ICML\u201917). 195\u2013204."},{"key":"e_1_3_3_10_2","first-page":"214","volume-title":"Proceedings of the International Conference on Machine Learning (ICML\u201917)","author":"Arjovsky Martin","year":"2017","unstructured":"Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML\u201917). 214\u2013223."},{"key":"e_1_3_3_11_2","volume-title":"Working Memory","author":"Baddeley A. D.","year":"1986","unstructured":"A. D. Baddeley. 1986. Working Memory. Oxford University Press, Oxford, UK."},{"key":"e_1_3_3_12_2","volume-title":"Proceedings of the 3rd International Conference on Learning Representations (ICLR\u201915)","author":"Bahdanau D.","year":"2015","unstructured":"D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR\u201915)."},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIINFS.2007.4579232"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1017\/9781108676649"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2015.03.005"},{"key":"e_1_3_3_16_2","article-title":"High fidelity speech synthesis with adversarial networks","author":"Bi\u0144kowski Miko\u0142aj","year":"2019","unstructured":"Miko\u0142aj Bi\u0144kowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, and Karen Simonyan. 2019. High fidelity speech synthesis with adversarial networks. arXiv:1909.11646. Retrieved from https:\/\/arxiv.org\/abs\/1909.11646","journal-title":"arXiv:1909.11646"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2018.10.009"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICME.2015.7177478"},{"key":"e_1_3_3_19_2","unstructured":"J. Bradbury S. Merity C. Xiong and R. Socher. 2016. Quasi-recurrent neural networks. ArXiv abs\/1611.01576 (2016). https:\/\/api.semanticscholar.org\/CorpusID:51559"},{"key":"e_1_3_3_20_2","article-title":"Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems.","author":"BS ITUR","year":"2015","unstructured":"ITUR BS. 2015. Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems. International Telecommunication Union, Geneva, Switzerland. https:\/\/www.itu.int\/dms_pubrec\/itu-r\/rec\/bs\/R-REC-BS.1534-3-201510-I!!PDF-E.pdf","journal-title":"International Telecommunication Union, Geneva, Switzerland"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10579-008-9076-6"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2007.905145"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.3390\/app11031144"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2014.2336244"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","unstructured":"Zhe Cao Tomas Simon Shih-En Wei and Yaser Sheikh. 2017. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. 1302\u20131310. DOI:10.1109\/CVPR.2017.143","DOI":"10.1109\/CVPR.2017.143"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2019.8756570"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_32"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00802"},{"key":"e_1_3_3_29_2","unstructured":"Nanxin Chen Yu Zhang Heiga Zen Ron J. Weiss Mohammad Norouzi and William Chan. 2021. WaveGrad: Estimating gradients for waveform generation. arXiv:2009.00713. Retrieved from https:\/\/arxiv.org\/abs\/2009.00713"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23974-8_14"},{"key":"e_1_3_3_31_2","unstructured":"Chung-Cheng Chiu and Stacy Marsella. 2011. A style controller for generating virtual human behaviors. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS\u201911 Taipei Taiwan May 2-6 2011 Volume 1-3) Liz Sonenberg Peter Stone Kagan Tumer and Pinar Yolum (Eds.). IFAAMAS. http:\/\/portal.acm.org\/citation.cfm?id=2034415&CFID=69154334&CFTOKEN=45298625"},{"key":"e_1_3_3_32_2","first-page":"781","volume-title":"Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems (AAMAS\u201914)","author":"Chiu Chung-Cheng","year":"2014","unstructured":"Chung-Cheng Chiu and Stacy Marsella. 2014. Gesture generation with low-dimensional embeddings. In Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems (AAMAS\u201914). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 781\u2013788."},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2009.4960497"},{"key":"e_1_3_3_35_2","volume-title":"Proceedings of the Deep Learning and Representation Learning Workshop at the 28th International Conference on Neural Information Processing Systems (NIPS\u201914)","author":"Chung J.","year":"2014","unstructured":"J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. 2014. Empirical evaluation of Gated Recurrent Neural Networks on sequence modeling. In Proceedings of the Deep Learning and Representation Learning Workshop at the 28th International Conference on Neural Information Processing Systems (NIPS\u201914)."},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1506.02216"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","unstructured":"J. S. Chung and A. Zisserman. 2017. Lip reading in the wild. In Computer Vision \u2013 (ACCV\u201916) S. H. Lai V. Lepetit K. Nishino and Y. Sato (Eds.). Lecture Notes in Computer Science Vol. 10112 Springer Cham. 10.1007\/978-3-319-54184-6_6","DOI":"10.1007\/978-3-319-54184-6_6"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1705.02966"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","unstructured":"J. S. Chung and A. Zisserman. 2017. Out of time: Automated lip sync in the wild. In Computer Vision \u2013 ACCV 2016 Workshops (ACCV\u201916) C. S. Chen J. Lu and K. K. Ma (Eds.). Lecture Notes in Computer Science Vol. 10117 Springer Cham. 10.1007\/978-3-319-54427-4_19","DOI":"10.1007\/978-3-319-54427-4_19"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1017\/cbo9780511843891.014"},{"key":"e_1_3_3_41_2","doi-asserted-by":"publisher","DOI":"10.1121\/1.2229005"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.927467"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1022627411411"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","unstructured":"Sara Dahmani Vincent Colotte Val\u00e9rian Girard and Slim Ouni. 2019. Conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In Proceeding of the Interspeech 2019. 2598\u20132602. DOI:10.21437\/Interspeech.2019-2848","DOI":"10.21437\/Interspeech.2019-2848"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-59789-3_51"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-014-2156-2"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2015-137"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2014-186"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40415-3_19"},{"key":"e_1_3_3_50_2","first-page":"773","volume-title":"Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems","author":"Ding Yu","year":"2014","unstructured":"Yu Ding, Ken Prepin, Jing Huang, Catherine Pelachaud, and Thierry Arti\u00e8res. 2014. Laughter animation synthesis. In Proceedings of the International Conference on Autonomous Agents and Multi-agent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 773\u2013780."},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01115465"},{"key":"e_1_3_3_52_2","volume-title":"Facial Action Coding System: A Technique for the Measurement of Facial Movement","author":"Ekman P.","year":"1978","unstructured":"P. Ekman and W. V. Friesen. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, CA."},{"key":"e_1_3_3_53_2","first-page":"39","volume-title":"Emotion in the Human Face","author":"Ekman P.","year":"1982","unstructured":"P. Ekman, W. V. Friesen, and P. Ellsworth. 1982. What emotion categories or dimensions can observers judge from facial behavior? In Emotion in the Human Face, P. Ekman (Ed.). Cambridge University Press, NY, 39\u201355."},{"key":"e_1_3_3_54_2","first-page":"27","article-title":"Universal facial expressions of emotion","author":"Ekman Paul","year":"1997","unstructured":"Paul Ekman and Dacher Keltner. 1997. Universal facial expressions of emotion. In Nonverbal Communication: Where Nature Meets Culture, U. Segerstrale and P. Molnar (Eds.). 27\u201346.","journal-title":"Segerstrale and P. Molnar (Eds.)"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACII.2017.8273663"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/1778765.1778828"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.598232"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2015.2457417"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178899"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11042-015-2944-3"},{"key":"e_1_3_3_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2010.2052239"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3421155"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267898"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3359566.3360053"},{"key":"e_1_3_3_65_2","article-title":"Practical Optimization Methods","author":"Fletcher R.","year":"1987","unstructured":"R. Fletcher. 1987. Practical Optimization Methods. John Wiley & Sons, New York, NY.","journal-title":"John Wiley & Sons, New York, NY"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jhealeco.2006.04.002"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2005.843341"},{"key":"e_1_3_3_68_2","article-title":"A neural algorithm of artistic style","author":"Gatys Leon A.","year":"2015","unstructured":"Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A neural algorithm of artistic style. arXiv:1508.06576. Retrieved from https:\/\/arxiv.org\/abs\/1508.06576","journal-title":"arXiv:1508.06576"},{"key":"e_1_3_3_69_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007425814087"},{"key":"e_1_3_3_70_2","first-page":"2962","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS)","author":"Gibiansky A.","year":"2017","unstructured":"A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou. 2017. Deep Voice 2: Multi-speaker neural text-to-speech. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS). 2962\u20132970."},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00361"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-264"},{"key":"e_1_3_3_73_2","volume-title":"Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS): Advances in Neural Information Processing Systems","author":"Goodfellow I.","year":"2014","unstructured":"I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS): Advances in Neural Information Processing Systems."},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24797-2_2"},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_3_3_76_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2005.06.042"},{"key":"e_1_3_3_77_2","volume-title":"Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC\u201904)","author":"Gravier G.","year":"2004","unstructured":"G. Gravier, J-F. Bonastre, E. Geoffrois, S. Galliano, K. McTait, and K. Choukri. 2004. The ESTER evaluation campaign for the rich transcription of French broadcast news. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC\u201904). European Language Resources Association."},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-894"},{"key":"e_1_3_3_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1984.1164317"},{"key":"e_1_3_3_80_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1984.1164317"},{"key":"e_1_3_3_81_2","volume-title":"Advances in Neural Information Processing Systems","author":"Gulrajani Ishaan","year":"2017","unstructured":"Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. 2017. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc."},{"key":"e_1_3_3_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/2813852.2813860"},{"key":"e_1_3_3_83_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-47665-0_18"},{"key":"e_1_3_3_84_2","doi-asserted-by":"publisher","DOI":"10.1162\/0899766042321814"},{"key":"e_1_3_3_85_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2407694"},{"key":"e_1_3_3_86_2","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267878"},{"key":"e_1_3_3_87_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_88_2","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417836"},{"key":"e_1_3_3_89_2","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295408"},{"key":"e_1_3_3_90_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.2006.18.7.1527"},{"key":"e_1_3_3_91_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_3_92_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2007-299"},{"key":"e_1_3_3_93_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01219-9_11"},{"key":"e_1_3_3_94_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00371-020-01982-7"},{"key":"e_1_3_3_95_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2017.280"},{"key":"e_1_3_3_96_2","first-page":"2466","volume-title":"Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI\u201913)","author":"Hussein Mohamed E.","year":"2013","unstructured":"Mohamed E. Hussein, Marwan Torki, Mohammad A. Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI\u201913). AAAI Press, 2466\u20132472."},{"key":"e_1_3_3_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2018.2856281"},{"key":"e_1_3_3_98_2","unstructured":"Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. Retrieved from https:\/\/keithito.com\/LJ-Speech-Dataset\/"},{"key":"e_1_3_3_99_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_43"},{"key":"e_1_3_3_100_2","doi-asserted-by":"publisher","DOI":"10.1145\/3383652.3423911"},{"key":"e_1_3_3_101_2","first-page":"2410","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Kalchbrenner Nal","year":"2018","unstructured":"Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 2410\u20132419."},{"key":"e_1_3_3_102_2","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073658"},{"key":"e_1_3_3_103_2","first-page":"8067","article-title":"Glow-TTS: A generative flow for text-to-speech via monotonic alignment search","volume":"33","author":"Kim Jaehyeon","year":"2020","unstructured":"Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Adv. Neural Inf. Process. Syst. 33 (2020), 8067\u20138077.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_3_104_2","doi-asserted-by":"publisher","DOI":"10.1145\/2783258.2783356"},{"key":"e_1_3_3_105_2","doi-asserted-by":"crossref","unstructured":"S. King and Vasilis Karaiskos. 2011. The Blizzard challenge 2011. https:\/\/api.semanticscholar.org\/CorpusID:150472016","DOI":"10.21437\/Blizzard.2011-1"},{"key":"e_1_3_3_106_2","volume-title":"Advances in Neural Information Processing Systems","author":"Kingma Durk P.","year":"2018","unstructured":"Durk P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc."},{"key":"e_1_3_3_107_2","article-title":"Glow: Generative flow with invertible 1x1 convolutions","volume":"31","author":"Kingma Durk P.","year":"2018","unstructured":"Durk P. Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. Adv. Neural Inf. Process. Syst. 31 (2018).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_3_108_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-6393(01)00041-3"},{"issue":"8","key":"e_1_3_3_109_2","first-page":"922","article-title":"Online controlled experiments and A\/B testing.","volume":"7","author":"Kohavi Ron","year":"2017","unstructured":"Ron Kohavi and Roger Longbotham. 2017. Online controlled experiments and A\/B testing. Encycl. Mach. Learn. Data Min. 7, 8 (2017), 922\u2013929.","journal-title":"Encycl. Mach. Learn. Data Min."},{"key":"e_1_3_3_110_2","doi-asserted-by":"publisher","DOI":"10.5555\/1071195.1071199"},{"key":"e_1_3_3_111_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACRIM.1993.407206"},{"key":"e_1_3_3_112_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3264970"},{"key":"e_1_3_3_113_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308532.3329472"},{"key":"e_1_3_3_114_2","doi-asserted-by":"publisher","DOI":"10.1145\/3382507.3418815"},{"key":"e_1_3_3_115_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397481.3450692"},{"key":"e_1_3_3_116_2","doi-asserted-by":"publisher","DOI":"10.1214\/aoms\/1177729694"},{"key":"e_1_3_3_117_2","article-title":"Melgan: Generative adversarial networks for conditional waveform synthesis","author":"Kumar Kundan","year":"2019","unstructured":"Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Br\u00e9bisson, Yoshua Bengio, and Aaron C. Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. In Proceedings of the International Conference on Neural Information Processing Systems.","journal-title":"Proceedings of the International Conference on Neural Information Processing Systems"},{"key":"e_1_3_3_118_2","doi-asserted-by":"publisher","unstructured":"Jonathan Lam Bill Kapralos K. Collins Andrew Hogue and Kamen Kanev. 2010. Amplitude panning-based sound system for a horizontal surface computer: A user-based study. In Proceedings of the IEEE International Workshop on Haptic Audio and Visual Environments and Their Applications. 10.1109\/HAVE.2010.5623999","DOI":"10.1109\/HAVE.2010.5623999"},{"key":"e_1_3_3_119_2","first-page":"763","volume-title":"Proceedings of the International Conference on Computer Vision (ICCV\u201919)","author":"Lee Gilwoo","year":"2019","unstructured":"Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the International Conference on Computer Vision (ICCV\u201919). IEEE, 763\u2013772."},{"key":"e_1_3_3_120_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00067"},{"key":"e_1_3_3_121_2","volume-title":"Proceedings of the International Conference on Neural Information Processing Systems (NIPS\u201917)","author":"Lee Y.","year":"2017","unstructured":"Y. Lee, A. Rabiee, and S.-Y. Lee. 2017. Emotional end-to-end neural speech synthesizer. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS\u201917)."},{"key":"e_1_3_3_122_2","doi-asserted-by":"publisher","DOI":"10.1109\/SLT48900.2021.9383524"},{"key":"e_1_3_3_123_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCSLP49672.2021.9362069"},{"key":"e_1_3_3_124_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-364"},{"key":"e_1_3_3_125_2","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2018.8593433"},{"key":"e_1_3_3_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/TRO.2016.2588880"},{"key":"e_1_3_3_127_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3076369"},{"key":"e_1_3_3_128_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472307.3484167"},{"key":"e_1_3_3_129_2","doi-asserted-by":"publisher","DOI":"10.1145\/2556288.2557274"},{"key":"e_1_3_3_130_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2010.5543262"},{"key":"e_1_3_3_131_2","doi-asserted-by":"publisher","DOI":"10.1145\/2998571"},{"key":"e_1_3_3_132_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICAR.2015.7251476"},{"key":"e_1_3_3_133_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2012.2201476"},{"key":"e_1_3_3_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDEW.2006.145"},{"key":"e_1_3_3_135_2","volume-title":"Hand and Mind: What Gestures Reveal about Thought","author":"McNeill David","year":"1992","unstructured":"David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press."},{"key":"e_1_3_3_136_2","volume-title":"Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality","author":"Metallinou Angeliki","year":"2010","unstructured":"Angeliki Metallinou, Chi-Chun Lee, Carlos Busso, Sharon Carnicke, and Shrikanth Narayanan. 2010. The USC creativeit database: A multimodal database of theatrical improvisation. In Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality."},{"key":"e_1_3_3_137_2","first-page":"333","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing","author":"Moore Robert C.","year":"2004","unstructured":"Robert C. Moore. 2004. On log-likelihood-ratios and the significance of rare events. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 333\u2013340."},{"key":"e_1_3_3_138_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74048-3_4"},{"key":"e_1_3_3_139_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-950"},{"key":"e_1_3_3_140_2","doi-asserted-by":"publisher","DOI":"10.1109\/QOMEX.2009.5246972"},{"key":"e_1_3_3_141_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2011.2131660"},{"key":"e_1_3_3_142_2","doi-asserted-by":"publisher","DOI":"10.1145\/1330511.1330516"},{"key":"e_1_3_3_143_2","volume-title":"Proceedings of the Southeastern ACM Conference","author":"New Joshua R.","year":"2003","unstructured":"Joshua R. New, Erion Hasanbelliu, and Mario Aguilar. 2003. Facilitating user interaction with complex systems via hand gesture recognition. In Proceedings of the Southeastern ACM Conference."},{"key":"e_1_3_3_144_2","doi-asserted-by":"publisher","DOI":"10.5555\/2343576.2343588"},{"key":"e_1_3_3_145_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3002500"},{"key":"e_1_3_3_146_2","volume-title":"MPEG-4 Facial Animation: The Standard, Implementation and Applications","author":"Pakstas Algirdas","year":"2002","unstructured":"Algirdas Pakstas, Robert Forchheimer, and Igor S. Pandzic. 2002. MPEG-4 Facial Animation: The Standard, Implementation and Applications. John Wiley & Sons, New York."},{"key":"e_1_3_3_147_2","doi-asserted-by":"publisher","DOI":"10.1098\/rstb.2009.0135"},{"key":"e_1_3_3_148_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-85729-997-0_26"},{"key":"e_1_3_3_149_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.895976"},{"key":"e_1_3_3_150_2","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_3_151_2","doi-asserted-by":"publisher","DOI":"10.1145\/1201775.882269"},{"key":"e_1_3_3_152_2","volume-title":"Proceedings of the 6rd International Conference on Learning Representations (ICLR\u201918)","author":"Ping W.","year":"2018","unstructured":"W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. 2018. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In Proceedings of the 6rd International Conference on Learning Representations (ICLR\u201918). International Conference on Learning Representations."},{"key":"e_1_3_3_153_2","doi-asserted-by":"publisher","DOI":"10.1089\/big.2016.0028"},{"key":"e_1_3_3_154_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.robot.2018.07.006"},{"key":"e_1_3_3_155_2","volume-title":"Blizzard Challenge Workshop","author":"Prahallad Kishore","year":"2013","unstructured":"Kishore Prahallad, Anandaswarup Vadapalli, Naresh Elluru, Gautam Mantena, Bhargav Pulugundla, Peri Bhaskararao, Hema A. Murthy, Simon King, Vasilis Karaiskos, and Alan W. Black. 2013. The blizzard challenge 2013\u2013Indian language task. In Blizzard Challenge Workshop, Vol. 2013."},{"key":"e_1_3_3_156_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683143"},{"key":"e_1_3_3_157_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2011.5946971"},{"key":"e_1_3_3_158_2","volume-title":"Audiovisual Database of Spoken American English","author":"Richie Carolyn","year":"2009","unstructured":"Carolyn Richie, Sarah Warburton, and Megan Carter. 2009. Audiovisual Database of Spoken American English. Linguistic Data Consortium, Philadelphia."},{"key":"e_1_3_3_159_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2011-316"},{"key":"e_1_3_3_160_2","doi-asserted-by":"publisher","DOI":"10.2307\/2685263"},{"key":"e_1_3_3_161_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2016.10.006"},{"key":"e_1_3_3_162_2","volume-title":"What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACs)","author":"Rosenberg E. L.","year":"1997","unstructured":"E. L. Rosenberg and P. Ekman. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACs). Oxford University Press, New York."},{"key":"e_1_3_3_163_2","article-title":"EM algorithms for PCA and SPCA","author":"Roweis Sam","year":"1997","unstructured":"Sam Roweis. 1997. EM algorithms for PCA and SPCA. In Proceedings of the Conference on Neural Information Processing Systems.","journal-title":"Proceedings of the Conference on Neural Information Processing Systems"},{"key":"e_1_3_3_164_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-67401-8_49"},{"key":"e_1_3_3_165_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2018.00066"},{"key":"e_1_3_3_166_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461967"},{"key":"e_1_3_3_167_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2019.04.005"},{"key":"e_1_3_3_168_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2019.2916031"},{"key":"e_1_3_3_169_2","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2015.7284885"},{"key":"e_1_3_3_170_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.308"},{"key":"e_1_3_3_171_2","doi-asserted-by":"publisher","DOI":"10.1017\/9781316676202.023"},{"key":"e_1_3_3_172_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-79948-3_183"},{"key":"e_1_3_3_173_2","doi-asserted-by":"publisher","DOI":"10.1109\/78.650093"},{"key":"e_1_3_3_174_2","article-title":"A survey of available corpora for building data-driven dialogue systems","volume":"1512","author":"Serban Iulian","year":"2018","unstructured":"Iulian Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data-driven dialogue systems. arXiv: 1512.05742. Retrieved from https:\/\/arxiv.org\/abs\/1512.05742","journal-title":"arXiv"},{"key":"e_1_3_3_175_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_3_176_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_3_177_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00790"},{"key":"e_1_3_3_178_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN48605.2020.9206665"},{"key":"e_1_3_3_179_2","first-page":"194","volume-title":"Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations","author":"Smolensky P.","year":"1986","unstructured":"P. Smolensky. 1986. Information processing in dynamical systems: Foundations of harmony theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, D. E. Rumelhart and J. L. McClelland (Eds.). MIT Press, Cambridge, MA, 194\u2013281."},{"key":"e_1_3_3_180_2","volume-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS\u201915)","author":"Sohn K.","year":"2015","unstructured":"K. Sohn, X. Yan, and H. Lee. 2015. Learning structured output representation using deep conditional generative models. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS\u201915). Curran Associates, Inc., Montreal, Canada."},{"key":"e_1_3_3_181_2","volume-title":"ICLR Workshop Track","author":"Sotelo J.","year":"2017","unstructured":"J. Sotelo, S. Mehri, K. Kumar, J. Santos, K. Kastner, A. Courville, and Y. Bengio. 2017. Char2Wav: End-to-end speech synthesis. In ICLR Workshop Track."},{"key":"e_1_3_3_182_2","volume-title":"Proceedings of the Deep Learning Workshop at the 32nd International Conference on Machine Learning (ICML\u201915)","author":"Srivastava R-K.","year":"2015","unstructured":"R-K. Srivastava, K. Greff, and J. Schmidhuber. 2015. Highway networks. In Proceedings of the Deep Learning Workshop at the 32nd International Conference on Machine Learning (ICML\u201915)."},{"key":"e_1_3_3_183_2","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969173"},{"key":"e_1_3_3_184_2","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_3_185_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2761740"},{"key":"e_1_3_3_186_2","volume-title":"Proceedings of the 6rd International Conference on Learning Representations (ICLR\u201918)","author":"Taigman Y.","year":"2018","unstructured":"Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani. 2018. VoiceLoop: Voice fitting and synthesis via a phonological loop. In Proceedings of the 6rd International Conference on Learning Representations (ICLR\u201918)."},{"key":"e_1_3_3_187_2","doi-asserted-by":"publisher","DOI":"10.1145\/3125739.3132594"},{"key":"e_1_3_3_188_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-58750-9_28"},{"key":"e_1_3_3_189_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-483"},{"key":"e_1_3_3_190_2","first-page":"275","volume-title":"Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation (SCA\u201912)","author":"Taylor Sarah L.","year":"2012","unstructured":"Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH\/Eurographics Symposium on Computer Animation (SCA\u201912). Eurographics Association, 275\u2013284."},{"key":"e_1_3_3_191_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292652"},{"key":"e_1_3_3_192_2","doi-asserted-by":"publisher","DOI":"10.1145\/2929464.2929475"},{"key":"e_1_3_3_193_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472669"},{"key":"e_1_3_3_194_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00165"},{"key":"e_1_3_3_195_2","doi-asserted-by":"publisher","DOI":"10.1109\/UR49135.2020.9144985"},{"key":"e_1_3_3_196_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053732"},{"key":"e_1_3_3_197_2","doi-asserted-by":"publisher","DOI":"10.1007\/s12193-010-0053-1"},{"key":"e_1_3_3_198_2","unstructured":"A. van den Oord S. Dieleman H. Zen K. Simonyan A. Vinyals O.and Graves N. Kalchbrenner A. Senior and K. Kavukcuoglu. 2016. WaveNet: A generative model for raw audio (unpublished)."},{"key":"e_1_3_3_199_2","unstructured":"A\u00e4ron van den Oord Sander Dieleman Heiga Zen Karen Simonyan Oriol Vinyals Alexander Graves Nal Kalchbrenner Andrew Senior and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. Arxiv. https:\/\/arxiv.org\/abs\/1609.03499"},{"key":"e_1_3_3_200_2","volume-title":"Proceedings of the 33rd International Conference on Machine Learning Research (PMLR\u201916)","author":"Oord A. van den","year":"2016","unstructured":"A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. 2016. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning Research (PMLR\u201916)."},{"key":"e_1_3_3_201_2","doi-asserted-by":"publisher","DOI":"10.5555\/3157382.3157633"},{"key":"e_1_3_3_202_2","first-page":"3918","volume-title":"International Conference on Machine Learning","author":"Oord Aaron van den","year":"2018","unstructured":"Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et\u00a0al. 2018. Parallel WaveNet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning. PMLR, 3918\u20133926."},{"key":"e_1_3_3_203_2","doi-asserted-by":"publisher","DOI":"10.1145\/3267851.3267918"},{"key":"e_1_3_3_204_2","unstructured":"Christophe Veaux Junichi Yamagishi Kirsten MacDonald et\u00a0al. 2017. Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit."},{"key":"e_1_3_3_205_2","volume-title":"Proceedings of the the 29th International Conference on Neural Information Processing Systems (NIPS\u201915)","author":"Vinyals O.","year":"2015","unstructured":"O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015. Grammar as a foreign language. In Proceedings of the the 29th International Conference on Neural Information Processing Systems (NIPS\u201915)."},{"key":"e_1_3_3_206_2","volume-title":"Advances in Neural Information Processing Systems","author":"Vondrick Carl","year":"2016","unstructured":"Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc."},{"key":"e_1_3_3_207_2","first-page":"1","article-title":"Realistic speech-driven facial animation with gans","author":"Vougioukas Konstantinos","year":"2019","unstructured":"Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with gans. Int. J. Comput. Vis. (2019), 1\u201316.","journal-title":"Int. J. Comput. Vis."},{"key":"e_1_3_3_208_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijhcs.2019.03.011"},{"key":"e_1_3_3_209_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.1167"},{"key":"e_1_3_3_210_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2006.885727"},{"key":"e_1_3_3_211_2","doi-asserted-by":"publisher","DOI":"10.1145\/3462244.3479914"},{"key":"e_1_3_3_212_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"e_1_3_3_213_2","series-title":"Proceedings of Machine Learning Research","first-page":"5180","volume-title":"Proceedings of the 35th International Conference on Machine Learning","volume":"80","author":"Wang Yuxuan","year":"2018","unstructured":"Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 5180\u20135189. http:\/\/proceedings.mlr.press\/v80\/wang18h.html"},{"key":"e_1_3_3_214_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2003.819861"},{"key":"e_1_3_3_215_2","doi-asserted-by":"publisher","DOI":"10.1016\/0893-6080(90)90088-3"},{"key":"e_1_3_3_216_2","first-page":"17","article-title":"Gradient-based learning algorithms for recurrent networks and their computational complexity","volume":"433","author":"Williams Ronald J.","year":"1995","unstructured":"Ronald J. Williams and David Zipser. 1995. Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation 433 (1995), 17.","journal-title":"Backpropagation"},{"key":"e_1_3_3_217_2","doi-asserted-by":"publisher","DOI":"10.1109\/34.790429"},{"key":"e_1_3_3_218_2","doi-asserted-by":"publisher","DOI":"10.1145\/3462244.3479889"},{"key":"e_1_3_3_219_2","volume-title":"Proceedings of the ICDL-EPIROB Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions","author":"Wolfert Pieter","year":"2019","unstructured":"Pieter Wolfert, Taras Kucherenko, Hedvig Kjellstr\u00f6m, and Tony Belpaeme. 2019. Should beat gestures be learned or designed?: A benchmarking user study. In Proceedings of the ICDL-EPIROB Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. IEEE."},{"key":"e_1_3_3_220_2","doi-asserted-by":"publisher","DOI":"10.1109\/THMS.2022.3149173"},{"key":"e_1_3_3_221_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3052688"},{"key":"e_1_3_3_222_2","unstructured":"Junichi Yamagishi. 2012. English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. Retrieved from http:\/\/homepages.inf.ed.ac.uk\/jyamagis\/page3\/page58\/page58.html"},{"key":"e_1_3_3_223_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053795"},{"key":"e_1_3_3_224_2","doi-asserted-by":"publisher","DOI":"10.1109\/SLT48900.2021.9383551"},{"key":"e_1_3_3_225_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.261"},{"key":"e_1_3_3_226_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.261"},{"key":"e_1_3_3_227_2","article-title":"Human motion modeling with deep learning: A survey","author":"Ye Zijie","year":"2021","unstructured":"Zijie Ye, Haozhe Wu, and Jia Jia. 2021. Human motion modeling with deep learning: A survey. AI Open.","journal-title":"AI Open"},{"key":"e_1_3_3_228_2","doi-asserted-by":"publisher","DOI":"10.1145\/3414685.3417838"},{"key":"e_1_3_3_229_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2019.8793720"},{"key":"e_1_3_3_230_2","doi-asserted-by":"publisher","DOI":"10.1145\/3536221.3558058"},{"key":"e_1_3_3_231_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2009.11.011"},{"key":"e_1_3_3_232_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2016-522"},{"key":"e_1_3_3_233_2","first-page":"73","volume-title":"Proceedings of the 29th Pacific Asia conference on language, information and computation","author":"Zhang Shu","year":"2015","unstructured":"Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia conference on language, information and computation. 73\u201378."},{"key":"e_1_3_3_234_2","article-title":"Revisiting few-sample BERT fine-tuning","author":"Zhang Tianyi","year":"2020","unstructured":"Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, and Yoav Artzi. 2020. Revisiting few-sample BERT fine-tuning. arXiv:2006.05987. Retrieved from https:\/\/arxiv.org\/abs\/2006.05987","journal-title":"arXiv:2006.05987"},{"key":"e_1_3_3_235_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2011.07.002"},{"key":"e_1_3_3_236_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_24"},{"key":"e_1_3_3_237_2","first-page":"9299","article-title":"Talking face generation by adversarially disentangled audio-visual representation","author":"Zhou H.","year":"2019","unstructured":"H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), 31st Innovative Applications of Artificial Intelligence Conference (IAAI 2019), and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI 2019), 9299\u20139306.","journal-title":"Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), 31st Innovative Applications of Artificial Intelligence Conference (IAAI 2019), and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI 2019)"},{"key":"e_1_3_3_238_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU46091.2019.9003829"}],"container-title":["ACM Transactions on Human-Robot Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609235","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3609235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:58Z","timestamp":1750182538000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,30]]},"references-count":237,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3609235"],"URL":"https:\/\/doi.org\/10.1145\/3609235","relation":{},"ISSN":["2573-9522"],"issn-type":[{"type":"electronic","value":"2573-9522"}],"subject":[],"published":{"date-parts":[[2024,1,30]]},"assertion":[{"value":"2021-12-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-11","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}