{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,28]],"date-time":"2026-01-28T12:27:51Z","timestamp":1769603271628,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":42,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Plan","award":["2021QY1500"],"award-info":[{"award-number":["2021QY1500"]}]},{"name":"Shenzhen Key Laboratory of next generation interactive media innovative technology","award":["ZDSYS20210623092001004"],"award-info":[{"award-number":["ZDSYS20210623092001004"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62076144"],"award-info":[{"award-number":["62076144"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Natural Science Foundation of China-Research Grant Council of Hong Kong","award":["61531166002, N_CUHK40415"],"award-info":[{"award-number":["61531166002, N_CUHK40415"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547831","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"5811-5820","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks"],"prefix":"10.1145","author":[{"given":"Jingbei","family":"Li","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Yi","family":"Meng","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Xixin","family":"Wu","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]},{"given":"Zhiyong","family":"Wu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Jia","family":"Jia","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Helen","family":"Meng","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Hong Kong, China"}]},{"given":"Qiao","family":"Tian","sequence":"additional","affiliation":[{"name":"ByteDance, Shanghai, China"}]},{"given":"Yuping","family":"Wang","sequence":"additional","affiliation":[{"name":"ByteDance, Shanghai, China"}]},{"given":"Yuxuan","family":"Wang","sequence":"additional","affiliation":[{"name":"ByteDance, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho , Bart van Merrienboer , cC aglar G\u00fclcc ehre , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP. Kyunghyun Cho, Bart van Merrienboer, cC aglar G\u00fclcc ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP."},{"key":"e_1_3_2_2_2_1","volume-title":"Controllable Context-aware Conversational Speech Synthesis. arXiv preprint arXiv:2106.10828","author":"Cong Jian","year":"2021","unstructured":"Jian Cong , Shan Yang , Na Hu , Guangzhi Li , Lei Xie , and Dan Su. 2021. Controllable Context-aware Conversational Speech Synthesis. arXiv preprint arXiv:2106.10828 ( 2021 ). Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. 2021. Controllable Context-aware Conversational Speech Synthesis. arXiv preprint arXiv:2106.10828 (2021)."},{"key":"e_1_3_2_2_3_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_4_1","volume-title":"International conference on machine learning. PMLR, 1180--1189","author":"Ganin Yaroslav","year":"2015","unstructured":"Yaroslav Ganin and Victor Lempitsky . 2015 . Unsupervised domain adaptation by backpropagation . In International conference on machine learning. PMLR, 1180--1189 . Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189."},{"key":"e_1_3_2_2_5_1","volume-title":"Mario Marchand, and Victor Lempitsky.","author":"Ganin Yaroslav","year":"2016","unstructured":"Yaroslav Ganin , Evgeniya Ustinova , Hana Ajakan , Pascal Germain , Hugo Larochelle , Francc ois Laviolette , Mario Marchand, and Victor Lempitsky. 2016 . Domain-adversarial training of neural networks. The journal of machine learning research, Vol. 17 , 1 (2016), 2096--2030. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research, Vol. 17, 1 (2016), 2096--2030."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016383"},{"key":"e_1_3_2_2_7_1","volume-title":"EMNLP-IJCNLP 2019--2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference.","author":"Ghosal Deepanway","unstructured":"Deepanway Ghosal , Navonil Majumder , Soujanya Poria , Niyati Chhaya , and Alexander Gelbukh . 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation . In EMNLP-IJCNLP 2019--2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In EMNLP-IJCNLP 2019--2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/SLT48900.2021.9383460"},{"key":"e_1_3_2_2_9_1","volume-title":"an introduction to voice assistants. Medical reference services quarterly","author":"Hoy Matthew B","year":"2018","unstructured":"Matthew B Hoy . 2018. Alexa, Siri, Cortana, and more : an introduction to voice assistants. Medical reference services quarterly , Vol. 37 , 1 ( 2018 ), 81--88. Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly, Vol. 37, 1 (2018), 81--88."},{"key":"e_1_3_2_2_10_1","volume-title":"MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037--7041","author":"Hu Dou","year":"2022","unstructured":"Dou Hu , Xiaolong Hou , Lingwei Wei , Lianxin Jiang , and Yang Mo . 2022 . MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037--7041 . https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747397 ISSN: 2379--190X. Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037--7041. https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747397 ISSN: 2379--190X."},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.597"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401399"},{"key":"e_1_3_2_2_13_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540","author":"Kim Jaehyeon","year":"2021","unstructured":"Jaehyeon Kim , Jungil Kong , and Juhee Son . 2021 . Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech . In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540 . https:\/\/proceedings.mlr.press\/v139\/kim21f.html ISSN: 2640--3498. Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540. https:\/\/proceedings.mlr.press\/v139\/kim21f.html ISSN: 2640--3498."},{"key":"e_1_3_2_2_14_1","volume-title":"International Conference on Learning Representations (ICLR)","author":"Kipf Thomas N","year":"2017","unstructured":"Thomas N Kipf and Max Welling . 2017 . Semi-supervised classification with graph convolutional networks . International Conference on Learning Representations (ICLR) (2017). Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2017)."},{"key":"e_1_3_2_2_15_1","first-page":"17022","article-title":"Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis","volume":"33","author":"Kong Jungil","year":"2020","unstructured":"Jungil Kong , Jaehyeon Kim , and Jaekyoung Bae . 2020 . Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis . Advances in Neural Information Processing Systems , Vol. 33 (2020), 17022 -- 17033 . Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_16_1","volume-title":"An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering. arXiv:2005.12142 [cs] (May","author":"Kuo Chia-Chih","year":"2020","unstructured":"Chia-Chih Kuo , Shang-Bao Luo , and Kuan-Yu Chen . 2020. An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering. arXiv:2005.12142 [cs] (May 2020 ). http:\/\/arxiv.org\/abs\/2005.12142 arXiv: 2005.12142. Chia-Chih Kuo, Shang-Bao Luo, and Kuan-Yu Chen. 2020. An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering. arXiv:2005.12142 [cs] (May 2020). http:\/\/arxiv.org\/abs\/2005.12142 arXiv: 2005.12142."},{"key":"e_1_3_2_2_17_1","volume-title":"Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 732--737","author":"Latif Siddique","year":"2019","unstructured":"Siddique Latif , Junaid Qadir , and Muhammad Bilal . 2019 . Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 732--737 . https:\/\/doi.org\/10.1109\/ACII.2019.8925513 ISSN: 2156--8111. Siddique Latif, Junaid Qadir, and Muhammad Bilal. 2019. Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 732--737. https:\/\/doi.org\/10.1109\/ACII.2019.8925513 ISSN: 2156--8111."},{"key":"e_1_3_2_2_18_1","unstructured":"Jingbei Li Yi Meng Chenyi Li Zhiyong Wu Helen Meng Chao Weng and Dan Su. 2022a. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling. https:\/\/doi.org\/10.48550\/ARXIV.2106.06233  Jingbei Li Yi Meng Chenyi Li Zhiyong Wu Helen Meng Chao Weng and Dan Su. 2022a. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling. https:\/\/doi.org\/10.48550\/ARXIV.2106.06233"},{"key":"e_1_3_2_2_19_1","unstructured":"Jingbei Li Yi Meng Zhiyong Wu Helen Meng Qiao Tian Yuping Wang and Yuxuan Wang. 2022b. NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism. https:\/\/doi.org\/10.48550\/ARXIV.2203.16838  Jingbei Li Yi Meng Zhiyong Wu Helen Meng Qiao Tian Yuping Wang and Yuxuan Wang. 2022b. NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism. https:\/\/doi.org\/10.48550\/ARXIV.2203.16838"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240575"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240575"},{"key":"e_1_3_2_2_22_1","first-page":"2347","article-title":"Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks","volume":"2020","author":"Lian Zheng","year":"2020","unstructured":"Zheng Lian , Jianhua Tao , Bin Liu , Jian Huang , Zhanlei Yang , and Rongjun Li . 2020 . Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks . Proc. Interspeech 2020 (2020), 2347 -- 2351 . Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, and Rongjun Li. 2020. Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. Proc. Interspeech 2020 (2020), 2347--2351.","journal-title":"Proc. Interspeech"},{"key":"e_1_3_2_2_23_1","volume-title":"Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. arXiv:2005.11153 [cs] (May","author":"Luo Hongyin","year":"2020","unstructured":"Hongyin Luo , Shang-Wen Li , and James Glass . 2020. Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. arXiv:2005.11153 [cs] (May 2020 ). http:\/\/arxiv.org\/abs\/2005.11153 arXiv: 2005.11153. Hongyin Luo, Shang-Wen Li, and James Glass. 2020. Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. arXiv:2005.11153 [cs] (May 2020). http:\/\/arxiv.org\/abs\/2005.11153 arXiv: 2005.11153."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33016818"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054484"},{"key":"e_1_3_2_2_26_1","volume-title":"International Conference on Learning Representations.","author":"Ren Yi","year":"2020","unstructured":"Yi Ren , Chenxu Hu , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , and Tie-Yan Liu . 2020 . FastSpeech 2: Fast and High-Quality End-to-End Text to Speech . In International Conference on Learning Representations. Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_27_1","volume-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 285","author":"Ren Yi","year":"2019","unstructured":"Yi Ren , Yangjun Ruan , Xu Tan , Tao Qin , Sheng Zhao , Zhou Zhao , and Tie-Yan Liu . 2019 . FastSpeech: fast, robust and controllable text to speech . In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 285 . Curran Associates Inc., Red Hook, NY, USA, 3171--3180. Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 285. Curran Associates Inc., Red Hook, NY, USA, 3171--3180."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Yu-Ping Ruan Shu-Kai Zheng Taihao Li Fen Wang and Guanxiong Pei. 2022. Hierarchical and Multi-View Dependency Modelling Network for Conversational Emotion Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 7032--7036. https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747123 ISSN: 2379--190X.  Yu-Ping Ruan Shu-Kai Zheng Taihao Li Fen Wang and Guanxiong Pei. 2022. Hierarchical and Multi-View Dependency Modelling Network for Conversational Emotion Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). 7032--7036. https:\/\/doi.org\/10.1109\/ICASSP43922.2022.9747123 ISSN: 2379--190X.","DOI":"10.1109\/ICASSP43922.2022.9747123"},{"key":"e_1_3_2_2_29_1","volume-title":"European semantic web conference","author":"Schlichtkrull Michael","unstructured":"Michael Schlichtkrull , Thomas N Kipf , Peter Bloem , Rianne van den Berg , Ivan Titov , and Max Welling . 2018. Modeling relational data with graph convolutional networks . In European semantic web conference . Springer , 593--607. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593--607."},{"key":"e_1_3_2_2_30_1","volume-title":"Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884 [cs] (Dec","author":"Shen Jonathan","year":"2017","unstructured":"Jonathan Shen , Ruoming Pang , Ron J. Weiss , Mike Schuster , Navdeep Jaitly , Zongheng Yang , Zhifeng Chen , Yu Zhang , Yuxuan Wang , R. J. Skerry-Ryan , Rif A. Saurous , Yannis Agiomyrgiannakis , and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884 [cs] (Dec . 2017 ). http:\/\/arxiv.org\/abs\/1712.05884 arXiv: 1712.05884. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884 [cs] (Dec. 2017). http:\/\/arxiv.org\/abs\/1712.05884 arXiv: 1712.05884."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i15.17625"},{"key":"e_1_3_2_2_32_1","first-page":"4193","article-title":"Dimensional Emotion Prediction based on Interactive Context in Conversation","volume":"2020","author":"Shi Xiaohan","year":"2020","unstructured":"Xiaohan Shi , Sixia Li , and Jianwu Dang . 2020 . Dimensional Emotion Prediction based on Interactive Context in Conversation . Proc. Interspeech 2020 (2020), 4193 -- 4197 . Xiaohan Shi, Sixia Li, and Jianwu Dang. 2020. Dimensional Emotion Prediction based on Interactive Context in Conversation. Proc. Interspeech 2020 (2020), 4193--4197.","journal-title":"Proc. Interspeech"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1631\/FITEE.1700826"},{"key":"e_1_3_2_2_34_1","volume-title":"international conference on machine learning. PMLR, 4693--4702","author":"Skerry-Ryan RJ","year":"2018","unstructured":"RJ Skerry-Ryan , Eric Battenberg , Ying Xiao , Yuxuan Wang , Daisy Stanton , Joel Shor , Ron Weiss , Rob Clark , and Rif A Saurous . 2018 . Towards end-to-end prosody transfer for expressive speech synthesis with tacotron . In international conference on machine learning. PMLR, 4693--4702 . RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning. PMLR, 4693--4702."},{"key":"e_1_3_2_2_35_1","volume-title":"Attention is all you need. Advances in neural information processing systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems , Vol. 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017)."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12559-015-9326-z"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1452"},{"key":"e_1_3_2_2_38_1","volume-title":"International Conference on Machine Learning. PMLR, 5180--5189","author":"Wang Yuxuan","year":"2018","unstructured":"Yuxuan Wang , Daisy Stanton , Yu Zhang , RJ- Skerry Ryan , Eric Battenberg , Joel Shor , Ying Xiao , Ye Jia , Fei Ren , and Rif A Saurous . 2018 . Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis . In International Conference on Machine Learning. PMLR, 5180--5189 . Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz etal 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).  Thomas Wolf Lysandre Debut Victor Sanh Julien Chaumond Clement Delangue Anthony Moi Pierric Cistac Tim Rault R\u00e9mi Louf Morgan Funtowicz et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746909"},{"key":"e_1_3_2_2_41_1","volume-title":"Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. arXiv:1907.04448 [cs, eess] (July","author":"Zhang Yu","year":"2019","unstructured":"Yu Zhang , Ron J. Weiss , Heiga Zen , Yonghui Wu , Zhifeng Chen , R. J. Skerry-Ryan , Ye Jia , Andrew Rosenberg , and Bhuvana Ramabhadran . 2019. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. arXiv:1907.04448 [cs, eess] (July 2019 ). http:\/\/arxiv.org\/abs\/1907.04448. Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, R. J. Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. arXiv:1907.04448 [cs, eess] (July 2019). http:\/\/arxiv.org\/abs\/1907.04448."},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1162\/coli_a_00368"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547831","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547831","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:35Z","timestamp":1750186955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547831"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":42,"alternative-id":["10.1145\/3503161.3547831","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547831","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}