{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:08:44Z","timestamp":1750219724010,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":48,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,10,29]],"date-time":"2023-10-29T00:00:00Z","timestamp":1698537600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,11,2]]},"DOI":"10.1145\/3606039.3613113","type":"proceedings-article","created":{"date-parts":[[2023,10,20]],"date-time":"2023-10-20T10:08:16Z","timestamp":1697796496000},"page":"11-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-0161-4838","authenticated-orcid":false,"given":"Chaoyue","family":"Ding","sequence":"first","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-8109-2943","authenticated-orcid":false,"given":"Daoming","family":"Zong","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-4490-2157","authenticated-orcid":false,"given":"Baoxiang","family":"Li","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5856-9969","authenticated-orcid":false,"given":"Song","family":"Zhang","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-3562-4507","authenticated-orcid":false,"given":"Xiaoxu","family":"Zhu","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-1530-0783","authenticated-orcid":false,"given":"Guiping","family":"Zhong","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-8519-4630","authenticated-orcid":false,"given":"Dinghao","family":"Zhou","sequence":"additional","affiliation":[{"name":"SenseTime Research, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,10,29]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Schuller","author":"Amiriparian Shahin","year":"2023","unstructured":"Shahin Amiriparian , Lukas Christ , Andreas K\u00f6nig , Eva-Maria Messner , Alan Cowen , Erik Cambria , and Bj\u00f6rn W . Schuller . 2023 . MuSe 2023 Challenge : Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. In ACM Multimedia . Shahin Amiriparian, Lukas Christ, Andreas K\u00f6nig, Eva-Maria Messner, Alan Cowen, Erik Cambria, and Bj\u00f6rn W. Schuller. 2023. MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. In ACM Multimedia."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"crossref","unstructured":"Shahin Amiriparian Maurice Gerczuk Sandra Ottl Nicholas Cummins Michael Freitag Sergey Pugachevskiy and Bj\u00f6rn Schuller. 2017. Snore Sound Classification Using Image-based Deep Spectrum Features. In INTERSPEECH. 3512--3516.  Shahin Amiriparian Maurice Gerczuk Sandra Ottl Nicholas Cummins Michael Freitag Sergey Pugachevskiy and Bj\u00f6rn Schuller. 2017. Snore Sound Classification Using Image-based Deep Spectrum Features. In INTERSPEECH. 3512--3516.","DOI":"10.21437\/Interspeech.2017-434"},{"key":"e_1_3_2_1_3_1","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski , Yuhao Zhou , Abdelrahman Mohamed , and Michael Auli . 2020 . wav2vec 2.0: A framework for self-supervised learning of speech representations . In NeurIPS , Vol. 33. 12449 -- 12460 . Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, Vol. 33. 12449--12460.","journal-title":"NeurIPS"},{"key":"e_1_3_2_1_4_1","first-page":"423","article-title":"Multimodal machine learning: A survey and taxonomy","volume":"41","author":"Tadas Baltruvs","year":"2018","unstructured":"Tadas Baltruvs aitis, Chaitanya Ahuja , and Louis-Philippe Morency . 2018 . Multimodal machine learning: A survey and taxonomy . IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 41 , 2 (2018), 423 -- 443 . Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Mathilde Caron Hugo Touvron Ishan Misra Herv\u00e9 J\u00e9gou Julien Mairal Piotr Bojanowski and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV. 9650--9660.  Mathilde Caron Hugo Touvron Ishan Misra Herv\u00e9 J\u00e9gou Julien Mairal Piotr Bojanowski and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In ICCV. 9650--9660.","DOI":"10.1109\/ICCV48922.2021.00951"},{"volume-title":"Automatic speech emotion recognition: A survey","author":"Chandrasekar Purnima","key":"e_1_3_2_1_6_1","unstructured":"Purnima Chandrasekar , Santosh Chapaneri , and Deepak Jayaswal . 2014. Automatic speech emotion recognition: A survey . In CSCITA. IEEE , 341--346. Purnima Chandrasekar, Santosh Chapaneri, and Deepak Jayaswal. 2014. Automatic speech emotion recognition: A survey. In CSCITA. IEEE, 341--346."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.3390\/asi5040080"},{"key":"e_1_3_2_1_8_1","volume-title":"Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML. PMLR, 794--803.","author":"Chen Zhao","year":"2018","unstructured":"Zhao Chen , Vijay Badrinarayanan , Chen-Yu Lee , and Andrew Rabinovich . 2018 . Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML. PMLR, 794--803. Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML. PMLR, 794--803."},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Lukas Christ Shahin Amiriparian Alice Baird Alexander Kathan Niklas M\u00fcller Steffen Klug Chris Gagne Panagiotis Tzirakis Eva-Maria Me\u00dfner Andreas K\u00f6nig etal 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions Cross-Cultural Humour and Personalisation. arXiv preprint arXiv:2305.03369 (2023).  Lukas Christ Shahin Amiriparian Alice Baird Alexander Kathan Niklas M\u00fcller Steffen Klug Chris Gagne Panagiotis Tzirakis Eva-Maria Me\u00dfner Andreas K\u00f6nig et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions Cross-Cultural Humour and Personalisation. arXiv preprint arXiv:2305.03369 (2023).","DOI":"10.1145\/3606039.3613114"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3551876.3554817"},{"key":"e_1_3_2_1_11_1","volume-title":"Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555","author":"Clark Kevin","year":"2020","unstructured":"Kevin Clark , Minh-Thang Luong , Quoc V Le , and Christopher D Manning . 2020 . Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020). Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)."},{"key":"e_1_3_2_1_12_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Minneapolis, Minnesota , 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Minneapolis, Minnesota, 4171--4186."},{"key":"e_1_3_2_1_13_1","volume-title":"Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In IEEE Spoken Language Technology Workshop. 1104--1111","author":"Ding Chaoyue","year":"2023","unstructured":"Chaoyue Ding , Jiakui Li , Martin Zong , and Baoxiang Li . 2023 . Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In IEEE Spoken Language Technology Workshop. 1104--1111 . Chaoyue Ding, Jiakui Li, Martin Zong, and Baoxiang Li. 2023. Speed-Robust Keyword Spotting Via Soft Self-Attention on Multi-Scale Features. In IEEE Spoken Language Technology Workshop. 1104--1111."},{"key":"e_1_3_2_1_14_1","volume-title":"LETR: A lightweight and efficient transformer for keyword spotting","author":"Ding Kevin","year":"2022","unstructured":"Kevin Ding , Martin Zong , Jiakui Li , and Baoxiang Li . 2022 . LETR: A lightweight and efficient transformer for keyword spotting . In ICASSP. IEEE , 7987--7991. Kevin Ding, Martin Zong, Jiakui Li, and Baoxiang Li. 2022. LETR: A lightweight and efficient transformer for keyword spotting. In ICASSP. IEEE, 7987--7991."},{"key":"e_1_3_2_1_15_1","volume-title":"Facial action coding system. Environmental Psychology & Nonverbal Behavior","author":"Ekman Paul","year":"1978","unstructured":"Paul Ekman and Wallace V Friesen . 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior ( 1978 ). Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2015.2457417"},{"volume-title":"ACM Multimedia","author":"Eyben Florian","key":"e_1_3_2_1_17_1","unstructured":"Florian Eyben , Martin W\u00f6llmer , and Bj\u00f6rn Schuller . 2010. Opensmile: the munich versatile and fast open-source audio feature extractor . In ACM Multimedia . Association for Computing Machinery , Firenze, Italy , 1459--1462. Florian Eyben, Martin W\u00f6llmer, and Bj\u00f6rn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM Multimedia. Association for Computing Machinery, Firenze, Italy, 1459--1462."},{"key":"e_1_3_2_1_18_1","volume-title":"Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al.","author":"Goodfellow Ian J","year":"2013","unstructured":"Ian J Goodfellow , Dumitru Erhan , Pierre Luc Carrier , Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013 . Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124. Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. 2013. Challenges in representation learning: A report on three machine learning contests. In ICONIP. 117--124."},{"key":"e_1_3_2_1_19_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778."},{"key":"e_1_3_2_1_20_1","volume-title":"Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654","author":"He Pengcheng","year":"2020","unstructured":"Pengcheng He , Xiaodong Liu , Jianfeng Gao , and Weizhu Chen . 2020 . Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020). Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020)."},{"key":"e_1_3_2_1_21_1","volume-title":"Multimodal Temporal Attention in Sentiment Analysis. In International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66","author":"He Yu","year":"2022","unstructured":"Yu He , Licai Sun , Zheng Lian , Bin Liu , Jianhua Tao , Meng Wang , and Yuan Cheng . 2022 . Multimodal Temporal Attention in Sentiment Analysis. In International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66 . Yu He, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao, Meng Wang, and Yuan Cheng. 2022. Multimodal Temporal Attention in Sentiment Analysis. In International on Multimodal Sentiment Analysis Workshop and Challenge. 61--66."},{"key":"e_1_3_2_1_22_1","volume-title":"Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, et al.","author":"Hershey Shawn","year":"2017","unstructured":"Shawn Hershey , Sourish Chaudhuri , Daniel PW Ellis , Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, et al. 2017 . CNN architectures for large-scale audio classification. In ICASSP. IEEE , 131--135. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_3_2_1_24_1","unstructured":"Yu Huang Junyang Lin Chang Zhou Hongxia Yang and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR 9226--9259.  Yu Huang Junyang Lin Chang Zhou Hongxia Yang and Longbo Huang. 2022. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In ICML. PMLR 9226--9259."},{"key":"e_1_3_2_1_25_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_26_1","volume-title":"Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , and Radu Soricut . 2019 . Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019). Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3551876.3554809"},{"key":"e_1_3_2_1_28_1","unstructured":"Shan Li Weihong Deng and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861.  Shan Li Weihong Deng and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In CVPR. 2852--2861."},{"key":"e_1_3_2_1_29_1","volume-title":"Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730","author":"Liu Kuan","year":"2018","unstructured":"Kuan Liu , Yanen Li , Ning Xu , and Prem Natarajan . 2018. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 ( 2018 ). Kuan Liu, Yanen Li, Ning Xu, and Prem Natarajan. 2018. Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018)."},{"key":"e_1_3_2_1_30_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019 . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2017.2736999"},{"key":"e_1_3_2_1_32_1","unstructured":"Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. In NeurIPS. 14200--14213.  Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid and Chen Sun. 2021. Attention bottlenecks for multimodal fusion. In NeurIPS. 14200--14213."},{"key":"e_1_3_2_1_33_1","volume-title":"Glove: Global vectors for word representation. In EMNLP. 1532--1543.","author":"Pennington Jeffrey","year":"2014","unstructured":"Jeffrey Pennington , Richard Socher , and Christopher D Manning . 2014 . Glove: Global vectors for word representation. In EMNLP. 1532--1543. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11431-020-1647-3"},{"key":"e_1_3_2_1_35_1","volume-title":"wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862","author":"Schneider Steffen","year":"2019","unstructured":"Steffen Schneider , Alexei Baevski , Ronan Collobert , and Michael Auli . 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 ( 2019 ). Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019)."},{"key":"e_1_3_2_1_36_1","volume-title":"LightFace: A Hybrid Deep Face Recognition Framework. In Innovations in Intelligent Systems and Applications Conference. 23--27","author":"Serengil Sefik Ilkin","year":"2020","unstructured":"Sefik Ilkin Serengil and Alper Ozpinar . 2020 . LightFace: A Hybrid Deep Face Recognition Framework. In Innovations in Intelligent Systems and Applications Conference. 23--27 . Sefik Ilkin Serengil and Alper Ozpinar. 2020. LightFace: A Hybrid Deep Face Recognition Framework. In Innovations in Intelligent Systems and Applications Conference. 23--27."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3475957.3484450"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3423327.3423673"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2023.3274829"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3551876.3554806"},{"key":"e_1_3_2_1_42_1","volume-title":"NeurIPS","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017 . Attention is all you need . In NeurIPS , Vol. 30 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30."},{"key":"e_1_3_2_1_43_1","first-page":"1","article-title":"Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap","volume":"01","author":"Wagner J.","year":"2023","unstructured":"J. Wagner , A. Triantafyllopoulos , H. Wierstorf , M. Schmitt , F. Burkhardt , F. Eyben , and B. W. Schuller . 2023 . Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap . IEEE Transactions on Pattern Analysis & Machine Intelligence 01 (2023), 1 -- 13 . J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller. 2023. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis & Machine Intelligence 01 (2023), 1--13.","journal-title":"IEEE Transactions on Pattern Analysis & Machine Intelligence"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Weiyao Wang Du Tran and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.  Weiyao Wang Du Tran and Matt Feiszli. 2020b. What makes training multi-modal classification networks hard?. In CVPR. 12695--12705.","DOI":"10.1109\/CVPR42600.2020.01271"},{"key":"e_1_3_2_1_45_1","first-page":"4835","article-title":"Deep multimodal fusion by channel exchanging","volume":"33","author":"Wang Yikai","year":"2020","unstructured":"Yikai Wang , Wenbing Huang , Fuchun Sun , Tingyang Xu , Yu Rong , and Junzhou Huang . 2020 a. Deep multimodal fusion by channel exchanging . In NeurIPS , Vol. 33. 4835 -- 4845 . Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, and Junzhou Huang. 2020a. Deep multimodal fusion by channel exchanging. In NeurIPS, Vol. 33. 4835--4845.","journal-title":"NeurIPS"},{"key":"e_1_3_2_1_46_1","volume-title":"NeurIPS","volume":"32","author":"Yang Zhilin","year":"2019","unstructured":"Zhilin Yang , Zihang Dai , Yiming Yang , Jaime Carbonell , Russ R Salakhutdinov , and Quoc V Le . 2019 . Xlnet: Generalized autoregressive pretraining for language understanding . In NeurIPS , Vol. 32 . Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, Vol. 32."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2016.2603342"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2021.3093397"}],"event":{"name":"MM '23: The 31st ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Ottawa ON Canada","acronym":"MM '23"},"container-title":["Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3606039.3613113","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3606039.3613113","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:36:20Z","timestamp":1750178180000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3606039.3613113"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,29]]},"references-count":48,"alternative-id":["10.1145\/3606039.3613113","10.1145\/3606039"],"URL":"https:\/\/doi.org\/10.1145\/3606039.3613113","relation":{},"subject":[],"published":{"date-parts":[[2023,10,29]]},"assertion":[{"value":"2023-10-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}