{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T15:13:46Z","timestamp":1775229226615,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":36,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,21]],"date-time":"2020-10-21T00:00:00Z","timestamp":1603238400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,21]]},"DOI":"10.1145\/3382507.3417960","type":"proceedings-article","created":{"date-parts":[[2020,10,22]],"date-time":"2020-10-22T10:04:35Z","timestamp":1603361075000},"page":"827-834","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition"],"prefix":"10.1145","author":[{"given":"Yanan","family":"Wang","sequence":"first","affiliation":[{"name":"KDDI Research, Inc., Saitama, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianming","family":"Wu","sequence":"additional","affiliation":[{"name":"KDDI Research, Inc., Saitama, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Panikos","family":"Heracleous","sequence":"additional","affiliation":[{"name":"KDDI Research, Inc., Saitama, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shinya","family":"Wada","sequence":"additional","affiliation":[{"name":"KDDI Research, Inc., Saitama, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rui","family":"Kimura","sequence":"additional","affiliation":[{"name":"KDDI Research, Inc., Saitama, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Satoshi","family":"Kurihara","sequence":"additional","affiliation":[{"name":"Keio University, Tokyo, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,10,22]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240578"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_3_1","volume-title":"Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","author":"Aneja Jyoti","unstructured":"Jyoti Aneja , Aditya Deshpande , and Alexander G. Schwing . 2018 . Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)."},{"key":"e_1_3_2_2_4_1","volume-title":"Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review","author":"Collins Jessica A","year":"2014","unstructured":"Jessica A Collins and Ingrid R Olson . 2014. Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review , Vol. 21 , 4 ( 2014 ), 843--860. Jessica A Collins and Ingrid R Olson. 2014. Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic bulletin & review, Vol. 21, 4 (2014), 843--860."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2010.2064307"},{"key":"e_1_3_2_2_6_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3355710"},{"key":"e_1_3_2_2_8_1","volume-title":"Crossmodal attention. Current opinion in neurobiology","author":"Driver Jon","year":"1998","unstructured":"Jon Driver and Charles Spence . 1998. Crossmodal attention. Current opinion in neurobiology , Vol. 8 , 2 ( 1998 ), 245--253. Jon Driver and Charles Spence. 1998. Crossmodal attention. Current opinion in neurobiology, Vol. 8, 2 (1998), 245--253."},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1873951.1874246"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2019.8852184"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3340555.3355712"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3264990"},{"key":"e_1_3_2_2_13_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR."},{"key":"e_1_3_2_2_14_1","unstructured":"Geoffrey Hinton Oriol Vinyals and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS.  Geoffrey Hinton Oriol Vinyals and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS."},{"key":"e_1_3_2_2_15_1","volume-title":"Laurens Van Der Maaten, and Kilian Q Weinberger","author":"Huang Gao","year":"2017","unstructured":"Gao Huang , Zhuang Liu , Laurens Van Der Maaten, and Kilian Q Weinberger . 2017 . Densely connected convolutional networks. In CVPR. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR."},{"key":"e_1_3_2_2_16_1","volume-title":"mbox","author":"Ron Kohavi","year":"1995","unstructured":"Ron Kohavi et al mbox . 1995 . A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14 . Montreal, Canada , 1137--1145. Ron Kohavi et almbox. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137--1145."},{"key":"e_1_3_2_2_17_1","volume-title":"Bi-Modality Fusion for Emotion Recognition in the Wild. In 2019 International Conference on Multimodal Interaction","author":"Li Sunan","year":"2019","unstructured":"Sunan Li , Wenming Zheng , Yuan Zong , Cheng Lu , Chuangao Tang , Xingxun Jiang , Jiateng Liu , and Wanchuang Xia . 2019 b . Bi-Modality Fusion for Emotion Recognition in the Wild. In 2019 International Conference on Multimodal Interaction ( Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 589--594. https:\/\/doi.org\/10.1145\/3340555.3355719 10.1145\/3340555.3355719 Sunan Li, Wenming Zheng, Yuan Zong, Cheng Lu, Chuangao Tang, Xingxun Jiang, Jiateng Liu, and Wanchuang Xia. 2019 b. Bi-Modality Fusion for Emotion Recognition in the Wild. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 589--594. https:\/\/doi.org\/10.1145\/3340555.3355719"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2594"},{"key":"e_1_3_2_2_19_1","volume-title":"Human and Machine Vision","author":"Magnani Lorenzo","unstructured":"Lorenzo Magnani , Sabino Civita , and Guido Previde Massara . 1994. Visual cognition and cognitive modeling . In Human and Machine Vision . Springer , 229--243. Lorenzo Magnani, Sabino Civita, and Guido Previde Massara. 1994. Visual cognition and cognitive modeling. In Human and Machine Vision. Springer, 229--243."},{"key":"e_1_3_2_2_20_1","volume-title":"Cognitive Systems-Information Processing Meets Brain Science","author":"Morris Richard GM","unstructured":"Richard GM Morris , Lionel Tarassenko , and Michael Kenward . 2005. Cognitive Systems-Information Processing Meets Brain Science . Elsevier . Richard GM Morris, Lionel Tarassenko, and Michael Kenward. 2005. Cognitive Systems-Information Processing Meets Brain Science .Elsevier."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1016\/0010-0277(84)90021-0"},{"key":"e_1_3_2_2_22_1","volume-title":"The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity.. In Interspeech. 2378--2382","author":"Schuller Bj\u00f6rn W","year":"2019","unstructured":"Bj\u00f6rn W Schuller , Anton Batliner , Christian Bergler , Florian B Pokorny , Jarek Krajewski , Margaret Cychosz , Ralf Vollmann , Sonja-Dana Roelen , Sebastian Schnieder , Elika Bergelson , 2019 . The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity.. In Interspeech. 2378--2382 . Bj\u00f6rn W Schuller, Anton Batliner, Christian Bergler, Florian B Pokorny, Jarek Krajewski, Margaret Cychosz, Ralf Vollmann, Sonja-Dana Roelen, Sebastian Schnieder, Elika Bergelson, et almbox. 2019. The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity.. In Interspeech. 2378--2382."},{"key":"e_1_3_2_2_23_1","volume-title":"Automatic Group Level Affect and Cohesion Prediction in Videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 161--167","author":"Sharma Garima","year":"2019","unstructured":"Garima Sharma , Shreya Ghosh , and Abhinav Dhall . 2019 . Automatic Group Level Affect and Cohesion Prediction in Videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 161--167 . Garima Sharma, Shreya Ghosh, and Abhinav Dhall. 2019. Automatic Group Level Affect and Cohesion Prediction in Videos. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 161--167."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_2_25_1","volume-title":"End-to-End Speech Emotion Recognition Using Deep Neural Networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5089--5093","author":"Tzirakis P.","unstructured":"P. Tzirakis , J. Zhang , and B. W. Schuller . 2018 . End-to-End Speech Emotion Recognition Using Deep Neural Networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5089--5093 . P. Tzirakis, J. Zhang, and B. W. Schuller. 2018. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5089--5093."},{"key":"e_1_3_2_2_26_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NIPS.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NIPS."},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2923003"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3242969.3264991"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33017216"},{"key":"e_1_3_2_2_30_1","volume-title":"Multi-Attention Fusion Network for Video-Based Emotion Recognition. In 2019 International Conference on Multimodal Interaction","author":"Wang Yanan","year":"2019","unstructured":"Yanan Wang , Jianming Wu , and Keiichiro Hoashi . 2019 b . Multi-Attention Fusion Network for Video-Based Emotion Recognition. In 2019 International Conference on Multimodal Interaction ( Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 595--601. https:\/\/doi.org\/10.1145\/3340555.3355720 10.1145\/3340555.3355720 Yanan Wang, Jianming Wu, and Keiichiro Hoashi. 2019 b. Multi-Attention Fusion Network for Video-Based Emotion Recognition. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 595--601. https:\/\/doi.org\/10.1145\/3340555.3355720"},{"key":"e_1_3_2_2_31_1","unstructured":"Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2 .  Yuxin Wu Alexander Kirillov Francisco Massa Wan-Yen Lo and Ross Girshick. 2019. Detectron2. https:\/\/github.com\/facebookresearch\/detectron2 ."},{"key":"e_1_3_2_2_32_1","volume-title":"Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction","author":"Dang Tien Xuan","year":"2019","unstructured":"Tien Xuan Dang , Soo-Hyung Kim , Hyung-Jeong Yang , Guee-Sang Lee , and Thanh-Hung Vo . 2019 . Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction ( Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 572--576. https:\/\/doi.org\/10.1145\/3340555.3355715 10.1145\/3340555.3355715 Tien Xuan Dang, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, and Thanh-Hung Vo. 2019. Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 572--576. https:\/\/doi.org\/10.1145\/3340555.3355715"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00901"},{"key":"e_1_3_2_2_34_1","volume-title":"Thirty-Second AAAI Conference on Artificial Intelligence.","author":"Zadeh Amir","year":"2018","unstructured":"Amir Zadeh , Paul Pu Liang , Navonil Mazumder , Soujanya Poria , Erik Cambria , and Louis-Philippe Morency . 2018 . Memory fusion network for multi-view sequential learning . In Thirty-Second AAAI Conference on Artificial Intelligence. Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Thirty-Second AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"crossref","unstructured":"Yue Zheng Yali Li and Shengjin Wang. 2019. Intention Oriented Image Captions With Guiding Objects. In CVPR.  Yue Zheng Yali Li and Shengjin Wang. 2019. Intention Oriented Image Captions With Guiding Objects. In CVPR.","DOI":"10.1109\/CVPR.2019.00859"},{"key":"e_1_3_2_2_36_1","volume-title":"Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction","author":"Zhu Bin","year":"2019","unstructured":"Bin Zhu , Xin Guo , Kenneth Barner , and Charles Boncelet . 2019 . Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction ( Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 577--581. https:\/\/doi.org\/10.1145\/3340555.3355716 10.1145\/3340555.3355716 Bin Zhu, Xin Guo, Kenneth Barner, and Charles Boncelet. 2019. Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI '19). Association for Computing Machinery, New York, NY, USA, 577--581. https:\/\/doi.org\/10.1145\/3340555.3355716"}],"event":{"name":"ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","location":"Virtual Event Netherlands","acronym":"ICMI '20","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"]},"container-title":["Proceedings of the 2020 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3382507.3417960","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3382507.3417960","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:38:26Z","timestamp":1750199906000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3382507.3417960"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,21]]},"references-count":36,"alternative-id":["10.1145\/3382507.3417960","10.1145\/3382507"],"URL":"https:\/\/doi.org\/10.1145\/3382507.3417960","relation":{},"subject":[],"published":{"date-parts":[[2020,10,21]]},"assertion":[{"value":"2020-10-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}