{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T18:51:37Z","timestamp":1778179897094,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":64,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/"}],"funder":[{"name":"Beijing Academy of Artificial Intelligence (BAAI)"},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62173195"],"award-info":[{"award-number":["62173195"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547906","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"1688-1697","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":51,"title":["MIntRec: A New Dataset for Multimodal Intent Recognition"],"prefix":"10.1145","author":[{"given":"Hanlei","family":"Zhang","sequence":"first","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Hua","family":"Xu","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Xin","family":"Wang","sequence":"additional","affiliation":[{"name":"Tsinghua University; Hebei University of Science and Technology, Beijing, Shijiazhuang, China"}]},{"given":"Qianrui","family":"Zhou","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Shaojie","family":"Zhao","sequence":"additional","affiliation":[{"name":"Tsinghua University; Hebei University of Science and Technology, Beijing, China"}]},{"given":"Jiayan","family":"Teng","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_13"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00033"},{"key":"e_1_3_2_2_3_1","volume-title":"Proceedings of the 33th Advances in Neural Information Processing Systems","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski , Yuhao Zhou , Abdelrahman Mohamed , and Michael Auli . 2020 . wav2vec 2.0: A framework for self-supervised learning of speech representations . In Proceedings of the 33th Advances in Neural Information Processing Systems , Vol. 33 . 12449--12460. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 33th Advances in Neural Information Processing Systems, Vol. 33. 12449--12460."},{"key":"e_1_3_2_2_4_1","unstructured":"Michael E Bratman. 1988. Intention --Plans --and--Practical--Reason. Mind 97 388 (1988).  Michael E Bratman. 1988. Intention --Plans --and--Practical--Reason. Mind 97 388 (1988)."},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W17-5522"},{"key":"e_1_3_2_2_6_1","volume-title":"IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4","author":"Busso Carlos","year":"2008","unstructured":"Carlos Busso , Murtaza Bulut , Chi-Chun Lee , Abe Kazemzadeh , Emily Mower , Samuel Kim , Jeannette N Chang , Sungbok Lee , and Shrikanth S Narayanan . 2008 . IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335--359. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335--359."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1239"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.nlp4convai-1.5"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1455"},{"key":"e_1_3_2_2_10_1","volume-title":"et al","author":"Chen Kai","year":"2019","unstructured":"Kai Chen , Jiaqi Wang , Jiangmiao Pang , Yuhang Cao , Yu Xiong , Xiaoxiao Li , Shuyang Sun , Wansen Feng , Ziwei Liu , Jiarui Xu , et al . 2019 . MMDetection : Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019). Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al . 2019. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)."},{"key":"e_1_3_2_2_11_1","volume-title":"Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909","author":"Chen Qian","year":"2019","unstructured":"Qian Chen , Zhu Zhuo , and Wen Wang . 2019. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 ( 2019 ). Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 (2019)."},{"key":"e_1_3_2_2_12_1","volume-title":"Proceedings of the Asian Conference on Computer Vision. 251--263","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 . Out of time: automated lip sync in the wild . In Proceedings of the Asian Conference on Computer Vision. 251--263 . Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Proceedings of the Asian Conference on Computer Vision. 251--263."},{"key":"e_1_3_2_2_13_1","unstructured":"Alice Coucke Alaa Saade Adrien Ball Th\u00e9odore Bluche Alexandre Caulier David Leroy Cl\u00e9ment Doumouro Thibault Gisselbrecht Francesco Caltagirone Thibaut Lavril etal 2018. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190 (2018).  Alice Coucke Alaa Saade Adrien Ball Th\u00e9odore Bluche Alexandre Caulier David Leroy Cl\u00e9ment Doumouro Thibault Gisselbrecht Francesco Caltagirone Thibaut Lavril et al. 2018. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190 (2018)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1992.225858"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i14.17534"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1211"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413678"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.3115\/116580.116613"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01279"},{"key":"e_1_3_2_2_24_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186","author":"Ming-Wei Chang Jacob Devlin","year":"2019","unstructured":"Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186 . Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1469"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1131"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1548"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6353"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_30_1","volume-title":"Benchmarking natural language understanding services for building conversational agents. arXiv preprint arXiv:1903.05566","author":"Liu Xingkun","year":"2019","unstructured":"Xingkun Liu , Arash Eshghi , Pawel Swietojanski , and Verena Rieser . 2019. Benchmarking natural language understanding services for building conversational agents. arXiv preprint arXiv:1903.05566 ( 2019 ). Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. 2019. Benchmarking natural language understanding services for building conversational agents. arXiv preprint arXiv:1903.05566 (2019)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1209"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.25080\/Majora-7b98e3ed-003"},{"key":"e_1_3_2_2_33_1","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference. 6149--6157","author":"Nakamura Kai","year":"2020","unstructured":"Kai Nakamura , Sharon Levy , and William Yang Wang . 2020 . Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection . In Proceedings of the 12th Language Resources and Evaluation Conference. 6149--6157 . Kai Nakamura, Sharon Levy, and William Yang Wang. 2020. Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the 12th Language Resources and Evaluation Conference. 6149--6157."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1050"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1214"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.214"},{"key":"e_1_3_2_2_37_1","volume-title":"Proceedings of the 28th International Conference on Neural Information Processing Systems. 91--99","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren , Kaiming He , Ross B Girshick , and Jian Sun . 2015 . Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks . In Proceedings of the 28th International Conference on Neural Information Processing Systems. 91--99 . Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 91--99."},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053900"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/s12559-019-09704-5"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.402"},{"key":"e_1_3_2_2_41_1","volume-title":"emotion, and action: A neural theory based on semantic pointers. Cognitive science 38, 5","author":"Schr\u00f6der Tobias","year":"2014","unstructured":"Tobias Schr\u00f6der , Terrence C Stewart , and Paul Thagard . 2014. Intention , emotion, and action: A neural theory based on semantic pointers. Cognitive science 38, 5 ( 2014 ), 851--880. Tobias Schr\u00f6der, Terrence C Stewart, and Paul Thagard. 2014. Intention, emotion, and action: A neural theory based on semantic pointers. Cognitive science 38, 5 (2014), 851--880."},{"key":"e_1_3_2_2_42_1","volume-title":"Oxford English","author":"Simpson Esc","year":"1989","unstructured":"Esc Simpson , Ja & Weiner . 1989. Oxford english dictionary. Dictionary , Oxford English ( 1989 ). Esc Simpson, Ja & Weiner. 1989. Oxford english dictionary. Dictionary, Oxford English (1989)."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.218"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2019.07.003"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475587"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1656"},{"key":"e_1_3_2_2_47_1","volume-title":"Proceedings of the 30th Advances in neural information processing systems","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017 . Attention is all you need . In Proceedings of the 30th Advances in neural information processing systems , Vol. 30 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 30th Advances in neural information processing systems, Vol. 30."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413890"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1517"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-58855-8"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/W15-1509"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1166"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.343"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1115"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.12021"},{"key":"e_1_3_2_2_57_1","volume-title":"Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259","author":"Zadeh Amir","year":"2016","unstructured":"Amir Zadeh , Rowan Zellers , Eli Pincus , and Louis-Philippe Morency . 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 ( 2016 ). Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016)."},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1208"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1519"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.249"},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-demo.20"},{"key":"e_1_3_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i16.17690"},{"key":"e_1_3_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i16.17689"},{"key":"e_1_3_2_2_64_1","volume-title":"Proceedings of the IEEE international conference on computer vision. 192--201","author":"Zhang Shifeng","year":"2017","unstructured":"Shifeng Zhang , Xiangyu Zhu , Zhen Lei , Hailin Shi , Xiaobo Wang , and Stan Z Li . 2017 . S3fd: Single shot scale-invariant face detector . In Proceedings of the IEEE international conference on computer vision. 192--201 . Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision. 192--201."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547906","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547906","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:30Z","timestamp":1750186830000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547906"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":64,"alternative-id":["10.1145\/3503161.3547906","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547906","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}