{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T05:13:53Z","timestamp":1769922833174,"version":"3.49.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,3,15]],"date-time":"2023-03-15T00:00:00Z","timestamp":1678838400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2018AAA0100604"],"award-info":[{"award-number":["2018AAA0100604"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61720106006, 62036012, 62072455, 61721004, U1836220, and 61872424"],"award-info":[{"award-number":["61720106006, 62036012, 62072455, 61721004, U1836220, and 61872424"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing Natural Science Foundation","award":["L201001"],"award-info":[{"award-number":["L201001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,7,31]]},"abstract":"<jats:p>As a crucial part of natural language processing, event-centered commonsense inference task has attracted increasing attention. With a given observed event, the intention and reaction of the people involved in the event are required to be inferred with artificial intelligent algorithms. To solve this problem, sequence-to-sequence methods are widely studied, where the event is first encoded into a specific representation and then decoded to generate the results. However, all the existing methods learn the event representation only with the textual information, while the visual information is ignored, which is actually helpful for the commonsense reference. In this article, we first define a new task of multi-modal commonsense reference with both textual and visual information. A new event-centered multi-modal dataset is also provided. Then we propose a multi-source knowledge reasoning graph network to solve this task, where three kinds of relational knowledge are considered. Multi-modal correlations are learned to get the event\u2019s multi-modal representation from a global perspective. Intra-event object relations are explored to capture the fine-grained event feature with an object graph. Inter-event semantic relations are also explored through the external knowledge to understand the semantic associations among events with an event graph. We conduct extensive experiments on the new dataset, and the results show the effectiveness of our method.<\/jats:p>","DOI":"10.1145\/3573201","type":"journal-article","created":{"date-parts":[[2022,12,1]],"date-time":"2022-12-01T12:41:10Z","timestamp":1669898470000},"page":"1-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Multi-Source Knowledge Reasoning Graph Network for Multi-Modal Commonsense Inference"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4598-9505","authenticated-orcid":false,"given":"Xuan","family":"Ma","sequence":"first","affiliation":[{"name":"National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences, China and Peng Cheng Laboratory, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5453-9755","authenticated-orcid":false,"given":"Xiaoshan","family":"Yang","sequence":"additional","affiliation":[{"name":"National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences, China and Peng Cheng Laboratory, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7464-9115","authenticated-orcid":false,"given":"Changsheng","family":"Xu","sequence":"additional","affiliation":[{"name":"National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences, China and Peng Cheng Laboratory, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2023,3,15]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/219717.219745"},{"key":"e_1_3_1_3_2","first-page":"1993","volume-title":"Advances in Neural Information Processing Systems","author":"Atwood James","year":"2016","unstructured":"James Atwood and Don Towsley. 2016. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems. 1993\u20132001."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00438"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.3115\/980845.980860"},{"key":"e_1_3_1_6_2","article-title":"Comet: Commonsense transformers for automatic knowledge graph construction","author":"Bosselut Antoine","year":"2019","unstructured":"Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317 (2019).","journal-title":"arXiv preprint arXiv:1906.05317"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10179"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00052"},{"key":"e_1_3_1_9_2","first-page":"3844","volume-title":"Advances in Neural Information Processing Systems","author":"Defferrard Micha\u00ebl","year":"2016","unstructured":"Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844\u20133852."},{"key":"e_1_3_1_10_2","article-title":"Modeling event background for if-then commonsense reasoning using context-aware variational autoencoder","author":"Du Li","year":"2019","unstructured":"Li Du, Xiao Ding, Ting Liu, and Zhongyang Li. 2019. Modeling event background for if-then commonsense reasoning using context-aware variational autoencoder. arXiv preprint arXiv:1909.08824 (2019).","journal-title":"arXiv preprint arXiv:1909.08824"},{"key":"e_1_3_1_11_2","article-title":"Beam search strategies for neural machine translation","author":"Freitag Markus","year":"2017","unstructured":"Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017).","journal-title":"arXiv preprint arXiv:1702.01806"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2010.5596796"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018303"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10344"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_16_2","article-title":"Deep convolutional networks on graph-structured data","author":"Henaff Mikael","year":"2015","unstructured":"Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).","journal-title":"arXiv preprint arXiv:1506.05163"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6731"},{"issue":"99","key":"e_1_3_1_18_2","first-page":"1","article-title":"Visual-textual hybrid sequence matching for joint reasoning","author":"Huang X.","year":"2020","unstructured":"X. Huang, Y. Peng, and Z. Wen. 2020. Visual-textual hybrid sequence matching for joint reasoning. IEEE Transactions on Cybernetics PP, 99 (2020), 1\u201314.","journal-title":"IEEE Transactions on Cybernetics"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289600.3290956"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01404"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.573"},{"key":"e_1_3_1_22_2","article-title":"Semi-supervised classification with graph convolutional networks","author":"Kipf Thomas N.","year":"2016","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).","journal-title":"arXiv preprint arXiv:1609.02907"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00188"},{"key":"e_1_3_1_24_2","article-title":"Diffusion convolutional recurrent neural network: Data-driven traffic forecasting","author":"Li Yaguang","year":"2017","unstructured":"Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).","journal-title":"arXiv preprint arXiv:1707.01926"},{"key":"e_1_3_1_25_2","article-title":"Constructing narrative event evolutionary graph for script event prediction","author":"Li Zhongyang","year":"2018","unstructured":"Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Constructing narrative event evolutionary graph for script event prediction. arXiv preprint arXiv:1805.05081 (2018).","journal-title":"arXiv preprint arXiv:1805.05081"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052675"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6364"},{"key":"e_1_3_1_28_2","first-page":"24","volume-title":"Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004","author":"Meyers Adam","year":"2004","unstructured":"Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. 2004. The NomBank project: An interim report. In Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004. 24\u201331."},{"key":"e_1_3_1_29_2","article-title":"Efficient estimation of word representations in vector space","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).","journal-title":"arXiv preprint arXiv:1301.3781"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1098"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123443"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1162\/0891201053630264"},{"key":"e_1_3_1_33_2","first-page":"311","volume-title":"40th Annual Meeting of the Association for Computational Linguistics","author":"Papineni Kishore","year":"2002","unstructured":"Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics. 311\u2013318."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v30i1.10347"},{"key":"e_1_3_1_35_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018)."},{"key":"e_1_3_1_36_2","article-title":"Modeling naive psychology of characters in simple commonsense stories","author":"Rashkin Hannah","year":"2018","unstructured":"Hannah Rashkin, Antoine Bosselut, Maarten Sap, Kevin Knight, and Yejin Choi. 2018. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533 (2018).","journal-title":"arXiv preprint arXiv:1805.06533"},{"key":"e_1_3_1_37_2","article-title":"Event2mind: Commonsense inference on events, intents, and reactions","author":"Rashkin Hannah","year":"2018","unstructured":"Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018. Event2mind: Commonsense inference on events, intents, and reactions. arXiv preprint arXiv:1805.06939 (2018).","journal-title":"arXiv preprint arXiv:1805.06939"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33013027"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2008.2005605"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-04167-0_33"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10983"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.11164"},{"key":"e_1_3_1_43_2","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).","journal-title":"arXiv preprint arXiv:1706.03762"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939753"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330989"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1006"},{"issue":"99","key":"e_1_3_1_47_2","first-page":"1","article-title":"Multi-level knowledge injecting for visual commonsense reasoning","author":"Wen Z.","year":"2020","unstructured":"Z. Wen and Y. Peng. 2020. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology PP, 99 (2020), 1\u20131.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00928"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_42"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3411895"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3366423.3380107"},{"key":"e_1_3_1_53_2","article-title":"Deep learning on graphs: A survey","author":"Zhang Ziwei","year":"2020","unstructured":"Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering (2020).","journal-title":"IEEE Transactions on Knowledge and Data Engineering"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3573201","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3573201","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:37:25Z","timestamp":1750178245000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3573201"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,15]]},"references-count":52,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,7,31]]}},"alternative-id":["10.1145\/3573201"],"URL":"https:\/\/doi.org\/10.1145\/3573201","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,15]]},"assertion":[{"value":"2021-10-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}