{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T09:03:42Z","timestamp":1773824622576,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,10,23]],"date-time":"2023-10-23T00:00:00Z","timestamp":1698019200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62072212, 61906187, 61976207, and 61902394"],"award-info":[{"award-number":["62072212, 61906187, 61976207, and 61902394"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Development Project of Jilin Province of China","award":["20220508125RC, 20230201065GX, 20200401083GX, 2020C003, and 20200403172SF"],"award-info":[{"award-number":["20220508125RC, 20230201065GX, 20200401083GX, 2020C003, and 20200403172SF"]}]},{"name":"National Key R&D Program","award":["2018YFC2001302"],"award-info":[{"award-number":["2018YFC2001302"]}]},{"name":"Jilin Provincial Key Laboratory of Big Data Intelligent Cognition","award":["20210504003GH"],"award-info":[{"award-number":["20210504003GH"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.<\/jats:p>","DOI":"10.1145\/3618301","type":"journal-article","created":{"date-parts":[[2023,9,2]],"date-time":"2023-09-02T11:30:05Z","timestamp":1693654205000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4751-0708","authenticated-orcid":false,"given":"Yan","family":"Wang","sequence":"first","affiliation":[{"name":"School of Artificial Intelligence, and Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5321-2176","authenticated-orcid":false,"given":"Peize","family":"Li","sequence":"additional","affiliation":[{"name":"School of Artificial Intelligence, Jilin University, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8433-0215","authenticated-orcid":false,"given":"Qingyi","family":"Si","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4081-8838","authenticated-orcid":false,"given":"Hanwen","family":"Zhang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6369-0681","authenticated-orcid":false,"given":"Wenyu","family":"Zang","sequence":"additional","affiliation":[{"name":"China Electronics Corporation, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8432-1658","authenticated-orcid":false,"given":"Zheng","family":"Lin","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9899-8566","authenticated-orcid":false,"given":"Peng","family":"Fu","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China"}]}],"member":"320","published-online":{"date-parts":[[2023,10,23]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-76298-0_52"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.285"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018102"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00209"},{"key":"e_1_3_1_7_2","first-page":"2758","article-title":"Knowledge-routed visual question reasoning: Challenges for deep representation embedding","author":"Cao Qingxing","year":"2021","unstructured":"Qingxing Cao, Bailin Li, Xiaodan Liang, Keze Wang, and Liang Lin. 2021. Knowledge-routed visual question reasoning: Challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33, 7 (2021), 2758\u20132767.","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00503"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3390891"},{"key":"e_1_3_1_10_2","first-page":"489","volume-title":"Findings of the Association for Computational Linguistics (EMNLP\u201920)","author":"Gard\u00e8res Fran\u00e7ois","year":"2020","unstructured":"Fran\u00e7ois Gard\u00e8res, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics (EMNLP\u201920). 489\u2013498."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.670"},{"key":"e_1_3_1_12_2","unstructured":"hasPartKB. 2004. hasPartKB: A New Knowledge Base of hasPart Relations. Retrieved from https:\/\/allenai.org\/data\/haspartkb"},{"key":"e_1_3_1_13_2","volume-title":"International Conference on Learning Representations","author":"Jang Eric","year":"2017","unstructured":"Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3321505"},{"key":"e_1_3_1_15_2","first-page":"1564","volume-title":"Advances in Neural Information Processing Systems","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564\u20131574."},{"key":"e_1_3_1_16_2","first-page":"32","volume-title":"Int. J. Comput. Vis","author":"Krishna Ranjay","year":"2017","unstructured":"Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et\u00a0al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32\u201373."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413943"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.469"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1023\/B:BTTJ.0000047600.45421.6d"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3498340"},{"key":"e_1_3_1_22_2","volume-title":"International Conference on Learning Representations","author":"Loshchilov Ilya","year":"2018","unstructured":"Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in Adam. In International Conference on Learning Representations."},{"key":"e_1_3_1_23_2","volume-title":"Advances in Neural Information Processing Systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2958756"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01389"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00331"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240508.3240712"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S18-2027"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3487042"},{"key":"e_1_3_1_30_2","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et\u00a0al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462987"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46547-0_19"},{"key":"e_1_3_1_34_2","first-page":"91","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neur. Inf. Process. Syst. 28 (2015), 91\u201399.","journal-title":"Adv. Neur. Inf. Process. Syst."},{"key":"e_1_3_1_35_2","volume-title":"Proceedings of the 3rd Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN\u201921)","author":"Shevchenko Violetta","year":"2021","unstructured":"Violetta Shevchenko, Damien Teney, Anthony Dick, and Anton van den Hengel. 2021. Reasoning over vision and language: Exploring the benefits of supplemental knowledge. In Proceedings of the 3rd Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN\u201921)."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806216"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v28i1.8735"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/2998574"},{"key":"e_1_3_1_40_2","first-page":"6827","article-title":"What makes for good views for contrastive learning?","volume":"33","author":"Tian Yonglong","year":"2020","unstructured":"Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Adv. Neur. Inf. Process. Syst. 33 (2020), 6827\u20136839.","journal-title":"Adv. Neur. Inf. Process. Syst."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3314577"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107248"},{"key":"e_1_3_1_43_2","unstructured":"Wikipedia. 2017. Wikipedia The Free Encyclopedia. Retrieved from https:\/\/www.wikipedia.org\/"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i3.20174"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3416493"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3316767"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2020.107563"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00644"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2817340"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3320061"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2020.10.007"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467285"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2020\/153"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3618301","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3618301","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:46Z","timestamp":1750291426000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3618301"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,23]]},"references-count":53,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3618301"],"URL":"https:\/\/doi.org\/10.1145\/3618301","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,23]]},"assertion":[{"value":"2022-09-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-08-20","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}