{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T19:09:57Z","timestamp":1770232197676,"version":"3.49.0"},"reference-count":85,"publisher":"Springer Science and Business Media LLC","issue":"7","license":[{"start":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T00:00:00Z","timestamp":1749859200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T00:00:00Z","timestamp":1749859200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Data Sci Anal"],"published-print":{"date-parts":[[2025,11]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Visual question answering (VQA) requires models to comprehend both visual and textual inputs, often necessitating multi-hop reasoning and external knowledge beyond image content. Despite recent advances, current VQA models struggle with complex questions that require reasoning over both structured knowledge and unstructured visual information. To address these limitations, we propose the neurosymbolic graph routing network (NeSyGRN), which integrates a graph routing network for stepwise reasoning with an enrichment mechanism leveraging the common sense knowledge graph (CSKG). This approach enriches scene graphs and question representations with common sense knowledge, enhancing the reasoning capabilities of VQA through structured data augmentation. NeSyGRN is evaluated on benchmark datasets, demonstrating substantial improvements over state-of-the-art baselines. On the KR-VQA dataset, NeSyGRN achieves an overall accuracy of 74.52%, surpassing prior best results by 8%. Ablation studies validate the critical role of knowledge enrichment, with scene graph enrichment contributing most significantly to the improvements. On the GQA dataset, NeSyGRN achieves an accuracy of 83.43%, setting new state-of-the-art results in consistency, plausibility, and contextual validity. These promising results highlight the effectiveness of integrating structured common sense knowledge for VQA, reinforcing its potential for applications in data-rich environments such as health care, education, and decision support systems.<\/jats:p>","DOI":"10.1007\/s41060-025-00827-7","type":"journal-article","created":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T00:25:44Z","timestamp":1749860744000},"page":"6391-6406","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Enhancing visual question answering with common sense knowledge: a data-driven neurosymbolic graph routing approach"],"prefix":"10.1007","volume":"20","author":[{"given":"Muhammad Junaid","family":"Khan","sequence":"first","affiliation":[]},{"given":"Adil Masood","family":"Siddiqui","sequence":"additional","affiliation":[]},{"given":"Hamid Saeed","family":"Khan","sequence":"additional","affiliation":[]},{"given":"Jaleed","family":"Khan","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,14]]},"reference":[{"key":"827_CR1","unstructured":"Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., M\u00fcller, H.: Vqa-med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes) 2(6), (2019)"},{"key":"827_CR2","unstructured":"Amizadeh, S., Palangi, H., Polozov, A., Huang, Y., Koishida, K.: Neuro-symbolic visual reasoning: Disentangling. In: International Conference on Machine Learning, pp. 279\u2013290. Pmlr (2020)"},{"key":"827_CR3","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077\u20136086 (2018)","DOI":"10.1109\/CVPR.2018.00636"},{"key":"827_CR4","doi-asserted-by":"crossref","unstructured":"Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A nucleus for a web of open data. In: Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (ISWC\u201907\/ASWC\u201907), pp. 722\u2013735. Springer-Verlag, Busan, Korea (2007)","DOI":"10.1007\/978-3-540-76298-0_52"},{"issue":"7","key":"827_CR5","doi-asserted-by":"publisher","first-page":"2758","DOI":"10.1109\/TNNLS.2020.3045034","volume":"33","author":"Q Cao","year":"2021","unstructured":"Cao, Q., Li, B., Liang, X., Wang, K., Lin, L.: Knowledge-routed visual question reasoning: challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33(7), 2758\u20132767 (2021)","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"issue":"7","key":"827_CR6","doi-asserted-by":"publisher","first-page":"2758","DOI":"10.1109\/TNNLS.2020.3045034","volume":"33","author":"Q Cao","year":"2022","unstructured":"Cao, Q., Li, B., Liang, X., Wang, K., Lin, L.: Knowledge-routed visual question reasoning: challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33(7), 2758\u20132767 (2022)","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"827_CR7","doi-asserted-by":"crossref","unstructured":"Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6163\u20136171 (2019)","DOI":"10.1109\/CVPR.2019.00632"},{"key":"827_CR8","doi-asserted-by":"crossref","unstructured":"Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W., Liu, J.: Meta module network for compositional visual reasoning. In: Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, pp. 655\u2013664 (2021)","DOI":"10.1109\/WACV48630.2021.00070"},{"key":"827_CR9","doi-asserted-by":"crossref","unstructured":"Chen, Y., Su, L., Chen, L., Lin, Z.: Lcv2: An efficient pretraining-free framework for grounded visual question answering. arXiv preprint arXiv:2401.15842 (2024)","DOI":"10.3390\/electronics13112061"},{"key":"827_CR10","doi-asserted-by":"crossref","unstructured":"Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076\u20133086 (2017)","DOI":"10.1109\/CVPR.2017.352"},{"key":"827_CR11","unstructured":"Damodaran, V., Chakravarthy, S., Kumar, A., Umapathy, A., Mitamura, T., Nakashima, Y., Garcia, N., Chu, C.: Understanding the role of scene graphs in visual question answering. arXiv preprint arXiv:2101.05479 (2021)"},{"key":"827_CR12","doi-asserted-by":"crossref","unstructured":"Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., Wu, Q.: MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 5089\u20135098 (2022)","DOI":"10.1109\/CVPR52688.2022.00503"},{"key":"827_CR13","doi-asserted-by":"crossref","unstructured":"Eiter, T., Geibinger, T., Higuera, N., Oetsch, J.: A logic-based approach to contrastive explainability for neurosymbolic visual question answering. In: IJCAI, pp. 3668\u20133676 (2023)","DOI":"10.24963\/ijcai.2023\/408"},{"key":"827_CR14","doi-asserted-by":"crossref","unstructured":"Garcez, A.d., Lamb, L.C.: Neurosymbolic ai: The 3rd wave. Artificial Intelligence Review pp. 1\u201320 (2023)","DOI":"10.1007\/s10462-023-10448-w"},{"key":"827_CR15","doi-asserted-by":"crossref","unstructured":"Gard\u00e8res, F., Ziaeefard, M., Abeloos, B., Lecue, F.: Conceptbert: Concept-aware representation for visual question answering. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 489\u2013498 (2020)","DOI":"10.18653\/v1\/2020.findings-emnlp.44"},{"key":"827_CR16","doi-asserted-by":"crossref","unstructured":"Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969\u20131978 (2019)","DOI":"10.1109\/CVPR.2019.00207"},{"key":"827_CR17","doi-asserted-by":"crossref","unstructured":"Guo, Y., Song, J., Gao, L., Shen, H.T.: One-shot scene graph generation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3090\u20133098 (2020)","DOI":"10.1145\/3394171.3414025"},{"key":"827_CR18","doi-asserted-by":"crossref","unstructured":"Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608\u20133617 (2018)","DOI":"10.1109\/CVPR.2018.00380"},{"key":"827_CR19","doi-asserted-by":"crossref","unstructured":"He, B., Xia, M., Yu, X., Jian, P., Meng, H., Chen, Z.: An educational robot system of visual question answering for preschoolers. In: 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE), pp. 441\u2013445. IEEE (2017)","DOI":"10.1109\/ICRAE.2017.8291426"},{"issue":"1","key":"827_CR20","first-page":"3","volume":"11","author":"P Hitzler","year":"2020","unstructured":"Hitzler, P., Bianchi, F., Ebrahimi, M., Sarker, M.K.: Neural-symbolic integration and the semantic web. Semant. Web 11(1), 3\u201311 (2020)","journal-title":"Semant. Web"},{"issue":"8","key":"827_CR21","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735\u20131780 (1997). https:\/\/doi.org\/10.1162\/neco.1997.9.8.1735","journal-title":"Neural Comput."},{"key":"827_CR22","doi-asserted-by":"publisher","unstructured":"Hong, W., Ji, K., Liu, J., Wang, J., Chen, J., Chu, W.: GilBERT: Generative vision-language pre-training for image-text retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201921), pp. 1379\u20131388. Association for Computing Machinery, New York, NY, USA (2021). https:\/\/doi.org\/10.1145\/3404835.3462838","DOI":"10.1145\/3404835.3462838"},{"key":"827_CR23","doi-asserted-by":"crossref","unstructured":"Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 10294\u201310303 (2019)","DOI":"10.1109\/ICCV.2019.01039"},{"key":"827_CR24","unstructured":"Hudson, D., Manning, C.D.: Learning by abstraction: The neural state machine. Adv. Neural Inf. Process. Syst. 32, (2019)"},{"key":"827_CR25","doi-asserted-by":"crossref","unstructured":"Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700\u20136709 (2019)","DOI":"10.1109\/CVPR.2019.00686"},{"key":"827_CR26","doi-asserted-by":"crossref","unstructured":"Ilievski, F., Szekely, P., Zhang, B.: Cskg: The commonsense knowledge graph. In: European Semantic Web Conference, pp. 680\u2013696. Springer (2021)","DOI":"10.1007\/978-3-030-77385-4_41"},{"key":"827_CR27","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1016\/j.cviu.2017.06.005","volume":"163","author":"K Kafle","year":"2017","unstructured":"Kafle, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3\u201320 (2017)","journal-title":"Comput. Vis. Image Underst."},{"key":"827_CR28","doi-asserted-by":"crossref","unstructured":"Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 1780\u20131790 (2021)","DOI":"10.1109\/ICCV48922.2021.00180"},{"key":"827_CR29","doi-asserted-by":"crossref","unstructured":"Khan, M.J., Breslin, J.G., Curry, E.: Expressive scene graph generation using commonsense knowledge infusion for visual understanding and reasoning. In: European Semantic Web Conference, pp. 93\u2013112. Springer (2022)","DOI":"10.1007\/978-3-031-06981-9_6"},{"key":"827_CR30","unstructured":"Khan, M.J., G\u00a0Breslin, J., Curry, E.: Neusyre: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment. Semantic Web (Preprint), 1\u201325 (2023)"},{"key":"827_CR31","unstructured":"Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583\u20135594 (2021)"},{"key":"827_CR32","doi-asserted-by":"crossref","unstructured":"Koner, R., Li, H., Hildebrandt, M., Das, D., Tresp, V., G\u00fcnnemann, S.: Graphhopper: Multi-hop scene graph reasoning for visual question answering. In: International Semantic Web Conference, pp. 111\u2013127. Springer (2021)","DOI":"10.1007\/978-3-030-88361-4_7"},{"issue":"1","key":"827_CR33","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R Krishna","year":"2017","unstructured":"Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32\u201373 (2017)","journal-title":"Int. J. Comput. Vision"},{"issue":"12","key":"827_CR34","doi-asserted-by":"publisher","first-page":"2743","DOI":"10.1109\/TPDS.2019.2921956","volume":"30","author":"D Li","year":"2019","unstructured":"Li, D., Zhang, Z., Yu, K., Huang, K., Tan, T.: Isee: an intelligent scene exploration and evaluation platform for large-scale visual surveillance. IEEE Trans. Parallel Distrib. Syst. 30(12), 2743\u20132758 (2019)","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"827_CR35","doi-asserted-by":"crossref","unstructured":"Li, H., Li, X., Karimi, B., Chen, J., Sun, M.: Joint learning of object graph and relation graph for visual question answering. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 01\u201306. IEEE (2022)","DOI":"10.1109\/ICME52920.2022.9859766"},{"key":"827_CR36","doi-asserted-by":"publisher","unstructured":"Li, M., Moens, M.F.: Dynamic key-value memory enhanced multi-step graph reasoning for knowledge-based visual question answering. arXiv e-prints (2022). https:\/\/doi.org\/10.48550\/arXiv.2203.02985","DOI":"10.48550\/arXiv.2203.02985"},{"key":"827_CR37","doi-asserted-by":"crossref","unstructured":"Liang, W., Jiang, Y., Liu, Z.: Graghvqa: Language-guided graph neural networks for graph-based visual question answering. In: Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pp. 79\u201386 (2021)","DOI":"10.18653\/v1\/2021.maiworkshop-1.12"},{"key":"827_CR38","doi-asserted-by":"crossref","unstructured":"Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 848\u2013857 (2017)","DOI":"10.1109\/CVPR.2017.469"},{"key":"827_CR39","doi-asserted-by":"crossref","unstructured":"Liu, L., Wang, M., He, X., Qing, L., Chen, H.: Fact-based visual question answering via dual-process system. Knowledge-Based Systems p. 107650 (2021)","DOI":"10.1016\/j.knosys.2021.107650"},{"key":"827_CR40","doi-asserted-by":"crossref","unstructured":"Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: European Conference on Computer Vision, pp. 852\u2013869. Springer (2016)","DOI":"10.1007\/978-3-319-46448-0_51"},{"key":"827_CR41","doi-asserted-by":"publisher","first-page":"5705","DOI":"10.1007\/s10462-020-09832-7","volume":"53","author":"S Manmadhan","year":"2020","unstructured":"Manmadhan, S., Kovoor, B.C.: Visual question answering: a state-of-the-art review. Artif. Intell. Rev. 53, 5705\u20135745 (2020)","journal-title":"Artif. Intell. Rev."},{"key":"827_CR42","doi-asserted-by":"crossref","unstructured":"Marino, K., Chen, X., Parikh, D., Gupta, A., Rohrbach, M.: Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111\u201314121 (2021)","DOI":"10.1109\/CVPR46437.2021.01389"},{"key":"827_CR43","doi-asserted-by":"crossref","unstructured":"Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Ok-VQA: A visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195\u20133204 (2019)","DOI":"10.1109\/CVPR.2019.00331"},{"key":"827_CR44","unstructured":"Narasimhan, M., Lazebnik, S., Schwing, A.G.: Out of the box: Reasoning with graph convolution nets for factual visual question answering. In: Advances in Neural Information Processing Systems, pp. 2654\u20132665 (2018)"},{"key":"827_CR45","doi-asserted-by":"crossref","unstructured":"Narasimhan, M., Schwing, A.G.: Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 451\u2013468 (2018)","DOI":"10.1007\/978-3-030-01237-3_28"},{"key":"827_CR46","doi-asserted-by":"crossref","unstructured":"Oliveira\u00a0Souza, B.C., Aasan, M., Pedrini, H., Rivera, A.R.: Selfgraphvqa: a self-supervised graph neural network for scene-based question answering. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp. 4640\u20134645 (2023)","DOI":"10.1109\/ICCVW60793.2023.00499"},{"key":"827_CR47","unstructured":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds). Advances in Neural Information Processing Systems, vol.\u00a035, pp. 27730\u201327744 (2022)"},{"key":"827_CR48","unstructured":"Penzkofer, A., Shi, L., Bulling, A.: Vsa4vqa: Scaling a vector symbolic architecture to visual question answering on natural images. arXiv preprint arXiv:2405.03852 (2024)"},{"key":"827_CR49","doi-asserted-by":"crossref","unstructured":"Perez, E., Strub, F., Vries, H.D., Dumoulin, V., Courville, A.: FiLM: Visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a032, pp. 3942\u20133951 (2018)","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"827_CR50","doi-asserted-by":"crossref","unstructured":"Qian, T., Chen, J., Chen, S., Wu, B., Jiang, Y.G.: Scene graph refinement network for visual question answering. IEEE Trans. Multimedia (2022)","DOI":"10.1109\/TMM.2022.3169065"},{"key":"827_CR51","doi-asserted-by":"crossref","unstructured":"Senior, H., Slabaugh, G., Yuan, S., Rossi, L.: Graph neural networks in vision-language image understanding: A survey. The Visual Computer pp. 1\u201326 (2024)","DOI":"10.1007\/s00371-024-03343-0"},{"key":"827_CR52","doi-asserted-by":"crossref","unstructured":"Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: Kvqa: Knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a033, pp. 8876\u20138884 (2019)","DOI":"10.1609\/aaai.v33i01.33018876"},{"key":"827_CR53","doi-asserted-by":"publisher","unstructured":"Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14974\u201314983 (2023). https:\/\/doi.org\/10.1109\/CVPR52729.2023.01438","DOI":"10.1109\/CVPR52729.2023.01438"},{"key":"827_CR54","doi-asserted-by":"crossref","unstructured":"Shi, J., Han, D., Chen, C., Shen, X.: Saffnet: self-attention based on fourier frequency domain filter network for visual question answering. The Visual Computer pp. 1\u201319 (2025)","DOI":"10.1007\/s00371-024-03777-6"},{"issue":"5","key":"827_CR55","doi-asserted-by":"publisher","first-page":"3017","DOI":"10.1007\/s00530-021-00867-6","volume":"29","author":"Z Song","year":"2023","unstructured":"Song, Z., Hu, Z., Hong, R.: Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning. Multimedia Syst. 29(5), 3017\u20133026 (2023)","journal-title":"Multimedia Syst."},{"key":"827_CR56","doi-asserted-by":"crossref","unstructured":"Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4444\u20134451 (2017)","DOI":"10.1609\/aaai.v31i1.11164"},{"key":"827_CR57","doi-asserted-by":"crossref","unstructured":"Su, Z., Zhu, C., Dong, Y., Cai, D., Chen, Y., Li, J.: Learning visual knowledge memory networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7736\u20137745 (2018)","DOI":"10.1109\/CVPR.2018.00807"},{"key":"827_CR58","doi-asserted-by":"crossref","unstructured":"Subramanian, S., Narasimhan, M., Khangaonkar, K., Yang, K., Nagrani, A., Schmid, C., Zeng, A., Darrell, T., Klein, D.: Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392 (2023)","DOI":"10.18653\/v1\/2023.acl-short.65"},{"issue":"4","key":"827_CR59","doi-asserted-by":"publisher","first-page":"2381","DOI":"10.1007\/s00371-023-02924-9","volume":"40","author":"B Sun","year":"2024","unstructured":"Sun, B., Hao, Z., Yu, L., He, J.: Unbiased scene graph generation using the self-distillation method. Vis. Comput. 40(4), 2381\u20132390 (2024)","journal-title":"Vis. Comput."},{"key":"827_CR60","doi-asserted-by":"crossref","unstructured":"Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)","DOI":"10.18653\/v1\/D19-1514"},{"key":"827_CR61","doi-asserted-by":"crossref","unstructured":"Tandon, N., de\u00a0Melo, G., Suchanek, F., Weikum, G.: WebChild: Harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM \u201914), pp. 523\u2013532 (2014)","DOI":"10.1145\/2556195.2556245"},{"key":"827_CR62","doi-asserted-by":"crossref","unstructured":"Teney, D., Hengel, A.v.d.: Actively seeking and learning from live data. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 1940\u20131949 (2019)","DOI":"10.1109\/CVPR.2019.00204"},{"key":"827_CR63","unstructured":"Trouillon, T., Welbl, J., Riedel, S., Gaussier, \u00c9., Bouchard, G.: Complex embeddings for simple link prediction. In: International Conference on Machine Learning (ICML), pp. 2071\u20132080. PMLR (2016)"},{"key":"827_CR64","doi-asserted-by":"crossref","unstructured":"Wang, D., Hu, L., Hao, R., Shao, Y., Lv, X., Nie, L., Li, J.: Let me show you step by step: An interpretable graph routing network for knowledge-based visual question answering. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1984\u20131994 (2024)","DOI":"10.1145\/3626772.3657790"},{"key":"827_CR65","doi-asserted-by":"publisher","unstructured":"Wang, D., Hu, L., Hao, R., Shao, Y., Lv, X., Nie, L., Li, J.: Let me show you step by step: An interpretable graph routing network for knowledge-based visual question answering. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201924), pp. 1984\u20131994. Association for Computing Machinery, New York, NY, USA (2024). https:\/\/doi.org\/10.1145\/3626772.3657790","DOI":"10.1145\/3626772.3657790"},{"issue":"10","key":"827_CR66","doi-asserted-by":"publisher","first-page":"2413","DOI":"10.1109\/TPAMI.2017.2754246","volume":"40","author":"P Wang","year":"2017","unstructured":"Wang, P., Wu, Q., Shen, C., Dick, A., Hengel, A.: FVQA: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413\u20132427 (2017)","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"827_CR67","unstructured":"Wang, W., Yang, Y.: Towards data-and knowledge-driven artificial intelligence: A survey on neuro-symbolic computing. arXiv preprint arXiv:2210.15889 (2022)"},{"key":"827_CR68","doi-asserted-by":"crossref","unstructured":"Wang, X., Ye, Y., Gupta, A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857\u20136866 (2018)","DOI":"10.1109\/CVPR.2018.00717"},{"key":"827_CR69","doi-asserted-by":"crossref","unstructured":"Xu, N., Gao, Y., Liu, A.A., Tian, H., Zhang, Y.: Multi-modal validation and domain interaction learning for knowledge-based visual question answering. IEEE Trans. Knowl. Data Eng. (2024)","DOI":"10.1109\/TKDE.2024.3384270"},{"issue":"6","key":"827_CR70","doi-asserted-by":"publisher","first-page":"e0287557","DOI":"10.1371\/journal.pone.0287557","volume":"18","author":"Y Xu","year":"2023","unstructured":"Xu, Y., Zhang, L., Shen, X.: Multi-modal adaptive gated mechanism for visual question answering. PLoS ONE 18(6), e0287557 (2023)","journal-title":"PLoS ONE"},{"key":"827_CR71","doi-asserted-by":"crossref","unstructured":"Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., Wang, L.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.\u00a036, pp. 3081\u20133089 (2022)","DOI":"10.1609\/aaai.v36i3.20215"},{"key":"827_CR72","doi-asserted-by":"crossref","unstructured":"Yang, Z., Qin, Z., Yu, J., Wan, T.: Prior visual relationship reasoning for visual question answering. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1411\u20131415. IEEE (2020)","DOI":"10.1109\/ICIP40778.2020.9190771"},{"key":"827_CR73","unstructured":"Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Adv. Neural Inf. Process. Syst. 31, (2018)"},{"key":"827_CR74","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281\u20136290 (2019)","DOI":"10.1109\/CVPR.2019.00644"},{"issue":"12","key":"827_CR75","doi-asserted-by":"publisher","first-page":"5947","DOI":"10.1109\/TNNLS.2018.2817340","volume":"29","author":"Z Yu","year":"2018","unstructured":"Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947\u20135959 (2018)","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"827_CR76","doi-asserted-by":"crossref","unstructured":"Zareian, A., Karaman, S., Chang, S.F.: Bridging knowledge graphs to generate scene graphs. In: European Conference on Computer Vision, pp. 606\u2013623. Springer (2020)","DOI":"10.1007\/978-3-030-58592-1_36"},{"key":"827_CR77","doi-asserted-by":"crossref","unstructured":"Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 3736\u20133745 (2020)","DOI":"10.1109\/CVPR42600.2020.00379"},{"key":"827_CR78","doi-asserted-by":"crossref","unstructured":"Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831\u20135840 (2018)","DOI":"10.1109\/CVPR.2018.00611"},{"key":"827_CR79","unstructured":"Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et\u00a0al.: GLM-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)"},{"issue":"4","key":"827_CR80","doi-asserted-by":"publisher","first-page":"280","DOI":"10.1016\/j.vrih.2023.06.002","volume":"6","author":"H Zhang","year":"2024","unstructured":"Zhang, H., Wei, Z., Liu, G., Wang, R., Mu, R., Liu, C., Yuan, A., Cao, G., Hu, N.: Mkeah: multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering. Virtual Real. Intell. Hardw. 6(4), 280\u2013291 (2024)","journal-title":"Virtual Real. Intell. Hardw."},{"key":"827_CR81","doi-asserted-by":"crossref","unstructured":"Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579\u20135588 (2021)","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"827_CR82","doi-asserted-by":"publisher","first-page":"2986","DOI":"10.1109\/TMM.2021.3091882","volume":"24","author":"X Zhang","year":"2021","unstructured":"Zhang, X., Zhang, F., Xu, C.: Explicit cross-modal representation learning for visual commonsense reasoning. IEEE Trans. Multimedia 24, 2986\u20132997 (2021)","journal-title":"IEEE Trans. Multimedia"},{"key":"827_CR83","doi-asserted-by":"publisher","first-page":"108367","DOI":"10.1016\/j.patcog.2021.108367","volume":"123","author":"H Zhou","year":"2022","unstructured":"Zhou, H., Yang, Y., Luo, T., Zhang, J., Li, S.: A unified deep sparse graph attention network for scene graph generation. Pattern Recogn. 123, 108367 (2022)","journal-title":"Pattern Recogn."},{"key":"827_CR84","doi-asserted-by":"crossref","unstructured":"Zhou, Y., Mishra, S., Verma, M., Bhamidipati, N., Wang, W.: Recommending themes for ad creative design via visual-linguistic representations. In: Proceedings of The Web Conference 2020, pp. 2521\u20132527 (2020)","DOI":"10.1145\/3366423.3380001"},{"key":"827_CR85","doi-asserted-by":"crossref","unstructured":"Zhu, Z., Yu, J., Wang, Y., Sun, Y., Hu, Y., Wu, Q.: Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 1097\u20131103 (2021)","DOI":"10.24963\/ijcai.2020\/153"}],"container-title":["International Journal of Data Science and Analytics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41060-025-00827-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41060-025-00827-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41060-025-00827-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,27]],"date-time":"2025-09-27T11:56:49Z","timestamp":1758974209000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41060-025-00827-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,14]]},"references-count":85,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2025,11]]}},"alternative-id":["827"],"URL":"https:\/\/doi.org\/10.1007\/s41060-025-00827-7","relation":{},"ISSN":["2364-415X","2364-4168"],"issn-type":[{"value":"2364-415X","type":"print"},{"value":"2364-4168","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,14]]},"assertion":[{"value":"23 February 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 June 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}