{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,3]],"date-time":"2025-11-03T13:26:49Z","timestamp":1762176409737,"version":"build-2065373602"},"reference-count":96,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,11,3]],"date-time":"2025-11-03T00:00:00Z","timestamp":1762128000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:p>Individuals with visual disabilities possess impairments that affect their ability to perceive visual information, ranging from partial to complete vision loss. Visual disabilities affect about 2.2 billion people globally. In this paper, we introduce a new multi-level Visual Questioning Answering (VQA) framework for visually disabled people that leverages the strengths of various VQA models of the multi-level components to enhance system performance. The model relies on a bi-level architecture that employs two distinct layers. In the first level, the model classifies the question type. This classification guides the visual question to the appropriate component model in the second level. This bi-level architecture incorporates a switch function that enables the system to select the optimal VQA model for each specific question, hence enhancing overall accuracy. The experimental findings indicate that the multi-level VQA technique is significantly effective. The bi-level VQA model enhances the overall accuracy over the state-of-the-art from 87.41% to 88.41%. This finding suggests the use of multiple levels with different models can boost the VQA systems' performance. This research presents a promising direction for developing advanced, multi-level VQA systems. Future work may explore optimizing and experimenting with various model levels to enhance performance further.<\/jats:p>","DOI":"10.3389\/frai.2025.1646176","type":"journal-article","created":{"date-parts":[[2025,11,3]],"date-time":"2025-11-03T12:27:37Z","timestamp":1762172857000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Enhancing accessibility: a multi-level platform for visual question answering in diabetic retinopathy for individuals with disabilities"],"prefix":"10.3389","volume":"8","author":[{"given":"Sarah","family":"Alotaibi","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Suheer","family":"Al-Hadhrami","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Saad","family":"Al-Ahmadi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2025,11,3]]},"reference":[{"key":"B1","article-title":"\u201cOverview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain,\u201d","author":"Abacha","year":"2020","journal-title":"Proceedings of the CLEF 2020\u2013Conference and Labs of the Evaluation Forum"},{"key":"B2","article-title":"\u201cVQA-Med: Overview of the medical visual question answering task at imageclef 2019,\u201d","volume-title":"CEUR Workshop Proceedings","author":"Abacha","year":"2019"},{"key":"B3","article-title":"\u201cNLM at imageclef 2018 visual question answering in the medical domain,\u201d","volume-title":"Technical Report","author":"Abacha","year":"2018"},{"key":"B4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-018-0040-6","article-title":"Pivotal trial of an autonomous ai-based diagnostic system for detection of diabetic retinopathy in primary care offices","volume":"1","author":"Abr\u00e0moff","year":"2018","journal-title":"Nat. Digit. Med"},{"key":"B5","doi-asserted-by":"publisher","first-page":"1660","DOI":"10.30574\/wjarr.2024.21.2.0593","article-title":"Ethical considerations in healthcare it: a review of data privacy and patient consent issues","volume":"21","author":"Adeniyi","year":"2024","journal-title":"World J. Adv. Res. Rev"},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2505.00153","article-title":"Audo-sight: enabling ambient interaction for blind and visually impaired individuals","author":"Ainary","year":"2025","journal-title":"arXiv"},{"key":"B7","first-page":"54","article-title":"\u201cUWB indoor tracking system for visually impaired people,\u201d","volume-title":"Proceedings of the 13th International Conference on Advances in Mobile Computing and Multimedia","author":"Alhadhrami","year":"2015"},{"key":"B8","doi-asserted-by":"publisher","first-page":"9735","DOI":"10.3390\/app13179735","article-title":"An effective med-vqa method using a transformer with weights fusion of multiple fine-tuned models","volume":"13","author":"Al-Hadhrami","year":"2023","journal-title":"Appl. Sci"},{"key":"B9","article-title":"\u201cDeep neural networks and decision tree classifier for visual question answering in the medical domain,\u201d","volume-title":"Technical Report","author":"Allaouzi","year":"2018"},{"key":"B10","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1181","article-title":"Learning to compose neural networks for question answering","author":"Andreas","year":"","journal-title":"arXiv"},{"key":"B11","first-page":"39","article-title":"\u201cNeural module networks,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Andreas","year":""},{"key":"B12","first-page":"2425","article-title":"\u201cVQA: Visual question answering,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Antol","year":"2015"},{"key":"B13","first-page":"20","article-title":"\u201cDeep attention neural tensor network for visual question answering,\u201d","author":"Bai","year":"2018","journal-title":"Proceedings of the European Conference on Computer Vision (ECCV)"},{"key":"B14","first-page":"2612","article-title":"\u201cMutan: Multimodal tucker fusion for visual question answering,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Ben-Younes","year":"2017"},{"key":"B15","article-title":"\u201cTlemcen university at imageclef 2019 visual question answering task,\u201d","volume-title":"Proceedings of the CLEF (Working Notes)","author":"Bounaama","year":"2019"},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1511.05960","article-title":"Abc-cnn: An attention based convolutional neural network for visual question answering","author":"Chen","year":"2015","journal-title":"arXiv"},{"key":"B17","first-page":"10800","article-title":"\u201cCounterfactual samples synthesizing for robust visual question answering,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Chen","year":""},{"key":"B18","first-page":"23","article-title":"\u201cUniter: Universal image-text representation learning,\u201d","volume-title":"Computer Vision\u2013ECCV 2020, 16th European Conference","author":"Chen","year":""},{"key":"B19","doi-asserted-by":"crossref","first-page":"3569","DOI":"10.1145\/3503161.3548122","article-title":"\u201cCaption-aware medical vqa via semantic focusing and progressive cross-modality comprehension,\u201d","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia","author":"Cong","year":"2022"},{"key":"B20","doi-asserted-by":"publisher","first-page":"e0000651","DOI":"10.1371\/journal.pdig.0000651","article-title":"Bias in medical AI: implications for clinical decision-making","volume":"3","author":"Cross","year":"2024","journal-title":"PLOS Digital Health"},{"key":"B21","doi-asserted-by":"crossref","first-page":"886","DOI":"10.1109\/CVPR.2005.177","article-title":"\u201cHistograms of oriented gradients for human detection,\u201d","volume-title":"2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)","author":"Dalal","year":"2005"},{"key":"B22","doi-asserted-by":"publisher","first-page":"8531","DOI":"10.3390\/s22218531","article-title":"Artificial intelligence of things applied to assistive technology: a systematic literature review","volume":"22","author":"de Freitas","year":"2022","journal-title":"Sensors"},{"key":"B23","doi-asserted-by":"publisher","first-page":"845","DOI":"10.14236\/jhi.v22i4.845","article-title":"Using routinely collected health data for surveillance, quality improvement and research: Framework and key questions to assess ethics and privacy and enable data access","volume":"22","author":"De Lusignan","year":"2015","journal-title":"BMJ Health Care Inform"},{"key":"B24","doi-asserted-by":"publisher","first-page":"196","DOI":"10.1016\/j.irbm.2013.01.010","article-title":"Teleophta: Machine learning and image processing methods for teleophthalmology","volume":"34","author":"Decenciere","year":"2013","journal-title":"IRBM"},{"key":"B25","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1810.04805","article-title":"Bert: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2019","journal-title":"arXiv"},{"key":"B26","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-87240-3_7","article-title":"Multiple meta-model quantifying for medical visual question answering","author":"Do","year":"2021","journal-title":"arXiv"},{"key":"B27","article-title":"\u201cTeams at vqa-med 2021: BBN-orchestra for long-tailed medical visual question answering,\u201d","volume-title":"Working Notes of CLEF, volume 201","author":"Eslami","year":"2021"},{"key":"B28","doi-asserted-by":"crossref","first-page":"457","DOI":"10.18653\/v1\/D16-1044","article-title":"\u201cMultimodal compact bilinear pooling for visual question answering and visual grounding,\u201d","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Fukui","year":"2016"},{"key":"B29","doi-asserted-by":"publisher","first-page":"6391","DOI":"10.1609\/aaai.v33i01.33016391","article-title":"Structured two-stream attention network for video question answering","volume":"33","author":"Gao","year":"2019","journal-title":"Proc. AAAI Conf. Artif. Intellig"},{"key":"B30","article-title":"\u201cSYSU-HCP at VQA-Med 2021: a data-centric model with efficient training methodology for medical visual question answering,\u201d","volume-title":"Working Notes of CLEF, volume 201","author":"Gong","year":"2021"},{"key":"B31","first-page":"4971","article-title":"\u201cLaPA: Latent prompt assist model for medical visual question answering,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops","author":"Gu","year":"2024"},{"key":"B32","first-page":"3608","article-title":"\u201cVizwiz grand challenge: Answering visual questions from blind people,\u201d","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Gurari","year":"2018"},{"key":"B33","first-page":"3838","article-title":"\u201cMED-GPVS: A deep learning-based joint biomedical image classification and visual question answering system for precision e-health,\u201d","volume-title":"Proceedings of the ICC 2022\u2013IEEE International Conference on Communications","author":"Haridas","year":"2022"},{"key":"B34","first-page":"770","article-title":"\u201cDeep residual learning for image recognition,\u201d","author":"He","year":"2016","journal-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.36227\/techrxiv.13127537","article-title":"Challenge-pathology visual question answering grand challenge","author":"He","year":"","journal-title":"Grand Challenge"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.36227\/techrxiv.13127537.v1","article-title":"Pathvqa: 30.000+ questions for medical visual question answering","author":"He","year":"","journal-title":"arXiv"},{"key":"B37","doi-asserted-by":"publisher","first-page":"103241","DOI":"10.1016\/j.ipm.2022.103241","article-title":"Medical knowledge-based network for patient-oriented visual question answering","volume":"60","author":"Huang","year":"2023","journal-title":"Inform. Proc. Managem"},{"key":"B38","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1604.01485","article-title":"A focused dynamic attention model for visual question answering","author":"Ilievski","year":"2016","journal-title":"arXiv"},{"key":"B39","article-title":"\u201cISO 9241-171: Ergonomics of human-system interaction-guidance on software accessibility,\u201d","volume-title":"Technical Report","year":"2008"},{"key":"B40","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/978-981-19-2416-3_12","article-title":"\u201cDeep learning for diabetic retinopathy detection: Challenges and opportunities,\u201d","volume-title":"Next Generation Healthcare Informatics","author":"Jagan Mohan","year":"2022"},{"key":"B41","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1511.05676","article-title":"Compositional memory for visual question answering","author":"Jiang","year":"2015","journal-title":"arXiv"},{"key":"B42","first-page":"361","article-title":"\u201cMultimodal residual learning for visual QA,\u201d","volume-title":"Advances in Neural Information Processing Systems","author":"Kim","year":"2016"},{"key":"B43","article-title":"\u201cHadamard product for low-rank bilinear pooling,\u201d","author":"Kim","year":"2017","journal-title":"Proceedings of the 5th International Conference on Learning Representations (ICLR"},{"key":"B44","first-page":"3294","article-title":"\u201cSkip-thought vectors,\u201d","volume-title":"Advances in Neural Information Processing Systems, vol. 28","author":"Kiros","year":"2015"},{"key":"B45","doi-asserted-by":"crossref","first-page":"60","DOI":"10.18653\/v1\/2020.bionlp-1.6","article-title":"\u201cTowards visual dialog for radiology,\u201d","volume-title":"Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing","author":"Kovaleva","year":"2020"},{"key":"B46","doi-asserted-by":"publisher","first-page":"84","DOI":"10.1145\/3065386","article-title":"ImageNet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"B47","first-page":"1378","article-title":"\u201cAsk me anything: Dynamic memory networks for natural language processing,\u201d","author":"Kumar","year":"2016","journal-title":"Proceedings of the International Conference on Machine Learning"},{"key":"B48","unstructured":"2024"},{"key":"B49","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1908.03557","article-title":"Visualbert: A simple and performant baseline for vision and language","author":"Li","year":"","journal-title":"arXiv"},{"key":"B50","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1908.03557","article-title":"VisualBERT: a simple and performant baseline for vision and language","author":"Li","year":"","journal-title":"arXiv"},{"key":"B51","article-title":"\u201cAIML at VQA-Med 2020: Knowledge inference via a skeleton-based sentence mapping approach for medical domain visual question answering,\u201d","volume-title":"Proceedings of the CLEF (Working Notes)","author":"Liao","year":"2020"},{"key":"B52","first-page":"900","article-title":"\u201cAn extended set of haar-like features for rapid object detection,\u201d","volume-title":"Proceedings of the IEEE International Conference on Image Processing","author":"Lienhart","year":"2002"},{"key":"B53","doi-asserted-by":"publisher","first-page":"102611","DOI":"10.1016\/j.artmed.2023.102611","article-title":"Medical visual question answering: a survey","volume":"143","author":"Lin","year":"2023","journal-title":"Artif. Intellig. Med"},{"key":"B54","first-page":"1650","article-title":"\u201cSlake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering,\u201d","volume-title":"Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)","author":"Liu","year":"2021"},{"key":"B55","doi-asserted-by":"crossref","first-page":"1150","DOI":"10.1109\/ICCV.1999.790410","article-title":"\u201cObject recognition from local scale-invariant features,\u201d","volume-title":"Proceedings of the Seventh IEEE International Conference on Computer Vision","author":"Lowe","year":"1999"},{"key":"B56","first-page":"13","article-title":"\u201cVilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,\u201d","volume-title":"Advances in Neural Information Processing Systems, vol. 32","author":"Lu","year":"2019"},{"key":"B57","first-page":"289","article-title":"\u201cHierarchical question-image co-attention for visual question answering,\u201d","author":"Lu","year":"2016","journal-title":"Advances in Neural Information Processing Systems, 29"},{"key":"B58","doi-asserted-by":"publisher","first-page":"288","DOI":"10.1001\/journalofethics.2016.18.3.pfor5-1603","article-title":"Federal privacy protections: Ethical foundations, sources of confusion in clinical medicine, and controversies in biomedical research","volume":"18","author":"Majumder","year":"2016","journal-title":"AMA J. Ethics"},{"key":"B59","first-page":"1","article-title":"\u201cAsk your neurons: a neural-based approach to answering questions about images,\u201d","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Malinowski","year":"2015"},{"key":"B60","doi-asserted-by":"publisher","first-page":"110","DOI":"10.1007\/s11263-017-1038-2","article-title":"Ask your neurons: a deep learning approach to visual question answering","volume":"125","author":"Malinowski","year":"2017","journal-title":"Int. J. Comp. Vision"},{"key":"B61","doi-asserted-by":"publisher","first-page":"5705","DOI":"10.1007\/s10462-020-09832-7","article-title":"Visual question answering: a state-of-the-art review","volume":"53","author":"Manmadhan","year":"2020","journal-title":"Artif. Intellig. Rev"},{"key":"B62","doi-asserted-by":"publisher","first-page":"31516","DOI":"10.1109\/ACCESS.2018.2844789","article-title":"Cross-modal multistep fusion network with co-attention for visual question answering","volume":"6","author":"Mingrui","year":"2018","journal-title":"IEEE Access"},{"key":"B63","first-page":"451","article-title":"\u201cStraight to the facts: Learning knowledge base retrieval for factual visual question answering,\u201d","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Narasimhan","year":"2018"},{"key":"B64","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1606.03647","article-title":"Training recurrent answering units with joint loss minimization for vqa","author":"Noh","year":"2016","journal-title":"arXiv"},{"key":"B65","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR.2016.11","article-title":"\u201cImage question answering using convolutional neural network with dynamic parameter prediction,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Noh","year":"2016"},{"key":"B66","article-title":"\u201cUmass at ImageCLEF medical visual question answering (med-vqa) 2018 task,\u201d","author":"Peng","year":"2018","journal-title":"Proceedings of the CEUR Workshop"},{"key":"B67","doi-asserted-by":"publisher","first-page":"25","DOI":"10.3390\/data3030025","article-title":"Indian diabetic retinopathy image dataset (IDRID): A database for diabetic retinopathy screening research","volume":"3","author":"Porwal","year":"2018","journal-title":"Data"},{"key":"B68","doi-asserted-by":"publisher","first-page":"37","DOI":"10.48550\/arXiv.2010.16061","article-title":"Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation","volume":"2","author":"Powers","year":"2011","journal-title":"J. Mach. Learn. Technol"},{"key":"B69","first-page":"8748","article-title":"\u201cLearning transferable visual models from natural language supervision,\u201d","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford","year":"2021"},{"key":"B70","first-page":"5","article-title":"\u201cImage question answering: a visual semantic embedding model and a new dataset,\u201d","volume-title":"Advances in Neural Information Processing Systems","author":"Ren","year":"2015"},{"key":"B71","article-title":"\u201cPuc chile team at VQA-Med 2021: Approaching VQA as a classification task via fine-tuning a pretrained CNN,\u201d","volume-title":"Working Notes of CLEF","author":"Schilling","year":"2021"},{"key":"B72","first-page":"151","article-title":"\u201cQuestion type guided attention in visual question answering,\u201d","volume-title":"Proceedings of the ECCV 2018","author":"Shi","year":"2018"},{"key":"B73","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1409.1556","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan","year":"2015","journal-title":"arXiv"},{"key":"B74","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2312.02959","article-title":"Detecting algorithmic bias in medical-ai models using trees","author":"Smith","year":"2023","journal-title":"arXiv"},{"key":"B75","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2206.01923","article-title":"From pixels to objects: Cubic visual attention for visual question answering","author":"Song","year":"2022","journal-title":"arXiv"},{"key":"B76","first-page":"1","article-title":"\u201cGoing deeper with convolutions,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Szegedy","year":"2015"},{"key":"B77","article-title":"\u201cJust at VQA-Med: A VGG-Seq2Seq model,\u201d","author":"Talafha","year":"2018","journal-title":"Technical Report"},{"key":"B78","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-16452-1_37","article-title":"\u201cConsistency-preserving visual question answering in medical imaging,\u201d","author":"Tascon-Morales","year":"2022","journal-title":"Medical Image Computing and Computer Assisted Intervention - MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, Vol. 13438"},{"key":"B79","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-43895-0_34","article-title":"\u201cLocalized questions in medical visual question answering,\u201d","author":"Tascon-Morales","year":"2023","journal-title":"Medical Image Computing and Computer Assisted Intervention - MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, Vol. 14221"},{"key":"B80","article-title":"\u201cHARENDRAKV at VQA-Med 2020: Sequential vqa with attention for medical visual question answering,\u201d","volume-title":"Proceedings of the CLEF (Working Notes)","author":"Verma","year":""},{"key":"B81","article-title":"\u201cHarendrakv at vqa-med 2020: Sequential vqa with attention for medical visual question answering,\u201d","volume-title":"Technical Report","author":"Verma","year":""},{"key":"B82","doi-asserted-by":"publisher","first-page":"2856","DOI":"10.1109\/TMI.2020.2978284","article-title":"A question-centric model for visual question answering in medical imaging","volume":"39","author":"Vu","year":"2020","journal-title":"IEEE Trans. Medical Imag"},{"key":"B83","unstructured":"Web Content Accessibility Guidelines (WCAG) 2.2\n          \n          2023"},{"key":"B84","first-page":"141","article-title":"\u201cM2fNet: multi-granularity feature fusion network for medical visual question answering,\u201d","volume-title":"Proceedings of the PRICAI 2022: Trends in Artificial Intelligence, 19th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2022","author":"Wang","year":""},{"key":"B85","doi-asserted-by":"publisher","first-page":"102346","DOI":"10.1016\/j.artmed.2022.102346","article-title":"Medical visual question answering based on question-type reasoning and semantic space constraint","volume":"131","author":"Wang","year":"","journal-title":"Artif. Intellig. Med"},{"key":"B86","doi-asserted-by":"publisher","first-page":"2413","DOI":"10.1109\/TPAMI.2017.2754246","article-title":"FVQA: Fact-based visual question answering","volume":"40","author":"Wang","year":"2017","journal-title":"IEEE Trans. Pattern Analy. Mach. Intellig"},{"key":"B87","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1511.02570","article-title":"Explicit knowledge-based reasoning for visual question answering","author":"Wang","year":"2015","journal-title":"arXiv"},{"key":"B88","volume-title":"World Report on Vision. Technical Report","year":"2019"},{"key":"B89","first-page":"4622","article-title":"\u201cAsk me anything: Free-form visual question answering based on knowledge from external sources,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Wu","year":"2016"},{"key":"B90","first-page":"2397","article-title":"\u201cDynamic memory networks for visual and textual question answering,\u201d","author":"Xiong","year":"2016","journal-title":"Proceedings of the International Conference on Machine Learning"},{"key":"B91","first-page":"5753","article-title":"\u201cXLNet: Generalized autoregressive pretraining for language understanding,\u201d","volume-title":"Advances in Neural Information Processing Systems","author":"Yang","year":"2019"},{"key":"B92","doi-asserted-by":"publisher","first-page":"268","DOI":"10.1016\/j.inffus.2019.03.005","article-title":"Information fusion in visual question answering: a survey","volume":"52","author":"Zhang","year":"2019","journal-title":"Inform. Fusion"},{"key":"B93","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2305.10415","article-title":"PMC-VQA: Visual instruction tuning for medical visual question answering","author":"Zhang","year":"2023","journal-title":"arXiv"},{"key":"B94","doi-asserted-by":"publisher","first-page":"5947","DOI":"10.1109\/TNNLS.2018.2817340","article-title":"Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering","volume":"29","author":"Zhou","year":"","journal-title":"IEEE Trans. Neural Netw. Learn. Syst"},{"key":"B95","article-title":"\u201cEmploying inception-resnet-v2 and bi-lstm for medical domain visual question answering,\u201d","volume-title":"Technical Report","author":"Zhou","year":""},{"key":"B96","first-page":"4995","article-title":"\u201cVisual7W: Grounded question answering in images,\u201d","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Zhu","year":"2016"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1646176\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,3]],"date-time":"2025-11-03T12:27:49Z","timestamp":1762172869000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1646176\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,3]]},"references-count":96,"alternative-id":["10.3389\/frai.2025.1646176"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1646176","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,3]]},"article-number":"1646176"}}