{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T14:53:20Z","timestamp":1770908000629,"version":"3.50.1"},"reference-count":89,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T00:00:00Z","timestamp":1756339200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T00:00:00Z","timestamp":1756339200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100007465","name":"UiT The Arctic University of Norway","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100007465","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Discov Artif Intell"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Visual question answering (VQA) aims to answer questions for a given image. The applications of VQA systems are well explored in education, e-commerce, and interactive exhibits. It also enhances accessibility for the visually impaired (VI). Several VQA systems exist in English for various applications. However, VQA developed for VI people is limited, and such VQA in low-resource languages, specifically Hindi and Bengali, does not exist. This article introduces two such datasets in Bengali and Hindi. The datasets are machine-translated from the popular VQA-VI dataset VizWiz, and curated by native speakers. The datasets consist of approximately 20K image-question pairs along with 10 different answers. We also report benchmark results using state-of-the-art VQA methods and explore different pre-trained embeddings. We achieve a maximum answer type prediction accuracy and answer accuracy of 68.00%\/20.35% (Bengali) and 67.09%\/23.06% (Hindi). The low accuracy using recent state-of-the-art methods is evidence of the complexity of the datasets. We hope the datasets will attract researchers and create a baseline for VQA for VI people in resource-constrained Indic languages. The code and the datasets are available in url. The URL (will be updated) when published.<\/jats:p>","DOI":"10.1007\/s44163-025-00482-8","type":"journal-article","created":{"date-parts":[[2025,8,28]],"date-time":"2025-08-28T12:32:44Z","timestamp":1756384364000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Multilingual visual question answering for visually impaired people"],"prefix":"10.1007","volume":"5","author":[{"given":"Ratnabali","family":"Pal","sequence":"first","affiliation":[]},{"given":"Samarjit","family":"Kar","sequence":"additional","affiliation":[]},{"given":"Dilip K.","family":"Prasad","sequence":"additional","affiliation":[]},{"given":"Arif Ahmed","family":"Sekh","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,8,28]]},"reference":[{"key":"482_CR1","doi-asserted-by":"crossref","unstructured":"Srivastava Y, Murali V, Dubey SR, Mukherjee S. Visual question answering using deep learning: A survey and performance analysis. In: Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5, 2021:75\u201386.","DOI":"10.1007\/978-981-16-1092-9_7"},{"key":"482_CR2","doi-asserted-by":"publisher","first-page":"104327","DOI":"10.1016\/j.imavis.2021.104327","volume":"116","author":"H Sharma","year":"2021","unstructured":"Sharma H, Jalal AS. A survey of methods, datasets and evaluation metrics for visual question answering. Image Vis Comput. 2021;116:104327.","journal-title":"Image Vis Comput"},{"key":"482_CR3","doi-asserted-by":"publisher","first-page":"325","DOI":"10.1016\/j.patrec.2021.09.008","volume":"151","author":"S Barra","year":"2021","unstructured":"Barra S, Bisogni C, De Marsico M, Ricciardi S. Visual question answering: Which investigated applications? Pattern Recogn Lett. 2021;151:325\u201331.","journal-title":"Pattern Recogn Lett"},{"key":"482_CR4","doi-asserted-by":"crossref","unstructured":"Vivoli E, Biten, AF, Mafla, A, Karatzas D, Gomez L. Must-VQA: multilingual scene-text VQA. In: European Conference on Computer Vision, 2021:345\u2013358.","DOI":"10.1007\/978-3-031-25069-9_23"},{"key":"482_CR5","doi-asserted-by":"crossref","unstructured":"Shi B, Wu Z, Mao M, Wang X, Darrell T. When do we not need larger vision models? In: European Conference on Computer Vision, 2021:444\u2013462.","DOI":"10.1007\/978-3-031-73242-3_25"},{"issue":"3","key":"482_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3641289","volume":"15","author":"Y Chang","year":"2024","unstructured":"Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, Yi X, Wang C, Wang Y, et al. A survey on evaluation of large language models. ACM Trans Intel Syst Technol. 2024;15(3):1\u201345.","journal-title":"ACM Trans Intel Syst Technol"},{"issue":"1","key":"482_CR7","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1038\/s43586-021-00018-1","volume":"1","author":"M Hafner","year":"2021","unstructured":"Hafner M, Katsantoni M, K\u00f6ster T, Marks J, Mukherjee J, Staiger D, Ule J, Zavolan M. Clip and complementary methods. Nat Rev Methods Primers. 2021;1(1):20.","journal-title":"Nat Rev Methods Primers"},{"key":"482_CR8","unstructured":"Li J, Li D, Xiong C, Hoi S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, 2021:12888\u201312900."},{"key":"482_CR9","first-page":"23716","volume":"35","author":"J-B Alayrac","year":"2022","unstructured":"Alayrac J-B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, et al. Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst. 2022;35:23716\u201336.","journal-title":"Adv Neural Inf Process Syst"},{"key":"482_CR10","doi-asserted-by":"crossref","unstructured":"Lee G and Zhai X. Realizing visual question answering for education: GPT-4v as a multimodal AI. TechTrends, 2025:1\u201317","DOI":"10.1007\/s11528-024-01035-z"},{"key":"482_CR11","unstructured":"Zhang L and Ng Y. Visual question answering via cross-modal retrieval-augmented generation of large language model. In: Proceedings of the National Conference of the Society for Artificial Intelligence, Chapter 38, General corporate body Artificial Intelligence Society. 2024:2\u2013130121301."},{"key":"482_CR12","doi-asserted-by":"crossref","unstructured":"Sun G, Qin C, Wang J, Chen Z, Xu R, Tao Z. SQ-llava: Self-questioning for large vision-language assistant. In: European Conference on Computer Vision, 2024:156\u2013172.","DOI":"10.1007\/978-3-031-72673-6_9"},{"key":"482_CR13","unstructured":"Kabra R, Matthey L, Lerchner A, Mitra N. Evaluating VLMs for score-based, multi-probe annotation of 3D objects. In: NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023."},{"key":"482_CR14","doi-asserted-by":"crossref","unstructured":"Bai Y, Geng X, Mangalam K, Bar A, Yuille AL, Darrell T, Malik J, Efros AA. Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2024:22861\u201322872.","DOI":"10.1109\/CVPR52733.2024.02157"},{"key":"482_CR15","doi-asserted-by":"crossref","unstructured":"Yao Y, Duan J, Xu K, Cai Y, Sun Z, Zhang Y. A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 2024:100211.","DOI":"10.1016\/j.hcc.2024.100211"},{"issue":"7954","key":"482_CR16","doi-asserted-by":"publisher","first-page":"773","DOI":"10.1038\/d41586-023-00816-5","volume":"615","author":"K Sanderson","year":"2023","unstructured":"Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615(7954):773.","journal-title":"Nature"},{"issue":"4","key":"482_CR17","first-page":"129","volume":"29","author":"C Jeong","year":"2023","unstructured":"Jeong C. Generative AI service implementation using LLM application architecture: based on rag model and Langchain framework. J Intel Inf Syst. 2023;29(4):129\u201364.","journal-title":"J Intel Inf Syst"},{"key":"482_CR18","doi-asserted-by":"crossref","unstructured":"Liu S, Cheng H, Liu H, Zhang H, Li F, Ren T, Zou X, Yang J, Su H, Zhu J, et al.: Llava-plus: learning to use tools for creating multimodal agents. In: European Conference on Computer Vision, 2024:126\u2013142.","DOI":"10.1007\/978-3-031-72970-6_8"},{"key":"482_CR19","unstructured":"Chen X, Wang X, Changpinyo S, Piergiovanni A, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L, et al.: Pali: a jointly-scaled multilingual language-image model, 2022. arXiv preprint arXiv:2209.06794"},{"key":"482_CR20","doi-asserted-by":"crossref","unstructured":"Sharma H. and Jalal AS. Convolutional neural networks-based VQA model. In: Proceedings of International Conference on Frontiers in Computing and Systems: COMSYS 2021, 2022:109\u2013116.","DOI":"10.1007\/978-981-19-0105-8_11"},{"key":"482_CR21","doi-asserted-by":"publisher","first-page":"353","DOI":"10.7717\/peerj-cs.353","volume":"7","author":"Z Ma","year":"2021","unstructured":"Ma Z, Zheng W, Chen X, Yin L. Joint embedding VQA model based on dynamic word vector. PeerJ Comput Sci. 2021;7:353.","journal-title":"PeerJ Comput Sci"},{"issue":"73","key":"482_CR22","doi-asserted-by":"publisher","first-page":"111","DOI":"10.4114\/intartif.vol27iss73pp111-128","volume":"27","author":"D Koshti","year":"2024","unstructured":"Koshti D, Gupta A, Kalla M, Sharma A. Trans-VQA: fully transformer-based image question-answering model using question-guided vision attention. Intel Artif. 2024;27(73):111\u201328.","journal-title":"Intel Artif"},{"key":"482_CR23","doi-asserted-by":"crossref","unstructured":"Bhavana B and Chaitanya C et al. Visual question answering for enhanced user interaction with resnet and bert. In: 2024 4th International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), 2024:1225\u20131231","DOI":"10.1109\/ICUIS64676.2024.10866516"},{"key":"482_CR24","unstructured":"Gunti RR and Rorissa A. A dual of stacked attention networks (san\u2019s) and VGG-16 model-based visual question answering evaluation. In: CLEF (Working Notes), 2023:1478\u20131487"},{"key":"482_CR25","doi-asserted-by":"crossref","unstructured":"Nithish S, Kawinbalaji E, Sudalaimuthu T. Enhanced visual question answering system using densenet. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), 2024:01\u201306.","DOI":"10.1109\/ADICS58448.2024.10533524"},{"issue":"1","key":"482_CR26","doi-asserted-by":"publisher","first-page":"27","DOI":"10.56471\/slujst.v4i.266","volume":"4","author":"HD Abubakar","year":"2022","unstructured":"Abubakar HD, Umar M, Bakale MA. Sentiment classification: review of text vectorization methods: Bag of words, TF-IDF, word2vec and doc2vec. SLU J Sci Technol. 2022;4(1):27\u201333.","journal-title":"SLU J Sci Technol"},{"issue":"1","key":"482_CR27","first-page":"9285324","volume":"2022","author":"L Xiang","year":"2022","unstructured":"Xiang L. Application of an improved TF-IDF method in literary text classification. Adv Multimed. 2022;2022(1):9285324.","journal-title":"Adv Multimed"},{"issue":"7","key":"482_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3648471","volume":"56","author":"J Wang","year":"2024","unstructured":"Wang J, Huang JX, Tu X, Wang J, Huang AJ, Laskar MTR, Bhuiyan A. Utilizing bert for information retrieval: Survey, applications, resources, and challenges. ACM Comput Surv. 2024;56(7):1\u201333.","journal-title":"ACM Comput Surv"},{"key":"482_CR29","doi-asserted-by":"publisher","first-page":"101998","DOI":"10.1016\/j.inffus.2023.101998","volume":"101","author":"A Kumar","year":"2024","unstructured":"Kumar A, Jain DK, Mallik A, Kumar S. Modified node2vec and attention based fusion framework for next poi recommendation. Inf Fusion. 2024;101:101998.","journal-title":"Inf Fusion"},{"key":"482_CR30","doi-asserted-by":"publisher","first-page":"102611","DOI":"10.1016\/j.artmed.2023.102611","volume":"143","author":"Z Lin","year":"2023","unstructured":"Lin Z, Zhang D, Tao Q, Shi D, Haffari G, Wu Q, He M, Ge Z. Medical visual question answering: a survey. Artif Intell Med. 2023;143:102611.","journal-title":"Artif Intell Med"},{"key":"482_CR31","doi-asserted-by":"crossref","unstructured":"Khare Y, Bagal V, Mathew M, Devi A, Priyakumar UD, Jawahar C. Mmbert: multimodal bert pretraining for improved medical VQA. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021:1033\u20131036.","DOI":"10.1109\/ISBI48211.2021.9434063"},{"key":"482_CR32","doi-asserted-by":"crossref","unstructured":"Wang X, Liu Y, Shen C, Ng CC, Luo C, Jin L, Chan CS, Hengel AVD, Wang L. On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2020:10126\u201310135.","DOI":"10.1109\/CVPR42600.2020.01014"},{"key":"482_CR33","doi-asserted-by":"crossref","unstructured":"Qi L, Lv S, Li H, Liu J, Zhang Y, She Q, Wu H, Wang H, Liu T. Dureadervis: a Chinese dataset for open-domain document visual question answering. In: Findings of the Association for Computational Linguistics: ACL 2022, 2022:1338\u20131351.","DOI":"10.18653\/v1\/2022.findings-acl.105"},{"issue":"8","key":"482_CR34","doi-asserted-by":"publisher","first-page":"10803","DOI":"10.1007\/s13369-023-07687-y","volume":"48","author":"SM Kamel","year":"2023","unstructured":"Kamel SM, Hassan SI, Elrefaei L. Vaqa: visual Arabic question answering. Arab J Sci Eng. 2023;48(8):10803\u201323.","journal-title":"Arab J Sci Eng"},{"key":"482_CR35","doi-asserted-by":"crossref","unstructured":"Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL. Microsoft coco: common objects in context. In: Computer vision\u2013ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. 2014;13:740\u2013755.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"482_CR36","doi-asserted-by":"publisher","first-page":"335","DOI":"10.1016\/j.procs.2024.10.207","volume":"244","author":"A ElMaghraby","year":"2024","unstructured":"ElMaghraby A, Maged S, Essawey M, ElFaramawy R, Negm E, Khoriba G. Enhancing visual question answering for Arabic language using Llava and reinforcement learning. Procedia Comput Sci. 2024;244:335\u201341.","journal-title":"Procedia Comput Sci"},{"key":"482_CR37","unstructured":"Lee H, Phatale S, Mansoor H, Lu KR, Mesnard T, Ferret J, Bishop C, Hall, E, Carbune V, Rastogi A. Rlaif: scaling reinforcement learning from human feedback with AI feedback, 2023."},{"issue":"9","key":"482_CR38","doi-asserted-by":"publisher","first-page":"1477","DOI":"10.3390\/rs16091477","volume":"16","author":"Y Bazi","year":"2024","unstructured":"Bazi Y, Bashmal L, Al Rahhal MM, Ricci R, Melgani F. Rs-llava: a large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024;16(9):1477.","journal-title":"Remote Sens"},{"key":"482_CR39","unstructured":"Tran, K.Q., Nguyen, A.T., Le, A.T.-H., Van\u00a0Nguyen, K.: Vivqa: Vietnamese visual question answering. In: Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021:683\u2013691."},{"key":"482_CR40","doi-asserted-by":"crossref","unstructured":"Gupta D, Lenka P, Ekbal A, Bhattacharyya P. A unified framework for multilingual and code-mixed visual question answering. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020:900\u2013913.","DOI":"10.18653\/v1\/2020.aacl-main.90"},{"issue":"4","key":"482_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3573891","volume":"22","author":"SK Mishra","year":"2023","unstructured":"Mishra SK, Sinha S, Saha S, Bhattacharyya P. Dynamic convolution-based encoder-decoder framework for image captioning in Hindi. ACM Trans Asian Low-Resource Language Inf Process. 2023;22(4):1\u201318.","journal-title":"ACM Trans Asian Low-Resource Language Inf Process"},{"key":"482_CR42","doi-asserted-by":"crossref","unstructured":"Rafi MH, Islam S, Labib SHI, Hasan SS, Shah FM, Ahmed S. A deep learning-based Bengali visual question answering system. In: 2022 25th International Conference on Computer and Information Technology (ICCIT), 2022:114\u2013119.","DOI":"10.1109\/ICCIT57492.2022.10055205"},{"issue":"20","key":"482_CR43","doi-asserted-by":"publisher","first-page":"2470","DOI":"10.3390\/electronics10202470","volume":"10","author":"D Bhatt","year":"2021","unstructured":"Bhatt D, Patel C, Talsania H, Patel J, Vaghela R, Pandya S, Modi K, Ghayvat H. CNN variants for computer vision: history, architecture, application, challenges and future scope. Electronics. 2021;10(20):2470.","journal-title":"Electronics"},{"issue":"3","key":"482_CR44","first-page":"4123","volume":"44","author":"R Bensoltane","year":"2023","unstructured":"Bensoltane R, Zaki T. Combining bert with TCN-Bigru for enhancing Arabic aspect category detection. J Intel Fuzzy Syst. 2023;44(3):4123\u201336.","journal-title":"J Intel Fuzzy Syst"},{"key":"482_CR45","doi-asserted-by":"crossref","unstructured":"Bhuyan MSM, Hossain E, Sathi KA, Hossain MA, Dewan MAA. BVQA: connecting language and vision through multimodal attention for open-ended question answering. IEEE Access, 2025.","DOI":"10.1109\/ACCESS.2025.3540388"},{"key":"482_CR46","doi-asserted-by":"crossref","unstructured":"Batool M, Alotaibi M, Alotaibi SR, AlHammadi DA, Jamal MA, Jalal A, Lee B. Multimodal human action recognition framework using an improved CNNGRU classifier. IEEE Access, 2024.","DOI":"10.1109\/ACCESS.2024.3481631"},{"key":"482_CR47","unstructured":"Parida S, Sahoo S, Sekhar S, Sahoo K, Kotwal K, Khosla S, Dash SR, Bose A, Kohli GS, Lenka SS et al. Ovqa: a dataset for visual question answering and multimodal research in Odia language. In: Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, 2025: 58\u201366."},{"key":"482_CR48","doi-asserted-by":"crossref","unstructured":"Amin D, Govilkar S, Kulkarni S. Visual question answering system for Indian regional languages. In: 2022 5th International Conference on Advances in Science and Technology (ICAST), 2022:22\u201327.","DOI":"10.1109\/ICAST55766.2022.10039528"},{"key":"482_CR49","doi-asserted-by":"crossref","unstructured":"Han C, Wang J, Zhang X. YNU-HPCC at semeval-2022 task 5: multi-modal and multi-label emotion classification based on lxmert. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 2022:748\u2013755.","DOI":"10.18653\/v1\/2022.semeval-1.104"},{"key":"482_CR50","doi-asserted-by":"crossref","unstructured":"Joseph J, Ram VA, YadhuKrishna P, Anjali T. Visual question answering in Malayalam text. In: 2024 3rd International Conference on Sentiment Analysis and Deep Learning (ICSADL), 2024:225\u2013232.","DOI":"10.1109\/ICSADL61749.2024.00042"},{"issue":"24","key":"482_CR51","doi-asserted-by":"publisher","first-page":"14691","DOI":"10.1007\/s00521-024-09818-4","volume":"36","author":"AG Kovath","year":"2024","unstructured":"Kovath AG, Nayyar A, Sikha O. Multimodal attention-driven visual question answering for Malayalam. Neural Comput Appl. 2024;36(24):14691\u2013708.","journal-title":"Neural Comput Appl"},{"key":"482_CR52","doi-asserted-by":"crossref","unstructured":"Ch\u2019ng CK and Chan CS. Total-text: a comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017;1:935\u2013942.","DOI":"10.1109\/ICDAR.2017.157"},{"key":"482_CR53","doi-asserted-by":"crossref","unstructured":"Liao M, Shi B, Bai X, Wang X, Liu W. Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017;31.","DOI":"10.1609\/aaai.v31i1.11196"},{"key":"482_CR54","doi-asserted-by":"crossref","unstructured":"Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al. ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015:1156\u20131160.","DOI":"10.1109\/ICDAR.2015.7333942"},{"key":"482_CR55","unstructured":"Yuliang L, Lianwen J, Shuaitao Z, Sheng Z. Detecting curve text in the wild: New dataset and new solution, 2017. arXiv preprint arXiv:1712.02170"},{"key":"482_CR56","doi-asserted-by":"crossref","unstructured":"Sun Y, Ni Z, Chng C-K, Liu Y, Luo C, Ng CC, Han J, Ding E, Liu J, Karatzas D, et al.: ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019:1557\u20131562.","DOI":"10.1109\/ICDAR.2019.00250"},{"key":"482_CR57","doi-asserted-by":"crossref","unstructured":"Han X, Wang Y, Zhai B, You Q, Yang H. Coco is \u201call\u201d you need for visual instruction fine-tuning. In: 2024 IEEE International Conference on Multimedia and Expo (ICME), 2024:1\u20135.","DOI":"10.1109\/ICME57554.2024.10687511"},{"key":"482_CR58","doi-asserted-by":"crossref","unstructured":"Zou Y, Xie Q. A survey on VQA: Datasets and approaches. In: 2020 2nd International Conference on Information Technology and Computer Application (ITCA), 2020:289\u2013297.","DOI":"10.1109\/ITCA52113.2020.00069"},{"key":"482_CR59","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R Krishna","year":"2017","unstructured":"Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision. 2017;123:32\u201373.","journal-title":"Int J Comput Vision"},{"key":"482_CR60","doi-asserted-by":"crossref","unstructured":"Yenduri G, Ramalingam M, Selvi GC, Supriya Y, Srivastava G, Maddikunta PKR, Raj GD, Jhaveri RH, Prabadevi B, Wang W et al. GPT (generative pre-trained transformer)\u2013a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access, 2024.","DOI":"10.1109\/ACCESS.2024.3389497"},{"issue":"5","key":"482_CR61","doi-asserted-by":"publisher","first-page":"2685","DOI":"10.1109\/TCOMM.2023.3247733","volume":"71","author":"Z Zhu","year":"2023","unstructured":"Zhu Z, Yu H, Shen C, Du J, Shen Z, Wang Z. Causal language model aided sequential decoding with natural redundancy. IEEE Trans Commun. 2023;71(5):2685\u201397.","journal-title":"IEEE Trans Commun"},{"key":"482_CR62","doi-asserted-by":"crossref","unstructured":"Arefyev N, Kharchev D, Shelmanov A. NB-MLM: efficient domain adaptation of masked language models for sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021:9114\u20139124.","DOI":"10.18653\/v1\/2021.emnlp-main.717"},{"issue":"21","key":"482_CR63","doi-asserted-by":"publisher","first-page":"7744","DOI":"10.1126\/sciadv.adn7744","volume":"10","author":"S Yu","year":"2024","unstructured":"Yu S, Gu C, Huang K, Li P. Predicting the next sentence (not word) in large language models: what model-brain alignment tells us about discourse comprehension. Sci Adv. 2024;10(21):7744.","journal-title":"Sci Adv"},{"key":"482_CR64","doi-asserted-by":"crossref","unstructured":"Naeve Z, Mitchell L, Reed C, Campbell P, Morgan T, Rogers V. Introducing dynamic token embedding sampling of large language models for improved inference accuracy. Authorea Preprints, 2024.","DOI":"10.36227\/techrxiv.173014793.37761346\/v1"},{"key":"482_CR65","first-page":"16079","volume":"34","author":"T Likhomanenko","year":"2021","unstructured":"Likhomanenko T, Xu Q, Synnaeve G, Collobert R, Rogozhnikov A. Cape: encoding relative positions with continuous augmented positional embeddings. Adv Neural Inf Process Syst. 2021;34:16079\u201392.","journal-title":"Adv Neural Inf Process Syst"},{"key":"482_CR66","doi-asserted-by":"crossref","unstructured":"Ge W, Lu X, Shen J. Video object segmentation using global and instance embedding learning. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2021:16836\u201316845.","DOI":"10.1109\/CVPR46437.2021.01656"},{"key":"482_CR67","doi-asserted-by":"publisher","first-page":"21517","DOI":"10.1109\/ACCESS.2022.3152828","volume":"10","author":"KL Tan","year":"2022","unstructured":"Tan KL, Lee CP, Anbananthen KSM, Lim KM. Roberta-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network. IEEE Access. 2022;10:21517\u201325.","journal-title":"IEEE Access"},{"key":"482_CR68","doi-asserted-by":"crossref","unstructured":"Mozafari J, Fatemi A, Moradi P. A method for answer selection using Distilbert and important words. In: 2020 6th International Conference on Web Research (ICWR), 2020:72\u201376.","DOI":"10.1109\/ICWR49608.2020.9122302"},{"key":"482_CR69","doi-asserted-by":"publisher","first-page":"50150","DOI":"10.2196\/50150","volume":"11","author":"JA Lossio-Ventura","year":"2024","unstructured":"Lossio-Ventura JA, Weger R, Lee AY, Guinee EP, Chung J, Atlas L, Linos E, Pereira F. A comparison of Chatgpt and fine-tuned open pre-trained transformers (opt) against widely used sentiment analysis tools: sentiment analysis of covid-19 survey data. JMIR Mental Health. 2024;11:50150.","journal-title":"JMIR Mental Health"},{"key":"482_CR70","doi-asserted-by":"crossref","unstructured":"Yang ZG, Laki LJ, V\u00e1radi T, Pr\u00f3sz\u00e9ky G. Mono-and multilingual GPT-3 models for Hungarian. In: International Conference on Text, Speech, and Dialogue, 2023:94\u2013104.","DOI":"10.1007\/978-3-031-40498-6_9"},{"issue":"3","key":"482_CR71","first-page":"15","volume":"14","author":"MZ Zaki","year":"2024","unstructured":"Zaki MZ. Revolutionising translation technology: a comparative study of variant transformer models-bert, GPT and t5. Comput Sci Eng Int J. 2024;14(3):15\u201327.","journal-title":"Comput Sci Eng Int J"},{"key":"482_CR72","doi-asserted-by":"publisher","first-page":"271","DOI":"10.1007\/s10844-014-0323-6","volume":"43","author":"F Fauzi","year":"2014","unstructured":"Fauzi F, Belkhatir M. Image understanding and the web: a state-of-the-art review. J Intel Inf Syst. 2014;43:271\u2013306.","journal-title":"J Intel Inf Syst"},{"key":"482_CR73","doi-asserted-by":"crossref","unstructured":"Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP. Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:3608\u20133617.","DOI":"10.1109\/CVPR.2018.00380"},{"key":"482_CR74","doi-asserted-by":"crossref","unstructured":"Goyal Y. Khot T. Summers-Stay D. Batra D. Parikh D. Making the v in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:6904\u20136913.","DOI":"10.1109\/CVPR.2017.670"},{"key":"482_CR75","doi-asserted-by":"crossref","unstructured":"Marino K, Rastegari M, Farhadi A, Mottaghi R. Ok-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2019:3195\u20133204","DOI":"10.1109\/CVPR.2019.00331"},{"key":"482_CR76","doi-asserted-by":"crossref","unstructured":"Changpinyo S, Sharma P, Ding N, Soricut R. Conceptual 12m: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2021:3558\u20133568.","DOI":"10.1109\/CVPR46437.2021.00356"},{"key":"482_CR77","doi-asserted-by":"crossref","unstructured":"Chen Y-C, Li L, Yu L, El\u00a0Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. Uniter: universal image-text representation learning. In: European Conference on Computer Vision, 2020:104\u2013120.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"482_CR78","first-page":"25278","volume":"35","author":"C Schuhmann","year":"2022","unstructured":"Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M, et al. Laion-5b: an open large-scale dataset for training next generation image-text models. Adv Neural Inf Process Syst. 2022;35:25278\u201394.","journal-title":"Adv Neural Inf Process Syst"},{"key":"482_CR79","doi-asserted-by":"crossref","unstructured":"Yang Z, Lu Y, Wang J, Yin X, Florencio D, Wang L, Zhang C, Zhang L, Luo J. Tap: text-aware pre-training for text-VQA and text-caption. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2021:8751\u20138761.","DOI":"10.1109\/CVPR46437.2021.00864"},{"issue":"3","key":"482_CR80","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1007\/s00799-022-00329-y","volume":"23","author":"T Saikh","year":"2022","unstructured":"Saikh T, Ghosal T, Mittal A, Ekbal A, Bhattacharyya P. Scienceqa: a novel resource for question answering on scholarly articles. Int J Digit Libr. 2022;23(3):289\u2013301.","journal-title":"Int J Digit Libr"},{"key":"482_CR81","doi-asserted-by":"crossref","unstructured":"Hudson DA and Manning CD. GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 2019:6700\u20136709.","DOI":"10.1109\/CVPR.2019.00686"},{"key":"482_CR82","doi-asserted-by":"crossref","unstructured":"Yang B, He L, Liu K, Yan Z. Viassist: adapting multi-modal large language models for users with visual impairments. In: 2024 IEEE International Workshop on Foundation Models for Cyber-Physical Systems & Internet of Things (FMSys), 2024:32\u201337.","DOI":"10.1109\/FMSys62467.2024.00010"},{"key":"482_CR83","doi-asserted-by":"crossref","unstructured":"De\u00a0Marsico M, Giacanelli C, Manganaro CG, Palma A, Santoro D. Vqask: a multimodal android gpt-based application to help blind users visualize pictures. In: Proceedings of the 2024 International Conference on Advanced Visual Interfaces, 2024:1\u20135.","DOI":"10.1145\/3656650.3656677"},{"issue":"4","key":"482_CR84","doi-asserted-by":"publisher","first-page":"652","DOI":"10.1109\/TPAMI.2016.2587640","volume":"39","author":"O Vinyals","year":"2016","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: lessons learned from the 2015 Mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell. 2016;39(4):652\u201363.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"482_CR85","doi-asserted-by":"crossref","unstructured":"Biten AF, Tito R, Mafla A, Gomez L, Rusinol M, Valveny E, Jawahar C, Karatzas D. Scene text visual question answering. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, 2019:4291\u20134301.","DOI":"10.1109\/ICCV.2019.00439"},{"key":"482_CR86","doi-asserted-by":"crossref","unstructured":"Mishra A, Shekhar S, Singh AK, Chakraborty A. OCR-VQA: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019:947\u2013952.","DOI":"10.1109\/ICDAR.2019.00156"},{"key":"482_CR87","doi-asserted-by":"crossref","unstructured":"Sokolov\u00e1 Z, Harahus M, Juh\u00e1r J, Pleva M, Hl\u00e1dek D, Sta\u0161 J. Comparison of sentiment classifiers on slovak datasets: Original versus machine translated. In: 2023 21st International Conference on Emerging eLearning Technologies and Applications (ICETA), 2023:485\u2013492.","DOI":"10.1109\/ICETA61311.2023.10343600"},{"key":"482_CR88","doi-asserted-by":"crossref","unstructured":"Silva A, Srivastava N, Ngoli TM, R\u00f6der M, Moussallem D, Ngomo A-CN. Benchmarking low-resource machine translation systems. In: Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), 2024:175\u2013185.","DOI":"10.18653\/v1\/2024.loresmt-1.18"},{"key":"482_CR89","doi-asserted-by":"crossref","unstructured":"Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2018;2(Short Papers).","DOI":"10.18653\/v1\/N18-2074"}],"container-title":["Discover Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-025-00482-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s44163-025-00482-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s44163-025-00482-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T05:04:52Z","timestamp":1757480692000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s44163-025-00482-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,28]]},"references-count":89,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["482"],"URL":"https:\/\/doi.org\/10.1007\/s44163-025-00482-8","relation":{},"ISSN":["2731-0809"],"issn-type":[{"value":"2731-0809","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,28]]},"assertion":[{"value":"24 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 August 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 August 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"All authors read and approved the final manuscript.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no conflict of interest.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"226"}}