{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T15:31:05Z","timestamp":1778167865366,"version":"3.51.4"},"reference-count":65,"publisher":"Springer Science and Business Media LLC","issue":"39","license":[{"start":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T00:00:00Z","timestamp":1753401600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T00:00:00Z","timestamp":1753401600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001778","name":"Deakin University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001778","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Multimed Tools Appl"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Visual Question Answering (VQA) is a multimodality research domain that intersects the fields of computer vision and natural language processing for visual-textual data processing and understanding. Traditional VQA methods extract visual and textual features from pre-trained architectures, respectively, then combine the features from both modalities in a common feature space. The traditional methods perform well on high-level perception questions. However, attaining high accuracy on low-level perception questions still remains challenging. The difficulties include detecting relevant visual and textual information, building meaningful associations, and extracting insights from the multimodal data. To address these challenges, unlike existing approaches, we propose a novel multi-modality guided cross self-attention mechanism for building semantic relationships within individual modalities as well as between them. Specifically, we examine visual-guided cross-attention (VGCA), textual-guided cross-attention (TGCA), and multi-modality-guided cross-attention (MMGCA). We utilise convolutional neural networks (CNNs) for visual feature learning, and LSTM and FNET for textual feature learning. We evaluate our method on two benchmark datasets, including VQA 1.0 and VQA 2.0. Experimental results demonstrate the superiority of our method over existing baselines by improving the performance on various types of questions in both datasets.<\/jats:p>","DOI":"10.1007\/s11042-025-21049-w","type":"journal-article","created":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T05:21:01Z","timestamp":1753420861000},"page":"47543-47565","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Multi-modality guided cross-attention for visual question answering"],"prefix":"10.1007","volume":"84","author":[{"given":"Muhammad Zeeshan","family":"Khan","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Duc Thanh","family":"Nguyen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Thanh Thi","family":"Nguyen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anuroop","family":"Gaddam","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Imran","family":"Razzak","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,7,25]]},"reference":[{"issue":"25","key":"21049_CR1","doi-asserted-by":"publisher","first-page":"38799","DOI":"10.1007\/s11042-023-14333-0","volume":"82","author":"A Falcon","year":"2023","unstructured":"Falcon A, Serra G, Lanz O (2023) Video question answering supported by a multi-task learning objective. Multimedia Tools and Applications 82(25):38799\u201338826","journal-title":"Multimedia Tools and Applications"},{"issue":"19","key":"21049_CR2","doi-asserted-by":"publisher","first-page":"57829","DOI":"10.1007\/s11042-023-17797-2","volume":"83","author":"DS Asudani","year":"2024","unstructured":"Asudani DS, Nagwani NK, Singh P (2024) A comparative evaluation of machine learning and deep learning algorithms for question categorization of vqa datasets. Multimed Tools Appl 83(19):57829\u201357859","journal-title":"Multimed Tools Appl"},{"key":"21049_CR3","doi-asserted-by":"crossref","unstructured":"Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE, pp 3156\u20133164","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"21049_CR4","doi-asserted-by":"crossref","unstructured":"Noh H, Araujo A, Sim J, Weyand T, Han B (2017) Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE international conference on computer vision, pp 3456\u20133465","DOI":"10.1109\/ICCV.2017.374"},{"key":"21049_CR5","doi-asserted-by":"crossref","unstructured":"Donahue J, Anne\u00a0Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE international conference on computer vision and pattern recognition, pp 2625\u20132634","DOI":"10.1109\/CVPR.2015.7298878"},{"issue":"4","key":"21049_CR6","doi-asserted-by":"publisher","first-page":"643","DOI":"10.1016\/S0031-3203(96)00109-4","volume":"30","author":"HJ Zhang","year":"1997","unstructured":"Zhang HJ, Wu J, Zhong D, Smoliar SW (1997) An integrated system for content-based video retrieval and browsing. Pattern Recogn 30(4):643\u2013658","journal-title":"Pattern Recogn"},{"key":"21049_CR7","doi-asserted-by":"crossref","unstructured":"Gkioxari G, Girshick R, Doll\u00e1r P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359\u20138367","DOI":"10.1109\/CVPR.2018.00872"},{"key":"21049_CR8","doi-asserted-by":"crossref","unstructured":"Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3608\u20133617","DOI":"10.1109\/CVPR.2018.00380"},{"key":"21049_CR9","doi-asserted-by":"crossref","unstructured":"Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 326\u2013335","DOI":"10.1109\/CVPR.2017.121"},{"key":"21049_CR10","doi-asserted-by":"crossref","unstructured":"He B, Xia M, Yu X, Jian P, Meng H, Chen Z (2017) An educational robot system of visual question answering for preschoolers. In: 2017 2nd International conference on robotics and automation engineering (ICRAE). IEEE, pp 441\u2013445","DOI":"10.1109\/ICRAE.2017.8291426"},{"key":"21049_CR11","doi-asserted-by":"crossref","unstructured":"Hussain Z, Zhang M, Zhang X, Ye K, Thomas C, Agha Z, Ong N, Kovashka A (2017) Automatic understanding of image and video advertisements. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1705\u20131715","DOI":"10.1109\/CVPR.2017.123"},{"key":"21049_CR12","unstructured":"Ben\u00a0Abacha A, Hasan SA, Datla VV, Demner-Fushman D, M\u00fcller H (2019) Vqa-med. In: Proceedings of CLEF (Conference And Labs Of The Evaluation Forum) 2019 working notes. 9\u201312 September 2019"},{"key":"21049_CR13","doi-asserted-by":"crossref","unstructured":"Yuan Y, Wang S, Jiang M, Chen TY (2021) Perception matters: Detecting perception failures of VQA models using metamorphic testing. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 16908\u201316917","DOI":"10.1109\/CVPR46437.2021.01663"},{"key":"21049_CR14","doi-asserted-by":"crossref","unstructured":"Jing C, Jia Y, Wu Y, Liu X, Wu Q (2022) Maintaining reasoning consistency in compositional visual question answering. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 5099\u20135108","DOI":"10.1109\/CVPR52688.2022.00504"},{"key":"21049_CR15","doi-asserted-by":"crossref","unstructured":"Shih KJ, Singh S, Hoiem D (2016) Where to look: Focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613\u20134621","DOI":"10.1109\/CVPR.2016.499"},{"key":"21049_CR16","unstructured":"Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. Adv Neural Inform Process Syst 29"},{"key":"21049_CR17","doi-asserted-by":"crossref","unstructured":"Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21\u201329","DOI":"10.1109\/CVPR.2016.10"},{"key":"21049_CR18","doi-asserted-by":"crossref","unstructured":"Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821\u20131830","DOI":"10.1109\/ICCV.2017.202"},{"issue":"13","key":"21049_CR19","doi-asserted-by":"publisher","first-page":"16706","DOI":"10.1007\/s10489-022-04355-w","volume":"53","author":"X Shen","year":"2023","unstructured":"Shen X, Han D, Guo Z, Chen C, Hua J, Luo G (2023) Local self-attention in transformer for visual question answering. Appl Intell 53(13):16706\u201316723","journal-title":"Appl Intell"},{"key":"21049_CR20","doi-asserted-by":"publisher","first-page":"70","DOI":"10.1016\/j.inffus.2021.02.006","volume":"72","author":"W Zhang","year":"2021","unstructured":"Zhang W, Yu J, Zhao W, Ran C (2021) Dmrfnet: deep multimodal reasoning and fusion for visual question answering and explanation generation. Inform Fusion 72:70\u201379","journal-title":"Inform Fusion"},{"key":"21049_CR21","doi-asserted-by":"crossref","unstructured":"Liu Y, Zhang X, Zhang Q, Li C, Huang F, Tang X, Li Z (2021) Dual self-attention with co-attention networks for visual question answering. Pattern Recogn 117:107956","DOI":"10.1016\/j.patcog.2021.107956"},{"key":"21049_CR22","doi-asserted-by":"publisher","first-page":"6730","DOI":"10.1109\/TIP.2021.3097180","volume":"30","author":"W Guo","year":"2021","unstructured":"Guo W, Zhang Y, Yang J, Yuan X (2021) Re-attention for visual question answering. IEEE Trans Image Process 30:6730\u20136743","journal-title":"IEEE Trans Image Process"},{"key":"21049_CR23","unstructured":"Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning. PMLR, pp. 7354\u20137363"},{"key":"21049_CR24","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems"},{"issue":"4","key":"21049_CR25","doi-asserted-by":"publisher","first-page":"2192","DOI":"10.1109\/TSMC.2023.3342640","volume":"54","author":"Z Xiao","year":"2024","unstructured":"Xiao Z, Xing H, Qu R, Feng L, Luo S, Dai P, Zhao B, Dai Y (2024) Densely knowledge-aware network for multivariate time series classification. IEEE Trans Syst Man, Cybern: Syst 54(4):2192\u20132204","journal-title":"IEEE Trans Syst Man, Cybern: Syst"},{"issue":"8","key":"21049_CR26","doi-asserted-by":"publisher","first-page":"4101","DOI":"10.1109\/TAI.2024.3360180","volume":"5","author":"Z Xiao","year":"2024","unstructured":"Xiao Z, Xing H, Qu R, Li H, Feng L, Zhao B, Yang J (2024) Self-bidirectional decoupled distillation for time series classification. IEEE Trans Artif Intell 5(8):4101\u20134110","journal-title":"IEEE Trans Artif Intell"},{"key":"21049_CR27","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1016\/J.CVIU.2017.05.001","volume":"163","author":"Q Wu","year":"2017","unstructured":"Wu Q, Teney D, Wang P, Shen C, Dick AR, Hengel A (2017) Visual question answering: A survey of methods and datasets. Comput Vis Image Underst 163:21\u201340. https:\/\/doi.org\/10.1016\/J.CVIU.2017.05.001","journal-title":"Comput Vis Image Underst"},{"issue":"8","key":"21049_CR28","doi-asserted-by":"publisher","first-page":"5705","DOI":"10.1007\/S10462-020-09832-7","volume":"53","author":"S Manmadhan","year":"2020","unstructured":"Manmadhan S, Kovoor BC (2020) Visual question answering: a state-of-the-art review. Artif Intell Rev 53(8):5705\u20135745. https:\/\/doi.org\/10.1007\/S10462-020-09832-7","journal-title":"Artif Intell Rev"},{"key":"21049_CR29","doi-asserted-by":"crossref","unstructured":"Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425\u20132433","DOI":"10.1109\/ICCV.2015.279"},{"key":"21049_CR30","doi-asserted-by":"crossref","unstructured":"Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1\u20139","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"21049_CR31","unstructured":"Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556"},{"key":"21049_CR32","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"21049_CR33","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the international conference on learning representations"},{"key":"21049_CR34","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, Ieee, pp 248\u2013255","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"21049_CR35","unstructured":"Everingham M, Van\u00a0Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http:\/\/www.pascal-network.org\/challenges\/VOC\/voc2012\/workshop\/index.html"},{"key":"21049_CR36","doi-asserted-by":"crossref","unstructured":"Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll\u00e1r P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In: European conference on computer vision. Springer, pp 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"21049_CR37","doi-asserted-by":"crossref","unstructured":"Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995\u20135004","DOI":"10.1109\/CVPR.2016.540"},{"key":"21049_CR38","unstructured":"Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781"},{"key":"21049_CR39","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the conference on empirical methods in natural language processing, pp 1532\u20131543","DOI":"10.3115\/v1\/D14-1162"},{"issue":"8","key":"21049_CR40","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735\u20131780","journal-title":"Neural Comput"},{"key":"21049_CR41","doi-asserted-by":"crossref","unstructured":"Cho K, Van\u00a0Merri\u00ebnboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078","DOI":"10.3115\/v1\/D14-1179"},{"key":"21049_CR42","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"21049_CR43","unstructured":"Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473"},{"key":"21049_CR44","doi-asserted-by":"crossref","unstructured":"Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077\u20136086","DOI":"10.1109\/CVPR.2018.00636"},{"key":"21049_CR45","unstructured":"Schwartz I, Schwing A, Hazan T (2017) High-order attention models for visual question answering. Adv Neural Inform Process Syst 30"},{"key":"21049_CR46","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1016\/j.cviu.2019.05.001","volume":"185","author":"A Osman","year":"2019","unstructured":"Osman A, Samek W (2019) Drau: dual recurrent attention units for visual question answering. Comput Vis Image Underst 185:24\u201330","journal-title":"Comput Vis Image Underst"},{"key":"21049_CR47","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.inffus.2021.02.022","volume":"73","author":"S Zhang","year":"2021","unstructured":"Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodal feature-wise co-attention method for visual question answering. Inform Fusion 73:1\u201310","journal-title":"Inform Fusion"},{"key":"21049_CR48","doi-asserted-by":"crossref","unstructured":"Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: Fill in the blank description generation and question answering. In: Proceedings of the Ieee international conference on computer vision, pp 2461\u20132469","DOI":"10.1109\/ICCV.2015.283"},{"key":"21049_CR49","doi-asserted-by":"publisher","first-page":"451","DOI":"10.1016\/J.NEUCOM.2021.08.117","volume":"465","author":"T Le","year":"2021","unstructured":"Le T, Nguyen HT, Nguyen ML (2021) Multi visual and textual embedding on visual question answering for blind people. Neurocomputing 465:451\u2013464. https:\/\/doi.org\/10.1016\/J.NEUCOM.2021.08.117","journal-title":"Neurocomputing"},{"key":"21049_CR50","doi-asserted-by":"crossref","unstructured":"Bhardwaj J, Balakrishnan A, Pathak S, Unnarkar I, Gawande A, Ahmadnia B (2023) Multimodal learning for accurate visual question answering: An attention-based approach. In: Mitkov R, Angelova G (eds) Proceedings of the international conference on recent advances in natural language processing, pp 179\u2013186","DOI":"10.26615\/978-954-452-092-2_020"},{"key":"21049_CR51","doi-asserted-by":"crossref","unstructured":"Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847","DOI":"10.18653\/v1\/D16-1044"},{"issue":"12","key":"21049_CR52","doi-asserted-by":"publisher","first-page":"5947","DOI":"10.1109\/TNNLS.2018.2817340","volume":"29","author":"Z Yu","year":"2018","unstructured":"Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947\u20135959","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"21049_CR53","volume":"80","author":"Y Xi","year":"2020","unstructured":"Xi Y, Zhang Y, Ding S, Wan S (2020) Visual question answering model based on visual relationship detection. Signal Processing: Image Communication 80:115648","journal-title":"Signal Processing: Image Communication"},{"key":"21049_CR54","doi-asserted-by":"crossref","unstructured":"Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the conference on empirical methods in natural language processing, pp 1955\u20131960","DOI":"10.18653\/v1\/D16-1203"},{"key":"21049_CR55","doi-asserted-by":"crossref","unstructured":"Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904\u20136913","DOI":"10.1109\/CVPR.2017.670"},{"key":"21049_CR56","unstructured":"Ramakrishnan S, Agrawal A, Lee S (2018) Overcoming language priors in visual question answering with adversarial regularization. Adv Neural Inform Process Syst 31"},{"key":"21049_CR57","unstructured":"Cadene R, Dancette C, Cord M, Parikh D et al (2019) Rubi: Reducing unimodal biases for visual question answering. Adv Neural Inform Process Syst 32"},{"key":"21049_CR58","doi-asserted-by":"crossref","unstructured":"Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 10800\u201310809","DOI":"10.1109\/CVPR42600.2020.01081"},{"key":"21049_CR59","doi-asserted-by":"crossref","unstructured":"Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don\u2019t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971\u20134980","DOI":"10.1109\/CVPR.2018.00522"},{"key":"21049_CR60","doi-asserted-by":"crossref","unstructured":"Lee-Thorp J, Ainslie J, Eckstein I, Ontanon S (2021) Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824","DOI":"10.18653\/v1\/2022.naacl-main.319"},{"key":"21049_CR61","unstructured":"Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, pp 6105\u20136114"},{"key":"21049_CR62","doi-asserted-by":"crossref","unstructured":"Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510\u20134520","DOI":"10.1109\/CVPR.2018.00474"},{"key":"21049_CR63","unstructured":"Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162"},{"key":"21049_CR64","unstructured":"Zhang Y, Hare J, Pr\u00fcgel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv preprint arXiv:1802.05766"},{"key":"21049_CR65","unstructured":"Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. Adv Neural Inform Process Syst 31"}],"container-title":["Multimedia Tools and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-025-21049-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11042-025-21049-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11042-025-21049-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,16]],"date-time":"2025-12-16T07:26:03Z","timestamp":1765869963000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11042-025-21049-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,25]]},"references-count":65,"journal-issue":{"issue":"39","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["21049"],"URL":"https:\/\/doi.org\/10.1007\/s11042-025-21049-w","relation":{},"ISSN":["1573-7721"],"issn-type":[{"value":"1573-7721","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,25]]},"assertion":[{"value":"28 August 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"8 July 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 July 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 July 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that this work does not have any competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}