{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T15:17:41Z","timestamp":1774451861473,"version":"3.50.1"},"reference-count":41,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2022,10,30]],"date-time":"2022-10-30T00:00:00Z","timestamp":1667088000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,10,30]],"date-time":"2022-10-30T00:00:00Z","timestamp":1667088000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100014959","name":"Guangxi Key Laboratory of Multi-Source Information Mining and Security","doi-asserted-by":"publisher","award":["MIMS20-04"],"award-info":[{"award-number":["MIMS20-04"]}],"id":[{"id":"10.13039\/501100014959","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100014959","name":"Guangxi Key Laboratory of Multi-Source Information Mining and Security","doi-asserted-by":"publisher","award":["20-A-01-02"],"award-info":[{"award-number":["20-A-01-02"]}],"id":[{"id":"10.13039\/501100014959","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Innovation Project of Guangxi Graduate Education","award":["YCSW2022124"],"award-info":[{"award-number":["YCSW2022124"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Sci. Eng."],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Visual question answering is a complex multimodal task involving images and text, with broad application prospects in human\u2013computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder\u2013decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word\u2013word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.<\/jats:p>","DOI":"10.1007\/s41019-022-00200-9","type":"journal-article","created":{"date-parts":[[2022,10,30]],"date-time":"2022-10-30T15:02:31Z","timestamp":1667142151000},"page":"339-353","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["A Multi-level Mesh Mutual Attention Model for Visual Question Answering"],"prefix":"10.1007","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0169-7666","authenticated-orcid":false,"given":"Zhi","family":"Lei","sequence":"first","affiliation":[]},{"given":"Guixian","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Lijuan","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Kui","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Rongjiao","family":"Liang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,10,30]]},"reference":[{"key":"200_CR1","doi-asserted-by":"crossref","unstructured":"Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425\u20132433","DOI":"10.1109\/ICCV.2015.279"},{"key":"200_CR2","unstructured":"Weiss M, Chamorro S, Girgis R, Luck M, Kahou SE, Cohen JP, Nowrouzezahrai D, Precup D, Golemo F, Pal C (2020) Navigation agents for the visually impaired: a sidewalk simulator and experiments. In: Conference on robot learning. PMLR, pp 1314\u20131327"},{"key":"200_CR3","doi-asserted-by":"crossref","unstructured":"Bghiel A, Dahdouh Y, Allaouzi I, Ben Ahmed M, Anouar Boudhir A (2019) Visual question answering system for identifying medical images attributes. In: The proceedings of the third international conference on smart city applications. Springer, pp 483\u2013492","DOI":"10.1007\/978-3-030-37629-1_35"},{"key":"200_CR4","doi-asserted-by":"crossref","unstructured":"Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1\u20139","DOI":"10.1109\/ICCV.2015.9"},{"key":"200_CR5","unstructured":"Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167"},{"key":"200_CR6","unstructured":"Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical co-attention for visual question answering. In: Advances in neural information processing systems (NIPS) 2"},{"key":"200_CR7","unstructured":"Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. In: Advances in neural information processing systems 31"},{"key":"200_CR8","doi-asserted-by":"crossref","unstructured":"Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847","DOI":"10.18653\/v1\/D16-1044"},{"key":"200_CR9","doi-asserted-by":"crossref","unstructured":"Ma Y, Lu T, Wu Y (2021) Multi-scale relational reasoning with regional attention for visual question answering. In: 2020 25th international conference on pattern recognition (ICPR). IEEE, pp 5642\u20135649","DOI":"10.1109\/ICPR48806.2021.9413140"},{"key":"200_CR10","doi-asserted-by":"crossref","unstructured":"Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077\u20136086","DOI":"10.1109\/CVPR.2018.00636"},{"key":"200_CR11","doi-asserted-by":"crossref","unstructured":"Zhu C, Zhao Y, Huang S, Tu K, Ma Y (2017) Structured attentions for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1291\u20131300","DOI":"10.1109\/ICCV.2017.145"},{"key":"200_CR12","doi-asserted-by":"crossref","unstructured":"Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613\u20134621","DOI":"10.1109\/CVPR.2016.499"},{"key":"200_CR13","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30"},{"key":"200_CR14","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"200_CR15","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16 $$\\times$$ 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929"},{"key":"200_CR16","doi-asserted-by":"crossref","unstructured":"Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X, Feng J, Yan S (2021) Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418","DOI":"10.1109\/CVPR52688.2022.01055"},{"key":"200_CR17","doi-asserted-by":"crossref","unstructured":"Nguyen D-K, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087\u20136096","DOI":"10.1109\/CVPR.2018.00637"},{"key":"200_CR18","doi-asserted-by":"crossref","unstructured":"Patro B, Namboodiri VP (2018) Differential attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7680\u20137688","DOI":"10.1109\/CVPR.2018.00801"},{"key":"200_CR19","doi-asserted-by":"publisher","first-page":"40771","DOI":"10.1109\/ACCESS.2019.2908035","volume":"7","author":"C Yang","year":"2019","unstructured":"Yang C, Jiang M, Jiang B, Zhou W, Li K (2019) Co-attention network with question type for visual question answering. IEEE Access 7:40771\u201340781","journal-title":"IEEE Access"},{"key":"200_CR20","unstructured":"Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems 27"},{"key":"200_CR21","doi-asserted-by":"crossref","unstructured":"Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904\u20136913","DOI":"10.1109\/CVPR.2017.670"},{"key":"200_CR22","unstructured":"Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems 28"},{"key":"200_CR23","doi-asserted-by":"crossref","unstructured":"Noh H, Seo P.H, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 30\u201338","DOI":"10.1109\/CVPR.2016.11"},{"key":"200_CR24","unstructured":"Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems 28"},{"key":"200_CR25","doi-asserted-by":"crossref","unstructured":"Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: Proceedings of the AAAI conference on artificial intelligence, vol 32","DOI":"10.1609\/aaai.v32i1.12240"},{"key":"200_CR26","doi-asserted-by":"crossref","unstructured":"Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21\u201329","DOI":"10.1109\/CVPR.2016.10"},{"key":"200_CR27","unstructured":"Xiong C, Merity S, Socher R (2016) Dynamic memory networks for visual and textual question answering. In: International conference on machine learning. PMLR, pp 2397\u20132406"},{"key":"200_CR28","doi-asserted-by":"crossref","unstructured":"Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 299\u2013307","DOI":"10.1109\/CVPR.2017.232"},{"key":"200_CR29","doi-asserted-by":"crossref","unstructured":"Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4223\u20134232","DOI":"10.1109\/CVPR.2018.00444"},{"issue":"1","key":"200_CR30","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R Krishna","year":"2017","unstructured":"Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32\u201373","journal-title":"Int J Comput Vis"},{"key":"200_CR31","doi-asserted-by":"crossref","unstructured":"Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 6281\u20136290","DOI":"10.1109\/CVPR.2019.00644"},{"key":"200_CR32","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"200_CR33","unstructured":"Lei Ba J, Kiros JR, Hinton GE (2016) Layer normalization. arXiv e-prints, 1607"},{"key":"200_CR34","first-page":"740","volume-title":"Microsoft coco: common objects in context","author":"TY Lin","year":"2014","unstructured":"Lin TY, Maire M, Belongie S, Hays J, Zitnick CL (2014) Microsoft coco: common objects in context. Springer, pp 740\u2013755"},{"key":"200_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.inffus.2021.02.022","volume":"73","author":"S Zhang","year":"2021","unstructured":"Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodal feature-wise co-attention method for visual question answering. Inf Fusion 73:1\u201310","journal-title":"Inf Fusion"},{"issue":"2","key":"200_CR36","doi-asserted-by":"publisher","first-page":"64","DOI":"10.1145\/2812802","volume":"59","author":"B Thomee","year":"2016","unstructured":"Thomee B, Elizalde B, Shamma DA, Ni K, Friedland G, Poland D, Borth D, Li LJ (2016) Yfcc100m: the new data in multimedia research. Commun ACM 59(2):64\u201373","journal-title":"Commun ACM"},{"key":"200_CR37","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1016\/j.cviu.2017.06.005","volume":"163","author":"K Kafle","year":"2017","unstructured":"Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3\u201320","journal-title":"Comput Vis Image Underst"},{"key":"200_CR38","unstructured":"Kim J-H, On KW, Lim W, Kim J, Ha J-W, Zhang B-T (2017) Hadamard Product for Low-rank Bilinear Pooling. In: The 5th international conference on learning representations"},{"key":"200_CR39","doi-asserted-by":"publisher","first-page":"334","DOI":"10.1016\/j.patrec.2020.02.031","volume":"133","author":"W Li","year":"2020","unstructured":"Li W, Sun J, Liu G, Zhao L, Fang X (2020) Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recognit Lett 133:334\u2013340","journal-title":"Pattern Recognit Lett"},{"issue":"9","key":"200_CR40","doi-asserted-by":"publisher","first-page":"3894","DOI":"10.1109\/TNNLS.2020.3016083","volume":"32","author":"Y Liu","year":"2020","unstructured":"Liu Y, Zhang X, Huang F, Cheng L, Li Z (2020) Adversarial learning with multi-modal attention for visual question answering. IEEE Trans Neural Netw Learn Syst 32(9):3894\u20133908","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"200_CR41","doi-asserted-by":"crossref","unstructured":"Sun Q, Xie B, Fu Y (2020) Second order enhanced multi-glimpse attention in visual question answering. In: Proceedings of the Asian conference on computer vision","DOI":"10.1007\/978-3-030-69538-5_6"}],"container-title":["Data Science and Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-022-00200-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41019-022-00200-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-022-00200-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,10,6]],"date-time":"2024-10-06T21:45:13Z","timestamp":1728251113000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41019-022-00200-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,30]]},"references-count":41,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["200"],"URL":"https:\/\/doi.org\/10.1007\/s41019-022-00200-9","relation":{},"ISSN":["2364-1185","2364-1541"],"issn-type":[{"value":"2364-1185","type":"print"},{"value":"2364-1541","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,10,30]]},"assertion":[{"value":"4 August 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 September 2022","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 October 2022","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 October 2022","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}