{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,14]],"date-time":"2026-05-14T23:27:56Z","timestamp":1778801276863,"version":"3.51.4"},"reference-count":55,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,4,29]],"date-time":"2022-04-29T00:00:00Z","timestamp":1651190400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,29]],"date-time":"2022-04-29T00:00:00Z","timestamp":1651190400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Imaging"],"published-print":{"date-parts":[[2022,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Visual question answering in medical domain (VQA-Med) exhibits great potential for enhancing confidence in diagnosing diseases and helping patients better understand their medical conditions. One of the challenges in VQA-Med is how to better understand and combine the semantic features of medical images (e.g., X-rays, Magnetic Resonance Imaging(MRI)) and answer the corresponding questions accurately in unlabeled medical datasets.<\/jats:p><\/jats:sec><jats:sec><jats:title>Method<\/jats:title><jats:p>We propose a novel Bi-branched model based on Parallel networks and Image retrieval for Medical Visual Question Answering (BPI-MVQA). The first branch of BPI-MVQA is a transformer structure based on a parallel network to achieve complementary advantages in image sequence feature and spatial feature extraction, and multi-modal features are implicitly fused by using the multi-head self-attention mechanism. The second branch is retrieving the similarity of image features generated by the VGG16 network to obtain similar text descriptions as labels.<\/jats:p><\/jats:sec><jats:sec><jats:title>Result<\/jats:title><jats:p>The BPI-MVQA model achieves state-of-the-art results on three VQA-Med datasets, and the main metric scores exceed the best results so far by 0.2<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>%<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>, 1.4<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>%<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>, and 1.1<jats:inline-formula><jats:alternatives><jats:tex-math>$$\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mml:mo>%<\/mml:mo><\/mml:math><\/jats:alternatives><\/jats:inline-formula>.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>The evaluation results support the effectiveness of the BPI-MVQA model in VQA-Med. The design of the bi-branch structure helps the model answer different types of visual questions. The parallel network allows for multi-angle image feature extraction, a unique feature extraction method that helps the model better understand the semantic information of the image and achieve greater accuracy in the multi-classification of VQA-Med. In addition, image retrieval helps the model answer irregular, open-ended type questions from the perspective of understanding the information provided by images. The comparison of our method with state-of-the-art methods on three datasets also shows that our method can bring substantial improvement to the VQA-Med system.<\/jats:p><\/jats:sec>","DOI":"10.1186\/s12880-022-00800-x","type":"journal-article","created":{"date-parts":[[2022,4,29]],"date-time":"2022-04-29T08:04:33Z","timestamp":1651219473000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":24,"title":["BPI-MVQA: a bi-branch model for medical visual question answering"],"prefix":"10.1186","volume":"22","author":[{"given":"Shengyan","family":"Liu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xuejie","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaobing","family":"Zhou","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jian","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,4,29]]},"reference":[{"key":"800_CR1","unstructured":"Weston J, Bordes A, Chopra S, Rush AM, van Merri\u00ebnboer B, Joulin A, Mikolov T. Towards ai-complete question answering: A set of prerequisite toy tasks. 2015. arXiv preprint arXiv:1502.05698."},{"issue":"7","key":"800_CR2","doi-asserted-by":"publisher","first-page":"6799","DOI":"10.3390\/s110706799","volume":"11","author":"P-C Hii","year":"2011","unstructured":"Hii P-C, Chung W-Y. A comprehensive ubiquitous healthcare solution on an android mobile device. Sensors. 2011;11(7):6799\u2013815.","journal-title":"Sensors"},{"issue":"2","key":"800_CR3","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1016\/j.jbi.2011.01.004","volume":"44","author":"Y Cao","year":"2011","unstructured":"Cao Y, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, Ely J, Hong Yu. Askhermes: an online question answering system for complex clinical questions. J Biomed Inform. 2011;44(2):277\u201388.","journal-title":"J Biomed Inform"},{"key":"800_CR4","doi-asserted-by":"crossref","unstructured":"Paramasivam A, Jaya NS. A survey on textual entailment based question answering. J King Saud Univ-Comput Inform Sci. 2021.","DOI":"10.1016\/j.jksuci.2021.11.017"},{"issue":"3","key":"800_CR5","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1136\/ebmed-2014-110146","volume":"20","author":"A Izcovich","year":"2015","unstructured":"Izcovich A, Criniti JM, Ruiz JI, Catalano HN. Impact of a grade-based medical question answering system on physician behaviour: a randomised controlled trial. BMJ Evid-Based Med. 2015;20(3):81\u20137.","journal-title":"BMJ Evid-Based Med"},{"key":"800_CR6","doi-asserted-by":"crossref","unstructured":"Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence ZC, Parikh D. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), 2016.","DOI":"10.1109\/ICCV.2015.279"},{"key":"800_CR7","unstructured":"Hasan Sadid\u00a0A, Yuan L, Farri O, Liu J, M\u00fcller H. Overview of imageclef 2018 medical domain visual question answering task. In: CLEF working Notes, 2018."},{"issue":"8","key":"800_CR8","doi-asserted-by":"publisher","first-page":"334","DOI":"10.3390\/info12080334","volume":"12","author":"M Sarrouti","year":"2021","unstructured":"Sarrouti M, Ben Abacha A, Demner-Fushman D. Goal-driven visual question generation from radiology images. Information. 2021;12(8):334.","journal-title":"Information"},{"key":"800_CR9","doi-asserted-by":"crossref","unstructured":"Thompson T, Grove L, Brown J, Buchan J, Burge S. Cogconnect: a new visual resource for teaching and learning effective consulting. Patient Educ Counsel. 2021.","DOI":"10.1016\/j.pec.2020.12.016"},{"key":"800_CR10","unstructured":"Sheng-Dong N, Bin Z, Wen L. Design of computer-aided detection and classification of lung nodules using ct images. J Syst Simul. 2007."},{"key":"800_CR11","unstructured":"Cid YD, Liauchuk V, Kovalev V, M\u00fcller H. Overview of image cleftuberculosis 2018-detecting multi-drug resistance, classifying tuberculosis types and assessing severity scores. In CLEF (Working Notes). 2018."},{"issue":"6","key":"800_CR12","first-page":"316","volume":"9","author":"M Nawaz","year":"2018","unstructured":"Nawaz M, Sewissy AA, Soliman THA. Multi-class breast cancer classification using deep learning convolutional neural network. Int J Adv Comput Sci Appl. 2018;9(6):316\u201332.","journal-title":"Int J Adv Comput Sci Appl"},{"key":"800_CR13","first-page":"5998","volume":"30","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998\u20136008.","journal-title":"Adv Neural Inf Process Syst"},{"key":"800_CR14","unstructured":"Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014."},{"key":"800_CR15","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016;770\u2013778.","DOI":"10.1109\/CVPR.2016.90"},{"issue":"3","key":"800_CR16","doi-asserted-by":"publisher","first-page":"4109","DOI":"10.32604\/cmc.2021.016736","volume":"68","author":"K Srinivasan","year":"2021","unstructured":"Srinivasan K, Garg L, Datta D, Alaboudi AA, Jhanjhi NZ, Agarwal R, Thomas AG. Performance comparison of deep cnn models for detecting driver\u2019s distraction. CMC-Comput Mater Continua. 2021;68(3):4109\u201324.","journal-title":"CMC-Comput Mater Continua"},{"key":"800_CR17","doi-asserted-by":"crossref","unstructured":"Cho K, Van\u00a0Merri\u00ebnboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.","DOI":"10.3115\/v1\/D14-1179"},{"issue":"4","key":"800_CR18","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","volume":"36","author":"J Lee","year":"2020","unstructured":"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234\u201340.","journal-title":"Bioinformatics"},{"key":"800_CR19","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"800_CR20","unstructured":"Peng Y, Liu F, Rosen MP. Umass at imageclef medical visual question answering (med-vqa) 2018 task. In CLEF (Working Notes), 2018."},{"key":"800_CR21","doi-asserted-by":"crossref","unstructured":"Yu Z, Yu J, Fan J, Tao D. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision, 2017;1821\u20131830.","DOI":"10.1109\/ICCV.2017.202"},{"key":"800_CR22","unstructured":"Zhou Y, Kang X, Ren F. Employing inception-resnet-v2 and bi-lstm for medical domain visual question answering. In CLEF (Working Notes), 2018."},{"key":"800_CR23","unstructured":"Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016."},{"issue":"11","key":"800_CR24","doi-asserted-by":"publisher","first-page":"2673","DOI":"10.1109\/78.650093","volume":"45","author":"M Schuster","year":"1997","unstructured":"Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673\u201381.","journal-title":"IEEE Trans Signal Process"},{"key":"800_CR25","unstructured":"Abacha AB, Gayen S, Lau JJ, Rajaraman Snan, Demner-Fushman Dina. Nlm at imageclef 2018 visual question answering in the medical domain. In CLEF (Working Notes), 2018."},{"key":"800_CR26","doi-asserted-by":"crossref","unstructured":"Yang Z, He X, Gao J, Deng L, Smola A. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016;21\u201329.","DOI":"10.1109\/CVPR.2016.10"},{"key":"800_CR27","unstructured":"Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain. 2019."},{"key":"800_CR28","unstructured":"Kornuta T, Rajan D, Shivade C, Asseman A, Ozcan AS. Leveraging medical visual question answering with supporting facts. arXiv preprint arXiv:1905.12008, 2019."},{"key":"800_CR29","unstructured":"Liao Z, Wu Q, Shen C, Van Den\u00a0Hengel A, Verjans J. Aiml at vqa-med 2020: Knowledge inference via a skeleton-based sentence mapping approach for medical domain visual question answering. 2020."},{"key":"800_CR30","unstructured":"Al-Sadi A, Hana\u2019Al-Theiabat, Al-Ayyoub M. The inception team at vqa-med 2020: Pretrained vgg with data augmentation for medical vqa and vqg. In CLEF (Working Notes), 2020."},{"key":"800_CR31","doi-asserted-by":"crossref","unstructured":"Zhan L-M, Liu B, Fan L, Chen J, Wu X-M. Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia, 2020;2345\u20132354.","DOI":"10.1145\/3394171.3413761"},{"key":"800_CR32","unstructured":"Xiao Qian, Zhou Xiaobing, Xiao Y, Zhao K. Yunnan university at vqa-med,. Pretrained biobert for medical domain visual question answering. Working Notes of CLEF. 2021;201:2021."},{"key":"800_CR33","doi-asserted-by":"publisher","first-page":"113993","DOI":"10.1016\/j.eswa.2020.113993","volume":"164","author":"D Gupta","year":"2021","unstructured":"Gupta D, Suman S, Ekbal A. Hierarchical deep multi-modal network for medical visual question answering. Expert Syst Appl. 2021;164:113993.","journal-title":"Expert Syst Appl"},{"key":"800_CR34","doi-asserted-by":"crossref","unstructured":"Do T, Nguyen BX, Tjiputra E, Tran M, Tran QD, Nguyen Anh. Multiple meta-model quantifying for medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 64\u201374. Springer, 2021.","DOI":"10.1007\/978-3-030-87240-3_7"},{"key":"800_CR35","unstructured":"Lin Z, Zhang D, Tac Q, Shi D, Haffari G, Wu Q, He M, Ge Z. Medical visual question answering: A survey. arXiv preprint arXiv:2111.10056, 2021."},{"key":"800_CR36","first-page":"2953","volume":"28","author":"M Ren","year":"2015","unstructured":"Ren M, Kiros R, Zemel R. Exploring models and data for image question answering. Adv Neural Inf Process Syst. 2015;28:2953\u201361.","journal-title":"Adv Neural Inf Process Syst"},{"key":"800_CR37","first-page":"2296","volume":"28","author":"H Gao","year":"2015","unstructured":"Gao H, Mao J, Zhou J, Huang Z, Wang L, Wei X. Are you talking to a machine? dataset and methods for multilingual image question. Adv Neural Inf Process Syst. 2015;28:2296\u2013304.","journal-title":"Adv Neural Inf Process Syst"},{"issue":"1","key":"800_CR38","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","volume":"123","author":"R Krishna","year":"2017","unstructured":"Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision. 2017;123(1):32\u201373.","journal-title":"Int J Comput Vision"},{"key":"800_CR39","doi-asserted-by":"crossref","unstructured":"Zhu Y, Groth O, Bernstein M, Fei-Fei L. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016;4995\u20135004.","DOI":"10.1109\/CVPR.2016.540"},{"key":"800_CR40","doi-asserted-by":"crossref","unstructured":"Johnson J, Hariharan B, van\u00a0der Maaten L, Fei-Fei L, Lawrence\u00a0Zitnick C, Girshick R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017;2901\u20132910.","DOI":"10.1109\/CVPR.2017.215"},{"key":"800_CR41","doi-asserted-by":"crossref","unstructured":"Kafle K, Yousefhussien M, Kanan C. Data augmentation for visual question answering. In Proceedings of the 10th International Conference on Natural Language Generation, 2017;198\u2013202.","DOI":"10.18653\/v1\/W17-3529"},{"key":"800_CR42","doi-asserted-by":"crossref","unstructured":"Li Q, Tao Q, Joty S, Cai J, Luo J. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. In Proceedings of the European Conference on Computer Vision (ECCV), 2018;552\u2013567.","DOI":"10.1007\/978-3-030-01234-2_34"},{"key":"800_CR43","unstructured":"Lu J, Batra D, Parikh D, Lee S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in neural information processing systems, 2019;13\u201323."},{"key":"800_CR44","doi-asserted-by":"crossref","unstructured":"Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.","DOI":"10.18653\/v1\/D19-1514"},{"key":"800_CR45","unstructured":"Lin M, Chen Q, Yan S. Network in network. arXiv preprint arXiv:1312.4400, 2013."},{"key":"800_CR46","unstructured":"Kougia V, Pavlopoulos J, Androutsopoulos I. Aueb nlp group at imageclefmed caption 2019. In CLEF (Working Notes), 2019."},{"key":"800_CR47","first-page":"1682","volume":"27","author":"M Malinowski","year":"2014","unstructured":"Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. Adv Neural Inf Process Syst. 2014;27:1682\u201390.","journal-title":"Adv Neural Inf Process Syst"},{"key":"800_CR48","first-page":"26","volume":"1","author":"AR Aronson","year":"2006","unstructured":"Aronson AR. Metamap: Mapping text to the umls metathesaurus. Bethesda, MD: NLM, NIH, DHHS. 2006;1:26.","journal-title":"Bethesda, MD: NLM, NIH, DHHS"},{"key":"800_CR49","unstructured":"Allaouzi I, Ahmed MB. Deep neural networks and decision tree classifier for visual question answering in the medical domain. In CLEF (Working Notes), 2018."},{"key":"800_CR50","unstructured":"Vu M, Sznitman R, Nyholm T, L\u00f6fstedt T. Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain. In CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, Sept 9-12, 2019, volume 2380, 2019."},{"key":"800_CR51","unstructured":"Shi L, Liu F, Rosen MP. Deep multimodal learning for medical visual question answering. In CLEF (Working Notes), 2019."},{"key":"800_CR52","doi-asserted-by":"publisher","first-page":"50626","DOI":"10.1109\/ACCESS.2020.2980024","volume":"8","author":"F Ren","year":"2020","unstructured":"Ren F, Zhou Y. Cgmvqa: A new classification and generative model for medical visual question answering. IEEE Access. 2020;8:50626\u201336.","journal-title":"IEEE Access"},{"key":"800_CR53","doi-asserted-by":"crossref","unstructured":"Nguyen BD, Do T-T, Nguyen BX, Do T, Tjiputra E, Tran QD. Overcoming data limitation in medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 522\u2013530. Springer, 2019.","DOI":"10.1007\/978-3-030-32251-9_57"},{"key":"800_CR54","unstructured":"Li LH, Yatskar M, Yin D, Hsieh C-J, Chang K-W. Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019."},{"key":"800_CR55","unstructured":"Qi D, Su L, Song J, Cui E, Bharti T, Sacheti A. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020."}],"container-title":["BMC Medical Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12880-022-00800-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12880-022-00800-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12880-022-00800-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,4]],"date-time":"2023-02-04T00:14:13Z","timestamp":1675469653000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedimaging.biomedcentral.com\/articles\/10.1186\/s12880-022-00800-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,29]]},"references-count":55,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,12]]}},"alternative-id":["800"],"URL":"https:\/\/doi.org\/10.1186\/s12880-022-00800-x","relation":{},"ISSN":["1471-2342"],"issn-type":[{"value":"1471-2342","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,29]]},"assertion":[{"value":"21 July 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 April 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 April 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"There are no conflicting interests known to the authors.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"79"}}