{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T08:47:28Z","timestamp":1764146848765,"version":"3.46.0"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"},{"start":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T00:00:00Z","timestamp":1764115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.springernature.com\/gp\/researchers\/text-and-data-mining"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2025,12]]},"DOI":"10.1007\/s11633-025-1564-2","type":"journal-article","created":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T08:43:39Z","timestamp":1764146619000},"page":"1127-1137","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Answer Semantics-enhanced Medical Visual Question Answering"],"prefix":"10.1007","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2946-7910","authenticated-orcid":false,"given":"Yuliang","family":"Liang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5419-5286","authenticated-orcid":false,"given":"Enneng","family":"Yang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1709-5056","authenticated-orcid":false,"given":"Guibing","family":"Guo","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0704-6621","authenticated-orcid":false,"given":"Wei","family":"Cai","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7492-0473","authenticated-orcid":false,"given":"Linying","family":"Jiang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4492-5075","authenticated-orcid":false,"given":"Jianzhe","family":"Zhao","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2856-4716","authenticated-orcid":false,"given":"Xingwei","family":"Wang","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,11,26]]},"reference":[{"key":"1564_CR1","doi-asserted-by":"publisher","unstructured":"Z. Lin, D. Zhang, Q. Tao, D. Shi, G. Haffari, Q. Wu, M. He, Z. Ge. Medical visual question answering: A survey. Artificial Intelligence in Medicine, vol. 143, Article number 102611, 2023. DOI: https:\/\/doi.org\/10.1016\/j.artmed.2023.102611.","DOI":"10.1016\/j.artmed.2023.102611"},{"key":"1564_CR2","first-page":"679","volume-title":"Proceedings of the 25th International Conference on Medical Image Computing and Computer Assisted Intervention","author":"Z Chen","year":"2022","unstructured":"Z. Chen, Y. Du, J. Hu, Y. Liu, G. Li, X. Wan, T. H. Chang. Multi-modal masked autoencoders for medical vision-and-language pre-training. In Proceedings of the 25th International Conference on Medical Image Computing and Computer Assisted Intervention, Springer, Singapore, pp. 679\u2013689, 2022."},{"key":"1564_CR3","doi-asserted-by":"publisher","first-page":"1181","DOI":"10.18653\/v1\/2023.findings-eacl.88","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: EACL 2023","author":"S Eslami","year":"2023","unstructured":"S. Eslami, C. Meinel, G. De Melo. PubMedCLIP: How much does clip benefit visual question answering in the medical domain? In Proceedings of Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, pp. 1181\u20131193, 2023."},{"key":"1564_CR4","first-page":"4101","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Q Si","year":"2021","unstructured":"Q. Si, Z. Lin, M. Zheng, P. Fu, W. Wang. Check it again: Progressive visual question answering via visual entailment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4101\u20134110, 2021."},{"key":"1564_CR5","doi-asserted-by":"publisher","first-page":"3569","DOI":"10.1145\/3503161.3548122","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia","author":"F Cong","year":"2022","unstructured":"F. Cong, S. Xu, L. Guo, Y. Tian. Caption-aware medical VQA via semantic focusing and progressive cross-modality comprehension. In Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, Lisboa, Portugal, pp. 3569\u20133577, 2022."},{"issue":"2","key":"1564_CR6","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1007\/s11633-022-1382-8","volume":"20","author":"Y Qiu","year":"2023","unstructured":"Y. Qiu, F. Lin, W. Chen, M. Xu. Pre-training in medical data: A survey. Machine Intelligence Research, vol. 20, no. 2, pp. 147\u2013179, 2023. DOI: https:\/\/doi.org\/10.1007\/s11633-022-1382-8.","journal-title":"Machine Intelligence Research"},{"issue":"1","key":"1564_CR7","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1007\/s11633-022-1369-5","volume":"20","author":"F L Chen","year":"2023","unstructured":"F. L. Chen, D. Z. Zhang, M. L. Han, X. Y. Chen, J. Shi, S. Xu, B. Xu. VLP: A survey on vision-language pre-training. Machine Intelligence Research, vol. 20, no. 1, pp. 38\u201356, 2023. DOI: https:\/\/doi.org\/10.1007\/s11633-022-1369-5.","journal-title":"Machine Intelligence Research"},{"key":"1564_CR8","volume-title":"BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs","author":"S Zhang","year":"2023","unstructured":"S. Zhang, Y. Xu, N. Usuyama, H. W. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong, A. Tupini, Y. Wang, M. Mazzola, S. Shukla, L. Liden, J. Gao, A. Crabtree, B. Piening, C. Bifulco, M. P. Lungren, T. Naumann, S. Wang, H. Poon. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, [Online], Available: https:\/\/arxiv.org\/abs\/2303.00915, 2023."},{"key":"1564_CR9","first-page":"1650","volume-title":"Proceedings of the 18th International Symposium on Biomedical Imaging","author":"B Liu","year":"2021","unstructured":"B. Liu, L. M. Zhan, L. Xu, L. Ma, Y. Yang, X. M. Wu. Slake: A semantically-labeled knowledge-enhanced data-set for medical visual question answering. In Proceedings of the 18th International Symposium on Biomedical Imaging, IEEE, Nice, France, pp. 1650\u20131654, 2021."},{"key":"1564_CR10","first-page":"8748","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"A Radford","year":"2021","unstructured":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748\u20138763, 2021."},{"key":"1564_CR11","first-page":"4904","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"C Jia","year":"2021","unstructured":"C. Jia, Y. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. Le, Y. H. Sung, Z. Li, T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 4904\u20134916, 2021."},{"key":"1564_CR12","doi-asserted-by":"publisher","unstructured":"A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Y. Deng, R. G. Mark, S. Horng. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, vol. 6, no. 1, Article number 317, 2019. DOI: https:\/\/doi.org\/10.1038\/s41597-019-0322-0.","DOI":"10.1038\/s41597-019-0322-0"},{"key":"1564_CR13","first-page":"522","volume-title":"Proceedings of the 22nd International Conference on Medical Image Computing and Computer Assisted Intervention","author":"B D Nguyen","year":"2019","unstructured":"B. D. Nguyen, T. T. Do, B. X. Nguyen, T. Do, E. Tjiputra, Q. D. Tran. Overcoming data limitation in medical visual question answering. In Proceedings of the 22nd International Conference on Medical Image Computing and Computer Assisted Intervention, Springer, Shenzhen, China, pp. 522\u2013530, 2019."},{"key":"1564_CR14","doi-asserted-by":"publisher","first-page":"2112","DOI":"10.18653\/v1\/2020.findings-emnlp.191","volume-title":"Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020","author":"S Subramanian","year":"2020","unstructured":"S. Subramanian, L. L. Wang, B. Bogin, S. Mehta, M. van Zuylen, S. Parasa, S. Singh, M. Gardner, H. Hajishirzi. MedICaT: A dataset of medical images, captions, and textual references. In Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2112\u20132120, 2020."},{"key":"1564_CR15","first-page":"210","volume-title":"Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention","author":"B Liu","year":"2021","unstructured":"B. Liu, L. M. Zhan, X. M. Wu. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention, Springer, Strasbourg, France, pp. 210\u2013220, 2021."},{"key":"1564_CR16","first-page":"1","volume-title":"Proceedings of the 20th International Symposium on Biomedical Imaging","author":"P Li","year":"2023","unstructured":"P. Li, G. Liu, L. Tan, J. Liao, S. Zhong. Self-supervised vision-language pretraining for medial visual question answering. In Proceedings of the 20th International Symposium on Biomedical Imaging, IEEE, Cartagena, Colombia, pp. 1\u20135, 2023."},{"key":"1564_CR17","first-page":"3876","volume-title":"Proceedings of Conference on Empirical Methods in Natural Language Processing","author":"Z Wang","year":"2022","unstructured":"Z. Wang, Z. Wu, D. Agarwal, J. Sun. MedCLIP: Contrastive learning from unpaired medical images and text. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, UAE, pp. 3876\u20133887, 2022."},{"key":"1564_CR18","first-page":"525","volume-title":"Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention","author":"W Lin","year":"2023","unstructured":"W. Lin, Z. Zhao, X. Zhang, C. Wu, Y. Zhang, Y. Wang, W. Xie. PMC-CLIP: Contrastive language-image pre-training using biomedical documents. In Proceedings of the 26th International Conference on Medical Image Computing and Computer Assisted Intervention, Springer, Vancouver, Canada, pp. 525\u2013536, 2023."},{"key":"1564_CR19","volume-title":"Proceedings of Working Notes of CLEF - Conference and Labs of the Evaluation Forum","author":"Y Zhou","year":"2018","unstructured":"Y. Zhou, X. Kang, F. Ren. Employing inception-resnet-v2 and bi-LSTM for medical domain visual question answering. In Proceedings of Working Notes of CLEF - Conference and Labs of the Evaluation Forum, CEURWS, Avignon, France, 2018. [Online], Available: https:\/\/ir.webis.de\/anthology\/2018.clef_conference-2018w.55\/."},{"key":"1564_CR20","volume-title":"PathVQA: 30000+ questions for medical visual question answering","author":"X He","year":"2020","unstructured":"X. He, Y. Zhang, L. Mou, E. Xing, P. Xie. PathVQA: 30000+ questions for medical visual question answering, [Online], Available: https:\/\/arxiv.org\/abs\/2003.10286, 2020."},{"key":"1564_CR21","volume-title":"GEMeX: A large-scale, groundable, and explainable medical VQA benchmark for chest X-ray diagnosis","author":"B Liu","year":"2024","unstructured":"B. Liu, K. Zou, L. Zhan, Z. Lu, X. Dong, Y. Chen, C. Xie, J. Cao, X. M. Wu, H. Fu. GEMeX: A large-scale, groundable, and explainable medical VQA benchmark for chest X-ray diagnosis, [Online], Available: https:\/\/arxiv.org\/abs\/2411.16778, 2024."},{"key":"1564_CR22","first-page":"22170","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Hu","year":"2024","unstructured":"Y. Hu, T. Li, Q. Lu, W. Shao, J. He, Y. Qiao, P. Luo. OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 22170\u201322183, 2024."},{"key":"1564_CR23","doi-asserted-by":"publisher","unstructured":"W. Dong, S. Shen, Y. Han, T. Tan, J. Wu, H. Xu. Generative models in medical visual question answering: A survey. Applied Sciences, vol. 15, no. 6, Article number 2983, 2025. DOI: https:\/\/doi.org\/10.3390\/app15062983.","DOI":"10.3390\/app15062983"},{"key":"1564_CR24","first-page":"770","volume-title":"Proceedings of Conference on Computer Vision and Pattern Recognition","author":"K He","year":"2016","unstructured":"K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770\u2013778, 2016."},{"key":"1564_CR25","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale","author":"A Dosovitskiy","year":"2021","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, [Online], Available: https:\/\/arxiv.org\/abs\/2010.11929, 2021."},{"key":"1564_CR26","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"J Li","year":"2023","unstructured":"J. Li, D. Li, S. Savarese, S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, Article number 814, 2023."},{"key":"1564_CR27","first-page":"4171","volume-title":"Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"J Devlin","year":"2019","unstructured":"J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, pp. 4171\u20134186, 2019."},{"issue":"5","key":"1564_CR28","doi-asserted-by":"publisher","first-page":"1400","DOI":"10.1109\/TC.2024.3365949","volume":"73","author":"L Ma","year":"2024","unstructured":"L. Ma, H. Kang, G. Yu, Q. Li, Q. He. Single-domain generalized predictor for neural architecture search system. IEEE Transactions on Computers, vol. 73, no. 5, pp. 1400\u20131413, 2024. DOI: https:\/\/doi.org\/10.1109\/TC.2024.3365949.","journal-title":"IEEE Transactions on Computers"},{"key":"1564_CR29","first-page":"15079","volume-title":"Proceedings of Conference on Empirical Methods in Natural Language Processing","author":"C Chen","year":"2023","unstructured":"C. Chen, B. Zhang, L. Cao, J. Shen, T. Gunter, A. M. Jose, A. Toshev, J. Shlens, R. Pang, Y. Yang. STAIR: Learning sparse text and image representation in grounded tokens. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, pp. 15079\u201315094, 2023."},{"key":"1564_CR30","doi-asserted-by":"publisher","unstructured":"J. J. Lau, S. Gayen, A. Ben Abacha, D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, vol. 5, no. 1, Article number 180251, 2018. DOI: https:\/\/doi.org\/10.1038\/sdata.2018.251.","DOI":"10.1038\/sdata.2018.251"},{"issue":"5","key":"1564_CR31","doi-asserted-by":"publisher","first-page":"1532","DOI":"10.1109\/TMI.2022.3232411","volume":"42","author":"B Liu","year":"2023","unstructured":"B. Liu, L. M. Zhan, L. Xu, X. M. Wu. Medical visual question answering via conditional reasoning and contrastive learning. IEEE Transactions on Medical Imaging, vol. 42, no. 5, pp. 1532\u20131545, 2023. DOI: https:\/\/doi.org\/10.1109\/TMI.2022.3232411.","journal-title":"IEEE Transactions on Medical Imaging"},{"key":"1564_CR32","first-page":"18145","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z Y Dou","year":"2022","unstructured":"Z. Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, Z. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145\u201318155, 2022."}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1564-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-025-1564-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1564-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,26]],"date-time":"2025-11-26T08:43:43Z","timestamp":1764146623000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-025-1564-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,26]]},"references-count":32,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["1564"],"URL":"https:\/\/doi.org\/10.1007\/s11633-025-1564-2","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,26]]},"assertion":[{"value":"12 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 November 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declared that they have no conflicts of interest to this work.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations of conflict of interest"}}]}}