{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T05:07:22Z","timestamp":1776402442579,"version":"3.51.2"},"reference-count":215,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T00:00:00Z","timestamp":1739318400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Large language models (LLMs) and large vision models (LVMs) have driven significant advancements in natural language processing (NLP) and computer vision (CV), establishing a foundation for multimodal large language models (MLLMs) to integrate diverse data types in real-world applications. This survey explores the evolution of MLLMs in radiology, focusing on radiology report generation (RRG) and radiology visual question answering (RVQA), where MLLMs leverage the combined capabilities of LLMs and LVMs to improve clinical efficiency. We begin by tracing the history of radiology and the development of MLLMs, followed by an overview of MLLM applications in RRG and RVQA, detailing core datasets, evaluation metrics, and leading MLLMs that demonstrate their potential in generating radiology reports and answering image-based questions. We then discuss the challenges MLLMs face in radiology, including dataset scarcity, data privacy and security, and issues within MLLMs such as bias, toxicity, hallucinations, catastrophic forgetting, and limitations in traditional evaluation metrics. Finally, this paper proposes future research directions to address these challenges, aiming to help AI researchers and radiologists overcome these obstacles and advance the study of MLLMs in radiology.<\/jats:p>","DOI":"10.3390\/info16020136","type":"journal-article","created":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T06:06:06Z","timestamp":1739340366000},"page":"136","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":20,"title":["A Survey on Multimodal Large Language Models in Radiology for Report Generation and Visual Question Answering"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0663-9983","authenticated-orcid":false,"given":"Ziruo","family":"Yi","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, University of North Texas, Denton, TX 76205, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8548-5710","authenticated-orcid":false,"given":"Ting","family":"Xiao","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, University of North Texas, Denton, TX 76205, USA"},{"name":"Department of Information Science, University of North Texas, Denton, TX 76205, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3977-2895","authenticated-orcid":false,"given":"Mark V.","family":"Albert","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, University of North Texas, Denton, TX 76205, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,12]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1191","DOI":"10.1016\/j.acra.2015.05.007","article-title":"The effects of changes in utilization and technological advancements of cross-sectional imaging on radiologist workload","volume":"22","author":"McDonald","year":"2015","journal-title":"Acad. Radiol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1016\/j.ajem.2009.07.011","article-title":"Accuracy of radiographic readings in the emergency department","volume":"29","author":"Petinaux","year":"2011","journal-title":"Am. J. Emerg. Med."},{"key":"ref_3","first-page":"10944","article-title":"What makes multi-modal learning better than single (provably)","volume":"34","author":"Huang","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Waqas, A., Tripathi, A., Ramachandran, R.P., Stewart, P.A., and Rasool, G. (2024). Multimodal data integration for oncology in the era of deep neural networks: A review. Front. Artif. Intell., 7.","DOI":"10.3389\/frai.2024.1408843"},{"key":"ref_5","unstructured":"Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., and Anadkat, S. (2023). GPT-4 Technical Report. arXiv."},{"key":"ref_6","unstructured":"(2025, February 06). Meta LLaMA Team. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date, Available online: https:\/\/ai.meta.com\/blog\/meta-llama-3\/."},{"key":"ref_7","unstructured":"(2025, February 06). OpenAI. DALL-E3, Available online: https:\/\/openai.com\/index\/dall-e-3\/."},{"key":"ref_8","unstructured":"Li, J., Li, D., Savarese, S., and Hoi, S. (2023, January 23\u201329). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA."},{"key":"ref_9","unstructured":"Huang, Y., Meng, Z., Liu, F., Su, Y., Collier, N., and Lu, Y. (2023). Sparkles: Unlocking chats across multiple images for multimodal instruction-following models. arXiv."},{"key":"ref_10","unstructured":"Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., and Neal, D. (2023). Towards expert-level medical question answering with large language models. arXiv."},{"key":"ref_11","first-page":"28541","article-title":"Llava-med: Training a large language-and-vision assistant for biomedicine in one day","volume":"36","author":"Li","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. Curr. Opin. Struct. Biol., 79.","DOI":"10.1016\/j.sbi.2023.102527"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"e179","DOI":"10.1016\/S2589-7500(23)00048-1","article-title":"Using ChatGPT to write patient clinic letters","volume":"5","author":"Ali","year":"2023","journal-title":"Lancet Digit. Health"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yildirim, N., Richardson, H., Wetscherek, M.T., Bajwa, J., Jacob, J., Pinnock, M.A., Harris, S., Coelho De Castro, D., Bannur, S., and Hyland, S. (2024, January 11\u201316). Multimodal healthcare AI: Identifying and designing clinically relevant vision-language applications for radiology. Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA.","DOI":"10.1145\/3613904.3642013"},{"key":"ref_15","unstructured":"Liu, Z., Li, Y., Shu, P., Zhong, A., Yang, L., Ju, C., Wu, Z., Ma, C., Luo, J., and Chen, C. (2023). Radiology-llama2: Best-in-class large language model for radiology. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1102","DOI":"10.1016\/j.procs.2023.08.094","article-title":"Generation of radiology findings in chest x-ray by leveraging collaborative knowledge","volume":"221","author":"Danu","year":"2023","journal-title":"Procedia Comput. Sci."},{"key":"ref_17","unstructured":"Wang, R., Duan, Y., Li, J., Pang, P., and Tan, T. (2025, February 06). Xrayglm: The First Chinese Medical Multimodal Model That Chest Radiographs Summarization. Available online: https:\/\/github.com\/WangRongsheng\/XrayGLM."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Thawkar, O., Shaker, A., Mullappilly, S.S., Cholakkal, H., Anwer, R.M., Khan, S., Laaksonen, J., and Khan, F.S. (2023). Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv.","DOI":"10.18653\/v1\/2024.bionlp-1.35"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"102611","DOI":"10.1016\/j.artmed.2023.102611","article-title":"Medical visual question answering: A survey","volume":"143","author":"Lin","year":"2023","journal-title":"Artif. Intell. Med."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"50","DOI":"10.3390\/biomedinformatics4010004","article-title":"Survey of Multimodal Medical Question Answering","volume":"4","author":"Demirhan","year":"2023","journal-title":"BioMedInformatics"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sloan, P., Clatworthy, P., Simpson, E., and Mirmehdi, M. (2024). Automated Radiology Report Generation: A Review of Recent Advances. IEEE Rev. Biomed. Eng., 1\u201320.","DOI":"10.1109\/RBME.2024.3408456"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1002\/hcs2.61","article-title":"Large language models in health care: Development, applications, and challenges","volume":"2","author":"Yang","year":"2023","journal-title":"Health Care Sci."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"80","DOI":"10.4274\/dir.2023.232417","article-title":"Large language models in radiology: Fundamentals, applications, ethical considerations, risks, and future directions","volume":"30","author":"Stanzione","year":"2024","journal-title":"Diagn. Interv. Radiol."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Hartsock, I., and Rasool, G. (2024). Vision-language models for medical report generation and visual question answering: A review. arXiv.","DOI":"10.3389\/frai.2024.1430984"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Xiao, H., Zhou, F., Liu, X., Liu, T., Li, Z., Liu, X., and Huang, X. (2024). A comprehensive survey of large language models and multimodal large language models in medicine. arXiv.","DOI":"10.2139\/ssrn.5031720"},{"key":"ref_26","unstructured":"Bushberg, J.T., and Boone, J.M. (2011). The Essential Physics of Medical Imaging, Lippincott Williams & Wilkins."},{"key":"ref_27","unstructured":"Huang, H.K. (2011). PACS and Imaging Informatics: Basic Principles and Applications, John Wiley & Sons."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"3","DOI":"10.2967\/jnumed.116.184028","article-title":"Total-body PET: Maximizing sensitivity to create new opportunities for clinical research and patient care","volume":"59","author":"Cherry","year":"2018","journal-title":"J. Nucl. Med."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Hutton, B.F., Buvat, I., and Beekman, F.J. (2011). Review and current status of SPECT scatter correction. Phys. Med. Biol., 56.","DOI":"10.1088\/0031-9155\/56\/14\/R01"},{"key":"ref_30","first-page":"1227","article-title":"Procedure guideline for SPECT\/CT imaging 1.0","volume":"47","author":"Delbeke","year":"2006","journal-title":"J. Nucl. Med."},{"key":"ref_31","first-page":"826","article-title":"Effect of cardiac resynchronization therapy on longitudinal and circumferential left ventricular mechanics by velocity vector imaging: Description and initial clinical application of a novel method using high-frame rate B-mode echocardiographic images","volume":"22","author":"Vannan","year":"2005","journal-title":"Echocardiogr. A J. Cardiovasc. Ultrasound Allied Tech."},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Lorenz, J.M. (2016). Management of malignant biliary obstruction. Seminars in Interventional Radiology, Thieme Medical Publishers.","DOI":"10.1055\/s-0036-1592330"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"570","DOI":"10.1148\/radiol.2019182210","article-title":"Implementing virtual and augmented reality tools for radiology education and training, communication, and clinical care","volume":"291","author":"Uppot","year":"2019","journal-title":"Radiology"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"von Ende, E., Ryan, S., Crain, M.A., and Makary, M.S. (2023). Artificial intelligence, augmented reality, and virtual reality advances and applications in interventional radiology. Diagnostics, 13.","DOI":"10.3390\/diagnostics13050892"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Mun, S.K., Wong, K.H., Lo, S.C.B., Li, Y., and Bayarsaikhan, S. (2021). Artificial intelligence for the future radiology diagnostic service. Front. Mol. Biosci., 7.","DOI":"10.3389\/fmolb.2020.614258"},{"key":"ref_36","unstructured":"Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems, Curran Associates, Inc."},{"key":"ref_37","unstructured":"Brown, T.B. (2020). Language models are few-shot learners. arXiv."},{"key":"ref_38","unstructured":"(2023). OpenAI. Gpt-4 technical report. arXiv."},{"key":"ref_39","first-page":"1","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"J. Mach. Learn. Res."},{"key":"ref_40","first-page":"27730","article-title":"Training language models to follow instructions with human feedback","volume":"35","author":"Ouyang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_41","first-page":"24824","article-title":"Chain-of-thought prompting elicits reasoning in large language models","volume":"35","author":"Wei","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_42","first-page":"e40895","article-title":"Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge","volume":"15","author":"Li","year":"2023","journal-title":"Cureus"},{"key":"ref_43","unstructured":"Liu, Z., Huang, Y., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Li, Y., and Shu, P. (2023). Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"e230922","DOI":"10.1148\/radiol.230922","article-title":"How AI responds to common lung cancer questions: ChatGPT versus Google Bard","volume":"307","author":"Rahsepar","year":"2023","journal-title":"Radiology"},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1016\/j.acra.2023.08.020","article-title":"Assessing AI-powered patient education: A case study in radiology","volume":"31","author":"Kuckelman","year":"2024","journal-title":"Acad. Radiol."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"e231362","DOI":"10.1148\/radiol.231362","article-title":"Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer","volume":"308","author":"Fink","year":"2023","journal-title":"Radiology"},{"key":"ref_47","unstructured":"Dosovitskiy, A. (2020). An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_48","unstructured":"Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J\u00e9gou, H. (2021, January 18\u201324). Training data-efficient image transformers & distillation through attention. Proceedings of the International Conference on Machine Learning. PMLR, Virtual."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021, January 10\u201317). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"ref_50","first-page":"15908","article-title":"Transformer in transformer","volume":"34","author":"Han","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"He, K., Chen, X., Xie, S., Li, Y., Doll\u00e1r, P., and Girshick, R. (2022, January 18\u201324). Masked autoencoders are scalable vision learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01553"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Chen, X., Xie, S., and He, K. (2021, January 10\u201317). An empirical study of training self-supervised vision transformers. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00950"},{"key":"ref_53","unstructured":"Bao, H., Dong, L., Piao, S., and Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., and Lo, W.Y. (2023, January 2\u20133). Segment anything. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"ref_55","unstructured":"Cheng, Y., Li, L., Xu, Y., Li, X., Yang, Z., Wang, W., and Yang, Y. (2023). Segment and track anything. arXiv."},{"key":"ref_56","unstructured":"Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., and Zheng, F. (2023). Track anything: Segment anything meets videos. arXiv."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"2530","DOI":"10.1177\/13623613231169546","article-title":"Automated movement tracking of young autistic children during free play is correlated with clinical features associated with autism","volume":"27","author":"Yuan","year":"2023","journal-title":"Autism"},{"key":"ref_58","unstructured":"Cao, M., Mou, C., Yu, F., Wang, X., Zheng, Y., Zhang, J., Dong, C., Li, G., Shan, Y., and Timofte, R. (2023, January 2\u20133). Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Paris, France."},{"key":"ref_59","unstructured":"Shen, Q., Yang, X., and Wang, X. (2023). Anything-3d: Towards single-view anything reconstruction in the wild. arXiv."},{"key":"ref_60","unstructured":"Yu, T., Feng, R., Feng, R., Liu, J., Jin, X., Zeng, W., and Chen, Z. (2023). Inpaint anything: Segment anything meets image inpainting. arXiv."},{"key":"ref_61","unstructured":"Roy, S., Wald, T., Koehler, G., Rokuss, M.R., Disch, N., Holzschuh, J., Zimmerer, D., and Maier-Hein, K.H. (2023). Sam. md: Zero-shot medical image segmentation capabilities of the segment anything model. arXiv."},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Xu, N., Price, B., Cohen, S., Yang, J., and Huang, T.S. (2016, January 27\u201330). Deep interactive object selection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.47"},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"1562","DOI":"10.1109\/TMI.2018.2791721","article-title":"Interactive medical image segmentation using deep learning with image-specific fine tuning","volume":"37","author":"Wang","year":"2018","journal-title":"IEEE Trans. Med. Imaging"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Jang, W.D., and Kim, C.S. (2019, January 15\u201320). Interactive image segmentation via backpropagating refinement scheme. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00544"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Lin, Z., Zhang, Z., Chen, L.Z., Cheng, M.M., and Lu, S.P. (2020, January 13\u201319). Interactive image segmentation with first click attention. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01335"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Chen, X., Zhao, Z., Yu, F., Zhang, Y., and Duan, M. (2021, January 10\u201317). Conditional diffusion for interactive segmentation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Montreal, QC, Canada.","DOI":"10.1109\/ICCV48922.2021.00725"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Lempitsky, V., Kohli, P., Rother, C., and Sharp, T. (October, January 29). Image segmentation with a bounding box prior. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan.","DOI":"10.1109\/ICCV.2009.5459262"},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Wu, J., Zhao, Y., Zhu, J.Y., Luo, S., and Tu, Z. (2014, January 23\u201328). Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.40"},{"key":"ref_69","doi-asserted-by":"crossref","first-page":"674","DOI":"10.1109\/TMI.2016.2621185","article-title":"Deepcut: Object segmentation from bounding box annotations using convolutional neural networks","volume":"36","author":"Rajchl","year":"2016","journal-title":"IEEE Trans. Med. Imaging"},{"key":"ref_70","doi-asserted-by":"crossref","unstructured":"Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., and Lim, S.N. (2022, January 23\u201327). Visual prompt tuning. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19827-4_41"},{"key":"ref_71","first-page":"16664","article-title":"Adaptformer: Adapting vision transformers for scalable visual recognition","volume":"35","author":"Chen","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_72","unstructured":"Jie, S., and Deng, Z.H. (2022). Convolutional bypasses are better vision transformer adapters. arXiv."},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Huang, Q., Dong, X., Chen, D., Zhang, W., Wang, F., Hua, G., and Yu, N. (2023, January 2\u20133). Diversity-aware meta visual prompting. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Paris, France.","DOI":"10.1109\/CVPR52729.2023.01047"},{"key":"ref_74","unstructured":"Zhou, T., Zhang, Y., Zhou, Y., Wu, Y., and Gong, C. (2023). Can sam segment polyps?. arXiv."},{"key":"ref_75","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Zhou, T., Wang, S., Liang, P., Zhang, Y., and Chen, D.Z. (2023, January 8\u201312). Input augmentation with sam: Boosting medical image segmentation with segmentation foundation model. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada.","DOI":"10.1007\/978-3-031-47401-9_13"},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Chen, T., Zhu, L., Ding, C., Cao, R., Wang, Y., Li, Z., Sun, L., Mao, P., and Zang, Y. (2023). SAM Fails to Segment Anything?\u2013SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More. arXiv.","DOI":"10.1109\/ICCVW60793.2023.00361"},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Li, H., Liu, H., Hu, D., Wang, J., and Oguz, I. (2024, January 6\u201310). Prism: A promptable and robust interactive segmentation model with visual prompts. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco.","DOI":"10.1007\/978-3-031-72384-1_37"},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"654","DOI":"10.1038\/s41467-024-44824-z","article-title":"Segment anything in medical images","volume":"15","author":"Ma","year":"2024","journal-title":"Nat. Commun."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Shi, P., Qiu, J., Abaxi, S.M.D., Wei, H., Lo, F.P.W., and Yuan, W. (2023). Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation. Diagnostics, 13.","DOI":"10.3390\/diagnostics13111947"},{"key":"ref_80","unstructured":"Ranem, A., Babendererde, N., Fuchs, M., and Mukhopadhyay, A. (2023). Exploring sam ablations for enhancing medical segmentation in radiology and pathology. arXiv."},{"key":"ref_81","doi-asserted-by":"crossref","unstructured":"Wu, C., Restrepo, D., Shuai, Z., Liu, Z., and Shen, L. (2024, January 6\u201310). Efficient In-Context Medical Segmentation with Meta-driven Visual Prompt Selection. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco.","DOI":"10.1007\/978-3-031-72114-4_25"},{"key":"ref_82","doi-asserted-by":"crossref","first-page":"164","DOI":"10.1016\/j.ins.2022.12.014","article-title":"Analysis of multimodal data fusion from an information theory perspective","volume":"623","author":"Dai","year":"2023","journal-title":"Inf. Sci."},{"key":"ref_83","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1016\/S0004-3702(00)00079-5","article-title":"Blocks world revisited","volume":"125","author":"Slaney","year":"2001","journal-title":"Artif. Intell."},{"key":"ref_84","doi-asserted-by":"crossref","first-page":"2122","DOI":"10.1007\/s11263-023-01784-z","article-title":"Multi-modal 3d object detection in autonomous driving: A survey","volume":"131","author":"Wang","year":"2023","journal-title":"Int. J. Comput. Vis."},{"key":"ref_85","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_86","doi-asserted-by":"crossref","first-page":"2673","DOI":"10.1109\/78.650093","article-title":"Bidirectional recurrent neural networks","volume":"45","author":"Schuster","year":"1997","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_87","first-page":"84","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_88","unstructured":"Mao, J. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv."},{"key":"ref_89","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). Vqa: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_90","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Fei-Fei, L. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_91","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_92","doi-asserted-by":"crossref","unstructured":"Mroueh, Y., Marcheret, E., and Goel, V. (2015, January 19\u201324). Deep multimodal learning for audio-visual speech recognition. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia.","DOI":"10.1109\/ICASSP.2015.7178347"},{"key":"ref_93","doi-asserted-by":"crossref","unstructured":"Arandjelovic, R., and Zisserman, A. (2017, January 22\u201329). Look, listen and learn. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.73"},{"key":"ref_94","doi-asserted-by":"crossref","unstructured":"Qi, C.R., Su, H., Nie\u00dfner, M., Dai, A., Yan, M., and Guibas, L.J. (2016, January 27\u201330). Volumetric and multi-view cnns for object classification on 3d data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.609"},{"key":"ref_95","unstructured":"Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_96","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"ref_97","unstructured":"Radford, A. (2025, February 06). Improving Language Understanding by Generative Pre-Training. Available online: https:\/\/openai.com\/research\/language-unsupervised."},{"key":"ref_98","doi-asserted-by":"crossref","unstructured":"Georgescu, M.I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., and Arnab, A. (2023, January 2\u20133). Audiovisual masked autoencoders. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.01479"},{"key":"ref_99","unstructured":"Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., and Li, H. (2023). Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv."},{"key":"ref_100","first-page":"28708","article-title":"Masked autoencoders that listen","volume":"35","author":"Huang","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_101","first-page":"13","article-title":"Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks","volume":"32","author":"Lu","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_102","unstructured":"Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., and Chang, K.W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv."},{"key":"ref_103","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18\u201324). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, PMLR, Virtual."},{"key":"ref_104","unstructured":"Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., and Cao, Y. (2021). Simvlm: Simple visual language model pretraining with weak supervision. arXiv."},{"key":"ref_105","first-page":"23716","article-title":"Flamingo: A visual language model for few-shot learning","volume":"35","author":"Alayrac","year":"2022","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_106","doi-asserted-by":"crossref","unstructured":"Zhu, L., and Yang, Y. (2020, January 13\u201319). Actbert: Learning global-local video-text representations. Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00877"},{"key":"ref_107","first-page":"25","article-title":"Self-supervised multimodal versatile networks","volume":"33","author":"Alayrac","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_108","doi-asserted-by":"crossref","unstructured":"Han, L., Zheng, T., Xu, L., and Fang, L. (2020, January 13\u201319). Occuseg: Occupancy-aware 3d instance segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00301"},{"key":"ref_109","unstructured":"Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. (2021, January 18\u201324). Zero-shot text-to-image generation. Proceedings of the International Conference on Machine Learning. PMLR, Virtual."},{"key":"ref_110","doi-asserted-by":"crossref","unstructured":"Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022, January 18\u201324). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"ref_111","doi-asserted-by":"crossref","unstructured":"Zhang, L., Rao, A., and Agrawala, M. (2023, January 2\u20133). Adding conditional control to text-to-image diffusion models. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00355"},{"key":"ref_112","doi-asserted-by":"crossref","unstructured":"Yi, Z., Blanco, E., Fan, H., and Albert, M.V. (2022, January 2\u20134). BAPO: A Large-Scale Multimodal Corpus for Ball Possession Prediction in American Football Games. Proceedings of the 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR), Virtual.","DOI":"10.1109\/MIPR54900.2022.00077"},{"key":"ref_113","doi-asserted-by":"crossref","unstructured":"Ko, H.K., Park, G., Jeon, H., Jo, J., Kim, J., and Seo, J. (2023, January 27\u201331). Large-scale text-to-image generation models for visual artists\u2019 creative works. Proceedings of the 28th International Conference on Intelligent User Interfaces, Sydney, NSW, Australia.","DOI":"10.1145\/3581641.3584078"},{"key":"ref_114","doi-asserted-by":"crossref","unstructured":"Fu, D., Li, X., Wen, L., Dou, M., Cai, P., Shi, B., and Qiao, Y. (2024, January 3\u20138). Drive like a human: Rethinking autonomous driving with large language models. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.","DOI":"10.1109\/WACVW60836.2024.00102"},{"key":"ref_115","unstructured":"Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., and Le, Q.V. (2021). Finetuned language models are zero-shot learners. arXiv."},{"key":"ref_116","first-page":"1","article-title":"Scaling instruction-finetuned language models","volume":"25","author":"Chung","year":"2024","journal-title":"J. Mach. Learn. Res."},{"key":"ref_117","unstructured":"Iyer, S., Lin, X.V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., and Koura, P.S. (2022). Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv."},{"key":"ref_118","unstructured":"(2025, February 06). OpenAI. ChatGPT, Available online: https:\/\/openai.com."},{"key":"ref_119","unstructured":"Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., and Shi, Y. (2023). mplug-owl: Modularization empowers large language models with multimodality. arXiv."},{"key":"ref_120","unstructured":"Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. (2023). Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv."},{"key":"ref_121","first-page":"34891","article-title":"Visual instruction tuning","volume":"36","author":"Liu","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_122","doi-asserted-by":"crossref","unstructured":"Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv.","DOI":"10.18653\/v1\/2022.acl-long.556"},{"key":"ref_123","unstructured":"Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., and Sui, Z. (2022). A survey on in-context learning. arXiv."},{"key":"ref_124","unstructured":"Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., and Wang, L. (2023). Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv."},{"key":"ref_125","doi-asserted-by":"crossref","unstructured":"Gupta, T., and Kembhavi, A. (2023, January 2\u20133). Visual programming: Compositional visual reasoning without training. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Paris, France.","DOI":"10.1109\/CVPR52729.2023.01436"},{"key":"ref_126","first-page":"43447","article-title":"Chameleon: Plug-and-play compositional reasoning with large language models","volume":"36","author":"Lu","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_127","unstructured":"Ge, J., Luo, H., Qian, S., Gan, Y., Fu, J., and Zhang, S. (2023). Chain of thought prompt tuning in vision language models. arXiv."},{"key":"ref_128","doi-asserted-by":"crossref","unstructured":"Himakunthala, V., Ouyang, A., Rose, D., He, R., Mei, A., Lu, Y., Sonar, C., Saxon, M., and Wang, W.Y. (2023). Let\u2019s think frame by frame: Evaluating video chain of thought with video infilling and prediction. arXiv.","DOI":"10.18653\/v1\/2023.emnlp-main.15"},{"key":"ref_129","unstructured":"Rose, D., Himakunthala, V., Ouyang, A., He, R., Mei, A., Lu, Y., Saxon, M., Sonar, C., Mirza, D., and Wang, W.Y. (2023). Visual chain of thought: Bridging logical gaps with multimodal infillings. arXiv."},{"key":"ref_130","unstructured":"Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., and Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. arXiv."},{"key":"ref_131","doi-asserted-by":"crossref","unstructured":"Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Qiao, Y., Gao, P., and Li, H. (2023, January 2\u20133). Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Paris, France.","DOI":"10.1109\/CVPR52729.2023.01460"},{"key":"ref_132","unstructured":"Wang, T., Zhang, J., Fei, J., Zheng, H., Tang, Y., Li, Z., Gao, M., and Zhao, S. (2023). Caption anything: Interactive image description with diverse multimodal controls. arXiv."},{"key":"ref_133","unstructured":"Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., and Duan, N. (2023). Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv."},{"key":"ref_134","doi-asserted-by":"crossref","unstructured":"Zhu, X., Zhang, R., He, B., Guo, Z., Zeng, Z., Qin, Z., Zhang, S., and Gao, P. (2023, January 2\u20133). Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Paris, France.","DOI":"10.1109\/ICCV51070.2023.00249"},{"key":"ref_135","unstructured":"Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. (2024). Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Adv. Neural Inf. Process. Syst., 36, Available online: https:\/\/arxiv.org\/abs\/2303.17580."},{"key":"ref_136","unstructured":"(2025, February 06). OpenAI. Gpt-4v(ision) System Card, Available online: https:\/\/api.semanticscholar.org\/CorpusID:263218031."},{"key":"ref_137","unstructured":"Liu, C., Tian, Y., and Song, Y. (2023). A systematic review of deep learning-based research on radiology report generation. arXiv."},{"key":"ref_138","doi-asserted-by":"crossref","first-page":"988","DOI":"10.1007\/s10278-020-00349-7","article-title":"Framework for Extracting Critical Findings in Radiology Reports","volume":"33","author":"Mabotuwana","year":"2020","journal-title":"J. Digit. Imaging"},{"key":"ref_139","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1093\/jamia\/ocv080","article-title":"Preparing a collection of radiology examinations for distribution and retrieval","volume":"23","author":"Kohli","year":"2016","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref_140","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1038\/s41597-019-0322-0","article-title":"MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports","volume":"6","author":"Johnson","year":"2019","journal-title":"Sci. Data"},{"key":"ref_141","doi-asserted-by":"crossref","unstructured":"Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., and Horng, S. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv.","DOI":"10.1038\/s41597-019-0322-0"},{"key":"ref_142","doi-asserted-by":"crossref","first-page":"101797","DOI":"10.1016\/j.media.2020.101797","article-title":"Padchest: A large chest x-ray image dataset with multi-label annotated reports","volume":"66","author":"Bustos","year":"2020","journal-title":"Med. Image Anal."},{"key":"ref_143","doi-asserted-by":"crossref","first-page":"e210136","DOI":"10.1148\/ryai.2021210136","article-title":"Curation of the candid-ptx dataset with free-text reports","volume":"3","author":"Feng","year":"2021","journal-title":"Radiol. Artif. Intell."},{"key":"ref_144","first-page":"188","article-title":"NegBio: A high-performance tool for negation and uncertainty detection in radiology reports","volume":"2018","author":"Peng","year":"2018","journal-title":"AMIA Summits Transl. Sci. Proc."},{"key":"ref_145","unstructured":"Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., and Shpanskaya, K. (February, January 27). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA."},{"key":"ref_146","doi-asserted-by":"crossref","first-page":"100033","DOI":"10.1016\/j.metrad.2023.100033","article-title":"R2gengpt: Radiology report generation with frozen llms","volume":"1","author":"Wang","year":"2023","journal-title":"Meta-Radiology"},{"key":"ref_147","doi-asserted-by":"crossref","first-page":"e32690","DOI":"10.2196\/32690","article-title":"Vision-language model for generating textual descriptions from clinical images: Model development and validation study","volume":"8","author":"Ji","year":"2024","journal-title":"JMIR Form. Res."},{"key":"ref_148","unstructured":"Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. (2023, January 10\u201316). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Proceedings of the NIPS\u201923, New Orleans, LA, USA."},{"key":"ref_149","unstructured":"Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv."},{"key":"ref_150","unstructured":"Hyland, S.L., Bannur, S., Bouzid, K., Castro, D.C., Ranjit, M., Schwaighofer, A., P\u00e9rez-Garc\u00eda, F., Salvatelli, V., Srivastav, S., and Thieme, A. (2023). Maira-1: A specialised large multimodal model for radiology report generation. arXiv."},{"key":"ref_151","doi-asserted-by":"crossref","unstructured":"Wang, Z., Wu, Z., Agarwal, D., and Sun, J. (2022). Medclip: Contrastive learning from unpaired medical images and text. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-main.256"},{"key":"ref_152","unstructured":"Liu, C., Tian, Y., Chen, W., Song, Y., and Zhang, Y. (2024, January 20\u201327). Bootstrapping Large Language Models for Radiology Report Generation. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada."},{"key":"ref_153","unstructured":"Wang, Y., Hao, C., Cui, Y., Su, X., Xie, W., Tan, T., and Yu, Z. (2024). TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model. arXiv."},{"key":"ref_154","unstructured":"Zhou, Z., Shi, M., Wei, M., Alabi, O., Yue, Z., and Vercauteren, T. (2024). Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning. arXiv."},{"key":"ref_155","unstructured":"Lu, Y., Hong, S., Shah, Y., and Xu, P. (2023). Effectively fine-tune to improve large multimodal models for radiology report generation. arXiv."},{"key":"ref_156","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002, January 6\u201312). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.","DOI":"10.3115\/1073083.1073135"},{"key":"ref_157","unstructured":"Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization Branches Out, Association for Computational Linguistic."},{"key":"ref_158","unstructured":"Banerjee, S., and Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization, Association for Computational Linguistics, Ann Arbor, MI, USA."},{"key":"ref_159","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015, January 7\u201312). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"ref_160","doi-asserted-by":"crossref","unstructured":"Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., and Lungren, M.P. (2020). CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-main.117"},{"key":"ref_161","doi-asserted-by":"crossref","first-page":"100802","DOI":"10.1016\/j.patter.2023.100802","article-title":"Evaluating progress in automatic chest x-ray radiology report generation","volume":"4","author":"Yu","year":"2023","journal-title":"Patterns"},{"key":"ref_162","unstructured":"Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., and Ng, A.Y. (2021). Radgraph: Extracting clinical entities and relations from radiology reports. arXiv."},{"key":"ref_163","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_164","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_165","doi-asserted-by":"crossref","unstructured":"Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. (2016, January 27\u201330). Stacked attention networks for image question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.10"},{"key":"ref_166","doi-asserted-by":"crossref","unstructured":"Bazi, Y., Rahhal, M.M.A., Bashmal, L., and Zuair, M. (2023). Vision\u2013language model for visual question answering in medical imagery. Bioengineering, 10.","DOI":"10.3390\/bioengineering10030380"},{"key":"ref_167","unstructured":"Hasan, S.A., Ling, Y., Farri, O., Liu, J., M\u00fcller, H., and Lungren, M. (2025, February 11). Overview of Imageclef 2018 Medical Domain Visual Question Answering Task. Proceedings of CLEF 2018 Working Notes, Available online: https:\/\/api.semanticscholar.org\/CorpusID:51943124."},{"key":"ref_168","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2018.251","article-title":"A dataset of clinically generated visual questions and answers about radiology images","volume":"5","author":"Lau","year":"2018","journal-title":"Sci. Data"},{"key":"ref_169","unstructured":"Ben Abacha, A., Hasan, S.A., Datla, V.V., Demner-Fushman, D., and M\u00fcller, H. (2019, January 9\u201312). Vqa-med: Overview of the medical visual question answering task at imageclef 2019. Proceedings of the CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes, Lugano, Switzerland."},{"key":"ref_170","doi-asserted-by":"crossref","unstructured":"Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, D., Coy, A., Karargyris, A., Guo, Y., and Beymer, D.B. (2020, January 9). Towards visual dialog for radiology. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online.","DOI":"10.18653\/v1\/2020.bionlp-1.6"},{"key":"ref_171","unstructured":"Ben Abacha, A., Datla, V.V.D., Demner-Fushman, D., Hasan, S.A., and M\u00fcller, H. (2020, January 22\u201325). Overview of the VQA-Med task at ImageCLEF 2020: Visual question answering and generation in the medical domain. Proceedings of the CLEF 2020 Conference and Labs of the Evaluation Forum-Working Notes, Hessaloniki, Greece."},{"key":"ref_172","doi-asserted-by":"crossref","unstructured":"Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., and Wu, X.M. (2021, January 13\u201316). Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France.","DOI":"10.1109\/ISBI48211.2021.9434010"},{"key":"ref_173","unstructured":"Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., and M\u00fcller, H. (2021, January 21\u201324). Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-Working Notes, Bucharest, Romania."},{"key":"ref_174","first-page":"3867","article-title":"Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images","volume":"36","author":"Bae","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_175","unstructured":"Xu, S., Yang, L., Kelly, C., Sieniek, M., Kohlberger, T., Ma, M., Weng, W.H., Kiraly, A., Kazemzadeh, S., and Melamed, Z. (2023). Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv."},{"key":"ref_176","doi-asserted-by":"crossref","unstructured":"Demner-fushman, D., Ananiadou, S., and Cohen, K. (2023, January 13). shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation. Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, Toronto, Canada. Available online: https:\/\/doi.org\/10.18653\/v1\/2023.bionlp-1.57.","DOI":"10.18653\/v1\/2023.bionlp-1.57"},{"key":"ref_177","unstructured":"Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., and Valluri, N. (2023). BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv."},{"key":"ref_178","doi-asserted-by":"crossref","unstructured":"Ha, C.N., Asaadi, S., Karn, S.K., Farri, O., Heimann, T., and Runkler, T. (2024). Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering. arXiv.","DOI":"10.18653\/v1\/2024.clinicalnlp-1.21"},{"key":"ref_179","unstructured":"Zhang, S., Xu, Y., Usuyama, N., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., and Wong, C. (2023). Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv, Available online: https:\/\/www.researchgate.net\/publication\/368935664_Large-Scale_Domain-Specific_Pretraining_for_Biomedical_Vision-Language_Processing."},{"key":"ref_180","unstructured":"Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., and Gonzalez, J.E. (2025, February 06). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chatgpt Quality. Available online: https:\/\/vicuna.lmsys.org\/."},{"key":"ref_181","doi-asserted-by":"crossref","unstructured":"Li, P., Liu, G., He, J., Zhao, Z., and Zhong, S. (2023, January 8\u201312). Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada.","DOI":"10.1007\/978-3-031-43907-0_36"},{"key":"ref_182","doi-asserted-by":"crossref","unstructured":"Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y. (2023, January 2\u20133). Eva: Exploring the limits of masked visual representation learning at scale. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Paris, France.","DOI":"10.1109\/CVPR52729.2023.01855"},{"key":"ref_183","unstructured":"Liu, G., He, J., Li, P., He, G., Chen, Z., and Zhong, S. (2024). PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging. arXiv."},{"key":"ref_184","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, G., and Wan, X. (2022, January 10\u201314). Align, reason and learn: Enhancing medical vision-and-language pre-training with knowledge. Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal.","DOI":"10.1145\/3503161.3547948"},{"key":"ref_185","doi-asserted-by":"crossref","unstructured":"Gu, T., Yang, K., Liu, D., and Cai, W. (2024, January 16\u201322). LaPA: Latent Prompt Assist Model For Medical Visual Question Answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPRW63382.2024.00502"},{"key":"ref_186","doi-asserted-by":"crossref","unstructured":"Ossowski, T., and Hu, J. (2023). Multimodal prompt retrieval for generative visual question answering. arXiv.","DOI":"10.18653\/v1\/2023.findings-acl.158"},{"key":"ref_187","unstructured":"Park, J., Kim, S., Yoon, B., Hyun, J., and Choi, K. (2024). M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation. arXiv."},{"key":"ref_188","doi-asserted-by":"crossref","unstructured":"Kim, T., Cho, Y., Shin, H., Jo, Y., and Shin, D. (2024). Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model. arXiv.","DOI":"10.3233\/FAIA240501"},{"key":"ref_189","doi-asserted-by":"crossref","unstructured":"Sharma, D., Purushotham, S., and Reddy, C.K. (2021). MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep., 11.","DOI":"10.1038\/s41598-021-98390-1"},{"key":"ref_190","doi-asserted-by":"crossref","first-page":"1773","DOI":"10.1038\/s41591-022-01981-2","article-title":"Multimodal biomedical AI","volume":"28","author":"Acosta","year":"2022","journal-title":"Nat. Med."},{"key":"ref_191","unstructured":"Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., and Rajpurkar, P. (2023, January 10). Med-flamingo: A multimodal medical few-shot learner. Proceedings of the Machine Learning for Health (ML4H), PMLR, New Orleans, LA, USA."},{"key":"ref_192","doi-asserted-by":"crossref","unstructured":"Zhao, R., Chen, H., Wang, W., Jiao, F., Do, X.L., Qin, C., Ding, B., Guo, X., Li, M., and Li, X. (2023). Retrieving multimodal information for augmented generation: A survey. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.314"},{"key":"ref_193","doi-asserted-by":"crossref","first-page":"106775","DOI":"10.1016\/j.knosys.2021.106775","article-title":"A survey on federated learning","volume":"216","author":"Zhang","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_194","unstructured":"Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., and Erlingsson, U. (2021, January 11\u201313). Extracting training data from large language models. Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online."},{"key":"ref_195","doi-asserted-by":"crossref","unstructured":"Li, H., Guo, D., Fan, W., Xu, M., Huang, J., Meng, F., and Song, Y. (2023). Multi-step jailbreaking privacy attacks on chatgpt. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.272"},{"key":"ref_196","unstructured":"Liu, J.M., Li, D., Cao, H., Ren, T., Liao, Z., and Wu, J. (2023). Chatcounselor: A large language models for mental health support. arXiv."},{"key":"ref_197","first-page":"31","article-title":"Perturbation methods for protecting data privacy: A review of techniques and applications","volume":"4","author":"Turgay","year":"2023","journal-title":"Autom. Mach. Learn."},{"key":"ref_198","unstructured":"Tang, R., Han, X., Jiang, X., and Hu, X. (2023). Does synthetic data generation of llms help clinical text mining?. arXiv."},{"key":"ref_199","doi-asserted-by":"crossref","unstructured":"Ferrara, E. (2023). Should chatgpt be biased? challenges and risks of bias in large language models. arXiv.","DOI":"10.2139\/ssrn.4627814"},{"key":"ref_200","doi-asserted-by":"crossref","first-page":"6074","DOI":"10.1109\/JBHI.2023.3316750","article-title":"Large ai models in health informatics: Applications, challenges, and the future","volume":"27","author":"Qiu","year":"2023","journal-title":"IEEE J. Biomed. Health Inform."},{"key":"ref_201","doi-asserted-by":"crossref","unstructured":"Yang, Y., Liu, X., Jin, Q., Huang, F., and Lu, Z. (2024). Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation. arXiv.","DOI":"10.1038\/s43856-024-00601-z"},{"key":"ref_202","doi-asserted-by":"crossref","unstructured":"Kotek, H., Dockum, R., and Sun, D. (2023, January 6\u20139). Gender bias and stereotypes in large language models. Proceedings of the ACM Collective Intelligence Conference, Delft, The Netherlands.","DOI":"10.1145\/3582269.3615599"},{"key":"ref_203","doi-asserted-by":"crossref","first-page":"103654","DOI":"10.1016\/j.artint.2021.103654","article-title":"Quantifying and alleviating political bias in language models","volume":"304","author":"Liu","year":"2022","journal-title":"Artif. Intell."},{"key":"ref_204","doi-asserted-by":"crossref","unstructured":"Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., and Narasimhan, K. (2023). Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv.","DOI":"10.18653\/v1\/2023.findings-emnlp.88"},{"key":"ref_205","doi-asserted-by":"crossref","unstructured":"Lahnala, A., Welch, C., Neuendorf, B., and Flek, L. (2022). Mitigating toxic degeneration with empathetic data: Exploring the relationship between toxicity and empathy. arXiv.","DOI":"10.18653\/v1\/2022.naacl-main.363"},{"key":"ref_206","unstructured":"Cui, S., Zhang, Z., Chen, Y., Zhang, W., Liu, T., Wang, S., and Liu, T. (2023). Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity. arXiv."},{"key":"ref_207","unstructured":"Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. (2024). A survey on hallucination in large vision-language models. arXiv."},{"key":"ref_208","doi-asserted-by":"crossref","unstructured":"Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.Y., Wang, Y.X., and Yang, Y. (2023). Aligning large multimodal models with factually augmented rlhf. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.775"},{"key":"ref_209","unstructured":"Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y.J., and Ma, Y. (2023). Investigating the catastrophic forgetting in multimodal large language models. arXiv."},{"key":"ref_210","doi-asserted-by":"crossref","unstructured":"Khan, H., Bouaynaya, N.C., and Rasool, G. (2023, January 9\u201312). The Importance of Robust Features in Mitigating Catastrophic Forgetting. Proceedings of the 2023 IEEE Symposium on Computers and Communications (ISCC), Tunis, Tunisia.","DOI":"10.1109\/ISCC58397.2023.10218203"},{"key":"ref_211","unstructured":"Zhou, D.W., Zhang, Y., Ning, J., Ye, H.J., Zhan, D.C., and Liu, Z. (2023). Learning without forgetting for vision-language models. arXiv."},{"key":"ref_212","doi-asserted-by":"crossref","first-page":"5362","DOI":"10.1109\/TPAMI.2024.3367329","article-title":"A comprehensive survey of continual learning: Theory, method and application","volume":"48","author":"Wang","year":"2024","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_213","doi-asserted-by":"crossref","unstructured":"Cai, Y., and Rostami, M. (2024). Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks. arXiv.","DOI":"10.2139\/ssrn.4713352"},{"key":"ref_214","doi-asserted-by":"crossref","first-page":"34054","DOI":"10.1109\/ACCESS.2024.3369488","article-title":"Brain-Inspired Continual Learning: Robust Feature Distillation and Re-Consolidation for Class Incremental Learning","volume":"12","author":"Khan","year":"2024","journal-title":"IEEE Access"},{"key":"ref_215","doi-asserted-by":"crossref","unstructured":"Zhang, W., Huang, Y., Zhang, T., Zou, Q., Zheng, W.S., and Wang, R. (2023, January 8\u201312). Adapter learning in pretrained feature extractor for continual learning of diseases. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada.","DOI":"10.1007\/978-3-031-43895-0_7"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/2\/136\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:32:04Z","timestamp":1760027524000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/2\/136"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,12]]},"references-count":215,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["info16020136"],"URL":"https:\/\/doi.org\/10.3390\/info16020136","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,12]]}}}