{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,17]],"date-time":"2026-07-17T19:28:20Z","timestamp":1784316500092,"version":"3.55.0"},"reference-count":45,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T00:00:00Z","timestamp":1739318400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T00:00:00Z","timestamp":1739318400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.<\/jats:p>","DOI":"10.1038\/s41746-025-01488-3","type":"journal-article","created":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T01:13:36Z","timestamp":1739322816000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":36,"title":["Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports"],"prefix":"10.1038","volume":"8","author":[{"given":"Su Hwan","family":"Kim","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Severin","family":"Schramm","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lisa C.","family":"Adams","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rickmer","family":"Braren","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Keno K.","family":"Bressem","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Matthias","family":"Keicher","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Paul-S\u00f6ren","family":"Platzek","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Karolin Johanna","family":"Paprottka","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Claus","family":"Zimmer","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dennis M.","family":"Hedderich","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Benedikt","family":"Wiestler","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,2,12]]},"reference":[{"key":"1488_CR1","doi-asserted-by":"publisher","first-page":"e230877","DOI":"10.1148\/radiol.230877","volume":"307","author":"RJ Gertz","year":"2023","unstructured":"Gertz, R. J. et al. GPT-4 for automated determination of radiologic study and protocol based on radiology request forms: a feasibility study. Radiology 307, e230877 (2023).","journal-title":"Radiology"},{"key":"1488_CR2","doi-asserted-by":"publisher","first-page":"e230970","DOI":"10.1148\/radiol.230970","volume":"308","author":"A Rau","year":"2023","unstructured":"Rau, A. et al. A context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology 308, e230970 (2023).","journal-title":"Radiology"},{"key":"1488_CR3","doi-asserted-by":"publisher","first-page":"e231167","DOI":"10.1148\/radiol.231167","volume":"308","author":"J Kottlors","year":"2023","unstructured":"Kottlors, J. et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 308, e231167 (2023).","journal-title":"Radiology"},{"key":"1488_CR4","doi-asserted-by":"publisher","first-page":"e240689","DOI":"10.1148\/radiol.240689","volume":"314","author":"S Schramm","year":"2025","unstructured":"Schramm, S. et al. Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases. Radiology 314, e240689 (2025).","journal-title":"Radiology"},{"key":"1488_CR5","doi-asserted-by":"publisher","first-page":"808","DOI":"10.1007\/s11547-023-01651-4","volume":"128","author":"CA Mallio","year":"2023","unstructured":"Mallio, C. A., Sertorio, A. C., Bernetti, C. & Beomonte Zobel, B. Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing. Radiol. Med. 128, 808\u2013812 (2023).","journal-title":"Radiol. Med."},{"key":"1488_CR6","doi-asserted-by":"publisher","first-page":"e231593","DOI":"10.1148\/radiol.231593","volume":"310","author":"R Doshi","year":"2024","unstructured":"Doshi, R. et al. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis. Radiology 310, e231593 (2024).","journal-title":"Radiology"},{"key":"1488_CR7","doi-asserted-by":"publisher","first-page":"230364","DOI":"10.1148\/ryai.230364","volume":"6","author":"B Le Guellec","year":"2024","unstructured":"Le Guellec, B. et al. Performance of an open-source large language model in extracting information from free-text radiology reports. Radiol. Artif. Intell. 6, 230364 (2024).","journal-title":"Radiol. Artif. Intell."},{"key":"1488_CR8","doi-asserted-by":"publisher","first-page":"e232741","DOI":"10.1148\/radiol.232741","volume":"311","author":"NC Lehnen","year":"2024","unstructured":"Lehnen, N. C. et al. Data extraction from free-text reports on mechanical thrombectomy in acute ischemic stroke using ChatGPT: a retrospective analysis. Radiology 311, e232741 (2024).","journal-title":"Radiology"},{"key":"1488_CR9","doi-asserted-by":"crossref","unstructured":"Katz, U. et al. GPT versus resident physicians \u2014 a benchmark based on official board scores. NEJM AI1, (2024).","DOI":"10.1056\/AIdbp2300192"},{"key":"1488_CR10","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.240273","volume":"312","author":"PS Suh","year":"2024","unstructured":"Suh, P. S. et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and gemini pro vision using image inputs from Diagnosis Please cases. Radiology 312, e240273 (2024).","journal-title":"Radiology"},{"key":"1488_CR11","doi-asserted-by":"publisher","first-page":"1231","DOI":"10.1007\/s11604-024-01619-y","volume":"42","author":"Y Sonoda","year":"2024","unstructured":"Sonoda, Y. et al. Diagnostic performances of GPT-4o, Claude 3 \u2013Opus, and Gemini 1.5 Pro in \u201cDiagnosis Please\u201d cases. Jpn. J. Radiol. 42, 1231\u20131235 (2024).","journal-title":"Jpn. J. Radiol."},{"key":"1488_CR12","doi-asserted-by":"crossref","unstructured":"Wu, S. et al. Benchmarking open-source large language models, GPT-4 and claude 2 on multiple-choice questions in nephrology. NEJM AI 1 (2024).","DOI":"10.1056\/AIdbp2300092"},{"key":"1488_CR13","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-024-46411-8","volume":"15","author":"S Sandmann","year":"2024","unstructured":"Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).","journal-title":"Nat. Commun."},{"key":"1488_CR14","doi-asserted-by":"publisher","first-page":"e241191","DOI":"10.1148\/radiol.241191","volume":"312","author":"LC Adams","year":"2024","unstructured":"Adams, L. C. et al. Llama 3 challenges proprietary state-of-the-art large language models in radiology board\u2013style examination questions. Radiology 312, e241191 (2024).","journal-title":"Radiology"},{"key":"1488_CR15","unstructured":"Eurorad. Homepage. https:\/\/eurorad.org\/ (2024)."},{"key":"1488_CR16","doi-asserted-by":"publisher","unstructured":"Liu, F. et al. Large language models in the clinic: a comprehensive benchmark. Preprint at medRxiv https:\/\/doi.org\/10.1101\/2024.04.24.24306315 (2024).","DOI":"10.1101\/2024.04.24.24306315"},{"key":"1488_CR17","doi-asserted-by":"crossref","unstructured":"Jeong, D. P., Garg, S., Lipton, Z. C. & Oberst, M. Medical adaptation of large language and vision-language models: are we making progress? In Proc. 2024 Conference on Empirical Methods in Natural Language Processing 12143-12170 (Association for Computational Linguistics, 2024).","DOI":"10.18653\/v1\/2024.emnlp-main.677"},{"key":"1488_CR18","unstructured":"Dorfner, F. J. et al. Biomedical large languages models seem not to be superior to generalist models on unseen medical data. Preprint at arXiv:2408.13833 (2024)."},{"key":"1488_CR19","unstructured":"Kim, S. H. et al. Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study. European Radiology (in press)."},{"key":"1488_CR20","doi-asserted-by":"publisher","first-page":"6652","DOI":"10.1007\/s00330-024-10727-2","volume":"34","author":"R Siepmann","year":"2024","unstructured":"Siepmann, R. et al. The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur. Radiol. 34, 6652\u20136666 (2024).","journal-title":"Eur. Radiol."},{"key":"1488_CR21","doi-asserted-by":"crossref","unstructured":"Dratsch, T. et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307, e222176 (2023).","DOI":"10.1148\/radiol.222176"},{"key":"1488_CR22","unstructured":"Kim, S. H. et al. Automation bias in AI-assisted detection of cerebral aneurysms on time-of-flight MR-angiography. La radiologia medica (in press)."},{"key":"1488_CR23","doi-asserted-by":"publisher","first-page":"e59439","DOI":"10.2196\/59439","volume":"26","author":"Y Ke","year":"2024","unstructured":"Ke, Y. et al. Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study. J. Med. Internet Res. 26, e59439 (2024).","journal-title":"J. Med. Internet Res."},{"key":"1488_CR24","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1038\/s41746-024-01315-1","volume":"7","author":"E Klang","year":"2024","unstructured":"Klang, E. et al. A strategy for cost-effective large language model use at health system-scale. npj Digit. Med. 7, 320 (2024).","journal-title":"npj Digit. Med."},{"key":"1488_CR25","first-page":"2396","volume":"29","author":"S Gilbert","year":"2023","unstructured":"Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nat. Med. 2023 29:10 29, 2396\u20132398 (2023).","journal-title":"Nat. Med. 2023 29:10"},{"key":"1488_CR26","doi-asserted-by":"publisher","first-page":"146045822110112","DOI":"10.1177\/14604582211011215","volume":"27","author":"Z Zhang","year":"2021","unstructured":"Zhang, Z. et al. Patients\u2019 perceptions of using artificial intelligence (AI)-based technology to comprehend radiology imaging data. Health Informatics J. 27, 14604582211011215 (2021).","journal-title":"Health Informatics J."},{"key":"1488_CR27","unstructured":"Golchin, S. & Surdeanu, M. Time travel in LLMs: tracing data contamination in large language models. In 12th International Conference on Learning Representations, ICLR 2024 (2023)."},{"key":"1488_CR28","unstructured":"Golchin, S. & Surdeanu, M. Data contamination quiz: a tool to detect and estimate contamination in large language models. Preprint at arXiv:2311.06233 (2023)."},{"key":"1488_CR29","unstructured":"Balloccu, S., Schmidtov\u00e1, P., Lango, M. & Du\u0161ek, O. Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In Proc. 18th Conference European Chapter Association Computational Linguistics 67\u201393 (Association for Computational Linguistics, 2024)."},{"key":"1488_CR30","doi-asserted-by":"crossref","unstructured":"Dong, Y. et al. Generalization or memorization: data contamination and trustworthy evaluation for large language models. Preprint at arxiv:2402.15938 (2024).","DOI":"10.18653\/v1\/2024.findings-acl.716"},{"key":"1488_CR31","unstructured":"Gemini models. Gemini API. Google AI for developers. https:\/\/ai.google.dev\/gemini-api\/docs\/models\/gemini (2024)."},{"key":"1488_CR32","unstructured":"Models. Anthropic. https:\/\/docs.anthropic.com\/en\/docs\/about-claude\/models (2024)."},{"key":"1488_CR33","unstructured":"Models. OpenAI API. https:\/\/platform.openai.com\/docs\/models\/gp (2024)."},{"key":"1488_CR34","unstructured":"Wang, W. et al. CogVLM: visual expert for pretrained language models. Preprint at arXiv:2311.03079 (2023)."},{"key":"1488_CR35","unstructured":"Bai, J. et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. Preprint at arXiv:2308.12966 (2023)."},{"key":"1488_CR36","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (2024).","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"1488_CR37","doi-asserted-by":"publisher","first-page":"1814","DOI":"10.1109\/ACCESS.2022.3232719","volume":"11","author":"MM Mohsan","year":"2023","unstructured":"Mohsan, M. M. et al. Vision transformer and language model based radiology report generation. IEEE Access 11, 1814\u20131824 (2023).","journal-title":"IEEE Access"},{"key":"1488_CR38","doi-asserted-by":"publisher","unstructured":"Tanno, R. et al. Collaboration between clinicians and vision\u2013language models in radiology report generation. Nat. Med. https:\/\/doi.org\/10.1038\/s41591-024-03302-1 (2024).","DOI":"10.1038\/s41591-024-03302-1"},{"key":"1488_CR39","doi-asserted-by":"publisher","first-page":"779","DOI":"10.1007\/s00062-024-01426-y","volume":"34","author":"D Horiuchi","year":"2024","unstructured":"Horiuchi, D. et al. Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases. Clin. Neuroradiol. 34, 779\u2013787 (2024).","journal-title":"Clin. Neuroradiol."},{"key":"1488_CR40","unstructured":"Wu, C. et al. Can GPT-4V(ision) serve medical applications? Case studies on GPT-4V for multimodal medical diagnosis. Preprint at arXiv:2310.09909 (2023)."},{"key":"1488_CR41","unstructured":"Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Preprint at arXiv:2308.02463 (2023)."},{"key":"1488_CR42","doi-asserted-by":"publisher","unstructured":"Blankemeier, L. et al. Merlin: a vision language foundation model for 3D computed tomography. Res. Sq. https:\/\/doi.org\/10.21203\/RS.3.RS-4546309\/V1 (2024).","DOI":"10.21203\/RS.3.RS-4546309\/V1"},{"key":"1488_CR43","doi-asserted-by":"publisher","DOI":"10.2196\/55318","volume":"12","author":"S Sivarajkumar","year":"2024","unstructured":"Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S. & Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural Language processing: algorithm development and validation study. JMIR Med. Inform. 12, e55318 (2024).","journal-title":"JMIR Med. Inform."},{"key":"1488_CR44","first-page":"46595","volume":"36","author":"L Zheng","year":"2023","unstructured":"Zheng, L. et al. Judging LLM-as-a-Judge with MT-bench and Chatbot arena. Adv. Neural Inf. Process Syst. 36, 46595\u201346623 (2023).","journal-title":"Adv. Neural Inf. Process Syst."},{"key":"1488_CR45","unstructured":"Singh Thakur, A., Choudhary, K., Srinik Ramayapally, V., Vaidyanathan, S. & Hupkes Meta, D. Judging the judges: evaluating alignment and vulnerabilities in LLMs-as-judges. Preprint at arXiv: 2406.12624 (2024)."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01488-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01488-3","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01488-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,12]],"date-time":"2025-02-12T13:09:48Z","timestamp":1739365788000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01488-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,12]]},"references-count":45,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1488"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-01488-3","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,12]]},"assertion":[{"value":"14 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 January 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"97"}}