{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,13]],"date-time":"2026-06-13T02:06:38Z","timestamp":1781316398711,"version":"3.54.1"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"6","license":[{"start":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T00:00:00Z","timestamp":1743984000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"funder":[{"DOI":"10.13039\/501100002347","name":"Federal Ministry of Education and Research","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002347","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Free State of Bavaria"},{"name":"Excellence Strategy of the Federal Government"},{"name":"L\u00e4nder"},{"name":"Technical University of Munich - Institute for Advanced Study"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objectives<\/jats:title>\n                  <jats:p>Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaf045","type":"journal-article","created":{"date-parts":[[2025,4,7]],"date-time":"2025-04-07T08:03:29Z","timestamp":1744013009000},"page":"1015-1024","source":"Crossref","is-referenced-by-count":32,"title":["Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks"],"prefix":"10.1093","volume":"32","author":[{"given":"Felix J","family":"Dorfner","sequence":"first","affiliation":[{"name":"Charit\u00e9\u2014Universit\u00e4tsmedizin Berlin, Corporate Member of Freie Universit\u00e4t Berlin and Humboldt-Universit\u00e4t zu Berlin , Berlin 10117,","place":["Germany"]},{"name":"Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School , Charlestown, MA 02129,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Amin","family":"Dada","sequence":"additional","affiliation":[{"name":"Institute for AI in Medicine (IKIM), University Hospital Essen (A\u00f6R) , Essen 45131,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9770-8555","authenticated-orcid":false,"given":"Felix","family":"Busch","sequence":"additional","affiliation":[{"name":"Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich , Munich 81675,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Marcus R","family":"Makowski","sequence":"additional","affiliation":[{"name":"Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich , Munich 81675,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tianyu","family":"Han","sequence":"additional","affiliation":[{"name":"Department of Diagnostic and Interventional Radiology, University Hospital Aachen , Aachen 52074,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Daniel","family":"Truhn","sequence":"additional","affiliation":[{"name":"Department of Diagnostic and Interventional Radiology, University Hospital Aachen , Aachen 52074,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jens","family":"Kleesiek","sequence":"additional","affiliation":[{"name":"Institute for AI in Medicine (IKIM), University Hospital Essen (A\u00f6R) , Essen 45131,","place":["Germany"]},{"name":"Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (A\u00f6R) , Essen 45147,","place":["Germany"]},{"name":"German Cancer Consortium (DKTK, Partner Site Essen) , Heidelberg,","place":["Germany"]},{"name":"Department of Physics, TU Dortmund , Dortmund 44227,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7884-0526","authenticated-orcid":false,"given":"Madhumita","family":"Sushil","sequence":"additional","affiliation":[{"name":"Bakar Computational Health Sciences Institute, University of California, San Francisco , San Francisco, CA 94158,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lisa C","family":"Adams","sequence":"additional","affiliation":[{"name":"Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich , Munich 81675,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Keno K","family":"Bressem","sequence":"additional","affiliation":[{"name":"Department of Radiology, Klinikum Rechts Der Isar, Technical University Munich , Munich 81675,","place":["Germany"]},{"name":"German Heart Center Munich, Technical University Munich , Munich 80636,","place":["Germany"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2025,4,7]]},"reference":[{"key":"2025052712470856000_ocaf045-B1","author":"Eriksen","year":"2023"},{"key":"2025052712470856000_ocaf045-B2","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1038\/s41591-021-01614-0","article-title":"AI in health and medicine","volume":"28","author":"Rajpurkar","year":"2022","journal-title":"Nat Med"},{"key":"2025052712470856000_ocaf045-B3","doi-asserted-by":"crossref","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","article-title":"Large language models in medicine","volume":"29","author":"Thirunavukarasu","year":"2023","journal-title":"Nat Med"},{"key":"2025052712470856000_ocaf045-B4","doi-asserted-by":"crossref","first-page":"e2440969","DOI":"10.1001\/jamanetworkopen.2024.40969","article-title":"Large language model influence on diagnostic reasoning: a randomized clinical trial","volume":"7","author":"Goh","year":"2024","journal-title":"JAMA Netw Open"},{"key":"2025052712470856000_ocaf045-B5","doi-asserted-by":"crossref","first-page":"e101102","DOI":"10.1136\/bmjhci-2024-101102","article-title":"Generative artificial intelligence in primary care: an online survey of UK general practitioners","volume":"31","author":"Blease","year":"2024","journal-title":"BMJ Health Care Inf"},{"key":"2025052712470856000_ocaf045-B6","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2025052712470856000_ocaf045-B7","author":"Pal","year":"2024"},{"key":"2025052712470856000_ocaf045-B8","author":"Labrak","year":"2024"},{"key":"2025052712470856000_ocaf045-B9","doi-asserted-by":"crossref","first-page":"1833","DOI":"10.1093\/jamia\/ocae045","article-title":"PMC-LLaMA: toward building open-source language models for medicine","volume":"31","author":"Wu","year":"2024","journal-title":"J Am Med Inform Assoc"},{"key":"2025052712470856000_ocaf045-B10","author":"Christophe","year":"2024"},{"key":"2025052712470856000_ocaf045-B11","doi-asserted-by":"crossref","first-page":"1320","DOI":"10.1001\/jama.2023.27861","article-title":"Comparative analysis of multimodal large language model performance on clinical vignette questions","volume":"331","author":"Han","year":"2024","journal-title":"JAMA"},{"key":"2025052712470856000_ocaf045-B12","author":"Jiang","year":"2023"},{"key":"2025052712470856000_ocaf045-B13","doi-asserted-by":"crossref","first-page":"e0000198","DOI":"10.1371\/journal.pdig.0000198","article-title":"Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models","volume":"2","author":"Kung","year":"2023","journal-title":"PLoS Digit Health"},{"key":"2025052712470856000_ocaf045-B14","author":"Golchin","year":"2023"},{"key":"2025052712470856000_ocaf045-B15","doi-asserted-by":"crossref","first-page":"3521","DOI":"10.1073\/pnas.1611835114","article-title":"Overcoming catastrophic forgetting in neural networks","volume":"114","author":"Kirkpatrick","year":"2017","journal-title":"Proc Natl Acad Sci USA"},{"key":"2025052712470856000_ocaf045-B16","author":"Dada","year":"2024"},{"key":"2025052712470856000_ocaf045-B17","doi-asserted-by":"crossref","first-page":"160035","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci Data"},{"key":"2025052712470856000_ocaf045-B18","author":"Romanov","year":"2018"},{"key":"2025052712470856000_ocaf045-B19","year":"2019"},{"key":"2025052712470856000_ocaf045-B20","year":"2022"},{"key":"2025052712470856000_ocaf045-B21","author":"Adams","year":"2024"},{"key":"2025052712470856000_ocaf045-B22","author":"Touvron","year":"2023"},{"key":"2025052712470856000_ocaf045-B23","volume-title":"Meta AI","author":"Meta","year":"2024"},{"key":"2025052712470856000_ocaf045-B24","author":"Chen","year":"2023"},{"key":"2025052712470856000_ocaf045-B25","author":"Toma","year":"2023"},{"key":"2025052712470856000_ocaf045-B26","author":"Balloccu","year":"2024"},{"key":"2025052712470856000_ocaf045-B27","author":"Han","year":"2023"},{"key":"2025052712470856000_ocaf045-B28","author":"Hoffmann","year":"2022"},{"key":"2025052712470856000_ocaf045-B29","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2023.conll-1.21","article-title":"Med-halt: medical domain hallucination test for large language models","author":"Pal","year":"2023"},{"key":"2025052712470856000_ocaf045-B30","author":"Ahmad","year":"2023"},{"key":"2025052712470856000_ocaf045-B31","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume":"33","author":"Lewis","year":"2020","journal-title":"Adv Neural Inf Process Syst"},{"key":"2025052712470856000_ocaf045-B32","author":"Xiong","year":"2024"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/6\/1015\/62881344\/ocaf045.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/6\/1015\/62881344\/ocaf045.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,27]],"date-time":"2025-05-27T16:47:17Z","timestamp":1748364437000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/32\/6\/1015\/8107422"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,7]]},"references-count":32,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2025,4,7]]},"published-print":{"date-parts":[[2025,6,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaf045","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,6]]},"published":{"date-parts":[[2025,4,7]]}}}