{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,17]],"date-time":"2026-05-17T06:49:27Z","timestamp":1779000567729,"version":"3.51.4"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,12,26]],"date-time":"2025-12-26T00:00:00Z","timestamp":1766707200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T00:00:00Z","timestamp":1769644800000},"content-version":"vor","delay-in-days":34,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 metrics covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and revised 2069 open-ended Q&amp;A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (\n                    <jats:italic>p<\/jats:italic>\n                    \u2009&lt;\u20090.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.\n                  <\/jats:p>","DOI":"10.1038\/s41746-025-02277-8","type":"journal-article","created":{"date-parts":[[2025,12,26]],"date-time":"2025-12-26T07:23:58Z","timestamp":1766733838000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["A novel evaluation benchmark for medical LLMs illuminating safety and effectiveness in clinical domains"],"prefix":"10.1038","volume":"9","author":[{"given":"Shirui","family":"Wang","sequence":"first","affiliation":[]},{"given":"Zhihui","family":"Tang","sequence":"additional","affiliation":[]},{"given":"Huaxia","family":"Yang","sequence":"additional","affiliation":[]},{"given":"Qiuhong","family":"Gong","sequence":"additional","affiliation":[]},{"given":"Tiantian","family":"Gu","sequence":"additional","affiliation":[]},{"given":"Hongyang","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Yongxin","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Wubin","family":"Sun","sequence":"additional","affiliation":[]},{"given":"Zeliang","family":"Lian","sequence":"additional","affiliation":[]},{"given":"Kehang","family":"Mao","sequence":"additional","affiliation":[]},{"given":"Yinan","family":"Jiang","sequence":"additional","affiliation":[]},{"given":"Zhicheng","family":"Huang","sequence":"additional","affiliation":[]},{"given":"Lingyun","family":"Ma","sequence":"additional","affiliation":[]},{"given":"Wenjie","family":"Shen","sequence":"additional","affiliation":[]},{"given":"Yajie","family":"Ji","sequence":"additional","affiliation":[]},{"given":"Yunhui","family":"Tan","sequence":"additional","affiliation":[]},{"given":"Chunbo","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Yunlu","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Qianling","family":"Ye","sequence":"additional","affiliation":[]},{"given":"Rui","family":"Lin","sequence":"additional","affiliation":[]},{"given":"Mingyu","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Lijuan","family":"Niu","sequence":"additional","affiliation":[]},{"given":"Zhihao","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Peng","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Mengran","family":"Lang","sequence":"additional","affiliation":[]},{"given":"Yue","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Huimin","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Haitao","family":"Shen","sequence":"additional","affiliation":[]},{"given":"Long","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Qiguang","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Si-Xuan","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Lina","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Hua","family":"Gao","sequence":"additional","affiliation":[]},{"given":"Dongqiang","family":"Ye","sequence":"additional","affiliation":[]},{"given":"Lingmin","family":"Meng","sequence":"additional","affiliation":[]},{"given":"Youtao","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Naixin","family":"Liang","sequence":"additional","affiliation":[]},{"given":"Jianxiong","family":"Wu","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,12,26]]},"reference":[{"key":"2277_CR1","doi-asserted-by":"publisher","first-page":"210","DOI":"10.7326\/M23-2772","volume":"177","author":"JA Omiye","year":"2024","unstructured":"Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review. Ann. Intern. Med. 177, 210\u2013220 (2024).","journal-title":"Ann. Intern. Med."},{"key":"2277_CR2","doi-asserted-by":"publisher","first-page":"451","DOI":"10.1038\/s41586-025-08869-4","volume":"642","author":"D McDuff","year":"2025","unstructured":"McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451\u2013457 (2025).","journal-title":"Nature"},{"key":"2277_CR3","doi-asserted-by":"publisher","first-page":"319","DOI":"10.1001\/jama.2024.21700","volume":"333","author":"S Bedi","year":"2025","unstructured":"Bedi, S. et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 333, 319\u2013328 (2025).","journal-title":"JAMA"},{"key":"2277_CR4","doi-asserted-by":"publisher","first-page":"259","DOI":"10.1038\/s41586-023-05881-4","volume":"616","author":"M Moor","year":"2023","unstructured":"Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259\u2013265 (2023).","journal-title":"Nature"},{"key":"2277_CR5","doi-asserted-by":"publisher","unstructured":"Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. https:\/\/doi.org\/10.1038\/s41591-025-03726-3 (2025).","DOI":"10.1038\/s41591-025-03726-3"},{"key":"2277_CR6","unstructured":"Dada, A. et al. 124-136 (Association for Computational Linguistics)"},{"key":"2277_CR7","doi-asserted-by":"publisher","first-page":"1134","DOI":"10.1038\/s41591-024-02855-5","volume":"30","author":"D Van Veen","year":"2024","unstructured":"Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134\u20131142 (2024).","journal-title":"Nat. Med."},{"key":"2277_CR8","unstructured":"Ive, J. et al. Clean & Clear: Feasibility of Safe LLM Clinical Guidance. arXiv:2503.20953. https:\/\/ui.adsabs.harvard.edu\/abs\/2025arXiv250320953I (2025)."},{"key":"2277_CR9","doi-asserted-by":"publisher","first-page":"e441","DOI":"10.1016\/S2589-7500(24)00111-0","volume":"6","author":"A de Hond","year":"2024","unstructured":"de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit Health 6, e441\u2013e443 (2024).","journal-title":"Lancet Digit Health"},{"key":"2277_CR10","doi-asserted-by":"publisher","first-page":"2613","DOI":"10.1038\/s41591-024-03097-1","volume":"30","author":"P Hager","year":"2024","unstructured":"Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613\u20132622 (2024).","journal-title":"Nat. Med."},{"key":"2277_CR11","doi-asserted-by":"publisher","DOI":"10.1186\/s12911-024-02709-7","volume":"24","author":"J Lee","year":"2024","unstructured":"Lee, J., Park, S., Shin, J. & Cho, B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inf. Decis. Mak. 24, 366 (2024).","journal-title":"BMC Med Inf. Decis. Mak."},{"key":"2277_CR12","unstructured":"Liu, M. et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. arXiv:2407.10990. https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv240710990L (2024)."},{"key":"2277_CR13","unstructured":"Ying, Z. et al. SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models. arXiv:2410.18927. https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv241018927Y."},{"key":"2277_CR14","unstructured":"Zhang, Z. et al. Agent-SafetyBench: Evaluating the Safety of LLM Agents. arXiv:2412.14470 (2024). https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv241214470Z."},{"key":"2277_CR15","unstructured":"Deniz, F. et al. aiXamine: Simplified LLM Safety and Security. arXiv:2504.14985 (2025). https:\/\/ui.adsabs.harvard.edu\/abs\/2025arXiv250414985D."},{"key":"2277_CR16","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1038\/s41746-025-01684-1","volume":"8","author":"F Gaber","year":"2025","unstructured":"Gaber, F. et al. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit Med 8, 263 (2025).","journal-title":"NPJ Digit Med"},{"key":"2277_CR17","unstructured":"Arora, R. K. et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775. https:\/\/ui.adsabs.harvard.edu\/abs\/2025arXiv250508775A."},{"key":"2277_CR18","unstructured":"Liu, L. et al. Towards Automatic Evaluation for LLMs\u2019 Clinical Capabilities: Metric, Data, and Algorithm. arXiv:2403.16446 (2024). https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv240316446L."},{"key":"2277_CR19","unstructured":"Singhal, K. et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617 (2023). https:\/\/ui.adsabs.harvard.edu\/abs\/2023arXiv230509617S."},{"key":"2277_CR20","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1038\/s41591-024-03328-5","volume":"31","author":"S Johri","year":"2025","unstructured":"Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med 31, 77\u201386 (2025).","journal-title":"Nat. Med"},{"key":"2277_CR21","unstructured":"Tu, T. et al. Towards Conversational Diagnostic AI. arXiv:2401.05654. https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv240105654T."},{"key":"2277_CR22","unstructured":"Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. arXiv:2405.07960. https:\/\/ui.adsabs.harvard.edu\/abs\/2024arXiv240507960S."},{"key":"2277_CR23","unstructured":"Liao, Y., Meng, Y., Liu, H., Wang, Y. & Wang, Y. An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models. arXiv:2309.02077. https:\/\/ui.adsabs.harvard.edu\/abs\/2023arXiv230902077L."},{"key":"2277_CR24","unstructured":"Shi, X. et al. LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation. arXiv:2308.07635. https:\/\/ui.adsabs.harvard.edu\/abs\/2023arXiv230807635S (2023)."},{"key":"2277_CR25","doi-asserted-by":"publisher","first-page":"358","DOI":"10.1038\/s41746-024-01356-6","volume":"7","author":"D Fast","year":"2024","unstructured":"Fast, D. et al. Autonomous medical evaluation for guideline adherence of large language models. NPJ Digit Med 7, 358 (2024).","journal-title":"NPJ Digit Med"},{"key":"2277_CR26","doi-asserted-by":"publisher","unstructured":"Sharma, S., Tuli, S. & Badam, N. Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek Family of Models. arXiv 2508.21377, https:\/\/doi.org\/10.48550\/arXiv.2508.21377 (2025).","DOI":"10.48550\/arXiv.2508.21377"},{"key":"2277_CR27","doi-asserted-by":"publisher","DOI":"10.1016\/j.lanepe.2024.101064","volume":"46","author":"L Verlingue","year":"2024","unstructured":"Verlingue, L. et al. Artificial intelligence in oncology: ensuring safe and effective integration of language models in clinical practice. Lancet Reg. Health Eur. 46, 101064 (2024).","journal-title":"Lancet Reg. Health Eur."},{"key":"2277_CR28","unstructured":"Bedi, S. et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. arXiv:2505.23802 (2025). https:\/\/ui.adsabs.harvard.edu\/abs\/2025arXiv250523802B."},{"key":"2277_CR29","doi-asserted-by":"publisher","first-page":"687","DOI":"10.3390\/bioengineering12070687","volume":"12","author":"B Pingua","year":"2025","unstructured":"Pingua, B. et al. Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation. Bioengineering 12, 687 (2025).","journal-title":"Bioengineering"},{"key":"2277_CR30","doi-asserted-by":"publisher","first-page":"3270","DOI":"10.1038\/s41591-025-03983-2","volume":"31","author":"ZL Teo","year":"2025","unstructured":"Teo, Z. L. et al. Generative Artificial Intelligence in Medicine. Nat. Med. 31, 3270\u20133282 (2025).","journal-title":"Nat. Med."},{"key":"2277_CR31","doi-asserted-by":"publisher","first-page":"231","DOI":"10.1007\/s40265-024-02124-2","volume":"85","author":"Y Zhang","year":"2025","unstructured":"Zhang, Y. et al. Aligning Large Language Models with Humans: A Comprehensive Survey of ChatGPT\u2019s Aptitude in Pharmacology. Drugs 85, 231\u2013254 (2025).","journal-title":"Drugs"},{"key":"2277_CR32","doi-asserted-by":"publisher","DOI":"10.1093\/jamiaopen\/ooaf055","volume":"8","author":"MT Dinc","year":"2025","unstructured":"Dinc, M. T., Bardak, A. E., Bahar, F. & Noronha, C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA Open 8, ooaf055 (2025).","journal-title":"JAMIA Open"},{"key":"2277_CR33","unstructured":"Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 (2023). https:\/\/ui.adsabs.harvard.edu\/abs\/2023arXiv230605685Z."},{"key":"2277_CR34","doi-asserted-by":"publisher","unstructured":"Croxford, E. et al. Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge. medRxiv, https:\/\/doi.org\/10.1101\/2025.04.22.25326219 (2025).","DOI":"10.1101\/2025.04.22.25326219"},{"key":"2277_CR35","doi-asserted-by":"crossref","unstructured":"La Bella, S. et al. Global Variations in Artificial Intelligence-Generated Information on Juvenile Idiopathic Arthritis. Rheumatology, keaf329, (2025).","DOI":"10.1093\/rheumatology\/keaf329"},{"key":"2277_CR36","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1007\/s00428-015-1878-5","volume":"468","author":"L Mastracci","year":"2016","unstructured":"Mastracci, L. et al. Interobserver Reproducibility in Pathologist Interpretation of Columnar-Lined Esophagus. Virchows Arch. 468, 159\u2013167 (2016).","journal-title":"Virchows Arch."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02277-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02277-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02277-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,29]],"date-time":"2026-01-29T11:54:45Z","timestamp":1769687685000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02277-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,26]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,12]]}},"alternative-id":["2277"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-02277-8","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,26]]},"assertion":[{"value":"14 August 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 December 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"SW, TG, YW, WS, ZL, KM, DY, HG and LM are employees of Medlinker Intelligent and Digital Technology Co., Ltd, Beijing, China, the developers of the MedGPT model evaluated in this study. These authors contributed to the study concept only. The other authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"91"}}