{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T23:16:25Z","timestamp":1776122185317,"version":"3.50.1"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,9]],"date-time":"2025-05-09T00:00:00Z","timestamp":1746748800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,9]],"date-time":"2025-05-09T00:00:00Z","timestamp":1746748800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001656","name":"Helmholtz-Gemeinschaft","doi-asserted-by":"publisher","award":["KA-TVP-60"],"award-info":[{"award-number":["KA-TVP-60"]}],"id":[{"id":"10.13039\/501100001656","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001656","name":"Helmholtz-Gemeinschaft","doi-asserted-by":"publisher","award":["KA-TVP-60"],"award-info":[{"award-number":["KA-TVP-60"]}],"id":[{"id":"10.13039\/501100001656","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Accurate medical decision-making is critical for both patients and clinicians. Patients often struggle to interpret their symptoms, determine their severity, and select the right specialist. Simultaneously, clinicians face challenges in integrating complex patient data to make timely, accurate diagnoses. Recent advances in large language models (LLMs) offer the potential to bridge this gap by supporting decision-making for both patients and healthcare providers. In this study, we benchmark multiple LLM versions and an LLM-based workflow incorporating retrieval-augmented generation (RAG) on a curated dataset of 2000 medical cases derived from the Medical Information Mart for Intensive Care database. Our findings show that these LLMs are capable of providing personalized insights into likely diagnoses, suggesting appropriate specialists, and assessing urgent care needs. These models may also support clinicians in refining diagnoses and decision-making, offering a promising approach to improving patient outcomes and streamlining healthcare delivery.<\/jats:p>","DOI":"10.1038\/s41746-025-01684-1","type":"journal-article","created":{"date-parts":[[2025,5,9]],"date-time":"2025-05-09T11:16:29Z","timestamp":1746789389000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":73,"title":["Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis"],"prefix":"10.1038","volume":"8","author":[{"given":"Farieda","family":"Gaber","sequence":"first","affiliation":[]},{"given":"Maqsood","family":"Shaik","sequence":"additional","affiliation":[]},{"given":"Fabio","family":"Allega","sequence":"additional","affiliation":[]},{"given":"Agnes Julia","family":"Bilecz","sequence":"additional","affiliation":[]},{"given":"Felix","family":"Busch","sequence":"additional","affiliation":[]},{"given":"Kelsey","family":"Goon","sequence":"additional","affiliation":[]},{"given":"Vedran","family":"Franke","sequence":"additional","affiliation":[]},{"given":"Altuna","family":"Akalin","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,9]]},"reference":[{"key":"1684_CR1","first-page":"85","volume":"9","author":"A Sutriningsih","year":"2020","unstructured":"Sutriningsih, A., Wahyuni, C. U. & Haksama, S. Factors affecting emergency nurses\u2019 perceptions of the triage systems. J. Pub. Health Res. 9, 85\u201387 (2020).","journal-title":"J. Pub. Health Res."},{"key":"1684_CR2","doi-asserted-by":"publisher","unstructured":"Ma, M. D. et al. CliBench: Multifaceted evaluation of Large Language Models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv https:\/\/doi.org\/10.48550\/arXiv.2406.09923 (2024).","DOI":"10.48550\/arXiv.2406.09923"},{"key":"1684_CR3","doi-asserted-by":"publisher","first-page":"744","DOI":"10.3390\/app14020744","volume":"14","author":"A Testolin","year":"2023","unstructured":"Testolin, A. Can neural networks do arithmetic? A survey on the elementary numerical skills of state-of-the-art deep learning models. Appl. Sci. (Basel) 14, 744 (2023).","journal-title":"Appl. Sci. (Basel)"},{"key":"1684_CR4","first-page":"e55991","volume":"16","author":"A Abbas","year":"2024","unstructured":"Abbas, A., Rehman, M. S. & Rehman, S. S. Comparing the performance of popular large language models on the National Board of Medical Examiners sample questions. Cureus 16, e55991 (2024).","journal-title":"Cureus"},{"key":"1684_CR5","doi-asserted-by":"publisher","unstructured":"Brin, D. et al. How large language models perform on the United States medical licensing examination: A systematic review. medRxiv https:\/\/doi.org\/10.1101\/2023.09.03.23294842 (2023)","DOI":"10.1101\/2023.09.03.23294842"},{"key":"1684_CR6","unstructured":"Resources Optimal Care Injured Patient. 6th. American Col- lege Surgeons (2014)."},{"key":"1684_CR7","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1080\/10903127.2022.2043963","volume":"27","author":"JR Lupton","year":"2023","unstructured":"Lupton, J. R. et al. Under-triage and over-triage using the Field Triage Guidelines for injured patients: A systematic review. Prehosp. Emerg. Care 27, 38\u201345 (2023).","journal-title":"Prehosp. Emerg. Care"},{"key":"1684_CR8","doi-asserted-by":"publisher","first-page":"2613","DOI":"10.1038\/s41591-024-03097-1","volume":"30","author":"P Hager","year":"2024","unstructured":"Hager, P. et al. Evaluating and mitigating limitations of large language models in clinical decision making. Nat. Med. 30, 2613\u20132622 (2024).","journal-title":"Nat. Med."},{"key":"1684_CR9","doi-asserted-by":"publisher","first-page":"1043","DOI":"10.1093\/mr\/road115","volume":"34","author":"Y Mori","year":"2024","unstructured":"Mori, Y., Izumiyama, T., Kanabuchi, R., Mori, N. & Aizawa, T. Large language model may assist diagnosis of SAPHO syndrome by bone scintigraphy. Mod. Rheumatol. 34, 1043\u20131046 (2024).","journal-title":"Mod. Rheumatol."},{"key":"1684_CR10","first-page":"18417","volume":"38","author":"T Kwon","year":"2023","unstructured":"Kwon, T. et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. ACM AAAI Conf. AI 38, 18417\u201318425 (2023).","journal-title":"ACM AAAI Conf. AI"},{"key":"1684_CR11","doi-asserted-by":"publisher","first-page":"2534","DOI":"10.1016\/j.jseint.2023.07.018","volume":"7","author":"M Daher","year":"2023","unstructured":"Daher, M. et al. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int. 7, 2534\u20132541 (2023).","journal-title":"JSES Int"},{"key":"1684_CR12","doi-asserted-by":"publisher","unstructured":"Madadi, Y. et al. ChatGPT assisting diagnosis of neuro-ophthalmology diseases based on case reports. JNO https:\/\/doi.org\/10.1097\/WNO.000000000000227 (2023).","DOI":"10.1097\/WNO.000000000000227"},{"key":"1684_CR13","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1097\/ICO.0000000000003492","volume":"43","author":"M Delsoz","year":"2023","unstructured":"Delsoz, M. et al. Performance of ChatGPT in diagnosis of corneal eye diseases. Cornea 43, 664\u2013670 (2023).","journal-title":"Cornea"},{"key":"1684_CR14","doi-asserted-by":"publisher","unstructured":"Sorin, V. et al. GPT-4 multimodal analysis on ophthalmology clinical cases including text and images. medRxiv https:\/\/doi.org\/10.1101\/2023.11.24.23298953 (2023)","DOI":"10.1101\/2023.11.24.23298953"},{"key":"1684_CR15","first-page":"1","volume":"1","author":"AV Eriksen","year":"2023","unstructured":"Eriksen, A. V., M\u00f6ller, S. & Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 1, 1\u20133 (2023).","journal-title":"NEJM AI"},{"key":"1684_CR16","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1186\/s44247-023-00058-5","volume":"2","author":"D Ueda","year":"2024","unstructured":"Ueda, D. et al. Evaluating GPT-4-based ChatGPT\u2019s clinical potential on the NEJM quiz. BMC Digit Health 2, 4 (2024).","journal-title":"BMC Digit Health"},{"key":"1684_CR17","doi-asserted-by":"publisher","unstructured":"Han, T. et al. Comparative analysis of GPT-4Vision, GPT-4 and open source LLMs in clinical diagnostic accuracy: A benchmark against human expertise. medRxiv https:\/\/doi.org\/10.1101\/2023.11.03.23297957 (2023)","DOI":"10.1101\/2023.11.03.23297957"},{"key":"1684_CR18","doi-asserted-by":"publisher","unstructured":"Harsha, N. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv https:\/\/doi.org\/10.48550\/arXiv.2311.16452 (2023).","DOI":"10.48550\/arXiv.2311.16452"},{"key":"1684_CR19","unstructured":"Gilboy, N., Tanabe, T., Travers, D. & Rosenau, M. Emergency Severity Index (ESI): Triage Tool Emergency Department Care, Version 4 (2011)."},{"key":"1684_CR20","doi-asserted-by":"publisher","unstructured":"Johnson, A. et al.MIMIC-IV. PhysioNet https:\/\/doi.org\/10.13026\/HXP0-HG59 (2024).","DOI":"10.13026\/HXP0-HG59"},{"key":"1684_CR21","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-022-01899-x","volume":"10","author":"AEW Johnson","year":"2023","unstructured":"Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).","journal-title":"Sci. Data"},{"key":"1684_CR22","doi-asserted-by":"publisher","first-page":"E215","DOI":"10.1161\/01.CIR.101.23.e215","volume":"101","author":"AL Goldberger","year":"2000","unstructured":"Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215\u2013E220 (2000).","journal-title":"Circulation"},{"key":"1684_CR23","unstructured":"Claude 3.5 Sonnet. https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet."},{"key":"1684_CR24","unstructured":"Introducing the next generation of Claude. https:\/\/www.anthropic.com\/news\/claude-3-family."},{"key":"1684_CR25","doi-asserted-by":"publisher","unstructured":"Johnson, A. et al. MIMIC-IV-ED. PhysioNet https:\/\/doi.org\/10.13026\/5NTK-KM72 (2023).","DOI":"10.13026\/5NTK-KM72"},{"key":"1684_CR26","doi-asserted-by":"publisher","unstructured":"Johnson, A., Pollard, T., Horng, S., Celi, L. A. & Mark, R. MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet https:\/\/doi.org\/10.13026\/1N74-NE17 (2023).","DOI":"10.13026\/1N74-NE17"},{"key":"1684_CR27","unstructured":"Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459\u20139474 (2020)."},{"key":"1684_CR28","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1038\/s41746-024-01091-y","volume":"7","author":"S Kresevic","year":"2024","unstructured":"Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1684_CR29","first-page":"3784","volume":"2021","author":"K Shuster","year":"2021","unstructured":"Shuster, K., Poff, S., Chen, M., Kiela, D. & Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. ACL Find. Assoc. Comput. Linguist.: EMNLP 2021, 3784\u20133803 (2021).","journal-title":"ACL Find. Assoc. Comput. Linguist.: EMNLP"},{"key":"1684_CR30","first-page":"228","volume":"6","author":"O Ayala","year":"2024","unstructured":"Ayala, O. & Bechard, P. Reducing hallucination in structured outputs via Retrieval-Augmented Generation. ACL Proc. Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. 6, 228\u2013238 (2024).","journal-title":"ACL Proc. Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol."},{"key":"1684_CR31","doi-asserted-by":"publisher","DOI":"10.2196\/58041","volume":"26","author":"D Wang","year":"2024","unstructured":"Wang, D. et al. Enhancement of the performance of large language models in diabetes education through retrieval-augmented generation: Comparative study. J. Med. Internet Res. 26, e58041 (2024).","journal-title":"J. Med. Internet Res."},{"key":"1684_CR32","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-024-03886-w","volume":"11","author":"YJ Park","year":"2024","unstructured":"Park, Y. J., Jerng, S. E., Yoon, S. & Li, J. 1.5 million materials narratives generated by chatbots. Sci. Data 11, 1060 (2024).","journal-title":"Sci. Data"},{"key":"1684_CR33","first-page":"280","volume":"7","author":"MA Khaliq","year":"2024","unstructured":"Khaliq, M. A., Chang, P. Y.-C., Ma, M., Pflugfelder, B. & Mileti\u0107, F. RAGAR, your falsehood radar: RAG-augmented reasoning for political fact-checking using multimodal large language models. ACL Proc. FEVER Workshop 7, 280\u2013296 (2024).","journal-title":"ACL Proc. FEVER Workshop"},{"key":"1684_CR34","doi-asserted-by":"crossref","unstructured":"Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. ACL Conf. Empir. Methods Nat. Lang. Process, 3045\u20133059 (2021).","DOI":"10.18653\/v1\/2021.emnlp-main.243"},{"key":"1684_CR35","doi-asserted-by":"publisher","unstructured":"Yang, C. et al. Large Language Models as Optimizers. arXiv https:\/\/doi.org\/10.48550\/arXiv.2309.03409 (2023).","DOI":"10.48550\/arXiv.2309.03409"},{"key":"1684_CR36","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1038\/s41746-024-01010-1","volume":"7","author":"T Savage","year":"2024","unstructured":"Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 7, 20 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1684_CR37","doi-asserted-by":"publisher","first-page":"2629","DOI":"10.1007\/s10439-023-03272-4","volume":"51","author":"L Giray","year":"2023","unstructured":"Giray, L. Prompt engineering with ChatGPT: A guide for academic writers. Ann. Biomed. Eng. 51, 2629\u20132633 (2023).","journal-title":"Ann. Biomed. Eng."},{"key":"1684_CR38","doi-asserted-by":"publisher","first-page":"14","DOI":"10.4236\/jcc.2024.1210002","volume":"12","author":"P Bansal","year":"2024","unstructured":"Bansal, P. Prompt engineering importance and applicability with generative AI. J. Comput. Commun. 12, 14\u201323 (2024).","journal-title":"J. Comput. Commun."},{"key":"1684_CR39","first-page":"22199","volume":"38","author":"T Kojima","year":"2022","unstructured":"Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv Acm. NIPS'22 38, 22199\u201322213 (2022).","journal-title":"arXiv Acm. NIPS'22"},{"key":"1684_CR40","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1038\/s41746-024-01029-4","volume":"7","author":"L Wang","year":"2024","unstructured":"Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit. Med. 7, 41 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1684_CR41","unstructured":"Annex III: High-Risk AI Systems Referred to in Article 6(2). https:\/\/artificialintelligenceact.eu\/annex\/3\/."},{"key":"1684_CR42","unstructured":"Claude 3.7 Sonnet. https:\/\/www.anthropic.com\/claude\/sonnet."},{"key":"1684_CR43","doi-asserted-by":"publisher","unstructured":"Gao, M. et al. Human-like Summarization Evaluation with ChatGPT. arXiv https:\/\/doi.org\/10.48550\/arXiv.2304.02554 (2023).","DOI":"10.48550\/arXiv.2304.02554"},{"key":"1684_CR44","doi-asserted-by":"publisher","first-page":"15607","DOI":"10.18653\/v1\/2023.acl-long.870","volume":"1","author":"C-H Chiang","year":"2023","unstructured":"Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? ACL Proc. 61st Annu. Meet. Assoc. Comput. Linguist. 1, 15607\u201315631 (2023).","journal-title":"ACL Proc. 61st Annu. Meet. Assoc. Comput. Linguist."},{"key":"1684_CR45","first-page":"46595","volume":"37","author":"L Zheng","year":"2023","unstructured":"Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. arXiv Acm. NIPS'23 37, 46595\u201346623 (2023).","journal-title":"arXiv Acm. NIPS'23"},{"key":"1684_CR46","first-page":"12032","volume":"2024","author":"TR Davidson","year":"2024","unstructured":"Davidson, T. R. et al. Self-Recognition in Language Models. ACL Find. Assoc. Comput. Linguist.: EMNLP 2024, 12032\u201312059 (2024).","journal-title":"ACL Find. Assoc. Comput. Linguist.: EMNLP"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01684-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01684-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01684-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,9]],"date-time":"2025-05-09T11:16:36Z","timestamp":1746789396000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01684-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,9]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1684"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-01684-1","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.09.27.24314505","asserted-by":"object"}]},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,9]]},"assertion":[{"value":"27 September 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 April 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"263"}}