{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,30]],"date-time":"2026-06-30T11:13:47Z","timestamp":1782818027396,"version":"3.54.5"},"reference-count":68,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T00:00:00Z","timestamp":1763424000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T00:00:00Z","timestamp":1763424000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100008460","name":"National Center for Complementary and Integrative Health","doi-asserted-by":"publisher","award":["R01AT009457"],"award-info":[{"award-number":["R01AT009457"]}],"id":[{"id":"10.13039\/100008460","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100008460","name":"National Center for Complementary and Integrative Health","doi-asserted-by":"publisher","award":["R01AT009457"],"award-info":[{"award-number":["R01AT009457"]}],"id":[{"id":"10.13039\/100008460","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100008460","name":"National Center for Complementary and Integrative Health","doi-asserted-by":"publisher","award":["R01AT009457"],"award-info":[{"award-number":["R01AT009457"]}],"id":[{"id":"10.13039\/100008460","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100008460","name":"National Center for Complementary and Integrative Health","doi-asserted-by":"publisher","award":["R01AT009457"],"award-info":[{"award-number":["R01AT009457"]}],"id":[{"id":"10.13039\/100008460","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000049","name":"National Institute on Aging","doi-asserted-by":"publisher","award":["R01AG078154"],"award-info":[{"award-number":["R01AG078154"]}],"id":[{"id":"10.13039\/100000049","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000049","name":"National Institute on Aging","doi-asserted-by":"publisher","award":["R01AG078154"],"award-info":[{"award-number":["R01AG078154"]}],"id":[{"id":"10.13039\/100000049","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000049","name":"National Institute on Aging","doi-asserted-by":"publisher","award":["R01AG078154"],"award-info":[{"award-number":["R01AG078154"]}],"id":[{"id":"10.13039\/100000049","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000049","name":"National Institute on Aging","doi-asserted-by":"publisher","award":["R01AG078154"],"award-info":[{"award-number":["R01AG078154"]}],"id":[{"id":"10.13039\/100000049","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","award":["R01CA287413"],"award-info":[{"award-number":["R01CA287413"]}],"id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","award":["R01CA287413"],"award-info":[{"award-number":["R01CA287413"]}],"id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","award":["R01CA287413"],"award-info":[{"award-number":["R01CA287413"]}],"id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000054","name":"National Cancer Institute","doi-asserted-by":"publisher","award":["R01CA287413"],"award-info":[{"award-number":["R01CA287413"]}],"id":[{"id":"10.13039\/100000054","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Explainable disease diagnosis, which leverages patient information (e.g., symptoms) and computational models to generate probable diagnoses and reasoning, holds strong clinical promise. Yet, when clinical notes lack sufficient evidence for a definitive diagnosis, such as the absence of definitive symptoms, diagnostic uncertainty commonly arises, increasing the risk of misdiagnosis. Despite its importance, the explicit identification and explanation of diagnostic uncertainty remain under-explored in artificial intelligence-driven systems. To fill this gap, we introduce ConfiDx, an uncertainty-aware large language model fine-tuned with diagnostic criteria. We formalized the task of uncertainty-aware diagnosis and curated richly annotated datasets that reflect varying degrees of diagnostic ambiguity. Evaluating on real-world datasets demonstrated that ConfiDx excelled in identifying diagnostic uncertainties, achieving superior diagnostic performance, and generating trustworthy explanations for diagnoses and uncertainties. Moreover, ConfiDx-assisted experts outperformed standalone experts by 10.7% in uncertainty recognition and 26% in uncertainty explanation, underscoring its substantial potential to improve clinical decision-making.<\/jats:p>","DOI":"10.1038\/s41746-025-02071-6","type":"journal-article","created":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T16:23:45Z","timestamp":1763483025000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":14,"title":["Uncertainty-aware large language models for explainable disease diagnosis"],"prefix":"10.1038","volume":"8","author":[{"given":"Shuang","family":"Zhou","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jiashuo","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zidu","family":"Xu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Song","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Brauer","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lindsay","family":"Welton","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jacob","family":"Cogan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yuen-Hei","family":"Chung","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lei","family":"Tian","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zaifu","family":"Zhan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yu","family":"Hou","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mingquan","family":"Lin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Genevieve B.","family":"Melton","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rui","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,11,18]]},"reference":[{"key":"2071_CR1","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1038\/s41746-024-01010-1","volume":"7","author":"T Savage","year":"2024","unstructured":"Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 7, 20 (2024).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR2","doi-asserted-by":"publisher","first-page":"2613","DOI":"10.1038\/s41591-024-03097-1","volume":"30","author":"P Hager","year":"2024","unstructured":"Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613\u20132622 (2024).","journal-title":"Nat. Med."},{"key":"2071_CR3","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1038\/s44401-025-00015-6","volume":"2","author":"S Zhou","year":"2025","unstructured":"Zhou, S. et al. Explainable differential diagnosis with dual-inference large language models. Npj Health Syst. 2, 12 (2025).","journal-title":"Npj Health Syst."},{"key":"2071_CR4","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1038\/s41746-024-01091-y","volume":"7","author":"S Kresevic","year":"2024","unstructured":"Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR5","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1093\/jamia\/ocae254","volume":"32","author":"T Savage","year":"2025","unstructured":"Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. 32, 139\u2013149 (2025).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"2071_CR6","unstructured":"Hu, Z. et al. Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs. In The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)."},{"key":"2071_CR7","doi-asserted-by":"publisher","first-page":"565","DOI":"10.1001\/jama.293.5.565","volume":"293","author":"PC Smith","year":"2005","unstructured":"Smith, P. C. et al. Missing clinical information during primary care visits. JAMA 293, 565\u2013571 (2005).","journal-title":"JAMA"},{"key":"2071_CR8","doi-asserted-by":"publisher","first-page":"369","DOI":"10.1016\/j.jemermed.2007.11.082","volume":"35","author":"M Szpakowicz","year":"2008","unstructured":"Szpakowicz, M. & Herd, A. Medically cleared\u201d: how well are patients with psychiatric presentations examined by emergency physicians?. J. Emerg. Med. 35, 369\u2013372 (2008).","journal-title":"J. Emerg. Med."},{"key":"2071_CR9","doi-asserted-by":"publisher","DOI":"10.1186\/1472-6963-11-114","volume":"11","author":"SJ Burnett","year":"2011","unstructured":"Burnett, S. J., Deelchand, V., Franklin, B. D., Moorthy, K. & Vincent, C. Missing clinical information in NHS hospital outpatient clinics: prevalence, causes and effects on patient care. BMC Health Serv. Res. 11, 114 (2011).","journal-title":"BMC Health Serv. Res."},{"key":"2071_CR10","doi-asserted-by":"publisher","first-page":"233","DOI":"10.1136\/qhc.11.3.233","volume":"11","author":"SM Dovey","year":"2002","unstructured":"Dovey, S. M. et al. A preliminary taxonomy of medical errors in family practice. Qual. Saf. Health Care 11, 233\u2013238 (2002).","journal-title":"Qual. Saf. Health Care"},{"key":"2071_CR11","doi-asserted-by":"publisher","first-page":"411","DOI":"10.5694\/j.1326-5377.1999.tb127814.x","volume":"170","author":"RM Wilson","year":"1999","unstructured":"Wilson, R. M., Harrison, B. T., Gibberd, R. W. & Hamilton, J. D. An analysis of the causes of adverse events from the Quality in Australian Health Care Study. Med. J. Aust. 170, 411\u2013415 (1999).","journal-title":"Med. J. Aust."},{"key":"2071_CR12","doi-asserted-by":"publisher","first-page":"E1","DOI":"10.1016\/j.jacc.2004.07.014","volume":"44","author":"EM Antman","year":"2004","unstructured":"Antman, E. M. et al. ACC\/AHA guidelines for the management of patients with ST-elevation myocardial infarction; A report of the American College of Cardiology\/American Heart Association Task Force on Practice Guidelines (Committee to Revise the 1999 Guidelines for the Management of patients with acute myocardial infarction). J. Am. Coll. Cardiol. 44, E1\u2013E211 (2004).","journal-title":"J. Am. Coll. Cardiol."},{"key":"2071_CR13","doi-asserted-by":"publisher","first-page":"2032","DOI":"10.1093\/eurheartj\/ehy076","volume":"39","author":"J-R Ghadri","year":"2018","unstructured":"Ghadri, J.-R. et al. International expert consensus document on takotsubo syndrome (part I): clinical characteristics, diagnostic criteria, and pathophysiology. Eur. Heart J. 39, 2032\u20132046 (2018).","journal-title":"Eur. Heart J."},{"key":"2071_CR14","doi-asserted-by":"publisher","first-page":"8","DOI":"10.1002\/ejhf.424","volume":"18","author":"AR Lyon","year":"2016","unstructured":"Lyon, A. R. et al. Current state of knowledge on Takotsubo syndrome: a Position Statement from the Taskforce on Takotsubo Syndrome of the Heart Failure Association of the European Society of Cardiology: Current state of knowledge on Takotsubo syndrome. Eur. J. Heart Fail. 18, 8\u201327 (2016).","journal-title":"Eur. J. Heart Fail."},{"key":"2071_CR15","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-023-41974-4","volume":"14","author":"M Lin","year":"2023","unstructured":"Lin, M. et al. Improving model fairness in image-based computer-aided diagnosis. Nat. Commun. 14, 6261 (2023).","journal-title":"Nat. Commun."},{"key":"2071_CR16","doi-asserted-by":"publisher","first-page":"106551","DOI":"10.1016\/j.neunet.2024.106551","volume":"179","author":"S Zhou","year":"2024","unstructured":"Zhou, S. et al. Open-world electrocardiogram classification via domain knowledge-driven contrastive learning. Neural Netw. 179, 106551 (2024).","journal-title":"Neural Netw."},{"key":"2071_CR17","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1038\/s44387-025-00011-z","volume":"1","author":"S Zhou","year":"2025","unstructured":"Zhou, S. et al. Large language models for disease diagnosis: A scoping review. npj Artificial Intelligence 1, 9 (2025).","journal-title":"npj Artificial Intelligence"},{"key":"2071_CR18","unstructured":"Vazhentsev, A. et al. Uncertainty-aware abstention in medical diagnosis based on medical texts. Preprint at https:\/\/arxiv.org\/abs\/2502.18050 (2025)."},{"key":"2071_CR19","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-020-15432-4","volume":"11","author":"AH Ribeiro","year":"2020","unstructured":"Ribeiro, A. H. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat. Commun. 11, 1760 (2020).","journal-title":"Nat. Commun."},{"key":"2071_CR20","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1038\/s41746-019-0216-8","volume":"3","author":"A Ghorbani","year":"2020","unstructured":"Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit. Med. 3, 10 (2020).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR21","doi-asserted-by":"publisher","first-page":"1061","DOI":"10.1038\/s42256-021-00423-x","volume":"3","author":"AJ Barnett","year":"2021","unstructured":"Barnett, A. J. et al. A case-based interpretable deep learning model for classification of mass lesions in digital mammography. Nat. Mach. Intell. 3, 1061\u20131070 (2021).","journal-title":"Nat. Mach. Intell."},{"key":"2071_CR22","doi-asserted-by":"crossref","unstructured":"McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451\u2013457 (2025).","DOI":"10.1038\/s41586-025-08869-4"},{"key":"2071_CR23","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-023-40260-7","volume":"14","author":"X Zhang","year":"2023","unstructured":"Zhang, X., Wu, C., Zhang, Y., Xie, W. & Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).","journal-title":"Nat. Commun."},{"key":"2071_CR24","first-page":"18417","volume":"38","author":"T Kwon","year":"2024","unstructured":"Kwon, T. et al. Large language models are clinical reasoners: reasoning-aware diagnosis framework with prompt-generated rationales. Proc. Conf. AAAI Artif. Intell. 38, 18417\u201318425 (2024).","journal-title":"Proc. Conf. AAAI Artif. Intell."},{"key":"2071_CR25","doi-asserted-by":"publisher","DOI":"10.1093\/jamiaopen\/ooae154","volume":"8","author":"Y Gao","year":"2025","unstructured":"Gao, Y. et al. Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability. JAMIA Open 8, ooae154 (2025).","journal-title":"JAMIA Open"},{"key":"2071_CR26","doi-asserted-by":"crossref","unstructured":"Geng, J. et al. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds. Duh, K., Gomez, H. & Bethard, S.) 6577\u20136595 (Association for Computational Linguistics, 2024).","DOI":"10.18653\/v1\/2024.naacl-long.366"},{"key":"2071_CR27","unstructured":"OpenAI et al. GPT-4 Technical Report. Preprint at https:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"2071_CR28","unstructured":"Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https:\/\/arxiv.org\/abs\/2302.13971 (2023)."},{"key":"2071_CR29","doi-asserted-by":"publisher","first-page":"e232715","DOI":"10.1148\/radiol.232715","volume":"311","author":"S Krishna","year":"2024","unstructured":"Krishna, S., Bhambra, N., Bleakney, R. & Bhayana, R. Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board-style examination. Radiology 311, e232715 (2024).","journal-title":"Radiology"},{"key":"2071_CR30","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-024-55628-6","volume":"16","author":"M Griot","year":"2025","unstructured":"Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 16, 642 (2025).","journal-title":"Nat. Commun."},{"key":"2071_CR31","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1111\/bdi.12609","volume":"20","author":"LN Yatham","year":"2018","unstructured":"Yatham, L. N. et al. Canadian Network for Mood and Anxiety Treatments (CANMAT) and International Society for Bipolar Disorders (ISBD) 2018 guidelines for the management of patients with bipolar disorder. Bipolar Disord. 20, 97\u2013170 (2018).","journal-title":"Bipolar Disord."},{"key":"2071_CR32","first-page":"1290","volume":"284","author":"GH Guyatt","year":"2000","unstructured":"Guyatt, G. H. et al. Users\u2019 Guides to the Medical Literature: XXV. Evidence-based medicine: principles for applying the Users\u2019 Guides to patient care. Evid. Based Med. Working Group. JAMA 284, 1290\u20131296 (2000).","journal-title":"Evid. Based Med. Working Group. JAMA"},{"key":"2071_CR33","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1136\/bmj.318.7179.323","volume":"318","author":"T Greenhalgh","year":"1999","unstructured":"Greenhalgh, T. Narrative based medicine: narrative based medicine in an evidence based world. BMJ 318, 323\u2013325 (1999).","journal-title":"BMJ"},{"key":"2071_CR34","unstructured":"Zhou, H. et al. A survey of large language models in medicine: progress, application, and challenge. Preprint at https:\/\/arxiv.org\/abs\/2311.05112 (2023)."},{"key":"2071_CR35","doi-asserted-by":"crossref","unstructured":"Wang, J. et al. A survey on large Language Models from general purpose to medical applications: datasets, methodologies, and evaluations. Preprint at https:\/\/arxiv.org\/abs\/2406.10303 (2024).","DOI":"10.2139\/ssrn.4972504"},{"key":"2071_CR36","first-page":"1","volume":"56","author":"B Wang","year":"2023","unstructured":"Wang, B. et al. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys 56, 1\u201352 (2023).","journal-title":"ACM Computing Surveys"},{"key":"2071_CR37","doi-asserted-by":"crossref","unstructured":"Xu, K., Cheng, Y., Hou, W., Tan, Q. & Li, W. Reasoning like a doctor: Improving medical dialogue systems via diagnostic reasoning process alignment. In Findings of the Association for Computational Linguistics ACL 2024 (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 6796\u20136814 (Association for Computational Linguistics, 2024).","DOI":"10.18653\/v1\/2024.findings-acl.406"},{"key":"2071_CR38","doi-asserted-by":"publisher","first-page":"96449","DOI":"10.52202\/079017-3057","volume":"37","author":"H Cui","year":"2024","unstructured":"Cui, H. et al. Biomedical visual instruction tuning with clinician preference alignment. Adv. Neural Inf. Process. Syst. 37, 96449\u201396467 (2024).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"2071_CR39","doi-asserted-by":"crossref","unstructured":"Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).","DOI":"10.1038\/s41597-023-01945-2"},{"key":"2071_CR40","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-023-02814-8","volume":"10","author":"Z Zhao","year":"2023","unstructured":"Zhao, Z., Jin, Q., Chen, F., Peng, T. & Yu, S. A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Sci. Data 10, 909 (2023).","journal-title":"Sci. Data"},{"key":"2071_CR41","unstructured":"Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at https:\/\/arxiv.org\/abs\/2312.11805 (2023)."},{"key":"2071_CR42","doi-asserted-by":"publisher","DOI":"10.15620\/cdc\/164020","author":"S Curtin","year":"2024","unstructured":"Curtin, S., Tejada-Vera, B. & Bastian, B. Deaths: leading causes for 2022. CDC https:\/\/doi.org\/10.15620\/cdc\/164020 (2024).","journal-title":"CDC"},{"key":"2071_CR43","doi-asserted-by":"publisher","unstructured":"Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature https:\/\/doi.org\/10.1038\/s41586-025-08866-7 (2025).","DOI":"10.1038\/s41586-025-08866-7"},{"key":"2071_CR44","unstructured":"Wang, B. et al. DiReCT: Diagnostic reasoning for clinical notes via large language models. In Advances in Neural Information Processing Systems (NIPS, 2024)."},{"key":"2071_CR45","doi-asserted-by":"publisher","first-page":"633","DOI":"10.1038\/s41586-025-09422-z","volume":"645","author":"D Guo","year":"2025","unstructured":"Guo, D. et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645, 633\u2013638 (2025).","journal-title":"Nature"},{"key":"2071_CR46","doi-asserted-by":"publisher","first-page":"166","DOI":"10.1038\/s41746-025-01556-8","volume":"8","author":"B Bhasuran","year":"2025","unstructured":"Bhasuran, B. et al. Preliminary analysis of the impact of lab results on large language model generated differential diagnoses. NPJ Digit. Med. 8, 166 (2025).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR47","doi-asserted-by":"publisher","first-page":"2467","DOI":"10.1001\/jama.2021.22396","volume":"326","author":"J Adler-Milstein","year":"2021","unstructured":"Adler-Milstein, J., Chen, J. H. & Dhaliwal, G. Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to \u201cwayfinding. JAMA 326, 2467\u20132468 (2021).","journal-title":"JAMA"},{"key":"2071_CR48","doi-asserted-by":"publisher","first-page":"932","DOI":"10.1038\/s41591-024-03416-6","volume":"31","author":"X Liu","year":"2025","unstructured":"Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932\u2013942 (2025).","journal-title":"Nat. Med."},{"key":"2071_CR49","doi-asserted-by":"publisher","first-page":"545","DOI":"10.1093\/jamia\/ocaf002","volume":"32","author":"Z Zhan","year":"2025","unstructured":"Zhan, Z., Zhou, S., Li, M. & Zhang, R. RAMIE: retrieval-augmented multi-task information extraction with large language models on dietary supplements. J. Am. Med. Inform. Assoc. 32, 545\u2013554 (2025).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"2071_CR50","unstructured":"Han, T. et al. MedAlpaca -- an open-source collection of medical conversational AI models and training data. Preprint at https:\/\/arxiv.org\/abs\/2304.08247 (2023)."},{"key":"2071_CR51","doi-asserted-by":"publisher","first-page":"1833","DOI":"10.1093\/jamia\/ocae045","volume":"31","author":"C Wu","year":"2024","unstructured":"Wu, C. et al. PMC-LLaMA: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 31, 1833\u20131843 (2024).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"2071_CR52","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-024-50043-3","volume":"15","author":"J Zhou","year":"2024","unstructured":"Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 15, 5649 (2024).","journal-title":"Nat. Commun."},{"key":"2071_CR53","doi-asserted-by":"publisher","first-page":"259","DOI":"10.1038\/s41586-023-05881-4","volume":"616","author":"M Moor","year":"2023","unstructured":"Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259\u2013265 (2023).","journal-title":"Nature"},{"key":"2071_CR54","doi-asserted-by":"publisher","first-page":"S17","DOI":"10.2337\/dc22-S002","volume":"45","author":"American Diabetes Association Professional Practice Committee","year":"2022","unstructured":"American Diabetes Association Professional Practice Committee 2. Classification and diagnosis of diabetes: Standards of Medical Care in diabetes-2022. Diabetes Care 45, S17\u2013S38 (2022).","journal-title":"Diabetes Care"},{"key":"2071_CR55","doi-asserted-by":"publisher","first-page":"305","DOI":"10.1016\/j.gheart.2018.08.004","volume":"13","author":"K Thygesen","year":"2018","unstructured":"Thygesen, K. et al. Fourth universal definition of myocardial infarction (2018). Glob. Heart 13, 305\u2013338 (2018).","journal-title":"Glob. Heart"},{"key":"2071_CR56","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1038\/s41746-025-01550-0","volume":"8","author":"X Chen","year":"2025","unstructured":"Chen, X. et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit. Med. 8, 159 (2025).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR57","doi-asserted-by":"crossref","unstructured":"DeYoung, J. et al. ERASER: a benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 4443\u20134458 (Association for Computational Linguistics, 2020).","DOI":"10.18653\/v1\/2020.acl-main.408"},{"key":"2071_CR58","doi-asserted-by":"publisher","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","volume":"45","author":"M Sokolova","year":"2009","unstructured":"Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45, 427\u2013437 (2009).","journal-title":"Inf. Process. Manag."},{"key":"2071_CR59","unstructured":"Christophe, C. et al. Med42 - Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches. AAAI 2024 Spring Symposium on Clinical Foundation Models (2024)."},{"key":"2071_CR60","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s44401-024-00003-2","volume":"2","author":"R Yang","year":"2025","unstructured":"Yang, R. et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2, 1\u20135 (2025).","journal-title":"npj Health Syst."},{"key":"2071_CR61","unstructured":"Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations (2022)."},{"key":"2071_CR62","doi-asserted-by":"crossref","unstructured":"Yin, Z. et al. Do large language models know what they don\u2019t know? In Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, 2023).","DOI":"10.18653\/v1\/2023.findings-acl.551"},{"key":"2071_CR63","doi-asserted-by":"crossref","unstructured":"Amayuelas, A., Wong, K., Pan, L., Chen, W. & Wang, W. Y. Knowledge of knowledge: exploring known-unknowns uncertainty with large language models. In Findings of the Association for Computational Linguistics ACL 2024 (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 6416\u20136432 (Association for Computational Linguistics, 2024).","DOI":"10.18653\/v1\/2024.findings-acl.383"},{"key":"2071_CR64","unstructured":"Satanjeev Banerjee, A. L. METEOR An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and\/or Summarization (Association for Computational Linguistics, 2005)."},{"key":"2071_CR65","unstructured":"Zhang, T. et al. BERTScore: Evaluating Text Generation with BERT. International Conference on Learning Representations (2020)."},{"key":"2071_CR66","doi-asserted-by":"publisher","unstructured":"Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (Association for Computational Linguistics, 2019). https:\/\/doi.org\/10.18653\/v1\/d19-1410.","DOI":"10.18653\/v1\/d19-1410"},{"key":"2071_CR67","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1038\/s41746-024-01074-z","volume":"7","author":"M Abbasian","year":"2024","unstructured":"Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit. Med. 7, 82 (2024).","journal-title":"NPJ Digit. Med."},{"key":"2071_CR68","doi-asserted-by":"crossref","unstructured":"Croxford, E. et al. Current and future state of evaluation of large language models for medical summarization tasks. Npj Health Syst. 2, 6 (2025).","DOI":"10.1038\/s44401-024-00011-2"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02071-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02071-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02071-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T05:07:36Z","timestamp":1763528856000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02071-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,18]]},"references-count":68,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["2071"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-02071-6","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,18]]},"assertion":[{"value":"8 July 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 October 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 November 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"690"}}