{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T20:41:56Z","timestamp":1778704916037,"version":"3.51.4"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2024,10,13]],"date-time":"2024-10-13T00:00:00Z","timestamp":1728777600000},"content-version":"vor","delay-in-days":1,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000060","name":"National Institute of Allergy and Infectious Diseases","doi-asserted-by":"publisher","award":["1R01AI17812101"],"award-info":[{"award-number":["1R01AI17812101"]}],"id":[{"id":"10.13039\/100000060","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Institute on Drug Abuse Clinical Trials Network","award":["UG1DA015815\u2014CTN-0136"],"award-info":[{"award-number":["UG1DA015815\u2014CTN-0136"]}]},{"DOI":"10.13039\/100000936","name":"Gordon and Betty Moore Foundation","doi-asserted-by":"publisher","award":["#12409"],"award-info":[{"award-number":["#12409"]}],"id":[{"id":"10.13039\/100000936","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Stanford Artificial Intelligence in Medicine and Imaging"},{"name":"Human-Centered Artificial Intelligence"},{"name":"Partnership Grant, Google Inc. Research Collaboration, American Heart Association"},{"name":"Strategically Focused Research Network"},{"name":"Diversity in Clinical Trials","award":["UL1TR003142"],"award-info":[{"award-number":["UL1TR003142"]}]},{"DOI":"10.13039\/100000738","name":"Department of Veterans Affairs","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000738","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Introduction<\/jats:title>\n                  <jats:p>The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Methods<\/jats:title>\n                  <jats:p>We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion and Conclusions<\/jats:title>\n                  <jats:p>SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocae254","type":"journal-article","created":{"date-parts":[[2024,10,13]],"date-time":"2024-10-13T07:38:53Z","timestamp":1728805133000},"page":"139-149","source":"Crossref","is-referenced-by-count":46,"title":["Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment"],"prefix":"10.1093","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4828-5802","authenticated-orcid":false,"given":"Thomas","family":"Savage","sequence":"first","affiliation":[{"name":"Department of Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]},{"name":"Division of Hospital Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]}]},{"given":"John","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]},{"name":"Division of Gastroenterology and Hepatology, Department of Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2601-0173","authenticated-orcid":false,"given":"Robert","family":"Gallo","sequence":"additional","affiliation":[{"name":"Palo Alto Veterans Affairs Medical Center , Palo Alto, CA 94304,","place":["United States"]},{"name":"Department of Health Policy, Stanford University , Stanford, CA 94304,","place":["United States"]}]},{"given":"Abdessalem","family":"Boukil","sequence":"additional","affiliation":[{"name":"Linguamind AI , Sousse 4000,","place":["Tunisia"]}]},{"given":"Vishwesh","family":"Patel","sequence":"additional","affiliation":[{"name":"M.P. Shah Government Medical College , Jamnagar, Gujarat 361008,","place":["India"]}]},{"given":"Seyed Amir Ahmad","family":"Safavi-Naini","sequence":"additional","affiliation":[{"name":"Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai , New York, NY 10029,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6900-5596","authenticated-orcid":false,"given":"Ali","family":"Soroush","sequence":"additional","affiliation":[{"name":"Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai , New York, NY 10029,","place":["United States"]},{"name":"The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai , New York, NY 10029,","place":["United States"]},{"name":"Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai , New York, NY 10029,","place":["United States"]}]},{"given":"Jonathan H","family":"Chen","sequence":"additional","affiliation":[{"name":"Department of Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]},{"name":"Division of Hospital Medicine, Stanford University , Stanford, CA 94304,","place":["United States"]},{"name":"Stanford Center for Biomedical Informatics Research, Stanford University , Stanford, CA 94304,","place":["United States"]},{"name":"Clinical Excellence Research Center, Stanford University , Stanford, CA 94304,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2024,10,12]]},"reference":[{"key":"2025101120020631300_ocae254-B1","doi-asserted-by":"crossref","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","article-title":"Large language models in medicine","volume":"29","author":"Thirunavukarasu","year":"2023","journal-title":"Nat Med"},{"key":"2025101120020631300_ocae254-B2","doi-asserted-by":"publisher","first-page":"1028","DOI":"10.1001\/jamainternmed.2023.2909","article-title":"Chatbot vs medical student performance on free-response clinical reasoning examinations","volume":"183","author":"Strong","year":"2023","journal-title":"JAMA Intern Med"},{"key":"2025101120020631300_ocae254-B3","doi-asserted-by":"publisher","author":"Singhal","year":"2022","DOI":"10.48550\/arXiv.2212.13138"},{"key":"2025101120020631300_ocae254-B4","doi-asserted-by":"publisher","author":"Singhal","year":"2023","DOI":"10.48550\/arXiv.2305.09617"},{"key":"2025101120020631300_ocae254-B5","doi-asserted-by":"publisher","author":"McDuff","year":"2023","DOI":"10.48550\/arXiv.2312.00164"},{"key":"2025101120020631300_ocae254-B6","doi-asserted-by":"crossref","first-page":"e2325000","DOI":"10.1001\/jamanetworkopen.2023.25000","article-title":"Use of GPT-4 to analyze medical records of patients with extensive investigations and delayed diagnosis","volume":"6","author":"Shea","year":"2023","journal-title":"JAMA Netw Open"},{"key":"2025101120020631300_ocae254-B7","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1177\/01410768231173123","article-title":"Large language models will not replace healthcare professionals: curbing popular fears and hype","volume":"116","author":"Thirunavukarasu","year":"2023","journal-title":"J R Soc Med"},{"key":"2025101120020631300_ocae254-B8","author":"Gao","year":"2024"},{"key":"2025101120020631300_ocae254-B9","author":"Hu","year":"2024"},{"key":"2025101120020631300_ocae254-B10","doi-asserted-by":"publisher","author":"Saab","year":"2024","DOI":"10.48550\/arXiv.2404.18416"},{"key":"2025101120020631300_ocae254-B11","author":"Labruna","year":"2024"},{"key":"2025101120020631300_ocae254-B12"},{"key":"2025101120020631300_ocae254-B13"},{"key":"2025101120020631300_ocae254-B14","author":"Touvron H"},{"key":"2025101120020631300_ocae254-B15"},{"key":"2025101120020631300_ocae254-B16","author":"Huang","year":"2023"},{"key":"2025101120020631300_ocae254-B17","author":"Hou","year":"2023"},{"key":"2025101120020631300_ocae254-B18","doi-asserted-by":"crossref","first-page":"711","DOI":"10.1038\/s41551-022-00988-x","article-title":"Tackling prediction uncertainty in machine learning for healthcare","volume":"7","author":"Chua","year":"2023","journal-title":"Nat Biomed Eng"},{"key":"2025101120020631300_ocae254-B19","author":"Ye","year":"2024"},{"key":"2025101120020631300_ocae254-B20","author":"Rivera","year":"2024"},{"key":"2025101120020631300_ocae254-B21","author":"Xiong","year":"2023"},{"key":"2025101120020631300_ocae254-B22","doi-asserted-by":"publisher","first-page":"5506","DOI":"10.18653\/v1\/2023.emnlp-main.335","author":"Zhou","year":"2023"},{"key":"2025101120020631300_ocae254-B23","author":"Bakman","year":"2024"},{"key":"2025101120020631300_ocae254-B24","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1038\/s41746-020-00367-3","article-title":"Second opinion needed: communicating uncertainty in medical machine learning","volume":"4","author":"Kompa","year":"2021","journal-title":"npj Digit Med"},{"key":"2025101120020631300_ocae254-B25","author":"Kuhn","year":"2023"},{"key":"2025101120020631300_ocae254-B26","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1016\/j.jbi.2017.10.008","article-title":"Beyond discrimination: a comparison of calibration methods and clinical usefulness of predictive models of readmission risk","volume":"76","author":"Walsh","year":"2017","journal-title":"J Biomed Inform"},{"key":"2025101120020631300_ocae254-B27","doi-asserted-by":"crossref","first-page":"230","DOI":"10.1186\/s12916-019-1466-7","article-title":"Calibration: the Achilles heel of predictive analytics","volume":"17","author":"Van Calster","year":"2019","journal-title":"BMC Med"},{"key":"2025101120020631300_ocae254-B28","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.1001\/jama.2017.12126","article-title":"Discrimination and calibration of clinical prediction models: users\u2019 guides to the medical literature","volume":"318","author":"Alba","year":"2017","journal-title":"JAMA"},{"key":"2025101120020631300_ocae254-B29","doi-asserted-by":"publisher","author":"Jin","year":"2020","DOI":"10.48550\/arXiv.2009.13081"},{"key":"2025101120020631300_ocae254-B30"},{"key":"2025101120020631300_ocae254-B31","author":"Tian","year":"2023"},{"key":"2025101120020631300_ocae254-B32","author":"Manakul","year":"2023"},{"key":"2025101120020631300_ocae254-B33"},{"key":"2025101120020631300_ocae254-B34","year":"2024"},{"key":"2025101120020631300_ocae254-B35"},{"key":"2025101120020631300_ocae254-B36","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1038\/s41746-024-01010-1","article-title":"Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine","volume":"7","author":"Savage","year":"2023","journal-title":"npj Digit Med"},{"key":"2025101120020631300_ocae254-B37"},{"key":"2025101120020631300_ocae254-B38","year":"2023"},{"key":"2025101120020631300_ocae254-B39","year":"2023"},{"key":"2025101120020631300_ocae254-B40","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1097\/JTO.0b013e3181ec173d","article-title":"Receiver operating characteristic curve in diagnostic test assessment","volume":"5","author":"Mandrekar","year":"2010","journal-title":"J Thorac Oncol"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/1\/139\/61202176\/ocae254.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/1\/139\/61202176\/ocae254.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T00:02:13Z","timestamp":1760227333000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/32\/1\/139\/7819854"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,12]]},"references-count":40,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,10,12]]},"published-print":{"date-parts":[[2025,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocae254","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,1]]},"published":{"date-parts":[[2024,10,12]]}}}