{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,29]],"date-time":"2026-05-29T17:33:23Z","timestamp":1780076003677,"version":"3.54.0"},"reference-count":86,"publisher":"Oxford University Press (OUP)","issue":"4","license":[{"start":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T00:00:00Z","timestamp":1741564800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/pages\/standard-publication-reuse-rights"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objectives<\/jats:title>\n                  <jats:p>Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects\u2014such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness\u2014to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians\u2019 manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaf023","type":"journal-article","created":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T15:43:24Z","timestamp":1741621404000},"page":"626-637","source":"Crossref","is-referenced-by-count":8,"title":["Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments"],"prefix":"10.1093","volume":"32","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7056-9559","authenticated-orcid":false,"given":"Chuan","family":"Hong","sequence":"first","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5703-113X","authenticated-orcid":false,"given":"Anand","family":"Chowdhury","sequence":"additional","affiliation":[{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Anthony D","family":"Sorrentino","sequence":"additional","affiliation":[{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Haoyuan","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Monica","family":"Agrawal","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6496-7024","authenticated-orcid":false,"given":"Armando","family":"Bedoya","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]},{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Sophia","family":"Bessias","sequence":"additional","affiliation":[{"name":"Duke Clinical and Translational Science Institute, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4078-9809","authenticated-orcid":false,"given":"Nicoleta J","family":"Economou-Zavlanos","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ian","family":"Wong","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]},{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Christian","family":"Pean","sequence":"additional","affiliation":[{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fan","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]},{"name":"Department of Statistical Science, Duke University , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5559-2416","authenticated-orcid":false,"given":"Kathryn I","family":"Pollak","sequence":"additional","affiliation":[{"name":"Cancer Prevention and Control Research Program, Duke Cancer Institute , Durham, NC 27710,","place":["United States"]},{"name":"Department of Population Health Sciences, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7251-5842","authenticated-orcid":false,"given":"Eric G","family":"Poon","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]},{"name":"Department of Medicine, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5798-8855","authenticated-orcid":false,"given":"Michael J","family":"Pencina","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Bioinformatics, Duke University School of Medicine , Durham, NC 27710,","place":["United States"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2025,3,10]]},"reference":[{"key":"2025041716421368600_ocaf023-B1","doi-asserted-by":"publisher","DOI":"10.1016\/S2589-7500(23)00021-3","article-title":"ChatGPT: the future of discharge summaries?","volume":"5","author":"Patel","year":"2023","journal-title":"Lancet Digit Health"},{"key":"2025041716421368600_ocaf023-B2","doi-asserted-by":"crossref","first-page":"e48568","DOI":"10.2196\/48568","article-title":"Utility of ChatGPT in clinical practice","volume":"25","author":"Liu","year":"2023","journal-title":"J Med Internet Res"},{"key":"2025041716421368600_ocaf023-B3","doi-asserted-by":"crossref","first-page":"77","DOI":"10.47392\/irjash.2021.170","article-title":"A diabetic diet suggester and appointment scheduler Chatbot using artificial intelligence and cloud","volume":"3","author":"Kolanu","year":"2021","journal-title":"IRJASH"},{"key":"2025041716421368600_ocaf023-B4","first-page":"9","author":"Lyu","year":"2023"},{"key":"2025041716421368600_ocaf023-B5","first-page":"1148","article-title":"The role of ChatGPT in scientific communication: writing better scientific review articles","volume":"13","author":"Huang","year":"2023","journal-title":"Am J Cancer Res"},{"key":"2025041716421368600_ocaf023-B6"},{"key":"2025041716421368600_ocaf023-B7","author":"Chowdhery"},{"key":"2025041716421368600_ocaf023-B8","doi-asserted-by":"publisher","first-page":"2776","DOI":"10.3390\/healthcare11202776","article-title":"Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration","volume":"11","author":"Yu","year":"2023","journal-title":"Healthcare (Basel, Switzerland)"},{"key":"2025041716421368600_ocaf023-B9","doi-asserted-by":"publisher","author":"Tan","year":"2023","DOI":"10.48550\/arXiv.2303.07992"},{"key":"2025041716421368600_ocaf023-B10","doi-asserted-by":"publisher","author":"He","year":"2023","DOI":"10.48550\/arXiv.2310.05694"},{"key":"2025041716421368600_ocaf023-B11"},{"key":"2025041716421368600_ocaf023-B12","doi-asserted-by":"publisher","author":"Schmidhuber","year":"1997","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"2025041716421368600_ocaf023-B13","author":"Sai","year":"2020"},{"key":"2025041716421368600_ocaf023-B14","author":"Fabbri","year":"2020"},{"key":"2025041716421368600_ocaf023-B15","author":"Abacha","year":"2023"},{"key":"2025041716421368600_ocaf023-B16","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","article-title":"Large language models encode clinical knowledge","volume":"620","author":"Singhal","year":"2023","journal-title":"Nature"},{"key":"2025041716421368600_ocaf023-B17","first-page":"153","author":"Dong","year":"2017"},{"key":"2025041716421368600_ocaf023-B18","author":"Lu","year":"2022"},{"key":"2025041716421368600_ocaf023-B19","author":"Moghe","year":"2022"},{"key":"2025041716421368600_ocaf023-B20","author":"Deutsch","year":"2022"},{"key":"2025041716421368600_ocaf023-B21","author":"Gehrmann","year":"2022"},{"key":"2025041716421368600_ocaf023-B22","author":"Lin","year":"2023"},{"key":"2025041716421368600_ocaf023-B23","author":"Bohnet","year":"2022"},{"key":"2025041716421368600_ocaf023-B24","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1016\/j.jbi.2013.12.004","article-title":"Boland MR, Rusanov A, So Y, et al. From expert-derived user needs to user-perceived ease of use and usefulness: a two-phase mixed-methods evaluation framework","volume":"52","year":"2014","journal-title":"J Biomed Informatics"},{"key":"2025041716421368600_ocaf023-B25","author":"Oniani","year":"2020"},{"key":"2025041716421368600_ocaf023-B26","doi-asserted-by":"publisher","DOI":"10.3115\/992133.992137"},{"key":"2025041716421368600_ocaf023-B27","author":"Kusner","year":"2017"},{"key":"2025041716421368600_ocaf023-B28"},{"key":"2025041716421368600_ocaf023-B29"},{"key":"2025041716421368600_ocaf023-B30","author":"Wang","year":"2021"},{"key":"2025041716421368600_ocaf023-B31","first-page":"1631","author":"Socher","year":"2013"},{"key":"2025041716421368600_ocaf023-B32","author":"Liang","year":"2022"},{"key":"2025041716421368600_ocaf023-B33","author":"Li","year":"2023"},{"key":"2025041716421368600_ocaf023-B34","author":"Yoon","year":"2023"},{"key":"2025041716421368600_ocaf023-B35","first-page":"505","author":"Wang","year":"2016"},{"key":"2025041716421368600_ocaf023-B36","first-page":"514","author":"Stanchev","year":"2019"},{"key":"2025041716421368600_ocaf023-B37","first-page":"392","author":"Popovi\u0107","year":"2015"},{"key":"2025041716421368600_ocaf023-B38","first-page":"74","author":"Lin","year":"2004"},{"key":"2025041716421368600_ocaf023-B39","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"2025041716421368600_ocaf023-B40","first-page":"65","author":"Banerjee","year":"2005"},{"key":"2025041716421368600_ocaf023-B41","first-page":"223","author":"Snover","year":"2006"},{"key":"2025041716421368600_ocaf023-B42","doi-asserted-by":"publisher","author":"Doddington","year":"2002","DOI":"10.3115\/1289189.1289273"},{"key":"2025041716421368600_ocaf023-B43","author":"Vedantam","year":"2014"},{"key":"2025041716421368600_ocaf023-B44","doi-asserted-by":"publisher","author":"Liu","year":"2016","DOI":"10.1109\/ICCV.2017.100"},{"key":"2025041716421368600_ocaf023-B45","author":"Mikolov","year":"2013"},{"key":"2025041716421368600_ocaf023-B46","first-page":"1532","author":"Pennington","year":"2014"},{"key":"2025041716421368600_ocaf023-B47","author":"Devlin","year":"2018"},{"key":"2025041716421368600_ocaf023-B48","author":"Zhang","year":"2019"},{"key":"2025041716421368600_ocaf023-B49","first-page":"1443","author":"Durmus","year":"2022"},{"key":"2025041716421368600_ocaf023-B50","first-page":"2430","author":"Sinha","year":"2020"},{"key":"2025041716421368600_ocaf023-B51","first-page":"4164","author":"Phy","year":"2020"},{"key":"2025041716421368600_ocaf023-B52","first-page":"386","author":"Gao","year":"2020"},{"key":"2025041716421368600_ocaf023-B53","author":"Dziri","year":"2021"},{"key":"2025041716421368600_ocaf023-B54","first-page":"5108","author":"Zhang","year":"2020"},{"key":"2025041716421368600_ocaf023-B55","author":"Kry\u015bci\u0144ski","year":"2019"},{"key":"2025041716421368600_ocaf023-B56","author":"Goyal","year":"2021"},{"key":"2025041716421368600_ocaf023-B57","author":"Wu","year":"2023"},{"key":"2025041716421368600_ocaf023-B58","author":"Liu","year":"2023"},{"key":"2025041716421368600_ocaf023-B59","author":"Zheng","year":"2023"},{"key":"2025041716421368600_ocaf023-B60","year":"2023"},{"key":"2025041716421368600_ocaf023-B61","author":"Fu","year":"2023"},{"key":"2025041716421368600_ocaf023-B62","first-page":"507","author":"Lo","year":"2019"},{"key":"2025041716421368600_ocaf023-B63","first-page":"563","author":"Zhao","year":"2019"},{"key":"2025041716421368600_ocaf023-B64","first-page":"202","author":"Stanojevi\u0107","year":"2014"},{"key":"2025041716421368600_ocaf023-B65","first-page":"598","author":"Ma","year":"2017"},{"key":"2025041716421368600_ocaf023-B66","first-page":"3950","author":"Nema","year":"2018"},{"key":"2025041716421368600_ocaf023-B67","first-page":"14"},{"key":"2025041716421368600_ocaf023-B68","first-page":"4344","author":"Wieting","year":"2019"},{"key":"2025041716421368600_ocaf023-B69","doi-asserted-by":"publisher","author":"Chen","year":"2016","DOI":"10.18653\/v1\/P17-1152"},{"key":"2025041716421368600_ocaf023-B70","first-page":"751","author":"Shimanaka","year":"2018"},{"key":"2025041716421368600_ocaf023-B71","author":"Shimanaka","year":"2019"},{"key":"2025041716421368600_ocaf023-B72","author":"Sellam","year":"2020"},{"key":"2025041716421368600_ocaf023-B73","author":"Kane","year":"2020"},{"key":"2025041716421368600_ocaf023-B74","author":"Zhao","year":"2022"},{"key":"2025041716421368600_ocaf023-B75","first-page":"7100","author":"Chaudhury","year":"2022"},{"key":"2025041716421368600_ocaf023-B76","first-page":"3356","author":"Gehman","year":"2020"},{"key":"2025041716421368600_ocaf023-B77","author":"Dhamala J, Sun T, Kumar V, et al. Bold: dataset and metrics for measuring biases in open-ended language generation. In:"},{"key":"2025041716421368600_ocaf023-B78","first-page":"4902","author":"Ribeiro","year":"2020"},{"key":"2025041716421368600_ocaf023-B79","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1038\/s41746-024-01074-z","article-title":"Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI","volume":"7","author":"Abbasian","year":"2024","journal-title":"NPJ Digit Med"},{"key":"2025041716421368600_ocaf023-B80","doi-asserted-by":"crossref","first-page":"e49240","DOI":"10.2196\/49240","article-title":"Clinical accuracy of large language models and Google search responses to postpartum depression questions: cross-sectional study","volume":"25","author":"Sezgin","year":"2023","journal-title":"J Med Internet Res"},{"key":"2025041716421368600_ocaf023-B81"},{"key":"2025041716421368600_ocaf023-B82","doi-asserted-by":"crossref","first-page":"e243201","DOI":"10.1001\/jamanetworkopen.2024.3201","article-title":"Artificial intelligence-generated draft replies to patient inbox messages","volume":"7","author":"Garcia","year":"2024","journal-title":"JAMA Netw Open"},{"key":"2025041716421368600_ocaf023-B83"},{"key":"2025041716421368600_ocaf023-B84","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1038\/s41746-023-00873-0","article-title":"The imperative for regulatory oversight of large language models (or generative AI) in healthcare","volume":"6","author":"Mesk\u00f3","year":"2023","journal-title":"NPJ Digit Med"},{"key":"2025041716421368600_ocaf023-B85"},{"key":"2025041716421368600_ocaf023-B86","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1038\/s41746-023-00939-z","article-title":"Large language models propagate race-based medicine","volume":"6","author":"Omiye","year":"2023","journal-title":"NPJ Digit Med"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/4\/626\/62367015\/ocaf023.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/32\/4\/626\/62367015\/ocaf023.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,17]],"date-time":"2025-04-17T20:42:25Z","timestamp":1744922545000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/32\/4\/626\/8068783"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,10]]},"references-count":86,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,3,10]]},"published-print":{"date-parts":[[2025,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaf023","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,4]]},"published":{"date-parts":[[2025,3,10]]}}}