{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T16:12:17Z","timestamp":1775059937703,"version":"3.50.1"},"reference-count":41,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:sec><jats:title>Background<\/jats:title><jats:p>Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>In a comparative analysis using 219 multiple-choice pharmacotherapy questions, five LLMs (GPT-3.5, GPT-4, Claude 2, Llama2-7b and 2-13b) were evaluated. Each LLM was queried five times to evaluate the primary outcome of accuracy (i.e., correctness). Secondary outcomes included variance, the impact of prompt engineering techniques (e.g., chain-of-thought, CoT) and training of a customized GPT on performance, and comparison to third year doctor of pharmacy students on knowledge recall vs. knowledge application questions. Accuracy and variance were compared with student\u2019s t-test to compare performance under different model settings.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>ChatGPT-4 exhibited the highest accuracy (71.6%), while Llama2-13b had the lowest variance (0.070). All LLMs performed more accurately on knowledge recall vs. knowledge application questions (e.g., ChatGPT-4: 87% vs. 67%). When applied to ChatGPT-4, few-shot CoT across five runs improved accuracy (77.4% vs. 71.5%) with no effect on variance. Self-consistency and the custom-trained GPT demonstrated similar accuracy to ChatGPT-4 with few-shot CoT. Overall pharmacy student accuracy was 81%, compared to an optimal overall LLM accuracy of 73%. Comparing question types, six of the LLMs demonstrated equivalent or higher accuracy than pharmacy students on knowledge recall questions (e.g., self-consistency vs. students: 93% vs. 84%), but pharmacy students achieved higher accuracy than all LLMs on knowledge application questions (e.g., self-consistency vs. students: 68% vs. 80%).<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>ChatGPT-4 was the most accurate LLM on critical care pharmacy questions and few-shot CoT improved accuracy the most. Average student accuracy was similar to LLMs overall, and higher on knowledge application questions. These findings support the need for future assessment of customized training for the type of output needed. Reliance on LLMs is only supported with recall-based questions.<\/jats:p><\/jats:sec>","DOI":"10.3389\/frai.2024.1514896","type":"journal-article","created":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T06:14:25Z","timestamp":1736403265000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education"],"prefix":"10.3389","volume":"7","author":[{"given":"Huibo","family":"Yang","sequence":"first","affiliation":[]},{"given":"Mengxuan","family":"Hu","sequence":"additional","affiliation":[]},{"given":"Amoreena","family":"Most","sequence":"additional","affiliation":[]},{"given":"W. Anthony","family":"Hawkins","sequence":"additional","affiliation":[]},{"given":"Brian","family":"Murray","sequence":"additional","affiliation":[]},{"given":"Susan E.","family":"Smith","sequence":"additional","affiliation":[]},{"given":"Sheng","family":"Li","sequence":"additional","affiliation":[]},{"given":"Andrea","family":"Sikora","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,1,9]]},"reference":[{"key":"ref1","doi-asserted-by":"publisher","first-page":"e55991","DOI":"10.7759\/cureus.55991","article-title":"Comparing the performance of popular large language models on the National Board of medical examiners sample questions","volume":"16","author":"Abbas","year":"2024","journal-title":"Cureus"},{"key":"ref2","doi-asserted-by":"publisher","first-page":"100324","DOI":"10.1016\/j.xops.2023.100324","article-title":"Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings","volume":"3","author":"Antaki","year":"2023","journal-title":"Ophthalmol Sci."},{"key":"ref3","doi-asserted-by":"publisher","first-page":"e2343689","DOI":"10.1001\/jamanetworkopen.2023.43689","article-title":"Leveraging large language models for decision support in personalized oncology","volume":"6","author":"Benary","year":"2023","journal-title":"JAMA Netw. Open"},{"key":"ref4","doi-asserted-by":"publisher","first-page":"140","DOI":"10.1111\/nyas.15007","article-title":"Holistic evaluation of language models","volume":"1525","author":"Bommasani","year":"2023","journal-title":"Ann. N. Y. Acad. Sci."},{"key":"ref5","doi-asserted-by":"publisher","first-page":"871","DOI":"10.1093\/ajhp\/zxae153","article-title":"How critical is it? Integrating critical care into the pharmacy didactic curriculum","volume":"81","author":"Branan","year":"2024","journal-title":"Am. J. Health Syst. Pharm."},{"key":"ref6","first-page":"2005.14165","volume-title":"Language models are few-shot learners","author":"Brown","year":"2020"},{"key":"ref7","doi-asserted-by":"publisher","first-page":"662","DOI":"10.1111\/bcp.15963","article-title":"Clinical decision making in benzodiazepine deprescribing by healthcare providers vs AI-assisted approach","volume":"90","author":"Bu\u017ean\u010di\u0107","year":"2024","journal-title":"Br. J. Clin. Pharmacol."},{"key":"ref8","first-page":"2204.02311","volume-title":"PaLM: scaling language modeling with pathways","author":"Chowdhery","year":"2022"},{"key":"ref9","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1810.04805","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2018","journal-title":"arXiv"},{"key":"ref10","volume-title":"CohortGPT: An enhanced GPT for participant recruitment in clinical study","author":"Guan","year":"2023"},{"key":"ref19","first-page":"2312.01639","volume-title":"On the effectiveness of large language models in domain-specific code generation","author":"Gu","year":"2024"},{"key":"ref11","doi-asserted-by":"publisher","first-page":"555","DOI":"10.22454\/FamMed.2024.233738","article-title":"Performance of language models on the family medicine in-training exam","volume":"56","author":"Hanna","year":"2024","journal-title":"Fam. Med."},{"key":"ref12","doi-asserted-by":"publisher","DOI":"10.1093\/ajhp\/zxae366","article-title":"Cultivating expert thinking skills for experiential pharmacy trainees","author":"Hawkins","year":"2024","journal-title":"Am. J. Health Syst. Pharm."},{"key":"ref13","doi-asserted-by":"publisher","DOI":"10.3389\/fonc.2023.1219326","article-title":"Evaluating large language models on a highly-specialized topic, radiation oncology physics","volume":"13","author":"Holmes","year":"2023","journal-title":"Front. Oncol."},{"key":"ref14","doi-asserted-by":"publisher","first-page":"e48433","DOI":"10.2196\/48433","article-title":"Examining real-world medication consultations and drug-herb interactions: ChatGPT performance evaluation","volume":"9","author":"Hsu","year":"2023","journal-title":"JMIR Med. Educ."},{"key":"ref15","doi-asserted-by":"publisher","first-page":"1812","DOI":"10.1093\/jamia\/ocad259","article-title":"Improving large language models for clinical named entity recognition via prompt engineering","volume":"31","author":"Hu","year":"2024","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"ref16","doi-asserted-by":"publisher","first-page":"78","DOI":"10.1001\/jama.2023.8288","article-title":"Accuracy of a generative artificial intelligence model in a complex diagnostic challenge","volume":"330","author":"Kanjee","year":"2023","journal-title":"JAMA"},{"key":"ref17","doi-asserted-by":"publisher","DOI":"10.1056\/AIdbp2300192","article-title":"GPT versus resident physicians: a benchmark based on official board scores","volume":"1","author":"Katz","year":"2024","journal-title":"NEJM AI"},{"key":"ref18","doi-asserted-by":"publisher","first-page":"e48452","DOI":"10.2196\/48452","article-title":"The potential of GPT-4 as a support tool for pharmacists: analytical study using the Japanese National Examination for pharmacists","volume":"9","author":"Kunitsu","year":"2023","journal-title":"JMIR Med. Educ."},{"key":"ref20","doi-asserted-by":"publisher","first-page":"e60807","DOI":"10.2196\/60807","article-title":"Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis","volume":"26","author":"Liu","year":"2024","journal-title":"J. Med. Internet Res."},{"key":"ref21","author":"Liu","year":"2023"},{"key":"ref22","author":"Liu","year":"2021"},{"key":"ref23","first-page":"2304.08448","volume-title":"An iterative optimizing framework for radiology report summarization with ChatGPT","author":"Ma","year":"2024"},{"key":"ref24","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2307.06435","article-title":"A comprehensive overview of large language models","author":"Naveed","year":"2023","journal-title":"arXiv"},{"key":"ref25","author":"Pryzant","year":"2023"},{"key":"ref26","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2112.11446","article-title":"Scaling language models: methods, analysis & insights from training gopher","author":"Rae","year":"2022","journal-title":"arXiv"},{"key":"ref27","first-page":"1910.10683","volume-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","author":"Raffel","year":"2019"},{"key":"ref28","doi-asserted-by":"publisher","first-page":"887","DOI":"10.3390\/healthcare11060887","article-title":"ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns","volume":"11","author":"Sallam","year":"2023","journal-title":"Healthcare (Basel)"},{"key":"ref29","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1016\/j.ccc.2023.01.006","article-title":"Critical care pharmacists: a focus on horizons","volume":"39","author":"Sikora","year":"2023","journal-title":"Crit. Care Clin."},{"key":"ref30","doi-asserted-by":"publisher","first-page":"876","DOI":"10.1111\/1742-6723.14280","article-title":"Will code one day run a code? Performance of language models on ACEM primary examinations and implications","volume":"35","author":"Smith","year":"2023","journal-title":"Emerg. Med. Australas."},{"key":"ref31","author":"Sun","year":"2023"},{"key":"ref32","first-page":"171","article-title":"Medication dispensing errors and prevention","volume-title":"StatPearls","author":"Tariq","year":"2024"},{"key":"ref33","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1038\/s41591-018-0300-7","article-title":"High-performance medicine: the convergence of human and artificial intelligence","volume":"25","author":"Topol","year":"2019","journal-title":"Nat. Med."},{"key":"ref34","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1186\/s13000-024-01464-7","article-title":"Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology \u2013 a recent scoping review","volume":"19","author":"Ullah","year":"2024","journal-title":"Diagn. Pathol."},{"key":"ref35","author":"Wang","year":"2022"},{"key":"ref36","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2109.01652","article-title":"Finetuned language models are zero-shot learners","author":"Wei","year":"","journal-title":"arXiv"},{"key":"ref37","author":"Wei","year":""},{"key":"ref38","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2201.11903","article-title":"Chain-of-thought prompting elicits reasoning in large language models","author":"Wei","year":"2023","journal-title":"arXiv"},{"key":"ref39","first-page":"2304.13712","volume-title":"Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond","author":"Yang","year":"2023"},{"key":"ref40","first-page":"2307.10928","volume-title":"FLASK: Fine-grained language model evaluation based on alignment skill sets","author":"Ye","year":"2024"},{"key":"ref41","doi-asserted-by":"publisher","first-page":"719","DOI":"10.1038\/s41551-018-0305-z","article-title":"Artificial intelligence in healthcare","volume":"2","author":"Yu","year":"2018","journal-title":"Nat. Biomed. Eng."}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2024.1514896\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T06:14:30Z","timestamp":1736403270000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2024.1514896\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,9]]},"references-count":41,"alternative-id":["10.3389\/frai.2024.1514896"],"URL":"https:\/\/doi.org\/10.3389\/frai.2024.1514896","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,9]]},"article-number":"1514896"}}