{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T08:08:29Z","timestamp":1778486909505,"version":"3.51.4"},"reference-count":63,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T00:00:00Z","timestamp":1778457600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T00:00:00Z","timestamp":1778457600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100008641","name":"Chung Shan Medical University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100008641","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Med Syst"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Large language models (LLMs) show promise in medical applications, yet their translation into clinical practice requires rigorous validation. Current robustness testing often employs adversarial approaches borrowed from AI safety, raising questions about their alignment with authentic clinical scenarios. To systematically map methodologies used for robustness testing of LLMs in medical contexts and assess their clinical plausibility. A scoping review was conducted following PRISMA-ScR guidelines, searching PubMed, Embase, Web of Science, IEEE Xplore, ACM Digital Library, arXiv, and MedRxiv from January 2023 to September 2025. Two independent physician reviewers screened 5,331 articles, extracting data on testing methodologies, medical domains, expert involvement, and clinical plausibility. Thirty-three studies met inclusion criteria, predominantly from 2025 (82%). The most common robustness testing approaches were misleading prompts (49%) and adversarial prompts (39%). Only 33% of studies designed tests clearly mimicking plausible clinical scenarios. While 58% reported expert involvement, the depth of integration varied considerably. Studies predominantly addressed mixed medical domains (73%) rather than specialized fields. The emerging literature suggests that LLM robustness testing in medicine often emphasizes technical vulnerability detection, with fewer studies examining clinically plausible scenarios of routine use. Future frameworks should complement adversarial testing with clinically grounded, longitudinal, and specialty focused evaluations to support deployment-relevant inference.<\/jats:p>","DOI":"10.1007\/s10916-026-02405-1","type":"journal-article","created":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T07:14:39Z","timestamp":1778483679000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Clinical Plausibility in Large Language Model Robustness Testing for Medicine: A Scoping Review"],"prefix":"10.1007","volume":"50","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9954-921X","authenticated-orcid":false,"given":"Yu","family":"Chang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ming-Hong","family":"Hsieh","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Po-Chung","family":"Ju","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yi-Chun","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Cheng-Chen","family":"Chang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2026,5,11]]},"reference":[{"key":"2405_CR1","doi-asserted-by":"publisher","first-page":"109713","DOI":"10.1016\/j.isci.2024.109713","volume":"27","author":"X Meng","year":"2024","unstructured":"Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, Zhang M, Cao C, Wang J, Wang X, Gao J, Wang Y-G-S, Ji J, Qiu Z, Li M, Qian C, Guo T, Ma S, Wang Z, Guo Z, Lei Y, Shao C, Wang W, Fan H, Tang Y-D (2024) The application of large language models in medicine: A scoping review. iScience 27:109713. https:\/\/doi.org\/10.1016\/j.isci.2024.109713","journal-title":"iScience"},{"key":"2405_CR2","doi-asserted-by":"publisher","first-page":"924","DOI":"10.1038\/s41591-022-01772-9","volume":"28","author":"B Vasey","year":"2022","unstructured":"Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, Denaxas S, Denniston AK, Faes L, Geerts B, Ibrahim M, Liu X, Mateen BA, Mathur P, McCradden MD, Morgan L, Ordish J, Rogers C, Saria S, Ting DSW, Watkinson P, Weber W, Wheatstone P, McCulloch P (2022) Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 28:924\u2013933. https:\/\/doi.org\/10.1038\/s41591-022-01772-9","journal-title":"Nat Med"},{"key":"2405_CR3","doi-asserted-by":"publisher","first-page":"1351","DOI":"10.1038\/s41591-020-1037-7","volume":"26","author":"S Cruz Rivera","year":"2020","unstructured":"Cruz Rivera S, Liu X, Chan A-W, Denniston AK, Calvert MJ (2020) Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 26:1351\u20131363. https:\/\/doi.org\/10.1038\/s41591-020-1037-7","journal-title":"Nat Med"},{"key":"2405_CR4","doi-asserted-by":"publisher","first-page":"1364","DOI":"10.1038\/s41591-020-1034-x","volume":"26","author":"X Liu","year":"2020","unstructured":"Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK (2020) Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 26:1364\u20131374. https:\/\/doi.org\/10.1038\/s41591-020-1034-x","journal-title":"Nat Med"},{"key":"2405_CR5","doi-asserted-by":"publisher","first-page":"1211150","DOI":"10.3389\/frhs.2023.1211150","volume":"3","author":"E Steerling","year":"2023","unstructured":"Steerling E, Siira E, Nilsen P, Svedberg P, Nygren J (2023) Implementing AI in healthcare\u2014the relevance of trust: a scoping review. Front Health Serv 3:1211150. https:\/\/doi.org\/10.3389\/frhs.2023.1211150","journal-title":"Front Health Serv"},{"key":"2405_CR6","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1007\/s11606-017-4164-1","volume":"33","author":"V Bhise","year":"2018","unstructured":"Bhise V, Rajan SS, Sittig DF, Morgan RO, Chaudhary P, Singh H (2018) Defining and Measuring Diagnostic Uncertainty in Medicine: A Systematic Review. J GEN INTERN MED 33:103\u2013115. https:\/\/doi.org\/10.1007\/s11606-017-4164-1","journal-title":"J GEN INTERN MED"},{"key":"2405_CR7","doi-asserted-by":"publisher","first-page":"2305","DOI":"10.3390\/healthcare12222305","volume":"12","author":"Y Chang","year":"2024","unstructured":"Chang Y, Su C-Y, Liu Y-C (2024) Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model. Healthcare 12:2305. https:\/\/doi.org\/10.3390\/healthcare12222305","journal-title":"Healthcare"},{"key":"2405_CR8","doi-asserted-by":"publisher","first-page":"e0330303","DOI":"10.1371\/journal.pone.0330303","volume":"20","author":"Y Chang","year":"2025","unstructured":"Chang Y, Huang S-S, Hsu W-Y, Liu Y-C (2025) Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning. PLOS ONE 20:e0330303. https:\/\/doi.org\/10.1371\/journal.pone.0330303","journal-title":"PLOS ONE"},{"key":"2405_CR9","doi-asserted-by":"publisher","first-page":"633","DOI":"10.1038\/s41586-025-09422-z","volume":"645","author":"D Guo","year":"2025","unstructured":"Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, Xu R, Zhang R, Ma S, Bi X, Zhang X, Yu X, Wu Y, Wu ZF, Gou Z, Shao Z, Li Z, Gao Z, Liu A, Xue B, Wang B, Wu B, Feng B, Lu C, Zhao C, Deng C, Ruan C, Dai D, Chen D, Ji D, Li E, Lin F, Dai F, Luo F, Hao G, Chen G, Li G, Zhang H, Xu H, Ding H, Gao H, Qu H, Li H, Guo J, Li J, Chen J, Yuan J, Tu J, Qiu J, Li J, Cai JL, Ni J, Liang J, Chen J, Dong K, Hu K, You K, Gao K, Guan K, Huang K, Yu K, Wang L, Zhang L, Zhao L, Wang L, Zhang L, Xu L, Xia L, Zhang M, Zhang M, Tang M, Zhou M, Li M, Wang M, Li M, Tian N, Huang P, Zhang P, Wang Q, Chen Q, Du Q, Ge R, Zhang R, Pan R, Wang R, Chen RJ, Jin RL, Chen R, Lu S, Zhou S, Chen S, Ye S, Wang S, Yu S, Zhou S, Pan S, Li SS, Zhou S, Wu S, Yun T, Pei T, Sun T, Wang T, Zeng W, Liu W, Liang W, Gao W, Yu W, Zhang W, Xiao WL, An W, Liu X, Wang X, Chen X, Nie X, Cheng X, Liu X, Xie X, Liu X, Yang X, Li X, Su X, Lin X, Li XQ, Jin X, Shen X, Chen X, Sun X, Wang X, Song X, Zhou X, Wang X, Shan X, Li YK, Wang YQ, Wei YX, Zhang Y, Xu Y, Li Y, Zhao Y, Sun Y, Wang Y, Yu Y, Zhang Y, Shi Y, Xiong Y, He Y, Piao Y, Wang Y, Tan Y, Ma Y, Liu Y, Guo Y, Ou Y, Wang Y, Gong Y, Zou Y, He Y, Xiong Y, Luo Y, You Y, Liu Y, Zhou Y, Zhu YX, Huang Y, Li Y, Zheng Y, Zhu Y, Ma Y, Tang Y, Zha Y, Yan Y, Ren ZZ, Ren Z, Sha Z, Fu Z, Xu Z, Xie Z, Zhang Z, Hao Z, Ma Z, Yan Z, Wu Z, Gu Z, Zhu Z, Liu Z, Li Z, Xie Z, Song Z, Pan Z, Huang Z, Xu Z, Zhang Z, Zhang Z (2025) DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645:633\u2013638. https:\/\/doi.org\/10.1038\/s41586-025-09422-z","journal-title":"Nature"},{"key":"2405_CR10","doi-asserted-by":"publisher","unstructured":"Braiek HB, Khomh F (2025) Chap. 3 - Machine learning robustness: a primer. In: Lorenzi M, Zuluaga MA (eds) Trustworthy AI in Medical Imaging. Academic Press, pp 37\u201371. https:\/\/doi.org\/10.1016\/B978-0-44-323761-4.00012-2","DOI":"10.1016\/B978-0-44-323761-4.00012-2"},{"key":"2405_CR11","doi-asserted-by":"publisher","first-page":"1808","DOI":"10.1038\/s41467-024-46000-9","volume":"15","author":"D Kiyasseh","year":"2024","unstructured":"Kiyasseh D, Cohen A, Jiang C, Altieri N (2024) A framework for evaluating clinical artificial intelligence systems without ground-truth annotations. Nat Commun 15:1808. https:\/\/doi.org\/10.1038\/s41467-024-46000-9","journal-title":"Nat Commun"},{"key":"2405_CR12","doi-asserted-by":"publisher","first-page":"8236","DOI":"10.1038\/s41467-024-52415-1","volume":"15","author":"CYK Williams","year":"2024","unstructured":"Williams CYK, Miao BY, Kornblith AE, Butte AJ (2024) Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun 15:8236. https:\/\/doi.org\/10.1038\/s41467-024-52415-1","journal-title":"Nat Commun"},{"key":"2405_CR13","doi-asserted-by":"publisher","first-page":"e84120","DOI":"10.2196\/84120","volume":"27","author":"EJ Gong","year":"2025","unstructured":"Gong EJ, Bang CS, Lee JJ, Baik GH (2025) Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks. Journal of Medical Internet Research 27:e84120. https:\/\/doi.org\/10.2196\/84120","journal-title":"Journal of Medical Internet Research"},{"key":"2405_CR14","doi-asserted-by":"publisher","first-page":"345","DOI":"10.1038\/s41746-025-01725-9","volume":"8","author":"K Sokol","year":"2025","unstructured":"Sokol K, Fackler J, Vogt JE (2025) Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. npj Digit Med 8:345. https:\/\/doi.org\/10.1038\/s41746-025-01725-9","journal-title":"npj Digit Med"},{"key":"2405_CR15","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1038\/s41746-025-01542-0","volume":"8","author":"CT Chang","year":"2025","unstructured":"Chang CT, Farah H, Gui H, Rezaei SJ, Bou-Khalil C, Park Y-J, Swaminathan A, Omiye JA, Kolluri A, Chaurasia A, Lozano A, Heiman A, Jia AS, Kaushal A, Jia A, Iacovelli A, Yang A, Salles A, Singhal A, Narasimhan B, Belai B, Jacobson BH, Li B, Poe CH, Sanghera C, Zheng C, Messer C, Kettud DV, Pandya D, Kaur D, Hla D, Dindoust D, Moehrle D, Ross D, Chou E, Lin E, Haredasht FN, Cheng G, Gao I, Chang J, Silberg J, Fries JA, Xu J, Jamison J, Tamaresis JS, Chen JH, Lazaro J, Banda JM, Lee JJ, Matthys KE, Steffner KR, Tian L, Pegolotti L, Srinivasan M, Manimaran M, Schwede M, Zhang M, Nguyen M, Fathzadeh M, Zhao Q, Bajra R, Khurana R, Azam R, Bartlett R, Truong ST, Fleming SL, Raj S, Behr S, Onyeka S, Muppidi S, Bandali T, Eulalio TY, Chen W, Zhou X, Ding Y, Cui Y, Tan Y, Liu Y, Shah N, Daneshjou R (2025) Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit Med 8:149. https:\/\/doi.org\/10.1038\/s41746-025-01542-0","journal-title":"NPJ Digit Med"},{"key":"2405_CR16","doi-asserted-by":"publisher","first-page":"i68","DOI":"10.1136\/qshc.2010.042085","volume":"19","author":"DF Sittig","year":"2010","unstructured":"Sittig DF, Singh H (2010) A New Socio-technical Model for Studying Health Information Technology in Complex Adaptive Healthcare Systems. Qual Saf Health Care 19:i68\u2013i74. https:\/\/doi.org\/10.1136\/qshc.2010.042085","journal-title":"Qual Saf Health Care"},{"key":"2405_CR17","doi-asserted-by":"publisher","first-page":"e251692","DOI":"10.1001\/jamahealthforum.2025.1692","volume":"6","author":"MJ Pencina","year":"2025","unstructured":"Pencina MJ, Silcox C, Economou-Zavlanos N, McClellan M (2025) Bridging the Gap Between Developers and Implementers in Health AI. JAMA Health Forum 6:e251692. https:\/\/doi.org\/10.1001\/jamahealthforum.2025.1692","journal-title":"JAMA Health Forum"},{"key":"2405_CR18","doi-asserted-by":"publisher","DOI":"10.6028\/NIST.AI.600-1","volume-title":"Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile","author":"C Autio","year":"2024","unstructured":"Autio C, Schwartz R, Dunietz J, Jain S, Stanley M, Tabassi E, Hall P, Roberts K (2024) Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. National Institute of Standards and Technology, Gaithersburg, MD. https:\/\/doi.org\/10.6028\/NIST.AI.600-1"},{"key":"2405_CR19","doi-asserted-by":"publisher","first-page":"706","DOI":"10.3390\/bioengineering12070706","volume":"12","author":"M Trabilsy","year":"2025","unstructured":"Trabilsy M, Prabha S, Gomez-Cabello CA, Haider SA, Genovese A, Borna S, Wood N, Gopala N, Tao C, Forte AJ (2025) The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision-Making. Bioengineering (Basel) 12:706. https:\/\/doi.org\/10.3390\/bioengineering12070706","journal-title":"Bioengineering (Basel)"},{"key":"2405_CR20","doi-asserted-by":"publisher","unstructured":"Feffer M, Sinha A, Deng WH, Lipton ZC, Heidari H (2024) Red-Teaming for Generative AI: Silver Bullet or Security Theater? In: Proceedings of the AAAI\/ACM Conference on AI, Ethics, and Society. AAAI Press, pp 421\u2013437. https:\/\/doi.org\/10.1609\/aies.v7i1.31647","DOI":"10.1609\/aies.v7i1.31647"},{"key":"2405_CR21","doi-asserted-by":"publisher","DOI":"10.1007\/s41666-025-00218-4","author":"KH Lim","year":"2025","unstructured":"Lim KH, Kang U, Li X, Kim JS, Jung Y-C, Park S, Kim B-H (2025) Susceptibility of Large Language Models to User-Driven Factors in Medical Queries. J Healthc Inform Res. https:\/\/doi.org\/10.1007\/s41666-025-00218-4","journal-title":"J Healthc Inform Res"},{"key":"2405_CR22","doi-asserted-by":"publisher","first-page":"6750","DOI":"10.1038\/s41598-026-38019-3","volume":"16","author":"Y Chang","year":"2026","unstructured":"Chang Y, Ju P-C, Hsieh M-H, Chang C-C (2026) Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study. Sci Rep 16:6750. https:\/\/doi.org\/10.1038\/s41598-026-38019-3","journal-title":"Sci Rep"},{"key":"2405_CR23","doi-asserted-by":"publisher","first-page":"478","DOI":"10.1186\/s12888-025-06912-2","volume":"25","author":"DH Shoval","year":"2025","unstructured":"Shoval DH, Gigi K, Haber Y, Itzhaki A, Asraf K, Piterman D, Elyoseph Z (2025) A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm. BMC Psychiatry 25:478. https:\/\/doi.org\/10.1186\/s12888-025-06912-2","journal-title":"BMC Psychiatry"},{"key":"2405_CR24","doi-asserted-by":"publisher","first-page":"1277756","DOI":"10.3389\/fpsyt.2023.1277756","volume":"14","author":"I Dergaa","year":"2024","unstructured":"Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, Ben Aissa M, Souissi N, Guelmami N, Swed S, El Omri A, Bragazzi NL, Ben Saad H (2024) ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry 14:1277756. https:\/\/doi.org\/10.3389\/fpsyt.2023.1277756","journal-title":"Front Psychiatry"},{"key":"2405_CR25","doi-asserted-by":"publisher","first-page":"600","DOI":"10.1038\/s41746-025-01963-x","volume":"8","author":"M Agrawal","year":"2025","unstructured":"Agrawal M, Chen IY, Gulamali F, Joshi S (2025) The evaluation illusion of large language models in medicine. NPJ Digit Med 8:600. https:\/\/doi.org\/10.1038\/s41746-025-01963-x","journal-title":"NPJ Digit Med"},{"key":"2405_CR26","doi-asserted-by":"crossref","unstructured":"Lyu C, Du Z, Xu J, Duan Y, Wu M, Lynn T, Aji AF, Wong DF, Wang L (2024) A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models. In: Calzolari N, Kan M-Y, Hoste V, Lenci A, Sakti S, Xue N (eds) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, pp 1339\u20131352","DOI":"10.63317\/3dqcen72ckgz"},{"key":"2405_CR27","doi-asserted-by":"publisher","first-page":"e69485","DOI":"10.2196\/69485","volume":"13","author":"Z Yao","year":"2025","unstructured":"Yao Z, Duan L, Xu S, Chi L, Sheng D (2025) Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations. JMIR Medical Informatics 13:e69485. https:\/\/doi.org\/10.2196\/69485","journal-title":"JMIR Medical Informatics"},{"key":"2405_CR28","doi-asserted-by":"publisher","first-page":"116501","DOI":"10.1016\/j.psychres.2025.116501","volume":"348","author":"Y Chang","year":"2025","unstructured":"Chang Y, Liu Y-C, Huang S-S, Hsu W-Y (2025) Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression. Psychiatry Research 348:116501. https:\/\/doi.org\/10.1016\/j.psychres.2025.116501","journal-title":"Psychiatry Research"},{"key":"2405_CR29","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.ijmedinf.2017.06.003","volume":"106","author":"S Richardson","year":"2017","unstructured":"Richardson S, Mishuris R, O\u2019Connell A, Feldstein D, Hess R, Smith P, McCullagh L, McGinn T, Mann D (2017) \u201cThink Aloud\u201d and \u201cNear Live\u201d Usability Testing of Two Complex Clinical Decision Support Tools. Int J Med Inform 106:1\u20138. https:\/\/doi.org\/10.1016\/j.ijmedinf.2017.06.003","journal-title":"Int J Med Inform"},{"key":"2405_CR30","doi-asserted-by":"publisher","first-page":"e100444","DOI":"10.1136\/bmjhci-2021-100444","volume":"28","author":"S Reddy","year":"2021","unstructured":"Reddy S, Rogers W, Makinen V-P, Coiera E, Brown P, Wenzel M, Weicken E, Ansari S, Mathur P, Casey A, Kelly B (2021) Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform 28:e100444. https:\/\/doi.org\/10.1136\/bmjhci-2021-100444","journal-title":"BMJ Health Care Inform"},{"key":"2405_CR31","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1038\/s41746-025-01684-1","volume":"8","author":"F Gaber","year":"2025","unstructured":"Gaber F, Shaik M, Allega F, Bilecz AJ, Busch F, Goon K, Franke V, Akalin A (2025) Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. npj Digit Med 8:263. https:\/\/doi.org\/10.1038\/s41746-025-01684-1","journal-title":"npj Digit Med"},{"key":"2405_CR32","doi-asserted-by":"publisher","first-page":"386","DOI":"10.1093\/jamia\/ocae294","volume":"32","author":"RJ Gallo","year":"2024","unstructured":"Gallo RJ, Baiocchi M, Savage TR, Chen JH (2024) Establishing best practices in large language model research: an application to repeat prompting. J Am Med Inform Assoc 32:386\u2013390. https:\/\/doi.org\/10.1093\/jamia\/ocae294","journal-title":"J Am Med Inform Assoc"},{"key":"2405_CR33","doi-asserted-by":"publisher","first-page":"8384","DOI":"10.1038\/s41467-024-52417-z","volume":"15","author":"P Qiu","year":"2024","unstructured":"Qiu P, Wu C, Zhang X, Lin W, Wang H, Zhang Y, Wang Y, Xie W (2024) Towards building multilingual language model for medicine. Nat Commun 15:8384. https:\/\/doi.org\/10.1038\/s41467-024-52417-z","journal-title":"Nat Commun"},{"key":"2405_CR34","doi-asserted-by":"publisher","first-page":"213","DOI":"10.3233\/SHTI250304","volume":"327","author":"J-B Lamy","year":"2025","unstructured":"Lamy J-B, Falcoff H, Dubois S, Meneton P, Tsopra R, Saab A (2025) Simulation Trials for Evaluating Clinical Decision Support Systems. Stud Health Technol Inform 327:213\u2013214. https:\/\/doi.org\/10.3233\/SHTI250304","journal-title":"Stud Health Technol Inform"},{"key":"2405_CR35","doi-asserted-by":"publisher","first-page":"560","DOI":"10.3171\/2024.12.JNS241607","volume":"143","author":"R Ali","year":"2025","unstructured":"Ali R, Abdulrazeq HF, Patil A, Cheatham M, Connolly ID, Tang OY, Doberstein CA, Riccelli T, Huang KT, Shankar GM, Williamson T, Shin JH, Carter B, Torabi R, Lee CK, Cielo D, Telfeian AE, Gokaslan ZL, Cohen-Gadol AA, Zou J, Asaad WF (2025) AtlasGPT: a language model grounded in neurosurgery with domain-specific data and document retrieval. J Neurosurg 143:560\u2013567. https:\/\/doi.org\/10.3171\/2024.12.JNS241607","journal-title":"J Neurosurg"},{"key":"2405_CR36","doi-asserted-by":"publisher","first-page":"e66207","DOI":"10.2196\/66207","volume":"9","author":"T Zada","year":"2025","unstructured":"Zada T, Tam N, Barnard F, Sittert MV, Bhat V, Rambhatla S (2025) Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models. JMIR Formative Research 9:e66207. https:\/\/doi.org\/10.2196\/66207","journal-title":"JMIR Formative Research"},{"key":"2405_CR37","doi-asserted-by":"publisher","first-page":"1239","DOI":"10.1038\/s41467-024-55631-x","volume":"16","author":"J Clusmann","year":"2025","unstructured":"Clusmann J, Ferber D, Wiest IC, Schneider CV, Brinker TJ, Foersch S, Truhn D, Kather JN (2025) Prompt injection attacks on vision language models in oncology. Nat Commun 16:1239. https:\/\/doi.org\/10.1038\/s41467-024-55631-x","journal-title":"Nat Commun"},{"key":"2405_CR38","doi-asserted-by":"publisher","first-page":"330","DOI":"10.1038\/s43856-025-01021-3","volume":"5","author":"M Omar","year":"2025","unstructured":"Omar M, Sorin V, Collins JD, Reich D, Freeman R, Gavin N, Charney A, Stump L, Bragazzi NL, Nadkarni GN, Klang E (2025) Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support. Commun Med 5:330. https:\/\/doi.org\/10.1038\/s43856-025-01021-3","journal-title":"Commun Med"},{"key":"2405_CR39","doi-asserted-by":"publisher","first-page":"332","DOI":"10.1111\/eje.13073","volume":"29","author":"Y-T Xiong","year":"2025","unstructured":"Xiong Y-T, Zhan Z-Z, Zhong C-L, Zeng W, Guo J-X, Tang W, Liu C (2025) Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination. Eur J Dent Educ 29:332\u2013340. https:\/\/doi.org\/10.1111\/eje.13073","journal-title":"Eur J Dent Educ"},{"key":"2405_CR40","doi-asserted-by":"publisher","first-page":"1255","DOI":"10.1016\/j.cgh.2024.10.033","volume":"23","author":"BD Liu","year":"2025","unstructured":"Liu BD, D\u2019Souza S, Roy M, Dietitians M, Saleh S, Fass R, Song G (2025) Assessing the Quality of Artificial Intelligence Responses and Resistance to Sycophancy in Providing Patient-centered Medical Advice on Gastroesophageal Reflux Disease. Clin Gastroenterol Hepatol 23:1255\u20131257.e4. https:\/\/doi.org\/10.1016\/j.cgh.2024.10.033","journal-title":"Clin Gastroenterol Hepatol"},{"key":"2405_CR41","doi-asserted-by":"publisher","first-page":"e0302217","DOI":"10.1371\/journal.pone.0302217","volume":"19","author":"M Safrai","year":"2024","unstructured":"Safrai M, Azaria A (2024) Does small talk with a medical provider affect ChatGPT\u2019s medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS One 19:e0302217. https:\/\/doi.org\/10.1371\/journal.pone.0302217","journal-title":"PLoS One"},{"key":"2405_CR42","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1007\/s13755-025-00368-0","volume":"13","author":"VMS Campos","year":"2025","unstructured":"Campos VMS, Prudente TP, Le\u00e3o LL, da Costa MS, Oliva HNP, Monteiro-Junior RS (2025) Analyses of different prescriptions for health using artificial intelligence: a critical approach based on the international guidelines of health institutions. Health Inf Sci Syst 13:52. https:\/\/doi.org\/10.1007\/s13755-025-00368-0","journal-title":"Health Inf Sci Syst"},{"key":"2405_CR43","doi-asserted-by":"publisher","first-page":"295","DOI":"10.1038\/s41746-024-01283-6","volume":"7","author":"S Schmidgall","year":"2024","unstructured":"Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW, Ziaei R, Eshraghian J, Abadir P, Chellappa R (2024) Evaluation and mitigation of cognitive biases in medical language models. npj Digit Med 7:295. https:\/\/doi.org\/10.1038\/s41746-024-01283-6","journal-title":"npj Digit Med"},{"key":"2405_CR44","doi-asserted-by":"publisher","unstructured":"Lee RW, Jun TJ, Lee J-M, Cho SI, Park HJ, Suh J (2025) Manipulating Medical Advice Through Stealth Prompt Injection in Large Language Models: An Experimental Study on Vulnerabilities and Patient Safety Risks. SSRN:5284059 [Preprint]. https:\/\/doi.org\/10.2139\/ssrn.5284059","DOI":"10.2139\/ssrn.5284059"},{"key":"2405_CR45","doi-asserted-by":"publisher","first-page":"605","DOI":"10.1038\/s41746-025-02008-z","volume":"8","author":"S Chen","year":"2025","unstructured":"Chen S, Gao M, Sasse K, Hartvigsen T, Anthony B, Fan L, Aerts H, Gallifant J, Bitterman DS (2025) When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior. npj Digit Med 8:605. https:\/\/doi.org\/10.1038\/s41746-025-02008-z","journal-title":"npj Digit Med"},{"key":"2405_CR46","doi-asserted-by":"publisher","unstructured":"Yang Y, Jin Q, Huang F, Lu Z (2024) Adversarial Attacks on Large Language Models in Medicine. arXiv:2406.12259 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2406.12259","DOI":"10.48550\/arXiv.2406.12259"},{"key":"2405_CR47","doi-asserted-by":"publisher","unstructured":"Huang X, Wang X, Zhang H, Zhu Y, Xi J, An J, Wang H, Liang H, Pan C (2025) Medical MLLM Is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence 39:3797\u20133805. https:\/\/doi.org\/10.1609\/aaai.v39i4.32396","DOI":"10.1609\/aaai.v39i4.32396"},{"key":"2405_CR48","doi-asserted-by":"publisher","unstructured":"Ness RO, Matton K, Helm H, Zhang S, Bajwa J, Priebe CE, Horvitz E (2024) MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering. arXiv:2406.06573 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2406.06573","DOI":"10.48550\/arXiv.2406.06573"},{"key":"2405_CR49","doi-asserted-by":"publisher","unstructured":"Yang Y, Jin Q, Leaman R, Liu X, Xiong G, Sarfo-Gyamfi M, Gong C, Ferri\u00e8re-Steinert S, Wilbur WJ, Li X, Yuan J, An B, Castro KS, \u00c1lvarez FE, Stockle M, Zhang A, Huang F, Lu Z (2024) Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine. arXiv:2411.14487 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2411.14487","DOI":"10.48550\/arXiv.2411.14487"},{"key":"2405_CR50","doi-asserted-by":"publisher","first-page":"39426","DOI":"10.1038\/s41598-025-22940-0","volume":"15","author":"J Kim","year":"2025","unstructured":"Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D (2025) Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Sci Rep 15:39426. https:\/\/doi.org\/10.1038\/s41598-025-22940-0","journal-title":"Sci Rep"},{"key":"2405_CR51","doi-asserted-by":"publisher","DOI":"10.56147\/aaiet.1.1.3","author":"K Subedi","year":"2025","unstructured":"Subedi K (2025) The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation and Contextual Awareness. Journal of Advanced Artificial Intelligence, Engineering and Technology. https:\/\/doi.org\/10.56147\/aaiet.1.1.3","journal-title":"Journal of Advanced Artificial Intelligence, Engineering and Technology"},{"key":"2405_CR52","doi-asserted-by":"publisher","unstructured":"Zhu WB, Chen T, Lin CY, Law J, Jizzini M, Nieva JJ, Liu R, Jia R (2025) Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions. arXiv:2504.11373 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2504.11373","DOI":"10.48550\/arXiv.2504.11373"},{"key":"2405_CR53","doi-asserted-by":"publisher","unstructured":"Vishwanath K, Alyakin A, Alber DA, Lee JV, Kondziolka D, Oermann EK (2025) Medical large language models are easily distracted. arXiv:2504.01201 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2504.01201","DOI":"10.48550\/arXiv.2504.01201"},{"key":"2405_CR54","doi-asserted-by":"publisher","unstructured":"Chen K, Zhen T, Wang H, Liu K, Li X, Huo J, Yang T, Xu J, Dong W, Gao Y (2025) MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems. arXiv:2505.20824 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2505.20824","DOI":"10.48550\/arXiv.2505.20824"},{"key":"2405_CR55","doi-asserted-by":"publisher","unstructured":"Chen S, Li X, Zhang M, Jiang EH, Zeng Q, Yu C-H (2025) CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs. arXiv:2505.11413 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2505.11413","DOI":"10.48550\/arXiv.2505.11413"},{"key":"2405_CR56","doi-asserted-by":"publisher","unstructured":"Balazadeh V, Cooper M, Pellow D, Assadi A, Bell J, Coatsworth M, Deshpande K, Fackler J, Funingana G, Gable-Cook S, Gangadhar A, Jaiswal A, Kaja S, Khoury C, Krishnan A, Lin R, McKeen K, Naimimohasses S, Namdar K, Newatia A, Pang A, Pattoo A, Peesapati S, Prepelita D, Rakova B, Sadatamin S, Schulman R, Shah A, Shah SA, Shah SA, Taati B, Unnikrishnan B, Urteaga I, Williams S, Krishnan RG (2025) Red Teaming Large Language Models for Healthcare. arXiv:2505.00467 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2505.00467","DOI":"10.48550\/arXiv.2505.00467"},{"key":"2405_CR57","doi-asserted-by":"publisher","unstructured":"Sadanandan B, Behzadan V (2025) VSF-Med: A Vulnerability Scoring Framework for Medical Vision-Language Models. arXiv:2507.00052 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2507.00052","DOI":"10.48550\/arXiv.2507.00052"},{"key":"2405_CR58","doi-asserted-by":"publisher","unstructured":"Gourabathina A, Hao Y, Gerych W, Ghassemi M (2025) The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making. arXiv:2506.17163 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2506.17163","DOI":"10.48550\/arXiv.2506.17163"},{"key":"2405_CR59","doi-asserted-by":"publisher","unstructured":"Li Y, Yao J, Bunyi JBS, Frank AC, Hwang A, Liu R (2025) CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering. arXiv:2506.08584 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2506.08584","DOI":"10.48550\/arXiv.2506.08584"},{"key":"2405_CR60","doi-asserted-by":"publisher","unstructured":"Pan J, Jian B, Hager P, Zhang Y, Liu C, Jungmann F, Li HB, You C, Wu J, Zhu J, Liu F, Liu Y, Bubeck N, Wachinger C, Chen C, Gong Z, Ouyang C, Kaissis G, Wiestler B, Rueckert D (2025) Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models. arXiv:2508.00923 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2508.00923","DOI":"10.48550\/arXiv.2508.00923"},{"key":"2405_CR61","doi-asserted-by":"publisher","unstructured":"Vijayaraj RK (2025) Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs. arXiv:2507.21188 [Preprint]. https:\/\/doi.org\/10.48550\/arXiv.2507.21188","DOI":"10.48550\/arXiv.2507.21188"},{"key":"2405_CR62","doi-asserted-by":"publisher","unstructured":"Zhao S, Zhang Y, Xiao L, Wu X, Jia Y, Guo Z, Wu X, Nguyen CD, Zhang G, Luu AT (2025) Affective-ROPTester: capability and bias analysis of LLMs in predicting retinopathy of prematurity. IEEE Trans Affect Comput Early Access:1\u201314, Article 11240127. https:\/\/doi.org\/10.1109\/TAFFC.2025.3631581","DOI":"10.1109\/TAFFC.2025.3631581"},{"key":"2405_CR63","doi-asserted-by":"publisher","unstructured":"Ji K, Guo Y, Zhang Z, Zhu X, Tian Y, Liu N, Zhai G (2026) MedOmni-45\u00b0: A Safety\u2013Performance Benchmark for Reasoning-Oriented LLMs in Medicine. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, pp 35536\u201335544. https:\/\/doi.org\/10.1609\/aaai.v40i42.40864","DOI":"10.1609\/aaai.v40i42.40864"}],"container-title":["Journal of Medical Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10916-026-02405-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10916-026-02405-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10916-026-02405-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T07:14:46Z","timestamp":1778483686000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10916-026-02405-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,5,11]]},"references-count":63,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,12]]}},"alternative-id":["2405"],"URL":"https:\/\/doi.org\/10.1007\/s10916-026-02405-1","relation":{},"ISSN":["1573-689X"],"issn-type":[{"value":"1573-689X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,5,11]]},"assertion":[{"value":"18 October 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"1 May 2026","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 May 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics Approval and Consent to Participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for Publication"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Clinical Trial Number"}},{"value":"The authors declare no competing interests.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"77"}}