{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,27]],"date-time":"2026-05-27T12:13:06Z","timestamp":1779883986634,"version":"3.53.1"},"reference-count":32,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T00:00:00Z","timestamp":1765238400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Evidence-based medicine is crucial for clinical decision-making, yet studies suggest that a significant proportion of treatment decisions do not fully incorporate the latest evidence. Large Language Models (LLMs) show promise in bridging this gap, but their reliability for medical recommendations remains uncertain.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Methods<\/jats:title>\n                    <jats:p>We conducted an evaluation study comparing five LLMs\u2019 recommendations across 50 clinical scenarios related to multiple myeloma diagnosis, staging, treatment, and management, using a unified evidence cutoff of June 2024. The evaluation included three general-purpose LLMs (OpenAI o1-preview, Claude 3.5 Sonnet, Gemini 1.5 Pro), one retrieval-augmented generation (RAG) system (Myelo), and one agentic workflow-based system (HopeAI). General-purpose LLMs generated responses based solely on their internal knowledge, while the RAG system enhanced these capabilities by incorporating external knowledge retrieval. The agentic workflow system extended the RAG approach by implementing multi-step reasoning and coordinating with multiple tools and external systems for complex task execution. Three independent hematologist-oncologists evaluated the LLM-generated responses using standardized scoring criteria developed specifically for this study. Performance assessment encompassed five dimensions: accuracy, relevance, comprehensiveness, hallucination rate, and clinical use readiness.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>HopeAI demonstrated superior performance across accuracy (82.0%), relevance (85.3%), and comprehensiveness (74.0%), compared to OpenAI o1-preview (64.7, 57.3, 36.0%), Claude 3.5 Sonnet (50.0, 51.3, 29.3%), Gemini 1.5 Pro (48.0, 46.0, 30.0%), and Myelo (58.7, 56, 32.7%). Hallucination rates were consistently low across all systems: HopeAI (5.3%), OpenAI o1-preview (3.3%), Claude 3.5 Sonnet (10.0%), Gemini 1.5 Pro (8.0%), and Myelo (5.3%). Clinical use readiness scores were relatively low for all systems: HopeAI (25.3%), OpenAI o1-preview (6.0%), Claude 3.5 Sonnet (2.7%), Gemini 1.5 Pro (4.0%), and Myelo (4.0%).<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusion<\/jats:title>\n                    <jats:p>This study demonstrates that while current LLMs show promise in medical decision support, their recommendations require careful clinical supervision to ensure patient safety and optimal care. Further research is needed to improve their clinical use readiness before integration into oncology workflows. These findings provide valuable insights into the capabilities and limitations of LLMs in oncology, guiding future research and development efforts toward integrating AI into clinical workflows.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.3389\/frai.2025.1683322","type":"journal-article","created":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T06:30:09Z","timestamp":1765261809000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["AI for evidence-based treatment recommendation in oncology: a blinded evaluation of large language models and agentic workflows"],"prefix":"10.3389","volume":"8","author":[{"given":"Guannan","family":"Zhai","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Merav","family":"Bar","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andrew J.","family":"Cowan","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Samuel","family":"Rubinstein","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qian","family":"Shi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ningjie","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"En","family":"Xie","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Will","family":"Ma","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2025,12,9]]},"reference":[{"key":"ref1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2303.08774","article-title":"GPT-4 technical report","author":"Achiam","year":"2023","journal-title":"arXiv preprint arXiv:2303.08774."},{"key":"ref2","volume-title":"Introducing Claude 3.5 Sonnet","year":"2024"},{"key":"ref3","doi-asserted-by":"publisher","first-page":"101116","DOI":"10.1016\/j.blre.2023.101116","article-title":"Aiming for the cure in myeloma: putting our best foot forward","volume":"62","author":"Bar","year":"2023","journal-title":"Blood Rev."},{"key":"ref4","doi-asserted-by":"publisher","first-page":"e2343689","DOI":"10.1001\/jamanetworkopen.2023.43689","article-title":"Leveraging large language models for decision support in personalized oncology","volume":"6","author":"Benary","year":"2023","journal-title":"JAMA Netw. Open"},{"key":"ref5","doi-asserted-by":"publisher","first-page":"868","DOI":"10.1007\/s10439-023-03172-7","article-title":"Role of chat GPT in public health","volume":"51","author":"Biswas","year":"2023","journal-title":"Ann. Biomed. Eng."},{"key":"ref6","doi-asserted-by":"publisher","first-page":"723","DOI":"10.1002\/(sici)1097-0258(20000315)19:5<723::aid-sim379>3.0.co;2-a","article-title":"Interval estimation for Cohen's kappa as a measure of agreement","volume":"19","author":"Blackman","year":"2000","journal-title":"Stat. Med."},{"key":"ref7","doi-asserted-by":"publisher","first-page":"12712","DOI":"10.48550\/arXiv.2303.12712","article-title":"Sparks of artificial general intelligence: early experiments with gpt-4","volume":"22","author":"Bubeck","year":"2023","journal-title":"arXiv preprint arXiv:2303"},{"key":"ref8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3648699.3648939","article-title":"Palm: scaling language modeling with pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"J. Mach. Learn. Res."},{"key":"ref9","doi-asserted-by":"publisher","first-page":"551","DOI":"10.1016\/0895-4356(90)90159-m","article-title":"High agreement but low kappa: II. Resolving the paradoxes","volume":"43","author":"Cicchetti","year":"1990","journal-title":"J. Clin. Epidemiol."},{"key":"ref10","doi-asserted-by":"publisher","first-page":"464","DOI":"10.1001\/jama.2022.0003","article-title":"Diagnosis and management of multiple myeloma: a review","volume":"327","author":"Cowan","year":"2022","journal-title":"JAMA"},{"key":"ref11","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1016\/j.annonc.2020.11.014","article-title":"Multiple myeloma: EHA\u2013ESMO clinical practice guidelines for diagnosis, treatment and follow-up","volume":"32","author":"Dimopoulos","year":"2021","journal-title":"Ann. Oncol."},{"key":"ref12","doi-asserted-by":"publisher","first-page":"7357","DOI":"10.1182\/blood-2023-180951","article-title":"Multiple myeloma treatment-related decision-making and preferences of patients and care partners in the United States","volume":"142","author":"Flora","year":"2023","journal-title":"Blood"},{"key":"ref13","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1146\/annurev-med-050522-033815","article-title":"New biological therapies for multiple myeloma","volume":"75","author":"Garfall","year":"2024","journal-title":"Annu. Rev. Med."},{"key":"ref14","doi-asserted-by":"publisher","first-page":"2396","DOI":"10.1038\/s41591-023-02412-6","article-title":"Large language model AI chatbots require approval as medical devices","volume":"29","author":"Gilbert","year":"2023","journal-title":"Nat. Med."},{"key":"ref15","volume-title":"Introducing Gemini 1.5 pro","year":"2024"},{"key":"ref16","doi-asserted-by":"publisher","DOI":"10.36227\/techrxiv.23589741.v7","article-title":"A survey on large language models: applications, challenges, limitations, and practical usage","author":"Hadi","year":"2023","journal-title":"Authorea Preprints"},{"key":"ref17","year":"2024"},{"key":"ref18","doi-asserted-by":"publisher","first-page":"e39305","DOI":"10.7759\/cureus.39305","article-title":"Embracing large language models for medical applications: opportunities and challenges","volume":"15","author":"Karabacak","year":"2023","journal-title":"Cureus."},{"key":"ref19","doi-asserted-by":"publisher","first-page":"1331","DOI":"10.2967\/jnumed.122.264972","article-title":"New developments in myeloma treatment and response assessment","volume":"64","author":"Kraeber-Bod\u00e9r\u00e9","year":"2023","journal-title":"J. Nucl. Med."},{"key":"ref20","doi-asserted-by":"publisher","first-page":"132","DOI":"10.6004\/jnccn.2025.0023","article-title":"NCCN guidelines\u00ae insights: multiple myeloma, version 1.2025","volume":"23","author":"Kumar","year":"2025","journal-title":"J. Natl. Compr. Cancer Netw."},{"key":"ref21","doi-asserted-by":"publisher","first-page":"1233","DOI":"10.1056\/NEJMsr2214184","article-title":"Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine","volume":"388","author":"Lee","year":"2023","journal-title":"N. Engl. J. Med."},{"key":"ref22","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1007\/s44336-024-00009-2","article-title":"A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges","volume":"1","author":"Li","year":"2024","journal-title":"Vicinagearth"},{"key":"ref23","doi-asserted-by":"publisher","first-page":"1228","DOI":"10.1200\/JCO.18.02096","article-title":"Treatment of multiple myeloma: ASCO and CCO joint clinical practice guideline","volume":"37","author":"Mikhael","year":"2019","journal-title":"J. Clin. Oncol."},{"key":"ref24","doi-asserted-by":"publisher","first-page":"705","DOI":"10.1056\/NEJMoa2024850","article-title":"Idecabtagene vicleucel in relapsed and refractory multiple myeloma","volume":"384","author":"Munshi","year":"2021","journal-title":"N. Engl. J. Med."},{"key":"ref25","doi-asserted-by":"publisher","first-page":"1394048","DOI":"10.3389\/fonc.2024.1394048","article-title":"Bispecific antibodies for the treatment of relapsed\/refractory multiple myeloma: updates and future perspectives","volume":"14","author":"Parrondo","year":"2024","journal-title":"Front. Oncol."},{"key":"ref26","doi-asserted-by":"publisher","first-page":"2024","DOI":"10.1101\/2024.03.14.24304293","article-title":"A RAG chatbot for precision medicine of multiple myeloma","volume":"12","author":"Quidwai","year":"2024","journal-title":"MedRxiv."},{"key":"ref27","doi-asserted-by":"publisher","first-page":"24","DOI":"10.1038\/s41408-023-00966-9","article-title":"GPRC5D as a novel target for the treatment of multiple myeloma: a narrative review","volume":"14","author":"Rodriguez-Otero","year":"2024","journal-title":"Blood Cancer J."},{"key":"ref28","doi-asserted-by":"publisher","first-page":"667","DOI":"10.1038\/s41417-024-00750-2","article-title":"CAR T therapies in multiple myeloma: unleashing the future","volume":"31","author":"Sheykhhasan","year":"2024","journal-title":"Cancer Gene Ther."},{"key":"ref29","author":"Singh","year":"2024"},{"key":"ref30","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1038\/s41591-022-02160-z","article-title":"The next generation of evidence-based medicine","volume":"29","author":"Subbiah","year":"2023","journal-title":"Nat. Med."},{"key":"ref31","doi-asserted-by":"publisher","first-page":"830","DOI":"10.1200\/JCO.2009.25.4177","article-title":"Patterns of improved survival in patients with multiple myeloma in the twenty-first century: a population-based study","volume":"28","author":"Turesson","year":"2010","journal-title":"J. Clin. Oncol."},{"key":"ref32","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1200\/JOP.2013.001319","article-title":"Projected supply of and demand for oncologists and radiation oncologists through 2025: an aging, better-insured population will result in shortage","volume":"10","author":"Yang","year":"2014","journal-title":"J. Oncol. Pract."}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1683322\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T06:30:11Z","timestamp":1765261811000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1683322\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,9]]},"references-count":32,"alternative-id":["10.3389\/frai.2025.1683322"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1683322","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,9]]},"article-number":"1683322"}}