{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T07:45:19Z","timestamp":1774597519862,"version":"3.50.1"},"reference-count":28,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T00:00:00Z","timestamp":1759795200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Digit. Health"],"abstract":"<jats:sec><jats:title>Background<\/jats:title><jats:p>Esophageal cancer has high incidence and mortality rates, leading to increased public demand for accurate information. However, the reliability of online medical information is often questionable. This study systematically compared the accuracy, completeness, and comprehensibility of mainstream large language models (LLMs) in answering esophageal cancer-related questions.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>In total, 65 questions covering fundamental knowledge, preoperative preparation, surgical treatment, and postoperative management were selected. Each model, namely, ChatGPT 5, Claude Sonnet 4.0, DeepSeek-R1, Gemini 2.5 Pro, and Grok-4, was queried independently using standardized prompts. Five senior clinical experts, including three thoracic surgeons, one radiologist, and one medical oncologist, evaluated the responses using a five-point Likert scale. A retesting mechanism was applied for the low-scoring responses, and intraclass correlation coefficients were used to assess the rating consistency. The statistical analyses were conducted using the Friedman test, the Wilcoxon signed-rank test, and the Bonferroni correction.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>All the models performed well, with average scores exceeding 4.0. However, the following significant differences emerged: Gemini excelled in accuracy, while ChatGPT led in completeness, particularly in surgical and postoperative contexts. Minor differences appeared in fundamental knowledge, but notable disparities were found in complex areas. Retesting showed improvements in overall quality, yet some responses showed decreased completeness and relevance.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>Large language models have considerable potential in answering questions about esophageal cancer, with significant differences in completeness. ChatGPT is more comprehensive in complex scenarios, while Gemini excels in accuracy. This study offers guidance for selecting artificial intelligence tools in clinical settings, advocating for a tiered application strategy tailored to specific scenarios and highlighting the importance of user education to understand the limitations and applicability of LLMs.<\/jats:p><\/jats:sec>","DOI":"10.3389\/fdgth.2025.1670510","type":"journal-article","created":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T05:36:09Z","timestamp":1759815369000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study"],"prefix":"10.3389","volume":"7","author":[{"given":"Zijie","family":"He","sequence":"first","affiliation":[]},{"given":"Lilan","family":"Zhao","sequence":"additional","affiliation":[]},{"given":"Genglin","family":"Li","sequence":"additional","affiliation":[]},{"given":"Jintao","family":"Wang","sequence":"additional","affiliation":[]},{"given":"Songyu","family":"Cai","sequence":"additional","affiliation":[]},{"given":"Pengjie","family":"Tu","sequence":"additional","affiliation":[]},{"given":"Jingbo","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Jianman","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Juan","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Ruiqi","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Yangyun","family":"Huang","sequence":"additional","affiliation":[]},{"given":"Xiaojie","family":"Pan","sequence":"additional","affiliation":[]},{"given":"Wenshu","family":"Chen","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2025,10,7]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"209","DOI":"10.3322\/caac.21660","article-title":"Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries","volume":"71","author":"Sung","year":"2021","journal-title":"CA Cancer J Clin"},{"key":"B2","doi-asserted-by":"publisher","first-page":"233","DOI":"10.2188\/jea.je20120162","article-title":"Epidemiology of esophageal cancer in Japan and China","volume":"23","author":"Lin","year":"2013","journal-title":"J Epidemiol"},{"key":"B3","doi-asserted-by":"publisher","first-page":"3253","DOI":"10.1007\/s00405-024-08581-5","article-title":"ChatGPT as an information tool in rhinology. Can we trust each other today?","volume":"281","author":"Riestra-Ayora","year":"2024","journal-title":"Eur Arch Otorhinolaryngol"},{"key":"B4","doi-asserted-by":"publisher","first-page":"5928","DOI":"10.1097\/JS9.0000000000001696","article-title":"The rise of ChatGPT-4: exploring its efficacy as a decision support tool in esophageal surgery\u2014a research letter","volume":"110","author":"Zhou","year":"2024","journal-title":"Int J Surg"},{"key":"B5","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1186\/s12929-025-01131-z","article-title":"Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis","volume":"32","author":"Iqbal","year":"2025","journal-title":"J Biomed Sci"},{"key":"B6","doi-asserted-by":"publisher","first-page":"721","DOI":"10.3350\/cmh.2023.0089","article-title":"Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma","volume":"29","author":"Yeo","year":"2023","journal-title":"Clin Mol Hepatol"},{"key":"B7","doi-asserted-by":"publisher","first-page":"1028","DOI":"10.1001\/jamainternmed.2023.2909","article-title":"Chatbot vs medical student performance on free-response clinical reasoning examinations","volume":"183","author":"Strong","year":"2023","journal-title":"JAMA Intern Med"},{"key":"B8","doi-asserted-by":"publisher","first-page":"707","DOI":"10.1093\/ced\/llad402","article-title":"ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology","volume":"49","author":"Stoneham","year":"2024","journal-title":"Clin Exp Dermatol"},{"key":"B9","doi-asserted-by":"publisher","first-page":"1518049","DOI":"10.3389\/frai.2025.1518049","article-title":"Benefits, limits, and risks of ChatGPT in medicine","volume":"8","author":"Tangsrivimol","year":"2025","journal-title":"Front Artif Intell"},{"key":"B10","doi-asserted-by":"publisher","first-page":"1477898","DOI":"10.3389\/fmed.2024.1477898","article-title":"Large language models in patient education: a scoping review of applications in medicine","volume":"11","author":"Aydin","year":"2024","journal-title":"Front Med (Lausanne)"},{"key":"B11","doi-asserted-by":"publisher","first-page":"220","DOI":"10.14366\/usg.25012","article-title":"Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions","volume":"44","author":"Han","year":"2025","journal-title":"Ultrasonography"},{"key":"B12","doi-asserted-by":"publisher","first-page":"197","DOI":"10.1186\/s12885-025-13596-0","article-title":"Benchmarking LLM chatbots\u2019 oncological knowledge with the Turkish Society of Medical Oncology\u2019s annual board examination questions","volume":"25","author":"Erdat","year":"2025","journal-title":"BMC Cancer"},{"key":"B13","doi-asserted-by":"publisher","first-page":"4252","DOI":"10.1097\/JS9.0000000000002507","article-title":"Evaluating generative AI models for explainable pathological feature extraction in lung adenocarcinoma: grading assessment and prognostic model construction","volume":"111","author":"Shen","year":"2025","journal-title":"Int J Surg"},{"key":"B14","doi-asserted-by":"publisher","first-page":"2546","DOI":"10.1038\/s41591-025-03727-2","article-title":"Benchmark evaluation of DeepSeek large language models in clinical decision-making","volume":"31","author":"Sandmann","year":"2025","journal-title":"Nat Med"},{"key":"B15","doi-asserted-by":"publisher","first-page":"1616145","DOI":"10.3389\/frai.2025.1616145","article-title":"Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1","volume":"8","author":"Moell","year":"2025","journal-title":"Front Artif Intell"},{"key":"B16","doi-asserted-by":"publisher","first-page":"1754","DOI":"10.1007\/s10439-025-03738-7","article-title":"Deepseek\u2019s readiness for medical research and practice: prospects, bottlenecks, and global regulatory constraints","volume":"53","author":"MohanaSundaram","year":"2025","journal-title":"Ann Biomed Eng"},{"key":"B17","doi-asserted-by":"publisher","first-page":"e81618","DOI":"10.7759\/cureus.81618","article-title":"Performance of large language models (ChatGPT and Gemini Advanced) in gastrointestinal pathology and clinical review of applications in gastroenterology","volume":"17","author":"Jain","year":"2025","journal-title":"Cureus"},{"key":"B18","doi-asserted-by":"publisher","first-page":"527","DOI":"10.1007\/s00417-024-06625-4","article-title":"Gemini AI vs. ChatGPT: a comprehensive examination alongside ophthalmology residents in medical knowledge","volume":"263","author":"Bahir","year":"2025","journal-title":"Graefes Arch Clin Exp Ophthalmol"},{"key":"B19","doi-asserted-by":"publisher","first-page":"903","DOI":"10.1186\/s12909-025-07493-0","article-title":"A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection","volume":"25","author":"Temizsoy Korkmaz","year":"2025","journal-title":"BMC Med Educ"},{"key":"B20","doi-asserted-by":"publisher","first-page":"1336","DOI":"10.1016\/j.cmi.2025.03.002","article-title":"Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?","volume":"31","author":"De Vito","year":"2025","journal-title":"Clin Microbiol Infect"},{"key":"B21","doi-asserted-by":"publisher","first-page":"1136","DOI":"10.2106\/JBJS.23.00914","article-title":"Assessing the accuracy and reliability of AI-generated responses to patient questions regarding spine surgery","volume":"106","author":"Kasthuri","year":"2024","journal-title":"J Bone Joint Surg Am"},{"key":"B22","doi-asserted-by":"publisher","first-page":"e244630","DOI":"10.1001\/jamanetworkopen.2024.4630","article-title":"Quality of large language model responses to radiation oncology patient care questions","volume":"7","author":"Yalamanchili","year":"2024","journal-title":"JAMA Netw Open"},{"key":"B23","doi-asserted-by":"publisher","first-page":"rs.3.rs-2566942","DOI":"10.21203\/rs.3.rs-2566942\/v1","article-title":"Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model","author":"Johnson","year":"2023","journal-title":"Res Sq"},{"key":"B24","doi-asserted-by":"publisher","first-page":"e86537","DOI":"10.7759\/cureus.86537","article-title":"How well do different AI language models inform patients about radiofrequency ablation for varicose veins?","volume":"17","author":"Zyada","year":"2025","journal-title":"Cureus"},{"key":"B25","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","article-title":"Large language models encode clinical knowledge","volume":"620","author":"Singhal","year":"2023","journal-title":"Nature"},{"key":"B26","doi-asserted-by":"publisher","first-page":"e0000651","DOI":"10.1371\/journal.pdig.0000651","article-title":"Bias in medical AI: implications for clinical decision-making","volume":"3","author":"Cross","year":"2024","journal-title":"PLoS Digit Health"},{"key":"B27","doi-asserted-by":"publisher","first-page":"819","DOI":"10.1001\/jamaophthalmol.2023.3119","article-title":"Evaluation and comparison of ophthalmic scientific abstracts and references by current artificial intelligence chatbots","volume":"141","author":"Hua","year":"2023","journal-title":"JAMA Ophthalmol"},{"key":"B28","doi-asserted-by":"publisher","first-page":"e53164","DOI":"10.2196\/53164","article-title":"Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis","volume":"26","author":"Chelli","year":"2024","journal-title":"J Med Internet Res"}],"container-title":["Frontiers in Digital Health"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdgth.2025.1670510\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,7]],"date-time":"2025-10-07T05:36:10Z","timestamp":1759815370000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdgth.2025.1670510\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,7]]},"references-count":28,"alternative-id":["10.3389\/fdgth.2025.1670510"],"URL":"https:\/\/doi.org\/10.3389\/fdgth.2025.1670510","relation":{},"ISSN":["2673-253X"],"issn-type":[{"value":"2673-253X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,7]]},"article-number":"1670510"}}