{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T13:16:35Z","timestamp":1781270195539,"version":"3.54.1"},"reference-count":28,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T00:00:00Z","timestamp":1759968000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers\u2019 perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.<\/jats:p><\/jats:sec><jats:sec><jats:title>Discussion<\/jats:title><jats:p>The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.<\/jats:p><\/jats:sec>","DOI":"10.3389\/frai.2025.1664303","type":"journal-article","created":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T11:56:47Z","timestamp":1760011007000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["There are significant differences among artificial intelligence large language models when answering scientific questions"],"prefix":"10.3389","volume":"8","author":[{"given":"Francisco Javier","family":"\u00c1lvarez-Mart\u00ednez","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Luis","family":"Esteban","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lucas","family":"Frungillo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Estefan\u00eda","family":"Butassi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alessandro","family":"Zambon","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mar\u00eda","family":"Herranz-L\u00f3pez","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mario","family":"Aranda","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Federica","family":"Pollastro","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Anne Sylvie","family":"Tixier","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jose V.","family":"Garcia-Perez","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"David","family":"Arr\u00e1ez-Rom\u00e1n","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andrew","family":"Ross","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pedro","family":"Mena","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ru Angelie","family":"Edrada-Ebel","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"James","family":"Lyng","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Vicente","family":"Micol","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fernando","family":"Borr\u00e1s-Rocher","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Enrique","family":"Barraj\u00f3n-Catal\u00e1n","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2025,10,9]]},"reference":[{"key":"ref1","doi-asserted-by":"publisher","first-page":"101805","DOI":"10.1016\/j.inffus.2023.101805","article-title":"Explainable artificial intelligence (XAI): what we know and what is left to attain trustworthy artificial intelligence","volume":"99","author":"Ali","year":"2023","journal-title":"Inf. Fusion"},{"key":"ref2","doi-asserted-by":"publisher","first-page":"923","DOI":"10.1111\/imcb.12689","article-title":"Artificial intelligence takes center stage: exploring the capabilities and implications of ChatGPT and other AI-assisted technologies in scientific research and education","volume":"101","author":"Borger","year":"2023","journal-title":"Immunol. Cell Biol."},{"key":"ref3","first-page":"364","article-title":"Challenges and opportunities in integrating LLMs into continuous integration\/continuous deployment (CI\/CD) pipelines","author":"Chen","year":"2024"},{"key":"ref4","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1097\/APO.0000000000000469","article-title":"Promoting transparency and standardization in ophthalmologic artificial intelligence: a call for artificial intelligence model card","volume":"11","author":"Chen","year":"2022","journal-title":"Asia Pac. J. Ophthalmol."},{"key":"ref5","doi-asserted-by":"publisher","first-page":"e51229","DOI":"10.2196\/51229","article-title":"Comparisons of quality, correctness, and similarity between ChatGPT-generated and human-written abstracts for basic research: cross-sectional study","volume":"25","author":"Cheng","year":"2023","journal-title":"J. Med. Internet Res."},{"key":"ref6","doi-asserted-by":"publisher","first-page":"e59641","DOI":"10.2196\/59641","article-title":"Large language models can enable inductive thematic analysis of a social media corpus in a single prompt: human validation study","volume":"4","author":"Deiner","year":"2024","journal-title":"JMIR Infodemiol."},{"key":"ref7","doi-asserted-by":"publisher","first-page":"e53043","DOI":"10.2196\/53043","article-title":"Comparing the perspectives of generative AI, mental health experts, and the general public on schizophrenia recovery: case vignette study","volume":"11","author":"Elyoseph","year":"2024","journal-title":"JMIR Ment. Health"},{"key":"ref8","doi-asserted-by":"publisher","first-page":"296","DOI":"10.48550\/arXiv.2312.02296","article-title":"LLMs accelerate annotation for medical information extraction","volume":"225","author":"Goel","year":"2023","journal-title":"Proc. Mach. Learn. Res."},{"key":"ref9","doi-asserted-by":"publisher","first-page":"449","DOI":"10.1177\/17470161231180449","article-title":"The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts","volume":"19","author":"Hosseini","year":"2023","journal-title":"Res. Ethics"},{"key":"ref10","doi-asserted-by":"publisher","first-page":"108189","DOI":"10.1016\/j.compbiomed.2024.108189","article-title":"A comprehensive evaluation of large language models on benchmark biomedical text processing tasks","volume":"171","author":"Jahan","year":"2024","journal-title":"Comput. Biol. Med."},{"key":"ref11","doi-asserted-by":"publisher","first-page":"895","DOI":"10.1016\/j.jds.2024.08.020","article-title":"Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study","volume":"20","author":"Kim","year":"2024","journal-title":"J. Dent. Sci."},{"key":"ref12","doi-asserted-by":"publisher","first-page":"879603","DOI":"10.3389\/frai.2022.879603","article-title":"Transparency of AI in healthcare as a multilayered system of accountabilities: between legal requirements and technical limitations","volume":"5","author":"Kiseleva","year":"2022","journal-title":"Front. Artif. Intell."},{"key":"ref13","author":"Kostic","year":"2024"},{"key":"ref14","doi-asserted-by":"publisher","first-page":"e55927","DOI":"10.2196\/55927","article-title":"Benchmarking state-of-the-art large language models for migraine patient education: performance comparison of responses to common queries","volume":"26","author":"Li","year":"2024","journal-title":"J. Med. Internet Res."},{"key":"ref15","volume-title":"A systematic investigation of knowledge retrieval and selection for retrieval augmented generation","author":"Li","year":"2024"},{"key":"ref16","volume-title":"Structure-aware domain knowledge injection for large language models","author":"Liu","year":"2024"},{"key":"ref17","doi-asserted-by":"publisher","first-page":"313","DOI":"10.1002\/ski2.313","article-title":"Comparison of large language models in management advice for melanoma: Google\u2019s AI bard, BingAI and ChatGPT","volume":"4","author":"Mu","year":"2024","journal-title":"Skin Health Dis."},{"key":"ref18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.53761\/1.20.02.07","article-title":"Academic integrity considerations of AI large language models in the post-pandemic era: ChatGPT and beyond","volume":"20","author":"Perkins","year":"2023","journal-title":"J. Univ. Teach. Learn. Pract."},{"key":"ref19","doi-asserted-by":"publisher","first-page":"455","DOI":"10.1007\/s00345-024-05146-3","article-title":"Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models","volume":"42","author":"Pompili","year":"2024","journal-title":"World. J. Urol."},{"key":"ref20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1057\/s41599-024-03609-x","article-title":"Performance and biases of large language models in public opinion simulation","volume":"11","author":"Qu","year":"2024","journal-title":"Humanit. Soc. Sci. Commun."},{"key":"ref21","doi-asserted-by":"publisher","first-page":"364","DOI":"10.37074\/jalt.2023.6.1.23","article-title":"War of the chatbots: bard, Bing chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education","volume":"6","author":"Rudolph","year":"2023","journal-title":"J. Appl. Learn. Teach."},{"key":"ref22","doi-asserted-by":"publisher","first-page":"2830","DOI":"10.3390\/cancers16162830","article-title":"Leveraging large language models for precision monitoring of chemotherapy-induced toxicities: a pilot study with expert comparisons and future directions","volume":"16","author":"Ruiz Sarrias","year":"2024","journal-title":"Cancers (Basel)"},{"key":"ref23","doi-asserted-by":"publisher","first-page":"1835","DOI":"10.1007\/s00405-023-08372-4","article-title":"Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation","volume":"281","author":"Saibene","year":"2024","journal-title":"Eur. Arch. Otorrinolaringol."},{"key":"ref24","doi-asserted-by":"publisher","first-page":"712","DOI":"10.3233\/SHTI240513","article-title":"Healthcare LLMs go to market: a realist review of product launch news","volume":"316","author":"Sharifi","year":"2024","journal-title":"Digit. Health Inform. Innov. Sustain. Health Care Syst."},{"key":"ref25","doi-asserted-by":"publisher","first-page":"e34391","DOI":"10.1016\/j.heliyon.2024.e34391","article-title":"Benchmarking four large language models\u2019 performance of addressing Chinese patients' inquiries about dry eye disease: a two-phase study","volume":"10","author":"Shi","year":"2024","journal-title":"Heliyon"},{"key":"ref26","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1186\/s12911-025-02954-4","article-title":"A systematic review of large language model (LLM) evaluations in clinical medicine","volume":"25","author":"Shool","year":"2023","journal-title":"BMC Med. Inform. Decis. Mak."},{"key":"ref27","article-title":"Omnieval: an omnidirectional and automatic RAG evaluation benchmark in financial domain","volume-title":"Arxiv","author":"Wang","year":"2024"},{"key":"ref28","volume-title":"Unveiling scoring processes: dissecting the differences between LLMs and human graders in automatic scoring","author":"Wu","year":"2024"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1664303\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T11:56:49Z","timestamp":1760011009000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1664303\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,9]]},"references-count":28,"alternative-id":["10.3389\/frai.2025.1664303"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1664303","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,9]]},"article-number":"1664303"}}