{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T02:31:08Z","timestamp":1778121068429,"version":"3.51.4"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T00:00:00Z","timestamp":1766361600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T00:00:00Z","timestamp":1766966400000},"content-version":"vor","delay-in-days":7,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"BayernKI"},{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["NE 2136\/3-1, LI3893\/6-1, TR 1700\/7-1"],"award-info":[{"award-number":["NE 2136\/3-1, LI3893\/6-1, TR 1700\/7-1"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["440719683"],"award-info":[{"award-number":["440719683"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"name":"German Federal Ministry of Education","award":["TRANSFORM LIVER, 031L0312A; SWAG, 01KD2215B"],"award-info":[{"award-number":["TRANSFORM LIVER, 031L0312A; SWAG, 01KD2215B"]}]},{"name":"European Union\u2019s Horizon Europe and innovation programme","award":["ODELIA [Open Consortium for Decentralized Medical Artificial Intelligence], 101057091"],"award-info":[{"award-number":["ODELIA [Open Consortium for Decentralized Medical Artificial Intelligence], 101057091"]}]},{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"crossref","award":["101079894"],"award-info":[{"award-number":["101079894"]}],"id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100019627","name":"Bayern Innovativ","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100019627","id-type":"DOI","asserted-by":"publisher"}]},{"name":"German Federal Ministry of Education and Research"},{"DOI":"10.13039\/100015330","name":"Max Kade Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100015330","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Wilhelm-Sander Foundation"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Large language models (LLMs) show promise for radiology decision support, yet conventional retrieval-augmented generation (RAG) relies on single-step retrieval and struggles with complex reasoning. We introduce radiology Retrieval and Reasoning (RaR), a multi-step retrieval framework that iteratively summarizes clinical questions, retrieves evidence, and synthesizes answers. We evaluated 25 LLMs spanning general-purpose, reasoning-optimized, and clinically fine-tuned models (0.5B\u2009\u2192\u2009670B parameters) on 104 expert-curated radiology questions and an independent set of 65 real radiology board-exam questions. RaR significantly improved mean diagnostic accuracy versus zero-shot prompting (75% vs. 67%;\n                    <jats:italic>P<\/jats:italic>\n                    \u2009=\u20091.1\u2009\u00d7\u200910\n                    <jats:sup>\u22127<\/jats:sup>\n                    ) and conventional online RAG (75% vs. 69%;\n                    <jats:italic>P<\/jats:italic>\n                    \u2009=\u20091.9\u2009\u00d7\u200910\n                    <jats:sup>\u22126<\/jats:sup>\n                    ). Gains were largest in mid-sized and small models (e.g., Mistral Large: 72%\u2009\u2192\u200981%), while very large models showed minimal change. RaR reduced hallucinations and provided clinically relevant evidence in 46% of cases, improving factual grounding. These results show that multi-step retrieval enhances diagnostic reliability, especially in deployable mid-sized LLMs. Code, datasets, and RaR are publicly available.\n                  <\/jats:p>","DOI":"10.1038\/s41746-025-02250-5","type":"journal-article","created":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T11:03:39Z","timestamp":1766401419000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Multi-step retrieval and reasoning improves radiology question answering with large language models"],"prefix":"10.1038","volume":"8","author":[{"given":"Sebastian","family":"Wind","sequence":"first","affiliation":[]},{"given":"Jeta","family":"Sopa","sequence":"additional","affiliation":[]},{"given":"Daniel","family":"Truhn","sequence":"additional","affiliation":[]},{"given":"Mahshad","family":"Lotfinia","sequence":"additional","affiliation":[]},{"given":"Tri-Thien","family":"Nguyen","sequence":"additional","affiliation":[]},{"given":"Keno","family":"Bressem","sequence":"additional","affiliation":[]},{"given":"Lisa","family":"Adams","sequence":"additional","affiliation":[]},{"given":"Mirabela","family":"Rusu","sequence":"additional","affiliation":[]},{"given":"Harald","family":"K\u00f6stler","sequence":"additional","affiliation":[]},{"given":"Gerhard","family":"Wellein","sequence":"additional","affiliation":[]},{"given":"Andreas","family":"Maier","sequence":"additional","affiliation":[]},{"given":"Soroosh","family":"Tayebi Arasteh","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,12,22]]},"reference":[{"key":"2250_CR1","doi-asserted-by":"publisher","unstructured":"Akinci D\u2019Antonoli, T. et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn. Interv. Radiol. https:\/\/doi.org\/10.4274\/dir.2023.232417 (2023).","DOI":"10.4274\/dir.2023.232417"},{"key":"2250_CR2","doi-asserted-by":"crossref","unstructured":"Buess, L., Keicher, M., Navab, N., Maier, A. & Tayebi Arasteh, S. From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine. Biomed. Eng. Lett. 15, 1\u201319 (2025).","DOI":"10.1007\/s13534-025-00497-1"},{"key":"2250_CR3","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.233441","volume":"313","author":"S Tayebi Arasteh","year":"2024","unstructured":"Tayebi Arasteh, S. et al. The treasure trove hidden in plain sight: the utility of GPT-4 in chest radiograph evaluation. Radiology 313, e233441 (2024).","journal-title":"Radiology"},{"key":"2250_CR4","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1038\/s43856-023-00370-1","volume":"3","author":"J Clusmann","year":"2023","unstructured":"Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).","journal-title":"Commun. Med."},{"key":"2250_CR5","doi-asserted-by":"publisher","first-page":"1930","DOI":"10.1038\/s41591-023-02448-8","volume":"29","author":"AJ Thirunavukarasu","year":"2023","unstructured":"Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930\u20131940 (2023).","journal-title":"Nat. Med."},{"key":"2250_CR6","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","volume":"620","author":"K Singhal","year":"2023","unstructured":"Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172\u2013180 (2023).","journal-title":"Nature"},{"key":"2250_CR7","doi-asserted-by":"publisher","first-page":"641","DOI":"10.1016\/S0140-6736(23)00216-7","volume":"401","author":"A Arora","year":"2023","unstructured":"Arora, A. & Arora, A. The promise of large language models in health care. Lancet 401, 641 (2023).","journal-title":"Lancet"},{"key":"2250_CR8","unstructured":"OpenAI. GPT-4 Technical Report. Preprint at http:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"2250_CR9","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.231362","volume":"308","author":"MA Fink","year":"2023","unstructured":"Fink, M. A. et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 308, e231362 (2023).","journal-title":"Radiology"},{"key":"2250_CR10","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.230725","volume":"307","author":"LC Adams","year":"2023","unstructured":"Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307, e230725 (2023).","journal-title":"Radiology"},{"key":"2250_CR11","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.231167","volume":"308","author":"J Kottlors","year":"2023","unstructured":"Kottlors, J. et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 308, e231167 (2023).","journal-title":"Radiology"},{"key":"2250_CR12","doi-asserted-by":"publisher","DOI":"10.1148\/ryai.230205","volume":"6","author":"RA Schmidt","year":"2024","unstructured":"Schmidt, R. A. et al. Generative large language models for detection of speech recognition errors in radiology reports. Radiol. Artif. Intell. 6, e230205 (2024).","journal-title":"Radiol. Artif. Intell."},{"key":"2250_CR13","unstructured":"Lewis, P. et al. Retrieval-Augmented generation for knowledge-intensive NLP Tasks. in Adv. Neural Inform. Proces. Sys. (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 33, 9459\u20139474 (Curran Associates, Inc., 2020)."},{"key":"2250_CR14","first-page":"e35179","volume":"15","author":"H Alkaissi","year":"2023","unstructured":"Alkaissi, H. & McFarlane, S. I. Artificial Hallucinations in ChatGPT: implications in scientific writing. Cureus 15, e35179 (2023).","journal-title":"Cureus"},{"key":"2250_CR15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3571730","volume":"55","author":"Z Ji","year":"2023","unstructured":"Ji, Z. et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 1\u201338 (2023).","journal-title":"ACM Comput. Surv."},{"key":"2250_CR16","doi-asserted-by":"crossref","unstructured":"Zakka, C. et al. Almanac\u2014retrieval-augmented language models for clinical medicine. NEJM AI 1, (2024).","DOI":"10.1056\/AIoa2300068"},{"key":"2250_CR17","doi-asserted-by":"crossref","unstructured":"Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics: ACL 2024, (eds Ku, L-W., Martins, A. & Srikumar, V.) 6233\u20136251 (Association for Computational Linguistics, Bangkok, Thailand, 2024).","DOI":"10.18653\/v1\/2024.findings-acl.372"},{"key":"2250_CR18","doi-asserted-by":"publisher","DOI":"10.1148\/ryai.240476","volume":"7","author":"S Tayebi Arasteh","year":"2025","unstructured":"Tayebi Arasteh, S. et al. RadioRAG: online retrieval\u2013augmented generation for radiology question answering. Radiol. Artif. Intell. 7, e240476 (2025).","journal-title":"Radiol. Artif. Intell."},{"key":"2250_CR19","unstructured":"Radiopaedia Australia Pty Ltd ACN 133 562 722. Radiopaedia https:\/\/radiopaedia.org\/."},{"key":"2250_CR20","doi-asserted-by":"crossref","unstructured":"Fink, A., Rau, A., Reisert, M., Bamberg, F. & Russe, M. F. Retrieval-augmented generation with large language models in radiology: from theory to practice. Radiol. Artif. Intell. 7, e240790 (2025).","DOI":"10.1148\/ryai.240790"},{"key":"2250_CR21","unstructured":"Brown, T. B. et al. Language models are few-shot learners. in Proc. 34th International Conference on Neural Information Processing Systems. 159, 1877\u20131901 (2020)."},{"key":"2250_CR22","doi-asserted-by":"publisher","DOI":"10.1038\/s41467-024-45879-8","volume":"15","author":"S Tayebi Arasteh","year":"2024","unstructured":"Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 15, 1603 (2024).","journal-title":"Nat. Commun."},{"key":"2250_CR23","doi-asserted-by":"publisher","unstructured":"Ferber, D. et al. Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer https:\/\/doi.org\/10.1038\/s43018-025-00991-6 (2025).","DOI":"10.1038\/s43018-025-00991-6"},{"key":"2250_CR24","doi-asserted-by":"crossref","unstructured":"Wang, L. et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 18, 186345 (2024).","DOI":"10.1007\/s11704-024-40231-1"},{"key":"2250_CR25","doi-asserted-by":"crossref","unstructured":"Xiong, G. et al. Improving Retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, Vol. 30, 199\u2013214 (2024).","DOI":"10.1142\/9789819807024_0015"},{"key":"2250_CR26","doi-asserted-by":"publisher","first-page":"103743","DOI":"10.1016\/j.inffus.2025.103743","volume":"127","author":"D Yang","year":"2026","unstructured":"Yang, D. et al. MedAide: information fusion and anatomy of medical intents via LLM-based agent collaboration. Inform. Fusion 127, 103743 (2026).","journal-title":"Inform. Fusion"},{"key":"2250_CR27","doi-asserted-by":"publisher","first-page":"AIdbp2500144","DOI":"10.1056\/AIdbp2500144","volume":"2","author":"Y Jiang","year":"2025","unstructured":"Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2, AIdbp2500144 (2025).","journal-title":"NEJM AI"},{"key":"2250_CR28","doi-asserted-by":"publisher","unstructured":"Liu, J. et al. Medchain: Bridging the Gap Between LLM agents and clinical practice through interactive sequential benchmarking. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2412.01605 (2024).","DOI":"10.48550\/arXiv.2412.01605"},{"key":"2250_CR29","doi-asserted-by":"publisher","unstructured":"Mao, Y., Xu, W., Qin, Y. & Gao, Y. CT-Agent: a multimodal-LLM agent for 3D CT radiology question answering. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2505.16229 (2025).","DOI":"10.48550\/arXiv.2505.16229"},{"key":"2250_CR30","doi-asserted-by":"publisher","unstructured":"Zeng, F., Lyu, Z., Li, Q. & Li, X. Enhancing LLMs for Impression Generation in Radiology Reports through a Multi-Agent System. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2412.06828 (2024).","DOI":"10.48550\/arXiv.2412.06828"},{"key":"2250_CR31","doi-asserted-by":"publisher","unstructured":"Yi, Z., Xiao, T. & Albert, M. V. A Multimodal Multi-Agent Framework for Radiology Report Generation. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2505.09787 (2025).","DOI":"10.48550\/arXiv.2505.09787"},{"key":"2250_CR32","doi-asserted-by":"publisher","unstructured":"Ben-Atya, H. et al. Agent-based uncertainty awareness improves automated radiology report labeling with an open-source large language model. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2502.01691 (2025).","DOI":"10.48550\/arXiv.2502.01691"},{"key":"2250_CR33","doi-asserted-by":"publisher","unstructured":"Zhou, H.-Y. et al. MedVersa: a generalist foundation model for medical image interpretation. Preprint at https:\/\/doi.org\/10.48550\/ARXIV.2405.07988 (2024).","DOI":"10.48550\/ARXIV.2405.07988"},{"key":"2250_CR34","first-page":"68539","volume":"36","author":"T Schick","year":"2023","unstructured":"Schick, T. et al. Toolformer: language models can teach themselves to use tools. Adv. Neural Inform. Process. Syst. 36, 68539\u201368551 (2023).","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"2250_CR35","unstructured":"Yao, S. et al. React: synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR) (ICLR, Kigali, Rwanda, 2023)."},{"key":"2250_CR36","doi-asserted-by":"publisher","first-page":"2983","DOI":"10.1038\/s41591-023-02594-z","volume":"29","author":"D Truhn","year":"2023","unstructured":"Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983\u20132984 (2023).","journal-title":"Nat. Med."},{"key":"2250_CR37","first-page":"24824","volume":"35","author":"J Wei","year":"2022","unstructured":"Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inform. Process. Syst. 35, 24824\u201324837 (2022).","journal-title":"Adv. Neural Inform. Process. Syst."},{"key":"2250_CR38","doi-asserted-by":"publisher","first-page":"73","DOI":"10.1016\/j.infoh.2025.03.001","volume":"2","author":"N Karunanayake","year":"2025","unstructured":"Karunanayake, N. Next-generation agentic AI for transforming healthcare. Inform. Health 2, 73\u201383 (2025).","journal-title":"Inform. Health"},{"key":"2250_CR39","unstructured":"Khattab, O. et al. Dspy: Compiling declarative language model calls into self-improving pipelines. Poster in Workshop: Workshop on robustness of zero\/few-shot learning in foundation models (R0-FoMo) https:\/\/neurips.cc\/virtual\/2023\/76693 (2023)."},{"key":"2250_CR40","doi-asserted-by":"publisher","DOI":"10.4274\/dir.2025.253470","author":"B Ko\u00e7ak","year":"2025","unstructured":"Ko\u00e7ak, B. & Me\u015fe, \u0130. AI agents in radiology: toward autonomous and adaptive intelligence. dir https:\/\/doi.org\/10.4274\/dir.2025.253470 (2025).","journal-title":"dir"},{"key":"2250_CR41","doi-asserted-by":"publisher","unstructured":"Bai, J. et al. Qwen technical report. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2309.16609 (2023).","DOI":"10.48550\/arXiv.2309.16609"},{"key":"2250_CR42","doi-asserted-by":"publisher","unstructured":"Sellergren, A. et al. MedGemma Technical Report. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2507.05201 (2025).","DOI":"10.48550\/arXiv.2507.05201"},{"key":"2250_CR43","doi-asserted-by":"publisher","unstructured":"Christophe, C., Kanithi, P. K., Raha, T., Khan, S. & Pimentel, M. A. Med42-v2: a suite of clinical LLMs. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2408.06142 (2024).","DOI":"10.48550\/arXiv.2408.06142"},{"key":"2250_CR44","doi-asserted-by":"publisher","first-page":"633","DOI":"10.1038\/s41586-025-09422-z","volume":"645","author":"D Guo","year":"2025","unstructured":"Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633\u2013638 (2025).","journal-title":"Nature"},{"key":"2250_CR45","unstructured":"Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at http:\/\/arxiv.org\/abs\/2302.13971 (2023)."},{"key":"2250_CR46","doi-asserted-by":"publisher","unstructured":"Grattafiori, A.et al. The Llama 3 herd of models. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2407.21783 (2024).","DOI":"10.48550\/arXiv.2407.21783"},{"key":"2250_CR47","doi-asserted-by":"publisher","unstructured":"Liu, A. et al. Deepseek-v3 technical report. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2412.19437 (2024).","DOI":"10.48550\/arXiv.2412.19437"},{"key":"2250_CR48","doi-asserted-by":"publisher","unstructured":"Yang, A. et al. Qwen3 technical report. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2505.09388 (2025).","DOI":"10.48550\/arXiv.2505.09388"},{"key":"2250_CR49","unstructured":"OpenAI. Introducing GPT-5. https:\/\/openai.com\/index\/introducing-gpt-5\/ (2025)."},{"key":"2250_CR50","doi-asserted-by":"publisher","unstructured":"Team, G. et al. Gemma: Open models based on gemini research and technology. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2403.08295 (2024).","DOI":"10.48550\/arXiv.2403.08295"},{"key":"2250_CR51","doi-asserted-by":"publisher","unstructured":"Team, G. et al. Gemma 3 technical report. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2503.19786 (2025).","DOI":"10.48550\/arXiv.2503.19786"},{"key":"2250_CR52","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1162\/neco.1991.3.1.79","volume":"3","author":"RA Jacobs","year":"1991","unstructured":"Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts. Neural Comput. 3, 79\u201387 (1991).","journal-title":"Neural Comput."},{"key":"2250_CR53","doi-asserted-by":"publisher","first-page":"543","DOI":"10.1038\/s44222-023-00097-7","volume":"1","author":"S Bakhshandeh","year":"2023","unstructured":"Bakhshandeh, S. Benchmarking medical large language models. Nat. Rev. Bioeng. 1, 543\u2013543 (2023).","journal-title":"Nat. Rev. Bioeng."},{"key":"2250_CR54","doi-asserted-by":"publisher","first-page":"1115","DOI":"10.1007\/s10439-023-03327-6","volume":"52","author":"C Wang","year":"2024","unstructured":"Wang, C. et al. Potential for GPT technology to optimize future clinical decision-making using retrieval-augmented generation. Ann. Biomed. Eng. 52, 1115\u20131118 (2024).","journal-title":"Ann. Biomed. Eng."},{"key":"2250_CR55","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1038\/s41746-024-01091-y","volume":"7","author":"S Kresevic","year":"2024","unstructured":"Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024).","journal-title":"npj Digit. Med."},{"key":"2250_CR56","unstructured":"Hoffmann, J. et al. Training compute-optimal large language models. NIPS'22: Proc. 36th International Conference on Neural Information Processing Systems, 30016\u201330030 (Curran Associates Inc., 2022)."},{"key":"2250_CR57","doi-asserted-by":"publisher","unstructured":"Kaplan, J. et al. Scaling laws for neural language models. Preprint at https:\/\/doi.org\/10.48550\/arXiv.2001.08361 (2020).","DOI":"10.48550\/arXiv.2001.08361"},{"key":"2250_CR58","doi-asserted-by":"publisher","first-page":"100","DOI":"10.1038\/s41746-024-01081-0","volume":"7","author":"S Gilbert","year":"2024","unstructured":"Gilbert, S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. npj Digit. Med. 7, 100 (2024).","journal-title":"npj Digit. Med."},{"key":"2250_CR59","unstructured":"Ren, R. et al. Investigating the factual knowledge boundary of large language models with retrieval augmentation. In Proc. 31st International Conference on Computational Linguistics, 3697\u20133715 (Association for Computational Linguistics, 2025)."},{"key":"2250_CR60","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1038\/s41746-022-00699-2","volume":"5","author":"H Chen","year":"2022","unstructured":"Chen, H., Gomez, C., Huang, C.-M. & Unberath, M. Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review. npj Digit. Med. 5, 156 (2022).","journal-title":"npj Digit. Med."},{"key":"2250_CR61","doi-asserted-by":"crossref","unstructured":"Chen, Y.-S., Jin, J., Kuo, P.-T., Huang, C.-W. & Chen, Y.-N. LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation. In Findings of the Association for Computational Linguistics: ACL 2025, (eds Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T.) 26669\u201326684 (Association for Computational Linguistics, Vienna, Austria, 2025).","DOI":"10.18653\/v1\/2025.findings-acl.1369"},{"key":"2250_CR62","doi-asserted-by":"publisher","first-page":"636","DOI":"10.1007\/s42979-024-02963-6","volume":"5","author":"C Gr\u00e9visse","year":"2024","unstructured":"Gr\u00e9visse, C., Pavlou, M. A. S. & Schneider, J. G. Docimological quality analysis of LLM-Generated multiple choice questions in computer science and medicine. SN Comput. Sci. 5, 636 (2024).","journal-title":"SN Comput. Sci."},{"key":"2250_CR63","unstructured":"Tukey, J. W. Exploratory Data Analysis (Springer, 1977)."},{"key":"2250_CR64","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1007\/s11222-012-9370-4","volume":"24","author":"F Konietschke","year":"2014","unstructured":"Konietschke, F. & Pauly, M. Bootstrapping and permuting paired t-test type statistics. Stat. Comput. 24, 283\u2013296 (2014).","journal-title":"Stat. Comput."},{"key":"2250_CR65","doi-asserted-by":"publisher","DOI":"10.1148\/radiol.220510","volume":"307","author":"F Khader","year":"2022","unstructured":"Khader, F. et al. Artificial intelligence for clinical interpretation of bedside chest radiographs. Radiology 307, e220510 (2022).","journal-title":"Radiology"},{"key":"2250_CR66","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1007\/BF02295996","volume":"12","author":"Q McNemar","year":"1947","unstructured":"McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153\u2013157 (1947).","journal-title":"Psychometrika"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02250-5","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02250-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02250-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T20:33:04Z","timestamp":1767040384000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-02250-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,22]]},"references-count":66,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["2250"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-02250-5","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,22]]},"assertion":[{"value":"24 August 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 December 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"S.W. is partially employed by DATEV eG, Germany. D.T. received honoraria for lectures by Bayer, G.E., Roche, AstraZeneca, and Philips and holds shares in StratifAI GmbH, Germany, and in Synagen GmbH, Germany. M.L. is employed by Generali Deutschland Services GmbH, Germany and is an editorial board at European Radiology Experimental. K.B. and L.A. are trainee editorial boards at Radiology: Artificial Intelligence. A.M. is an associate editor at IEEE Transactions on Medical Imaging. S.T.A. is an editorial board at Communications Medicine and at European Radiology Experimental, a trainee editorial board at Radiology: Artificial Intelligence, and was partially employed by Synagen GmbH, Germany. The other authors do not have any competing interests to disclose.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"790"}}