{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T12:57:36Z","timestamp":1781701056316,"version":"3.54.5"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T00:00:00Z","timestamp":1733961600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T00:00:00Z","timestamp":1733961600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models\u2019 adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models\u2019 capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9\/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8\/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA\u2019s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.<\/jats:p>","DOI":"10.1038\/s41746-024-01356-6","type":"journal-article","created":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T13:18:56Z","timestamp":1734009536000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Autonomous medical evaluation for guideline adherence of large language models"],"prefix":"10.1038","volume":"7","author":[{"given":"Dennis","family":"Fast","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5836-4542","authenticated-orcid":false,"given":"Lisa C.","family":"Adams","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9770-8555","authenticated-orcid":false,"given":"Felix","family":"Busch","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Conor","family":"Fallon","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Marc","family":"Huppertz","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Robert","family":"Siepmann","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Philipp","family":"Prucker","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8530-6783","authenticated-orcid":false,"given":"Nadine","family":"Bayerl","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9605-0728","authenticated-orcid":false,"given":"Daniel","family":"Truhn","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8778-647X","authenticated-orcid":false,"given":"Marcus","family":"Makowski","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alexander","family":"L\u00f6ser","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Keno K.","family":"Bressem","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2024,12,12]]},"reference":[{"key":"1356_CR1","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1053\/j.semnuclmed.2018.11.001","volume":"49","author":"MC Brouwers","year":"2019","unstructured":"Brouwers, M. C., Florez, I. D., McNair, S. A., Vella, E. T. & Yao, X. Clinical Practice Guidelines: Tools to Support High Quality Patient Care. Semin. Nucl. Med. 49, 145\u2013152 (2019).","journal-title":"Semin. Nucl. Med."},{"key":"1356_CR2","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1007\/s00432-024-05673-x","volume":"150","author":"A Lawson McLean","year":"2024","unstructured":"Lawson McLean, A., Wu, Y., Lawson McLean, A. C. & Hristidis, V. Large language models as decision aids in neuro-oncology: a review of shared decision-making applications. J. Cancer Res. Clin. Oncol. 150, 139 (2024).","journal-title":"J. Cancer Res. Clin. Oncol."},{"key":"1356_CR3","doi-asserted-by":"publisher","first-page":"e51346","DOI":"10.2196\/51346","volume":"8","author":"A Skryd","year":"2024","unstructured":"Skryd, A. & Lawrence, K. ChatGPT as a tool for medical education and clinical decision-making on the wards: case study. JMIR Form. Res. 8, e51346 (2024).","journal-title":"JMIR Form. Res."},{"key":"1356_CR4","unstructured":"Achiam, J. et al. Gpt-4 technical report. Preprint at https:\/\/arxiv.org\/abs\/2303.08774 (2023)."},{"key":"1356_CR5","unstructured":"Team, G. et al. Gemini: a family of highly capable multimodal models. Preprint at https:\/\/arxiv.org\/abs\/2312.11805 (2023)."},{"key":"1356_CR6","unstructured":"Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https:\/\/arxiv.org\/abs\/2302.13971 (2023)."},{"key":"1356_CR7","unstructured":"Jiang, A. Q. et al. Mistral 7B. Preprint at https:\/\/arxiv.org\/abs\/2310.06825 (2023)."},{"key":"1356_CR8","unstructured":"Jiang, A. Q. et al. Mixtral of experts. Preprint at https:\/\/arxiv.org\/abs\/2401.04088 (2024)."},{"key":"1356_CR9","doi-asserted-by":"publisher","first-page":"e22769","DOI":"10.2196\/22769","volume":"26","author":"L Wang","year":"2024","unstructured":"Wang, L. et al. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res. 26, e22769 (2024).","journal-title":"J Med Internet Res."},{"key":"1356_CR10","doi-asserted-by":"crossref","unstructured":"Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2357\u20132368 (Association for Computational Linguistics, Brussels, Belgium, 2018).","DOI":"10.18653\/v1\/D18-1258"},{"key":"1356_CR11","unstructured":"Lee, M. et al. Beyond Information Retrieval\u2014Medical Question Answering. In AMIA Annual Symposium Proceedings, 469\u2013473 (2006)."},{"key":"1356_CR12","doi-asserted-by":"crossref","unstructured":"\u0160uster, S. & Daelemans, W. CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1551\u20131563 (Association for Computational Linguistics, New Orleans, Louisiana, 2018).","DOI":"10.18653\/v1\/N18-1140"},{"key":"1356_CR13","unstructured":"Liu, F. et al. Large language models in the clinic: a comprehensive benchmark. Preprint at https:\/\/arxiv.org\/abs\/2405.00716 (2024)."},{"key":"1356_CR14","doi-asserted-by":"crossref","unstructured":"Hu, Y. et al. OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM. In Proceedings IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 22170\u201322183 (2024).","DOI":"10.1109\/CVPR52733.2024.02093"},{"key":"1356_CR15","doi-asserted-by":"publisher","first-page":"108189","DOI":"10.1016\/j.compbiomed.2024.108189","volume":"171","author":"I Jahan","year":"2024","unstructured":"Jahan, I., Laskar, M. T. R., Peng, C. & Huang, J. X. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Comput. Biol. Med. 171, 108189 (2024).","journal-title":"Comput. Biol. Med."},{"key":"1356_CR16","unstructured":"Panagoulias, D. P. et al. COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain. Preprint at https:\/\/arxiv.org\/abs\/2405.10893 (2024)."},{"key":"1356_CR17","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1016\/S0140-6736(16)31592-6","volume":"390","author":"B Djulbegovic","year":"2017","unstructured":"Djulbegovic, B. & Guyatt, G. H. Progress in evidence-based medicine: a quarter century on. Lancet 390, 415\u2013423 (2017).","journal-title":"Lancet"},{"key":"1356_CR18","doi-asserted-by":"crossref","unstructured":"Sarlin, P.-E., DeTone, D., Malisiewicz, T. & Rabinovich, A. SuperGlue: Learning Feature Matching With Graph Neural Networks. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 4938\u20134947 (2020)","DOI":"10.1109\/CVPR42600.2020.00499"},{"key":"1356_CR19","unstructured":"Liang, P. et al. Holistic evaluation of language models. Transactions on Machine Learning Research. (2023)."},{"key":"1356_CR20","unstructured":"Hendrycks, D. et al. Measuring massive multitask language understanding. International Conference on Learning Representations (2021)."},{"key":"1356_CR21","doi-asserted-by":"crossref","unstructured":"Rajpurkar, P. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383\u20132392 (Association for Computational Linguistics, Austin, Texas, 2016).","DOI":"10.18653\/v1\/D16-1264"},{"key":"1356_CR22","doi-asserted-by":"publisher","first-page":"249","DOI":"10.1162\/tacl_a_00266","volume":"7","author":"S Reddy","year":"2019","unstructured":"Reddy, S., Chen, D. & Manning, C. D. Coqa: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249\u2013266 (2019).","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"1356_CR23","unstructured":"Dua, D. et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368\u20132378 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019)."},{"key":"1356_CR24","doi-asserted-by":"crossref","unstructured":"Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. Hellaswag: can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019).","DOI":"10.18653\/v1\/P19-1472"},{"key":"1356_CR25","unstructured":"Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https:\/\/arxiv.org\/abs\/2110.14168 (2021)."},{"key":"1356_CR26","doi-asserted-by":"publisher","first-page":"6421","DOI":"10.3390\/app11146421","volume":"11","author":"D Jin","year":"2021","unstructured":"Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).","journal-title":"Appl. Sci."},{"key":"1356_CR27","doi-asserted-by":"crossref","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. Pubmedqa: a dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567\u20132577 (Association for Computational Linguistics, Hong Kong, China, 2019).","DOI":"10.18653\/v1\/D19-1259"},{"key":"1356_CR28","unstructured":"Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. Conference on Health, inference, and Learning. PMLR 174:248\u2013260 (2022)."},{"key":"1356_CR29","unstructured":"Nentidis, A. et al. Overview of BioASQ 2023: The Eleventh BioASQ Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. In International Conference of the Cross-Language Evaluation Forum for European Languages. 337\u2013361 (Springer, 2023)."},{"key":"1356_CR30","unstructured":"Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. Preprint at https:\/\/arxiv.org\/abs\/2405.07960 (2024)."},{"key":"1356_CR31","doi-asserted-by":"crossref","unstructured":"Xiong, G. et al. Improving retrieval-augmented generation in medicine with iterative follow-up questions. Preprint at https:\/\/arxiv.org\/abs\/2408.00727 (2024).","DOI":"10.1142\/9789819807024_0015"},{"key":"1356_CR32","unstructured":"Christophe, C. et al. Med42--evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches. In AAAI 2024 Spring Symposium - Clinical Foundation Models (2024)."},{"key":"1356_CR33","unstructured":"Tu, T. et al. Towards conversational diagnostic AI. Preprint at https:\/\/arxiv.org\/abs\/2401.05654 (2024)."},{"key":"1356_CR34","doi-asserted-by":"publisher","first-page":"1236","DOI":"10.1093\/bib\/bbx044","volume":"19","author":"R Miotto","year":"2018","unstructured":"Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform. 19, 1236\u20131246 (2018).","journal-title":"Brief. Bioinform."},{"key":"1356_CR35","doi-asserted-by":"publisher","first-page":"e10010","DOI":"10.2196\/10010","volume":"7","author":"J Shen","year":"2019","unstructured":"Shen, J. et al. Artificial intelligence versus clinicians in disease diagnosis: systematic review. JMIR Med. Inform. 7, e10010 (2019).","journal-title":"JMIR Med. Inform."},{"key":"1356_CR36","doi-asserted-by":"publisher","DOI":"10.1038\/s41746-024-01097-6","volume":"7","author":"C Silcox","year":"2024","unstructured":"Silcox, C. et al. The potential for artificial intelligence to transform healthcare: perspectives from international health leaders. NPJ Digital Med. 7, 88 (2024).","journal-title":"NPJ Digital Med."},{"key":"1356_CR37","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1145\/3624730","volume":"67","author":"G Faggioli","year":"2024","unstructured":"Faggioli, G. et al. Who determines what is relevant? Humans or AI? Why not both? Commun. ACM 67, 31\u201334 (2024).","journal-title":"Commun. ACM"},{"key":"1356_CR38","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3650209","volume":"16","author":"G Demartini","year":"2024","unstructured":"Demartini, G., Sadiq, S. & Yang, J. Special Issue on Human in the Loop Data Curation. J Data Inf Qual. 16, 1\u20132 (2024).","journal-title":"J Data Inf Qual."},{"key":"1356_CR39","unstructured":"Srivastava, A. et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. Transact Mach Learn Res. (2023)."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01356-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01356-6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01356-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,12]],"date-time":"2024-12-12T14:06:19Z","timestamp":1734012379000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-024-01356-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,12]]},"references-count":39,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["1356"],"URL":"https:\/\/doi.org\/10.1038\/s41746-024-01356-6","relation":{},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,12]]},"assertion":[{"value":"16 June 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 November 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 December 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"358"}}