{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T11:09:49Z","timestamp":1761908989873,"version":"build-2065373602"},"reference-count":44,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T00:00:00Z","timestamp":1761523200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T00:00:00Z","timestamp":1761523200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>The use of large language models (LLMs) in clinical diagnostics and intervention planning is expanding, yet their utility for personalized recommendations for longevity interventions remains opaque. We extended the BioChatter framework to benchmark LLMs\u2019 ability to generate personalized longevity intervention recommendations based on biomarker profiles while adhering to key medical validation requirements. Using 25 individual profiles across three different age groups, we generated 1000 diverse test cases covering interventions such as caloric restriction, fasting and supplements. Evaluating 56000 model responses via an LLM-as-a-Judge system with clinician validated ground truths, we found that proprietary models outperformed open-source models especially in comprehensiveness. However, even with Retrieval-Augmented Generation (RAG), all models exhibited limitations in addressing key medical validation requirements, prompt stability, and handling age-related biases. Our findings highlight limited suitability of LLMs for unsupervised longevity intervention recommendations. Our open-source framework offers a foundation for advancing AI benchmarking in various medical contexts.<\/jats:p>","DOI":"10.1038\/s41746-025-01996-2","type":"journal-article","created":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T12:46:21Z","timestamp":1761569181000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Benchmarking large language models for personalized, biomarker-based health intervention recommendations"],"prefix":"10.1038","volume":"8","author":[{"given":"Hans","family":"Jarchow","sequence":"first","affiliation":[]},{"given":"Christoph","family":"Bobrowski","sequence":"additional","affiliation":[]},{"given":"Steffi","family":"Falk","sequence":"additional","affiliation":[]},{"given":"Andreas","family":"Hermann","sequence":"additional","affiliation":[]},{"given":"Anton","family":"Kulaga","sequence":"additional","affiliation":[]},{"given":"Johann-Christian","family":"P\u00f5der","sequence":"additional","affiliation":[]},{"given":"Maximilian","family":"Unfried","sequence":"additional","affiliation":[]},{"given":"Nikolay","family":"Usanov","sequence":"additional","affiliation":[]},{"given":"Bijan","family":"Zendeh","sequence":"additional","affiliation":[]},{"given":"Brian K.","family":"Kennedy","sequence":"additional","affiliation":[]},{"given":"Sebastian","family":"Lobentanzer","sequence":"additional","affiliation":[]},{"given":"Georg","family":"Fuellen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"key":"1996_CR1","doi-asserted-by":"publisher","DOI":"10.1186\/s12909-023-04698-z","volume":"23","author":"SA Alowais","year":"2023","unstructured":"Alowais, S. A. et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med. Educ. 23, 689 (2023).","journal-title":"BMC Med. Educ."},{"key":"1996_CR2","doi-asserted-by":"publisher","DOI":"10.1186\/s12911-021-01488-9","volume":"21","author":"S Secinaro","year":"2021","unstructured":"Secinaro, S., Calandra, D., Secinaro, A., Muthurangu, V. & Biancone, P. The role of artificial intelligence in healthcare: a structured literature review. BMC Med. Inform. Decis. Mak. 21, 125 (2021).","journal-title":"BMC Med. Inform. Decis. Mak."},{"key":"1996_CR3","doi-asserted-by":"publisher","DOI":"10.1016\/j.isci.2024.109713","volume":"27","author":"X Meng","year":"2024","unstructured":"Meng, X. et al. The application of large language models in medicine: A scoping review. iScience 27, 109713 (2024).","journal-title":"iScience"},{"key":"1996_CR4","doi-asserted-by":"publisher","first-page":"88","DOI":"10.1038\/s41746-024-01097-6","volume":"7","author":"C Silcox","year":"2024","unstructured":"Silcox, C. et al. The potential for artificial intelligence to transform healthcare: perspectives from international health leaders. NPJ Digit. Med. 7, 88 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1996_CR5","doi-asserted-by":"publisher","first-page":"2043","DOI":"10.1016\/j.cell.2025.03.011","volume":"188","author":"G Kroemer","year":"2025","unstructured":"Kroemer, G. et al. From geroscience to precision geromedicine: Understanding and managing aging. Cell 188, 2043\u20132062 (2025).","journal-title":"Cell"},{"key":"1996_CR6","doi-asserted-by":"publisher","first-page":"6269","DOI":"10.1007\/s11357-024-01229-6","volume":"46","author":"N Parchmann","year":"2024","unstructured":"Parchmann, N., Hansen, D., Orzechowski, M. & Steger, F. An ethical assessment of professional opinions on concerns, chances, and limitations of the implementation of an artificial intelligence-based technology into the geriatric patient treatment and continuity of care. Geroscience 46, 6269\u20136282 (2024).","journal-title":"Geroscience"},{"key":"1996_CR7","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1016\/j.jagp.2024.01.011","volume":"32","author":"IV Vahia","year":"2024","unstructured":"Vahia, I. V. Navigating New Realities in Aging Care as Artificial Intelligence Enters Clinical Practice. Am. J. Geriatr. Psychiatry 32, 267\u2013269 (2024).","journal-title":"Am. J. Geriatr. Psychiatry"},{"key":"1996_CR8","doi-asserted-by":"publisher","first-page":"3651","DOI":"10.1111\/jgs.18569","volume":"71","author":"RG Stefanacci","year":"2023","unstructured":"Stefanacci, R. G. Artificial intelligence in geriatric medicine: Potential and pitfalls. J. Am. Geriatr. Soc. 71, 3651\u20133652 (2023).","journal-title":"J. Am. Geriatr. Soc."},{"key":"1996_CR9","doi-asserted-by":"publisher","first-page":"e635","DOI":"10.1016\/S2589-7500(23)00155-3","volume":"5","author":"UK Wiil","year":"2023","unstructured":"Wiil, U. K. Important steps for artificial intelligence-based risk assessment of older adults. Lancet Digit. Health 5, e635\u2013e636 (2023).","journal-title":"Lancet Digit. Health"},{"key":"1996_CR10","doi-asserted-by":"publisher","DOI":"10.1016\/j.arr.2022.101808","volume":"83","author":"B Ma","year":"2023","unstructured":"Ma, B. et al. Artificial intelligence in elderly healthcare: A scoping review. Ageing Res Rev. 83, 101808 (2023).","journal-title":"Ageing Res Rev."},{"key":"1996_CR11","doi-asserted-by":"publisher","unstructured":"Jin, D. et al. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 11, (2021). https:\/\/doi.org\/10.3390\/app11146421.","DOI":"10.3390\/app11146421"},{"key":"1996_CR12","unstructured":"Pal, A., Umapathi, L. K. & Sankarasubbu, M. in Proceedings of the Conference on Health, Inference, and Learning 174, 248-260 (PMLR, 2022)."},{"key":"1996_CR13","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1038\/s41586-023-06291-2","volume":"620","author":"K Singhal","year":"2023","unstructured":"Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172\u2013180 (2023).","journal-title":"Nature"},{"key":"1996_CR14","unstructured":"Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2567\u20132577 (2019)."},{"key":"1996_CR15","unstructured":"\u0160uster, S. & Daelemans, W. in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1551-1563 (2018)."},{"key":"1996_CR16","unstructured":"Wang, L. L., deYoung, J. & Wallace, B. in Proceedings of the Third Workshop on Scholarly Document Processing 175-180 (2022)."},{"key":"1996_CR17","doi-asserted-by":"publisher","unstructured":"Li, J. et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016 (2016). https:\/\/doi.org\/10.1093\/database\/baw068.","DOI":"10.1093\/database\/baw068"},{"key":"1996_CR18","doi-asserted-by":"publisher","unstructured":"Krallinger, M. et al. CHEMDNER: The drugs and chemical names extraction challenge. J. Cheminform. 7, S1 (2015). https:\/\/doi.org\/10.1186\/1758-2946-7-S1-S1.","DOI":"10.1186\/1758-2946-7-S1-S1"},{"key":"1996_CR19","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-020-00620-0","volume":"7","author":"F Kury","year":"2020","unstructured":"Kury, F. et al. Chia, a large annotated corpus of clinical trial eligibility criteria. Sci. Data 7, 281 (2020).","journal-title":"Sci. Data"},{"key":"1996_CR20","doi-asserted-by":"publisher","first-page":"295","DOI":"10.1038\/s41746-024-01283-6","volume":"7","author":"S Schmidgall","year":"2024","unstructured":"Schmidgall, S. et al. Evaluation and mitigation of cognitive biases in medical language models. NPJ Digit. Med. 7, 295 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1996_CR21","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1038\/s41746-024-01390-4","volume":"8","author":"C Wu","year":"2025","unstructured":"Wu, C. et al. Towards evaluating and building versatile large language models for medicine. NPJ Digit. Med. 8, 58 (2025).","journal-title":"NPJ Digit. Med."},{"key":"1996_CR22","unstructured":"Kanithi, P. K. et al. MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications. Preprint at: https:\/\/arxiv.org\/abs\/2409.07314 (2024)."},{"key":"1996_CR23","doi-asserted-by":"publisher","first-page":"358","DOI":"10.1038\/s41746-024-01356-6","volume":"7","author":"D Fast","year":"2024","unstructured":"Fast, D. et al. Autonomous medical evaluation for guideline adherence of large language models. NPJ Digit. Med. 7, 358 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1996_CR24","unstructured":"Li, D. et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge, 2025. Preprint at: https:\/\/arxiv.org\/abs\/2411.16594 (2025)."},{"key":"1996_CR25","doi-asserted-by":"publisher","DOI":"10.1016\/j.arr.2024.102617","volume":"104","author":"G Fuellen","year":"2025","unstructured":"Fuellen, G. et al. Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res. Rev. 104, 102617 (2025).","journal-title":"Ageing Res. Rev."},{"key":"1996_CR26","doi-asserted-by":"publisher","unstructured":"Zakka, C. et al. Almanac - Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 1 (2024). https:\/\/doi.org\/10.1056\/aioa2300068.","DOI":"10.1056\/aioa2300068"},{"key":"1996_CR27","doi-asserted-by":"publisher","first-page":"166","DOI":"10.1038\/s41587-024-02534-3","volume":"43","author":"S Lobentanzer","year":"2025","unstructured":"Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. 43, 166\u2013169 (2025).","journal-title":"Nat. Biotechnol."},{"key":"1996_CR28","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1038\/s43856-024-00717-2","volume":"5","author":"F Busch","year":"2025","unstructured":"Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun. Med. (Lond.) 5, 26 (2025).","journal-title":"Commun. Med. (Lond.)"},{"key":"1996_CR29","unstructured":"Beauchamp, T. L. & Childress, J. F. Principles of Biomedical Ethics. (Oxford University Press, 2012)."},{"key":"1996_CR30","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1057\/s41599-023-01619-9","volume":"10","author":"C Pang","year":"2023","unstructured":"Pang, C. Is a partially informed choice less autonomous?: a probabilistic account for autonomous choice and information. Humanit. Soc. Sci. Commun. 10, 131 (2023).","journal-title":"Humanit. Soc. Sci. Commun."},{"key":"1996_CR31","doi-asserted-by":"publisher","first-page":"2613","DOI":"10.1038\/s41591-024-03097-1","volume":"30","author":"P Hager","year":"2024","unstructured":"Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613\u20132622 (2024).","journal-title":"Nat. Med."},{"key":"1996_CR32","unstructured":"Mirzadeh, I. et al. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. Preprint at: https:\/\/arxiv.org\/abs\/2410.05229 (2024)."},{"key":"1996_CR33","doi-asserted-by":"publisher","first-page":"947","DOI":"10.1093\/geront\/gnab167","volume":"62","author":"CH Chu","year":"2022","unstructured":"Chu, C. H. et al. Digital Ageism: Challenges and Opportunities in Artificial Intelligence for Older Adults. Gerontologist 62, 947\u2013955 (2022).","journal-title":"Gerontologist"},{"key":"1996_CR34","doi-asserted-by":"publisher","unstructured":"Ng, K. K. Y., Matsuba, I. & Zhang, P. C. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI 2 (2024). https:\/\/doi.org\/10.1056\/AIra2400380.","DOI":"10.1056\/AIra2400380"},{"key":"1996_CR35","doi-asserted-by":"publisher","first-page":"102","DOI":"10.1038\/s41746-024-01091-y","volume":"7","author":"S Kresevic","year":"2024","unstructured":"Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).","journal-title":"NPJ Digit. Med."},{"key":"1996_CR36","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-019-0103-9","volume":"6","author":"H Harutyunyan","year":"2019","unstructured":"Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G. & Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci. Data. 6, 96 (2019).","journal-title":"Sci. Data."},{"key":"1996_CR37","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-022-01782-9","volume":"9","author":"F Xie","year":"2022","unstructured":"Xie, F. et al. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci. Data. 9, 658 (2022).","journal-title":"Sci. Data."},{"key":"1996_CR38","unstructured":"Nguyen, T.-T. et al. Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification. Preprint at: https:\/\/arxiv.org\/abs\/2304.13998 (2023)."},{"key":"1996_CR39","unstructured":"Grattafiori, A. et al. The Llama 3 Herd of Models. Preprint at: https:\/\/arxiv.org\/abs\/2407.21783 (2024)."},{"key":"1996_CR40","unstructured":"Christophe, C. et al. Med42-v2: A suite of clinical llms. Preprint at: https:\/\/arxiv.org\/abs\/2408.06142 (2024)."},{"key":"1996_CR41","doi-asserted-by":"publisher","first-page":"1056","DOI":"10.1038\/s41587-023-01848-y","volume":"41","author":"S Lobentanzer","year":"2023","unstructured":"Lobentanzer, S. et al. Democratizing knowledge representation with BioCypher. Nat. Biotechnol. 41, 1056\u20131059 (2023).","journal-title":"Nat. Biotechnol."},{"key":"1996_CR42","doi-asserted-by":"publisher","first-page":"1026","DOI":"10.21105\/joss.01026","volume":"3","author":"R Vallat","year":"2018","unstructured":"Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. 3, 1026 (2018).","journal-title":"J. Open Source Softw."},{"key":"1996_CR43","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825\u20132830 (2011).","journal-title":"J. Mach. Learn. Res."},{"key":"1996_CR44","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1038\/s41592-019-0686-2","volume":"17","author":"P Virtanen","year":"2020","unstructured":"Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261\u2013272 (2020).","journal-title":"Nat. Methods"}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01996-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01996-2","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01996-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T11:05:38Z","timestamp":1761908738000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-025-01996-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":44,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1996"],"URL":"https:\/\/doi.org\/10.1038\/s41746-025-01996-2","relation":{},"ISSN":["2398-6352"],"issn-type":[{"type":"electronic","value":"2398-6352"}],"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"14 May 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 September 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 October 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"B.K.K. reports a relationship with Ponce de Leon Health that includes: consulting or advisory and equity or stocks. C.B. has received lecturing fees from Novartis Deutschland GmbH and Bayer Vital GmbH. C.B. serves on the expert board for statutory health insurance data of IQTIG, the Institute for Quality and Transparency in German Healthcare (Institut f\u00fcr Qualit\u00e4tssicherung und Transparenz im Gesundheitswesen). G.F. is a consultant to BlueZoneTech GmbH, who distribute supplements.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}},{"value":"The first draft was written by H.J., with help from G.F. and S.L.; No writing assistance was employed. While the topic of the paper is the use of generative AI\/LLMs, no such tools were used to generate text or content of the manuscript. GPT4o was used for copy-editing (grammar, spelling) assistance and research queries on related work and references.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Statement on the use of AI"}}],"article-number":"631"}}