{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T16:12:19Z","timestamp":1775059939085,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T00:00:00Z","timestamp":1749859200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T00:00:00Z","timestamp":1749859200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005713","name":"Technische Universit\u00e4t M\u00fcnchen","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005713","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Healthc Inform Res"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs\u2019 capability in handling real-world, lengthy clinical data. We present the <jats:italic>LongHealth<\/jats:italic> benchmark, comprising 20 detailed fictional patient cases across various diseases, with each case containing 5090 to 6754 words. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting, challenging LLMs to extract and interpret information from large clinical documents. We evaluated eleven open-source LLMs with a minimum of 16,000 tokens and also included OpenAI\u2019s proprietary and cost-efficient Generative Pre-trained Transformers-3.5 Turbo for comparison. The highest accuracy was observed for Mistral-Small-24B-Instruct-2501 and Llama-4-Scout-17B-16E-Instruct, particularly in tasks focused on information retrieval from single and multiple patient documents. However, all models struggled significantly in tasks requiring the identification of missing information, highlighting a critical area for improvement in clinical data. In conclusion, while LLMs show considerable potential for processing long clinical documents, their current accuracy levels are insufficient for reliable clinical use, especially in scenarios requiring the identification of missing information. The <jats:italic>LongHealth<\/jats:italic> benchmark provides a more realistic assessment of LLMs in a healthcare setting and highlights the need for further model refinement for safe and effective clinical application. We make the benchmark and evaluation code publicly available.<\/jats:p>","DOI":"10.1007\/s41666-025-00204-w","type":"journal-article","created":{"date-parts":[[2025,6,14]],"date-time":"2025-06-14T11:53:40Z","timestamp":1749902020000},"page":"280-296","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["LongHealth: A Question Answering Benchmark with Long Clinical Documents"],"prefix":"10.1007","volume":"9","author":[{"given":"Lisa","family":"Adams","sequence":"first","affiliation":[]},{"given":"Felix","family":"Busch","sequence":"additional","affiliation":[]},{"given":"Tianyu","family":"Han","sequence":"additional","affiliation":[]},{"given":"Jean-Baptiste","family":"Excoffier","sequence":"additional","affiliation":[]},{"given":"Matthieu","family":"Ortala","sequence":"additional","affiliation":[]},{"given":"Alexander","family":"L\u00f6ser","sequence":"additional","affiliation":[]},{"given":"Hugo J. W. L.","family":"Aerts","sequence":"additional","affiliation":[]},{"given":"Jakob Nikolas","family":"Kather","sequence":"additional","affiliation":[]},{"given":"Daniel","family":"Truhn","sequence":"additional","affiliation":[]},{"given":"Keno","family":"Bressem","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,14]]},"reference":[{"issue":"1","key":"204_CR1","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s43856-023-00370-1","volume":"3","author":"J Clusmann","year":"2023","unstructured":"Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Ghaffari Laleh N et al (2023) The future landscape of large language models in medicine. Commun Med 3(1):1\u20138","journal-title":"Commun Med"},{"issue":"2","key":"204_CR2","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1097\/MLR.0b013e3181c12e6a","volume":"48","author":"TR Konrad","year":"2010","unstructured":"Konrad TR, Link CL, Shackelton RJ, Marceau LD, von dem Knesebeck O, Siegrist J et al (2010) It\u2019s about time: physicians\u2019 perceptions of time constraints in primary care medical practice in three national healthcare systems. Med Care 48(2):95","journal-title":"Med Care"},{"issue":"7","key":"204_CR3","doi-asserted-by":"publisher","first-page":"e2115334","DOI":"10.1001\/jamanetworkopen.2021.15334","volume":"4","author":"A Rule","year":"2021","unstructured":"Rule A, Bedrick S, Chiang MF, Hribar MR (2021) Length and redundancy of outpatient progress notes across a decade at an academic medical center. JAMA Netw Open 4(7):e2115334\u2013e2115334","journal-title":"JAMA Netw Open"},{"key":"204_CR4","doi-asserted-by":"publisher","first-page":"103301","DOI":"10.1016\/j.jbi.2019.103301","volume":"100","author":"S Datta","year":"2019","unstructured":"Datta S, Bernstam E, Roberts K (2019) A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J Biomed Inform 100:103301","journal-title":"J Biomed Inform"},{"key":"204_CR5","doi-asserted-by":"publisher","unstructured":"Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D (2022) Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 conference on empirical methods in natural language processing. Association for Computational Linguistics, Abu Dhabi, pp 1998\u20132022. https:\/\/doi.org\/10.18653\/v1\/2022.emnlp-main.130","DOI":"10.18653\/v1\/2022.emnlp-main.130"},{"issue":"14","key":"204_CR6","doi-asserted-by":"publisher","first-page":"6421","DOI":"10.3390\/app11146421","volume":"11","author":"D Jin","year":"2021","unstructured":"Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P (2021) What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci 11(14):6421","journal-title":"Appl Sci"},{"key":"204_CR7","doi-asserted-by":"publisher","unstructured":"Jin Q, Dhingra B, Liu Z, Cohen W, Lu X (2019) PubMedQA: a dataset for biomedical research question answering. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th International joint conference on natural language processing, (EMNLP-IJCNLP). Association for Computational Linguistics,\u00a0Hong Kong, pp 2567\u20132577. https:\/\/doi.org\/10.18653\/v1\/D19-1259","DOI":"10.18653\/v1\/D19-1259"},{"key":"204_CR8","unstructured":"Pal A, Umapathi LK, Sankarasubbu M (2022) MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Proceedings of the conference on health, inference, and learning. Proceedings of machine learning research,\u00a0vol 174. pp 248\u2013260"},{"key":"204_CR9","unstructured":"Hendrycks D, Burns C, Basart S, Critch A, Li J, Song D,\u00a0Steinhardt J (2021) Aligning AI with shared human values.\u00a0arXiv preprint https:\/\/arxiv.org\/abs\/2008.02275"},{"key":"204_CR10","unstructured":"Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D,\u00a0Steinhardt J (2021) Measuring massive multitask language understanding.\u00a0arXiv preprint https:\/\/arxiv.org\/abs\/2009.03300"},{"key":"204_CR11","unstructured":"Rein D, Hou BL, Stickland AC, Petty J, Pang RY, Dirani J, et al (2023) GPQA: a graduate-level google-proof Q&A benchmark. arXiv preprint https:\/\/arXiv:2311.12022"},{"key":"204_CR12","unstructured":"Tay Y, Dehghani M, Abnar S, Shen Y, Bahri D, Pham P, et al (2020) Long range arena: a benchmark for efficient transformers. arXiv preprint https:\/\/arXiv:2011.04006"},{"issue":"1\u20132","key":"204_CR13","first-page":"32","volume":"60","author":"JH Holmes","year":"2021","unstructured":"Holmes JH, Beinlich J, Boland MR, Bowles KH, Chen Y, Cook TS et al (2021) Why is the electronic health record so challenging for research and clinical care? Methods Inf Med 60(1\u20132):32\u201348","journal-title":"Methods Inf Med"},{"key":"204_CR14","doi-asserted-by":"publisher","unstructured":"van Aken B, Trajanovska I, Siu A, Mayrdorfer M, Budde K, Loeser A (2021) Assertion detection in clinical notes: medical language models to the rescue? In: Proceedings of the second workshop on natural language processing for medical conversations,\u00a0Online. Association for Computational Linguistics, pp 35\u201340. https:\/\/doi.org\/10.18653\/v1\/2021.nlpmc-1.5","DOI":"10.18653\/v1\/2021.nlpmc-1.5"},{"key":"204_CR15","unstructured":"Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training [Internet]. Available from: https:\/\/cdn.openai.com\/research-covers\/language-unsupervised\/language_understanding_paper.pdf. Accessed 14 Mar 2025"},{"key":"204_CR16","unstructured":"Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, et al (2022) Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint https:\/\/arXiv:2206.04615"},{"key":"204_CR17","unstructured":"Mistral AI (2025) Mistral Small 3.1 [Internet]. https:\/\/mistral.ai\/news\/mistral-small-3-1. Accessed 6 May 2025"},{"key":"204_CR18","unstructured":"Meta (2025) The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation [Internet]. https:\/\/ai.meta.com\/blog\/llama-4-multimodal-intelligence\/. Accessed 6 May 2025"},{"key":"204_CR19","unstructured":"Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, et al (2024) Mixtral of experts. arXiv preprint https:\/\/arXiv:2401.04088"},{"key":"204_CR20","unstructured":"OpenAI [Internet]. https:\/\/platform.openai.com\/docs\/models\/gpt-3-5. Accessed 26 Dec 2023"},{"key":"204_CR21","unstructured":"-ai Y(2023) GitHub repository [Internet]. https:\/\/github.com\/01-ai\/Yi. Accessed 26 Dec 2023"},{"key":"204_CR22","unstructured":"Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al (2023) Mistral 7B. arXiv preprint https:\/\/arXiv:2310.06825"},{"key":"204_CR23","unstructured":"Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al (2023) Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint https:\/\/arXiv:2306.05685"},{"key":"204_CR24","unstructured":"Tunstall L, Beeching E, Lambert N, Rajani N, Rasul K, Belkada Y, et al (2023) Zephyr: direct distillation of LM alignment. arXiv preprint htpps:\/\/arXiv:2310.16944"},{"key":"204_CR25","unstructured":"Li D, Shao R, Xi A, Sheng Y, Zheng L, Gonzalez JE, et al (2023) How long can open-source LLMs truly promise on context length? In: Instruction Workshop @ NeurIPS"},{"key":"204_CR26","unstructured":"Basmov V, Goldberg Y, Tsarfaty R (2024) LLMs\u2019 reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. arXiv preprint https:\/\/arXiv:2404.06283"},{"key":"204_CR27","doi-asserted-by":"crossref","unstructured":"Chuang YN, Tang R, Jiang X, Hu X (2023) SPEC: a soft prompt-based calibration on performance variability of large language model in clinical notes summarization. arXiv preprint https:\/\/arXiv:2303.13035","DOI":"10.1016\/j.jbi.2024.104606"},{"key":"204_CR28","unstructured":"Johnson A, Lungren M, Peng Y, Lu Z, Mark R, Berkowitz S, et al (2019) MIMIC-CXR-JPG-chest radiographs with structured labels (version 2.1.0). PhysioNet"},{"issue":"2","key":"204_CR29","doi-asserted-by":"publisher","first-page":"299","DOI":"10.1136\/amiajnl-2012-001506","volume":"21","author":"S Moon","year":"2014","unstructured":"Moon S, Pakhomov S, Liu N, Ryan JO, Melton GB (2014) A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources. J Am Med Inform Assoc 21(2):299\u2013307","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"204_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2016.35","volume":"3","author":"AEW Johnson","year":"2016","unstructured":"Johnson AEW, Pollard TJ, Shen L, Lehman L-H, Feng M, Ghassemi M et al (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3(1):1\u20139","journal-title":"Sci Data"},{"key":"204_CR31","unstructured":"Goel A, Gueta A, Gilon O, Liu C, Erell S, Nguyen LH, et al (2023) LLMs accelerate annotation for medical information extraction. In: Hegselmann S, Parziale A, Shanmugam D, Tang S, Asiedu MN, Chang S, et al., editors. Proceedings of the 3rd Machine learning for health symposium, volume 225 of Proceedings of Machine Learning Research. PMLR, p 82\u2013100."},{"key":"204_CR32","doi-asserted-by":"crossref","unstructured":"Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, et al (2019) Publicly available clinical BERT embeddings. arXiv preprint https:\/\/arXiv:1904.03323","DOI":"10.18653\/v1\/W19-1909"},{"key":"204_CR33","doi-asserted-by":"publisher","first-page":"438","DOI":"10.1007\/s41666-023-00155-0","volume":"8","author":"C Shyr","year":"2024","unstructured":"Shyr C, Hu Y, Harris PA, Xu H (2024) Identifying and extracting rare disease phenotypes with large language models. J Healthc Inform Res 8:438\u2013461","journal-title":"J Healthc Inform Res"},{"key":"204_CR34","unstructured":"Gu Y, Zhang S, Usuyama N, Woldesenbet Y, Wong C, Sanapathi P, et al (2023) Distilling large language models for biomedical knowledge extraction: a case study on adverse drug events. arXiv preprint https:\/\/arXiv:2307.06439"},{"key":"204_CR35","doi-asserted-by":"publisher","unstructured":"Ma Y, Cao Y, Hong Y, Sun A (2023) Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In: Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, pp 10572\u201310601. https:\/\/doi.org\/10.18653\/v1\/2023.findings-emnlp.710","DOI":"10.18653\/v1\/2023.findings-emnlp.710"},{"issue":"1","key":"204_CR36","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1038\/s43856-024-00717-2","volume":"5","author":"F Busch","year":"2025","unstructured":"Busch F, Hoffmann L, Rueger C, van Dijk EH, Kader R, Ortiz-Prado E et al (2025) Current applications and challenges in large language models for patient care: a systematic review. Commun Med 5(1):26","journal-title":"Commun Med"},{"key":"204_CR37","unstructured":"Anthropic (2023) Long context prompting for Claude 2.1 [Internet]. Available from: https:\/\/www.anthropic.com\/news\/claude-2-1-prompting. Accessed 26 Dec 2023"},{"key":"204_CR38","first-page":"24824","volume":"35","author":"J Wei","year":"2022","unstructured":"Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E et al (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 35:24824\u201324837","journal-title":"Adv Neural Inf Process Syst"},{"key":"204_CR39","unstructured":"Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, et al (2024) Biomedical large languages models seem not to be superior to generalist models on unseen medical data. arXiv preprint hhtps:\/\/arXiv:2408.13833"}],"container-title":["Journal of Healthcare Informatics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41666-025-00204-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41666-025-00204-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41666-025-00204-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,6]],"date-time":"2025-09-06T19:33:41Z","timestamp":1757187221000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41666-025-00204-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,14]]},"references-count":39,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["204"],"URL":"https:\/\/doi.org\/10.1007\/s41666-025-00204-w","relation":{},"ISSN":["2509-4971","2509-498X"],"issn-type":[{"value":"2509-4971","type":"print"},{"value":"2509-498X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,14]]},"assertion":[{"value":"25 January 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 May 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 June 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 June 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"JNK declares consulting services for Owkin, France; DoMore Diagnostics, Norway, Panakeia, UK and Histofy, UK; furthermore he holds shares in StratifAI GmbH and has received honoraria for lectures by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. KKB reports grants from the European Union (101079894) and Wilhelm-Sander Foundation; participation on a Data Safety Monitoring Board or Advisory Board for the EU Horizon 2020 LifeChamps project (875329) and the EU IHI Project IMAGIO (101112053); speaker Fees for Canon Medical Systems Corporation and GE HealthCare. The other authors declare no financial or non-financial competing interests for this work.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}