{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T06:16:01Z","timestamp":1778048161053,"version":"3.51.4"},"reference-count":24,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2023,9,28]],"date-time":"2023-09-28T00:00:00Z","timestamp":1695859200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Digit. Health"],"abstract":"<jats:sec><jats:title>Background<\/jats:title><jats:p>Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, \u226593%, followed by ALARM+, \u226587%. The F1 score of ESPRESSO was \u226574%, whilst that of Sem-EHR is \u226566%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored \u226598% and ALARM+ \u226590%. ESPRESSO scored lowest with \u226577% and Sem-EHR \u226581%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusions<\/jats:title><jats:p>The four NLP tools show varying F1 (and precision\/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed \u201cout of the box.\u201d It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.<\/jats:p><\/jats:sec>","DOI":"10.3389\/fdgth.2023.1184919","type":"journal-article","created":{"date-parts":[[2023,9,29]],"date-time":"2023-09-29T05:30:16Z","timestamp":1695965416000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports"],"prefix":"10.3389","volume":"5","author":[{"given":"Arlene","family":"Casey","sequence":"first","affiliation":[]},{"given":"Emma","family":"Davidson","sequence":"additional","affiliation":[]},{"given":"Claire","family":"Grover","sequence":"additional","affiliation":[]},{"given":"Richard","family":"Tobin","sequence":"additional","affiliation":[]},{"given":"Andreas","family":"Grivas","sequence":"additional","affiliation":[]},{"given":"Huayu","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Patrick","family":"Schrempf","sequence":"additional","affiliation":[]},{"given":"Alison Q.","family":"O\u2019Neil","sequence":"additional","affiliation":[]},{"given":"Liam","family":"Lee","sequence":"additional","affiliation":[]},{"given":"Michael","family":"Walsh","sequence":"additional","affiliation":[]},{"given":"Freya","family":"Pellie","sequence":"additional","affiliation":[]},{"given":"Karen","family":"Ferguson","sequence":"additional","affiliation":[]},{"given":"Vera","family":"Cvoro","sequence":"additional","affiliation":[]},{"given":"Honghan","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Heather","family":"Whalley","sequence":"additional","affiliation":[]},{"given":"Grant","family":"Mair","sequence":"additional","affiliation":[]},{"given":"William","family":"Whiteley","sequence":"additional","affiliation":[]},{"given":"Beatrice","family":"Alex","sequence":"additional","affiliation":[]}],"member":"1965","published-online":{"date-parts":[[2023,9,28]]},"reference":[{"key":"B1","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1186\/s12911-021-01533-7","article-title":"A systematic review of natural language processing applied to radiology reports","volume":"21","author":"Casey","year":"2021","journal-title":"BMC Med Inform Decis Mak"},{"key":"B2","doi-asserted-by":"publisher","first-page":"329","DOI":"10.1148\/radiol.16142770","article-title":"Natural language processing in radiology: a systematic review","volume":"279","author":"Pons","year":"2016","journal-title":"Radiology"},{"key":"B3","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1186\/s12911-021-01556-0","article-title":"Developing automated methods for disease subtyping in UK Biobank: an exemplar study on stroke","volume":"21","author":"Rannikm\u00e4e","year":"2021","journal-title":"BMC Med Inform Decis Mak"},{"key":"B4","doi-asserted-by":"publisher","first-page":"e113","DOI":"10.1093\/jamia\/ocv155","article-title":"Classification of radiology reports for falls in an HIV study cohort","volume":"23","author":"Bates","year":"2016","journal-title":"J Am Med Inform Assoc"},{"key":"B5","doi-asserted-by":"publisher","first-page":"e0214775","DOI":"10.1371\/journal.pone.0214775","article-title":"tbiExtractor: a framework for extracting traumatic brain injury common data elements from radiology reports","volume":"15","author":"Mahan","year":"2020","journal-title":"PLoS One"},{"key":"B6","doi-asserted-by":"publisher","first-page":"757","DOI":"10.1016\/j.jacr.2017.01.044","article-title":"Focal cystic pancreatic lesion follow-up recommendations after publication of ACR white paper on managing incidental findings","volume":"14","author":"Bobbin","year":"2017","journal-title":"J Am Coll Radiol"},{"key":"B7","doi-asserted-by":"publisher","first-page":"422","DOI":"10.1016\/j.jacr.2017.11.022","article-title":"Determining adherence to follow-up imaging recommendations","volume":"15","author":"Mabotuwana","year":"2018","journal-title":"J Am Coll Radiol"},{"key":"B8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-022-00730-6","article-title":"A survey on clinical natural language processing in the United Kingdom from 2007 to 2022","volume":"5","author":"Wu","year":"2022","journal-title":"NPJ Digit Med"},{"key":"B9","first-page":"220","author":"Mitchell","year":"2019"},{"key":"B10","doi-asserted-by":"publisher","first-page":"587","DOI":"10.1162\/tacl_a_00041","article-title":"Data statements for natural language processing: toward mitigating system bias and enabling better science","volume":"6","author":"Bender","year":"2018","journal-title":"Trans Assoc Comput Linguist"},{"key":"B11","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1186\/1471-2350-7-74","article-title":"Generation Scotland: the Scottish family health study; a new resource for researching genes and heritability","volume":"7","author":"Smith","year":"2006","journal-title":"BMC Med Genet"},{"key":"B12","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1186\/s13326-019-0211-7","article-title":"Text mining brain imaging reports","volume":"10","author":"Alex","year":"2019","journal-title":"J Biomed Semantics"},{"key":"B13","first-page":"102","author":"Stenetorp","year":"2012"},{"key":"B14","doi-asserted-by":"publisher","first-page":"184","DOI":"10.1186\/s12911-019-0908-7","article-title":"A validated natural language processing algorithm for brain imaging phenotypes from radiology reports in UK electronic health records","volume":"19","author":"Wheater","year":"2019","journal-title":"BMC Med Inform Decis Mak"},{"key":"B15","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1186\/s12911-020-1072-9","article-title":"Assessment of the impact of EHR heterogeneity for clinical research through a case study of silent brain infarction","volume":"20","author":"Fu","year":"2020","journal-title":"BMC Med Inform Decis Mak"},{"key":"B16","doi-asserted-by":"publisher","first-page":"e12109","DOI":"10.2196\/12109","article-title":"Natural language processing for the identification of silent brain infarcts from neuroimaging reports","volume":"7","author":"Fu","year":"2019","journal-title":"JMIR Med Inform"},{"key":"B17","first-page":"277","author":"Schrempf","year":"2020"},{"key":"B18","doi-asserted-by":"publisher","first-page":"299","DOI":"10.3390\/make3020015","article-title":"Templated text synthesis for expert-guided multi-label extraction from radiology reports","volume":"3","author":"Schrempf","year":"2021","journal-title":"Mach Learn Knowl Extr"},{"key":"B19","first-page":"345","article-title":"A probabilistic interpretation of precision, recall and F-score, with implication for evaluation","volume-title":"Advances in information retrieval. ECIR 2005. Lecture notes in computer science","author":"Goutte","year":"2005"},{"key":"B20","doi-asserted-by":"publisher","first-page":"1194","DOI":"10.1007\/s10278-020-00379-1","article-title":"Between always and never: evaluating uncertainty in radiology reports using natural language processing","volume":"33","author":"Callen","year":"2020","journal-title":"J Digit Imaging"},{"key":"B21","first-page":"590","author":"Irvin","year":"2019"},{"key":"B22","first-page":"3986","author":"Hollenstein","year":"2016"},{"key":"B23","first-page":"81","article-title":"Context: an algorithm for identifying contextual features from clinical text","volume-title":"Biological, translational, and clinical language processing","author":"Chapman","year":"2007"},{"key":"B24","first-page":"254","article-title":"Labelling imaging datasets on the basis of neuroradiology reports: a validation study","volume-title":"Interpretable and annotation-efficient learning for medical image computing. Lecture notes in computer science","author":"Wood","year":"2020"}],"container-title":["Frontiers in Digital Health"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdgth.2023.1184919\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,29]],"date-time":"2023-09-29T05:30:21Z","timestamp":1695965421000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdgth.2023.1184919\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,28]]},"references-count":24,"alternative-id":["10.3389\/fdgth.2023.1184919"],"URL":"https:\/\/doi.org\/10.3389\/fdgth.2023.1184919","relation":{},"ISSN":["2673-253X"],"issn-type":[{"value":"2673-253X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,28]]},"article-number":"1184919"}}