{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T05:08:26Z","timestamp":1768799306031,"version":"3.49.0"},"reference-count":42,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T00:00:00Z","timestamp":1715126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100006662","name":"NIHR","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100006662","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100013342","name":"Imperial Biomedical Research Centre","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100013342","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["215938\/Z\/19\/Z"],"award-info":[{"award-number":["215938\/Z\/19\/Z"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000272","name":"National Institute for Health and Care Research","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000272","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Applied Research Collaboration Northwest London"},{"DOI":"10.13039\/501100000266","name":"EPSRC","doi-asserted-by":"publisher","award":["EP\/N014529\/1"],"award-info":[{"award-number":["EP\/N014529\/1"]}],"id":[{"id":"10.13039\/501100000266","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Centre for Mathematics of Precision Healthcare"},{"DOI":"10.13039\/501100013342","name":"Imperial Biomedical Research Centre","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100013342","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100030827","name":"NHS","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100030827","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000276","name":"Department of Health and Social Care","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000276","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,6,20]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>This cohort study used primary care EHRs from 6\u00a0286\u00a0233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion and Conclusion<\/jats:title>\n                  <jats:p>Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocae091","type":"journal-article","created":{"date-parts":[[2024,5,8]],"date-time":"2024-05-08T23:49:31Z","timestamp":1715212171000},"page":"1451-1462","source":"Crossref","is-referenced-by-count":9,"title":["Comparing natural language processing representations of coded disease sequences for prediction in electronic health records"],"prefix":"10.1093","volume":"31","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9709-7264","authenticated-orcid":false,"given":"Thomas","family":"Beaney","sequence":"first","affiliation":[{"name":"Department of Primary Care and Public Health, Imperial College London , London, W12 0BZ, United Kingdom"},{"name":"Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London , London, SW7 2AZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sneha","family":"Jha","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London , London, SW7 2AZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Asem","family":"Alaa","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London , London, SW7 2AZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alexander","family":"Smith","sequence":"additional","affiliation":[{"name":"Department of Epidemiology and Biostatistics, Imperial College London , London, W2 1PG, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jonathan","family":"Clarke","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London , London, SW7 2AZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4735-4856","authenticated-orcid":false,"given":"Thomas","family":"Woodcock","sequence":"additional","affiliation":[{"name":"Department of Primary Care and Public Health, Imperial College London , London, W12 0BZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Azeem","family":"Majeed","sequence":"additional","affiliation":[{"name":"Department of Primary Care and Public Health, Imperial College London , London, W12 0BZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4589-1743","authenticated-orcid":false,"given":"Paul","family":"Aylin","sequence":"additional","affiliation":[{"name":"Department of Primary Care and Public Health, Imperial College London , London, W12 0BZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1089-5675","authenticated-orcid":false,"given":"Mauricio","family":"Barahona","sequence":"additional","affiliation":[{"name":"Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London , London, SW7 2AZ, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2024,5,8]]},"reference":[{"issue":"1","key":"2024062008075185600_ocae091-B1","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1093\/eurpub\/cky098","article-title":"Defining and measuring multimorbidity: a systematic review of systematic reviews","volume":"29","author":"Johnston","year":"2019","journal-title":"Eur J Public Health"},{"issue":"12","key":"2024062008075185600_ocae091-B2","doi-asserted-by":"crossref","first-page":"e599","DOI":"10.1016\/S2468-2667(19)30222-1","article-title":"Multimorbidity\u2014a defining challenge for health systems","volume":"4","author":"Pearson-Stuttard","year":"2019","journal-title":"Lancet Public Health"},{"issue":"7800","key":"2024062008075185600_ocae091-B3","doi-asserted-by":"crossref","first-page":"494","DOI":"10.1038\/d41586-020-00837-4","article-title":"Map clusters of diseases to tackle multimorbidity","volume":"579","author":"Whitty","year":"2020","journal-title":"Nature"},{"key":"2024062008075185600_ocae091-B4","author":"The Academy of Medical Sciences","year":"2018"},{"issue":"1","key":"2024062008075185600_ocae091-B5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41746-021-00455-y","article-title":"Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction","volume":"4","author":"Rasmy","year":"2021","journal-title":"NPJ Digit Med"},{"key":"2024062008075185600_ocae091-B6","author":"Choi","year":"2017"},{"key":"2024062008075185600_ocae091-B7","first-page":"1495","author":"Choi","year":"2016"},{"key":"2024062008075185600_ocae091-B8","author":"Solares"},{"issue":"1","key":"2024062008075185600_ocae091-B9","doi-asserted-by":"crossref","first-page":"7155","DOI":"10.1038\/s41598-020-62922-y","article-title":"BEHRT: transformer for electronic health records","volume":"10","author":"Li","year":"2020","journal-title":"Sci Rep"},{"issue":"1","key":"2024062008075185600_ocae091-B10","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1186\/s12874-018-0584-9","article-title":"A systematic review of the clinical application of data-driven population segmentation analysis","volume":"18","author":"Yan","year":"2018","journal-title":"BMC Med Res Methodol"},{"issue":"6","key":"2024062008075185600_ocae091-B11","doi-asserted-by":"crossref","first-page":"1740","DOI":"10.1093\/ije\/dyz034","article-title":"Data resource profile: clinical practice research datalink (CPRD) aurum","volume":"48","author":"Wolf","year":"2019","journal-title":"Int J Epidemiol"},{"issue":"3","key":"2024062008075185600_ocae091-B12","doi-asserted-by":"crossref","first-page":"827","DOI":"10.1093\/ije\/dyv098","article-title":"Data resource profile: clinical practice research datalink (CPRD)","volume":"44","author":"Herrett","year":"2015","journal-title":"Int J Epidemiol"},{"issue":"7","key":"2024062008075185600_ocae091-B13","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1002\/pds.1115","article-title":"The relationship between time since registration and measured incidence rates in the general practice research database","volume":"14","author":"Lewis","year":"2005","journal-title":"Pharmacoepidemiol Drug Saf"},{"key":"2024062008075185600_ocae091-B14","author":"Clinical Practice Research Datalink","year":"2022"},{"key":"2024062008075185600_ocae091-B15","author":"Ministry of Housing & Communities & Local Government","year":"2019"},{"key":"2024062008075185600_ocae091-B16","author":"NHS Digital"},{"key":"2024062008075185600_ocae091-B17","doi-asserted-by":"crossref","first-page":"104038","DOI":"10.1016\/j.ijmedinf.2019.104038","article-title":"CPRD GOLD and linked ONS mortality records: reconciling guidelines","volume":"136","author":"Delmestri","year":"2020","journal-title":"Int J Med Inform"},{"issue":"2","key":"2024062008075185600_ocae091-B18","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1093\/jamia\/ocac158","article-title":"Translating and evaluating historic phenotyping algorithms using SNOMED CT","volume":"30","author":"Elkheder","year":"2022","journal-title":"J Am Med Inform Assoc"},{"issue":"2","key":"2024062008075185600_ocae091-B19","doi-asserted-by":"crossref","first-page":"e63","DOI":"10.1016\/S2589-7500(19)30012-3","article-title":"A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service","volume":"1","author":"Kuan","year":"2019","journal-title":"Lancet Digit Health"},{"issue":"8","key":"2024062008075185600_ocae091-B20","doi-asserted-by":"crossref","first-page":"e489","DOI":"10.1016\/S2666-7568(21)00146-X","article-title":"Inequalities in incident and prevalent multimorbidity in England, 2004-19: a population-based, descriptive study","volume":"2","author":"Head","year":"2021","journal-title":"Lancet Healthy Longev"},{"key":"2024062008075185600_ocae091-B21","author":"Beaney","year":"2023"},{"key":"2024062008075185600_ocae091-B22","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J Mach Learn Res"},{"key":"2024062008075185600_ocae091-B23","author":"R\u00f6der","year":"2015"},{"key":"2024062008075185600_ocae091-B24","first-page":"1188","author":"Le","year":"2014"},{"key":"2024062008075185600_ocae091-B25","author":"Liu","year":"2019"},{"issue":"9","key":"2024062008075185600_ocae091-B26","doi-asserted-by":"crossref","first-page":"e072884","DOI":"10.1136\/bmjopen-2023-072884","article-title":"Identifying potential biases in code sequences in primary care electronic healthcare records: a retrospective cohort study of the determinants of code frequency","volume":"13","author":"Beaney","year":"2023","journal-title":"BMJ Open"},{"key":"2024062008075185600_ocae091-B27","author":"Xiao"},{"key":"2024062008075185600_ocae091-B28","first-page":"4171","author":"Devlin","year":"2019"},{"key":"2024062008075185600_ocae091-B29","author":"Davis","year":"2006"},{"key":"2024062008075185600_ocae091-B30","author":"The Python Language Reference"},{"key":"2024062008075185600_ocae091-B31","first-page":"56","author":"McKinney"},{"key":"2024062008075185600_ocae091-B32","first-page":"45","author":"Rehurek","year":"2010"},{"key":"2024062008075185600_ocae091-B33","author":"Wolf","year":"2020"},{"key":"2024062008075185600_ocae091-B34","author":"Lannou","year":"2021"},{"issue":"1","key":"2024062008075185600_ocae091-B35","doi-asserted-by":"crossref","first-page":"e022820","DOI":"10.1136\/bmjopen-2018-022820","article-title":"What are the social predictors of accident and emergency attendance in disadvantaged neighbourhoods? results from a cross-sectional household health survey in the north west of England","volume":"9","author":"Giebel","year":"2019","journal-title":"BMJ Open"},{"issue":"1","key":"2024062008075185600_ocae091-B36","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1186\/s13643-019-1105-6","article-title":"Population segmentation based on healthcare needs: a systematic review","volume":"8","author":"Chong","year":"2019","journal-title":"Syst Rev"},{"issue":"5","key":"2024062008075185600_ocae091-B37","doi-asserted-by":"crossref","first-page":"e185","DOI":"10.2196\/jmir.9134","article-title":"Possible sources of bias in primary care electronic health record data use and reuse","volume":"20","author":"Verheij","year":"2018","journal-title":"J Med Internet Res"},{"issue":"6","key":"2024062008075185600_ocae091-B38","doi-asserted-by":"crossref","first-page":"e010393","DOI":"10.1136\/bmjopen-2015-010393","article-title":"What evidence is there for a delay in diagnostic coding of RA in UK general practice records? an observational study of free text","volume":"6","author":"Ford","year":"2016","journal-title":"BMJ Open"},{"issue":"9","key":"2024062008075185600_ocae091-B39","doi-asserted-by":"crossref","first-page":"874","DOI":"10.1056\/NEJMms2004740","article-title":"Hidden in plain sight\u2013reconsidering the use of race correction in clinical algorithms","volume":"383","author":"Vyas","year":"2020","journal-title":"N Engl J Med"},{"issue":"6464","key":"2024062008075185600_ocae091-B40","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1126\/science.aax2342","article-title":"Dissecting racial bias in an algorithm used to manage the health of populations","volume":"366","author":"Obermeyer","year":"2019","journal-title":"Science"},{"key":"2024062008075185600_ocae091-B41","doi-asserted-by":"crossref","first-page":"e071950","DOI":"10.1136\/bmj-2022-071950","article-title":"How can we improve the quality of data collected in general practice?","volume":"380","author":"Shemtob","year":"2023","journal-title":"BMJ"},{"key":"2024062008075185600_ocae091-B42"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/31\/7\/1451\/58243723\/ocae091.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/jamia\/article-pdf\/31\/7\/1451\/58243723\/ocae091.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,20]],"date-time":"2024-06-20T08:14:16Z","timestamp":1718871256000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/31\/7\/1451\/7667337"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,8]]},"references-count":42,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2024,5,8]]},"published-print":{"date-parts":[[2024,6,20]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocae091","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,7]]},"published":{"date-parts":[[2024,5,8]]}}}