{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,3]],"date-time":"2026-05-03T23:45:43Z","timestamp":1777851943875,"version":"3.51.4"},"reference-count":22,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T00:00:00Z","timestamp":1672531200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000024","name":"Canadian Institutes of Health Research","doi-asserted-by":"publisher","award":["143303"],"award-info":[{"award-number":["143303"]}],"id":[{"id":"10.13039\/501100000024","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Health Informatics J"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:p>Background\/Objectives: Unsupervised topic models are often used to facilitate improved understanding of large unstructured clinical text datasets. In this study we investigated how ICD-9 diagnostic codes, collected alongside clinical text data, could be used to establish concurrent-, convergent- and discriminant-validity of learned topic models. Design\/Setting: Retrospective open cohort design. Data were collected from primary care clinics located in Toronto, Canada between 01\/01\/2017 through 12\/31\/2020. Methods: We fit a non-negative matrix factorization topic model, with K = 50 latent topics\/themes, to our input document term matrix (DTM). We estimated the magnitude of association between each Boolean-valued ICD-9 diagnostic code and each continuous latent topical vector. We identified ICD-9 diagnostic codes most strongly associated with each latent topical vector; and qualitatively interpreted how these codes could be used for external validation of the learned topic model. Results: The DTM consisted of 382,666 documents and 2210 words\/tokens. We correlated concurrently assigned ICD-9 diagnostic codes with learned topical vectors, and observed semantic agreement for a subset of latent constructs (e.g. conditions of the breast, disorders of the female genital tract, respiratory disease, viral infection, eye\/ear\/nose\/throat conditions, conditions of the urinary system, and dermatological conditions, etc.). Conclusions: When fitting topic models to clinical text corpora, researchers can leverage contemporaneously collected electronic medical record data to investigate the external validity of fitted latent variable models.<\/jats:p>","DOI":"10.1177\/14604582221115667","type":"journal-article","created":{"date-parts":[[2023,1,14]],"date-time":"2023-01-14T02:23:37Z","timestamp":1673663017000},"update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Using ICD-9 diagnostic codes for external validation of topic models derived from primary care electronic medical record clinical text data"],"prefix":"10.1177","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5429-5233","authenticated-orcid":false,"given":"Christopher","family":"Meaney","sequence":"first","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Escobar","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Therese A","family":"Stukel","sequence":"additional","affiliation":[{"name":"ICES, Toronto, ON, Canada; University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peter C","family":"Austin","sequence":"additional","affiliation":[{"name":"ICES, Toronto, ON, Canada; University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sumeet","family":"Kalia","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Babak","family":"Aliarzadeh","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5506-084X","authenticated-orcid":false,"family":"Rahim Moineddin","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8957-0285","authenticated-orcid":false,"given":"Michelle","family":"Greiver","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, ON, Canada; North York General Hospital, Toronto, ON, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2023,1,13]]},"reference":[{"key":"bibr1-14604582221115667","volume-title":"International profiles of health care systems","author":"International Commonwealth Fund","year":"2020"},{"key":"bibr2-14604582221115667","unstructured":"American Education Research Association. Standard for educational and psychological testing. Washington, DC, USA: American Educational Research Association, 2018, pp. 11\u201331."},{"key":"bibr3-14604582221115667","unstructured":"Krippendorff K. Content analysis: an introduction to its methodology. Thousand Oaks, CA: Sage Publications, 2008, pp. 313\u2013338."},{"key":"bibr4-14604582221115667","unstructured":"Cunningham P. Unsupervised learning and clustering. Springer, 2008, pp. 1\u201331."},{"key":"bibr5-14604582221115667","volume-title":"Evaluation metrics for unsupervised learning algorithms","author":"Palacio-Nino J","year":"2019"},{"key":"bibr6-14604582221115667","unstructured":"Rothman K, Greenland S, Lash T. Modern epidemiology. Philadelphia: Lippincott, Williams, and Wilkens Publishers, 2018, pp. 129\u2013131."},{"key":"bibr7-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1016\/j.csda.2006.11.006"},{"key":"bibr8-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1561\/2200000055"},{"key":"bibr9-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1002\/env.3170050203"},{"key":"bibr10-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1038\/44565"},{"key":"bibr11-14604582221115667","first-page":"556","volume":"13","author":"Lee D","year":"2001","journal-title":"Adv Neural Inf Process Syst"},{"key":"bibr12-14604582221115667","unstructured":"Matthews P. Human in the loop topic modelling. International Society for Knowledge Organization, 2019, pp. 1\u201331."},{"key":"bibr13-14604582221115667","first-page":"3824","author":"Doogan C","year":"2021","journal-title":"NAACL"},{"key":"bibr14-14604582221115667","first-page":"399","author":"Roder M","year":"2015","journal-title":"ACM"},{"key":"bibr15-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1016\/j.fss.2007.03.004"},{"key":"bibr16-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1080\/00401706.1978.10489693"},{"key":"bibr17-14604582221115667","volume-title":"Inferring concepts from topics: towards procedures for validating topics as measures","author":"Ying L","year":"2020"},{"key":"bibr18-14604582221115667","first-page":"1","volume":"13","author":"Katz A","year":"2012","journal-title":"BMC Med Inform Decis Making"},{"key":"bibr19-14604582221115667","first-page":"993","volume":"3","author":"Blei D","year":"2003","journal-title":"J Machine Learn Res"},{"key":"bibr20-14604582221115667","doi-asserted-by":"publisher","DOI":"10.1145\/2133806.2133826"},{"key":"bibr21-14604582221115667","volume-title":"TOP2VEC: Distributed representations of topics","author":"Angelov D","year":"2020"},{"key":"bibr22-14604582221115667","volume-title":"BERTopic: neural topic modeling with a class-based TF-IDF procedure","author":"Grootendorst M","year":"2022"}],"container-title":["Health Informatics Journal"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/14604582221115667","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/14604582221115667","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/14604582221115667","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T22:28:08Z","timestamp":1777501688000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/14604582221115667"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1]]},"references-count":22,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["10.1177\/14604582221115667"],"URL":"https:\/\/doi.org\/10.1177\/14604582221115667","relation":{},"ISSN":["1460-4582","1741-2811"],"issn-type":[{"value":"1460-4582","type":"print"},{"value":"1741-2811","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1]]},"article-number":"14604582221115667"}}