{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T07:23:29Z","timestamp":1773213809429,"version":"3.50.1"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"10","license":[{"start":{"date-parts":[[2020,9,15]],"date-time":"2020-09-15T00:00:00Z","timestamp":1600128000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"University of Texas Health Science Center in Houston School of Biomedical Informatics Data Service team"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,10,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion\/Conclusion<\/jats:title>\n                  <jats:p>In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaa180","type":"journal-article","created":{"date-parts":[[2020,7,25]],"date-time":"2020-07-25T04:29:02Z","timestamp":1595651342000},"page":"1593-1599","source":"Crossref","is-referenced-by-count":24,"title":["Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies"],"prefix":"10.1093","volume":"27","author":[{"given":"Laila","family":"Rasmy","sequence":"first","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"given":"Firat","family":"Tiryaki","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"given":"Yujia","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"given":"Yang","family":"Xiang","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"given":"Cui","family":"Tao","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5274-4672","authenticated-orcid":false,"given":"Hua","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]},{"given":"Degui","family":"Zhi","sequence":"additional","affiliation":[{"name":"School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,9,15]]},"reference":[{"issue":"5","key":"2020110613120465400_ocaa180-B1","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1007\/s10916-019-1243-3","article-title":"LSTM model for prediction of heart failure in big data","volume":"43","author":"Maragatham","year":"2019","journal-title":"J Med Syst"},{"key":"2020110613120465400_ocaa180-B2","first-page":"3504","article-title":"RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism","author":"Choi","year":"2016","journal-title":"Adv Neural Inf Process Syst"},{"issue":"2","key":"2020110613120465400_ocaa180-B3","doi-asserted-by":"crossref","first-page":"361","DOI":"10.1093\/jamia\/ocw112","article-title":"Using recurrent neural network models for early detection of heart failure onset","volume":"24","author":"Choi","year":"2017","journal-title":"J Am Med Inform Assoc"},{"key":"2020110613120465400_ocaa180-B4","doi-asserted-by":"crossref","DOI":"10.1016\/j.jbi.2018.06.011","article-title":"A study of generalizability of recurrent neural network-based predictive models for heart failure onset risk using a large and heterogeneous EHR data set","volume":"84","author":"Rasmy","year":"2018","journal-title":"J Biomed Inform"},{"key":"2020110613120465400_ocaa180-B5","doi-asserted-by":"crossref","first-page":"9256","DOI":"10.1109\/ACCESS.2017.2789324","article-title":"Predicting the risk of heart failure with EHR sequential data modeling","volume":"6","author":"Jin","year":"2018","journal-title":"IEEE Access"},{"key":"2020110613120465400_ocaa180-B6","doi-asserted-by":"crossref","first-page":"2","DOI":"10.3389\/frai.2019.00002","article-title":"Pancreatic cancer prediction through an artificial neural network","volume":"2","author":"Muhammad","year":"2019","journal-title":"Front Artif Intell"},{"key":"2020110613120465400_ocaa180-B7","doi-asserted-by":"crossref","first-page":"6317","DOI":"10.2147\/CMAR.S180791","article-title":"Development of a prediction model for pancreatic cancer in patients with type 2 diabetes using logistic regression and artificial neural network models","volume":"10","author":"Hsieh","year":"2018","journal-title":"Cancer Manag Res"},{"key":"2020110613120465400_ocaa180-B8","doi-asserted-by":"crossref","first-page":"103337","DOI":"10.1016\/j.jbi.2019.103337","article-title":"Deep learning for electronic health records: A comparative review of multiple deep neural architectures","volume":"101","author":"Ayala Solares","year":"2020","journal-title":"J. Biomed. Inform"},{"issue":"1","key":"2020110613120465400_ocaa180-B9","doi-asserted-by":"crossref","DOI":"10.1038\/s41598-019-39071-y","article-title":"Predictive modeling of the hospital readmission risk from patients\u2019 claims data using machine learning: a case study on COPD","volume":"9","author":"Min","year":"2019","journal-title":"Sci Rep"},{"key":"2020110613120465400_ocaa180-B10","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1038\/s41746-018-0029-1","article-title":"Scalable and accurate deep learning with electronic health records","volume":"1","author":"Rajkomar","year":"2018","journal-title":"NPJ Digit Med"},{"key":"2020110613120465400_ocaa180-B11","doi-asserted-by":"crossref","first-page":"1353","DOI":"10.1016\/j.procs.2020.04.145","article-title":"Deep contextualized medical concept normalization in social media text","volume":"171","author":"Subramanyam","year":"2020","journal-title":"Proc Comput Sci"},{"issue":"7","key":"2020110613120465400_ocaa180-B12","doi-asserted-by":"crossref","first-page":"e0175508","DOI":"10.1371\/journal.pone.0175508","article-title":"Evaluating phecodes, clinical classification software, and ICD-9-CM codes for phenome-wide association studies in the electronic health record","volume":"12","author":"Wei","year":"2017","journal-title":"PLoS One"},{"key":"2020110613120465400_ocaa180-B13","first-page":"462077","article-title":"Developing and evaluating mappings of ICD-10 and ICD-10-CM codes to Phecodes","author":"Wu","year":"2018","journal-title":"bioRxiv"},{"key":"2020110613120465400_ocaa180-B14","first-page":"911","article-title":"An evaluation of the NQF quality data model for representing electronic health record driven phenotyping algorithms","volume":"2012","author":"Thompson","year":"2012","journal-title":"AMIA Ann Symp Proc"},{"key":"2020110613120465400_ocaa180-B15","first-page":"4547","author":"Choi","year":"2018"},{"key":"2020110613120465400_ocaa180-B16","author":"Beam","year":"2018"},{"key":"2020110613120465400_ocaa180-B17","author":"Alawad"},{"issue":"S2","key":"2020110613120465400_ocaa180-B18","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1186\/s12911-019-0766-3","article-title":"Time-sensitive clinical concept embeddings learned from large electronic health records","volume":"19","author":"Xiang","year":"2019","journal-title":"BMC Med Inform Decis Mak"},{"key":"2020110613120465400_ocaa180-B19","author":"Feng","year":"2019"},{"key":"2020110613120465400_ocaa180-B20","doi-asserted-by":"crossref","first-page":"103115","DOI":"10.1016\/j.jbi.2019.103115","article-title":"Predicting need for advanced illness or palliative care in a primary care population using electronic health record data","volume":"92","author":"Jung","year":"2019","journal-title":"J Biomed Inform"},{"key":"2020110613120465400_ocaa180-B21","doi-asserted-by":"crossref","first-page":"D267","DOI":"10.1093\/nar\/gkh061","article-title":"The Unified Medical Language System (UMLS): integrating biomedical terminology","volume":"32 (Database issue","author":"Bodenreider","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2020110613120465400_ocaa180-B22","first-page":"41","article-title":"Learning low-dimensional representations of medical concepts","author":"Choi","year":"2016","journal-title":"AMIA Joint Summits Translational Science Proceedings"},{"key":"2020110613120465400_ocaa180-B23","first-page":"543","article-title":"Adversarial learning of knowledge embeddings for the unified medical language system","author":"Maldonado","year":"2019","journal-title":"AMIA Jt Summits Transl Sci Proc 2019"},{"key":"2020110613120465400_ocaa180-B24","author":"UMLS Knowledge Sources: File Downloads","year":"2019"},{"key":"2020110613120465400_ocaa180-B25","author":"2018-ICD-10-CM-and-GEMs;","year":"2017"},{"key":"2020110613120465400_ocaa180-B26","author":"PheWAS-Phenome Wide Association Studies","year":"2019"},{"key":"2020110613120465400_ocaa180-B27","author":"Beta Clinical Classifications Software (CCS) for ICD-10-CM\/PCS","year":"2019"},{"key":"2020110613120465400_ocaa180-B28","author":"HCUP CCS"},{"key":"2020110613120465400_ocaa180-B29","author":"Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses","year":"1330"},{"key":"2020110613120465400_ocaa180-B30","year":"2018"},{"key":"2020110613120465400_ocaa180-B31","author":"sklearn.linear_model.LogisticRegression\u2014scikit-learn 0.20.3 documentation","year":"2019"},{"key":"2020110613120465400_ocaa180-B32","author":"Ma","year":"2017"},{"key":"2020110613120465400_ocaa180-B33","author":"Ma","year":"2017"},{"key":"2020110613120465400_ocaa180-B34","volume-title":"Medinfo 2019 (podium abstract submitted Nov 2018). Simple Recurrent Neural Networks is all we need for clinical events predictions using EHR data. Lyon, France: MedInfo","author":"Rasmy","year":"2019"},{"issue":"3","key":"2020110613120465400_ocaa180-B35","doi-asserted-by":"crossref","first-page":"837","DOI":"10.2307\/2531595","article-title":"Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach","volume":"44","author":"DeLong","year":"1988","journal-title":"Biometrics"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/10\/1593\/34153830\/ocaa180.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/10\/1593\/34153830\/ocaa180.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,6]],"date-time":"2020-11-06T19:38:30Z","timestamp":1604691510000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/27\/10\/1593\/5905876"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,15]]},"references-count":35,"journal-issue":{"issue":"10","published-online":{"date-parts":[[2020,9,15]]},"published-print":{"date-parts":[[2020,10,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaa180","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,10]]},"published":{"date-parts":[[2020,9,15]]}}}