{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T01:04:51Z","timestamp":1775264691888,"version":"3.50.1"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T00:00:00Z","timestamp":1635292800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T00:00:00Z","timestamp":1635292800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Million Veteran Program, #MVP000"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["npj Digit. Med."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.<\/jats:p>","DOI":"10.1038\/s41746-021-00519-z","type":"journal-article","created":{"date-parts":[[2021,10,27]],"date-time":"2021-10-27T06:04:13Z","timestamp":1635314653000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":57,"title":["Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data"],"prefix":"10.1038","volume":"4","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7056-9559","authenticated-orcid":false,"given":"Chuan","family":"Hong","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5632-5723","authenticated-orcid":false,"given":"Everett","family":"Rush","sequence":"additional","affiliation":[]},{"given":"Molei","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Doudou","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Jiehuan","family":"Sun","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6165-9082","authenticated-orcid":false,"given":"Aaron","family":"Sonabend","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7390-6354","authenticated-orcid":false,"given":"Victor M.","family":"Castro","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6135-0173","authenticated-orcid":false,"given":"Petra","family":"Schubert","sequence":"additional","affiliation":[]},{"given":"Vidul A.","family":"Panickan","sequence":"additional","affiliation":[]},{"given":"Tianrun","family":"Cai","sequence":"additional","affiliation":[]},{"given":"Lauren","family":"Costa","sequence":"additional","affiliation":[]},{"given":"Zeling","family":"He","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4078-4842","authenticated-orcid":false,"given":"Nicholas","family":"Link","sequence":"additional","affiliation":[]},{"given":"Ronald","family":"Hauser","sequence":"additional","affiliation":[]},{"given":"J. Michael","family":"Gaziano","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1905-8806","authenticated-orcid":false,"given":"Shawn N.","family":"Murphy","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2043-6026","authenticated-orcid":false,"given":"George","family":"Ostrouchov","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3305-3830","authenticated-orcid":false,"given":"Yuk-Lam","family":"Ho","sequence":"additional","affiliation":[]},{"given":"Edmon","family":"Begoli","sequence":"additional","affiliation":[]},{"given":"Junwei","family":"Lu","sequence":"additional","affiliation":[]},{"given":"Kelly","family":"Cho","sequence":"additional","affiliation":[]},{"given":"Katherine P.","family":"Liao","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5379-2502","authenticated-orcid":false,"given":"Tianxi","family":"Cai","sequence":"additional","affiliation":[]},{"name":"VA Million Veteran Program","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,10,27]]},"reference":[{"key":"519_CR1","doi-asserted-by":"publisher","first-page":"147","DOI":"10.1002\/cpt.359","volume":"100","author":"K Lin","year":"2016","unstructured":"Lin, K. & Schneeweiss, S. Considerations for the analysis of longitudinal electronic health records linked to claims data to study the effectiveness and safety of drugs. Clin. Pharmacol. Ther. 100, 147\u2013159 (2016).","journal-title":"Clin. Pharmacol. Ther."},{"key":"519_CR2","doi-asserted-by":"publisher","first-page":"198","DOI":"10.1093\/jamia\/ocw042","volume":"24","author":"B Goldstein","year":"2017","unstructured":"Goldstein, B., Navar, A., Pencina, M. & Ioannidis, J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 24, 198\u2013208 (2017).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"519_CR3","doi-asserted-by":"publisher","first-page":"417","DOI":"10.1038\/nrg2999","volume":"12","author":"IS Kohane","year":"2011","unstructured":"Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417\u2013428 (2011).","journal-title":"Nat. Rev. Genet."},{"key":"519_CR4","doi-asserted-by":"publisher","first-page":"1102","DOI":"10.1038\/nbt.2749","volume":"31","author":"JC Denny","year":"2013","unstructured":"Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102\u20131111 (2013).","journal-title":"Nat. Biotechnol."},{"key":"519_CR5","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/j.hlpt.2012.03.001","volume":"1","author":"C Bennett","year":"2012","unstructured":"Bennett, C., Doub, T. & Selove, R. EHRs connect research and practice: where predictive modeling, artificial intelligence, and clinical decision support intersect. Heal. Policy Technol. 1, 105\u2013114 (2012).","journal-title":"Heal. Policy Technol."},{"key":"519_CR6","doi-asserted-by":"publisher","first-page":"E2","DOI":"10.3390\/jpm6010002","volume":"6","author":"E Karlson","year":"2016","unstructured":"Karlson, E., Boutin, N., Hoffnagle, A. & Allen, N. Building the partners healthcare biobank at partners personalized medicine: informed consent, return of research results, recruitment lessons and operational considerations. J. Pers. Med. 6, E2 (2016).","journal-title":"J. Pers. Med."},{"key":"519_CR7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s00392-016-1025-6","volume":"106","author":"M Cowie","year":"2017","unstructured":"Cowie, M. et al. Electronic health records to facilitate clinical research. Clin. Res. Cardiol. 106, 1\u20139 (2017).","journal-title":"Clin. Res. Cardiol."},{"key":"519_CR8","unstructured":"Organization, W. H. & others. International classification of diseases:[9th] ninth revision, basic tabulation list with alphabetic index (World Health Organization, 1978)."},{"key":"519_CR9","unstructured":"Organization, W. H. International statistical classification of diseases and related health problems. vol. 1 (World Health Organization, 2004)."},{"key":"519_CR10","doi-asserted-by":"publisher","first-page":"624","DOI":"10.1373\/49.4.624","volume":"49","author":"CJ McDonald","year":"2003","unstructured":"McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, 624\u2013633 (2003).","journal-title":"Clin. Chem."},{"key":"519_CR11","unstructured":"Abraham, M., Ahlman, J. T., Boudreau, A. J., Connelly, J. L. & Evans, D. D. CPT 2011: standard edition. (American Medical Association Press, 2010)."},{"key":"519_CR12","unstructured":"Elixhauser, A. Clinical Classifications Software (CCS) 2009. https:\/\/www.hcup-us.ahrq.gov\/toolssoftware\/ccs\/ccs.jsp (2009)."},{"key":"519_CR13","doi-asserted-by":"publisher","first-page":"634","DOI":"10.1016\/j.jbi.2012.02.011","volume":"45","author":"CC Bennett","year":"2012","unstructured":"Bennett, C. C. Utilizing RxNorm to support practical computing applications: capturing medication history in live electronic health records. J. Biomed. Inform. 45, 634\u2013641 (2012).","journal-title":"J. Biomed. Inform."},{"key":"519_CR14","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1016\/j.jbi.2015.10.001","volume":"58","author":"R Pivovarov","year":"2015","unstructured":"Pivovarov, R. et al. Learning probabilistic phenotypes from heterogeneous EHR data. J. Biomed. Inform. 58, 156\u2013165 (2015).","journal-title":"J. Biomed. Inform."},{"key":"519_CR15","doi-asserted-by":"publisher","first-page":"e143","DOI":"10.1093\/jamia\/ocw135","volume":"24","author":"S Yu","year":"2017","unstructured":"Yu, S. et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24, e143\u2013e149 (2017).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"519_CR16","first-page":"48","volume":"48","author":"J Banda","year":"2017","unstructured":"Banda, J., Halpern, Y., Sontag, D. & Shah, N. Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network. AMIA Summits Transl. Sci. Proc 48, 48\u201357 (2017).","journal-title":"AMIA Summits Transl. Sci. Proc"},{"key":"519_CR17","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst3111\u20133119 (2013)."},{"key":"519_CR18","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2014.32","volume":"1","author":"S Finlayson","year":"2014","unstructured":"Finlayson, S., LePendu, P. & Shah, N. Building the graph of medicine from millions of clinical narratives. Sci. Data 1, 140032 (2014).","journal-title":"Sci. Data"},{"key":"519_CR19","doi-asserted-by":"crossref","unstructured":"Kartchner, D., Christensen, T., Humpherys, J. & Wade, S. Code2vec: Embedding and clustering medical diagnosis data. in 2017 IEEE International Conference on Healthcare Informatics (ICHI) 386\u2013390 (2017).","DOI":"10.1109\/ICHI.2017.94"},{"key":"519_CR20","first-page":"295","volume":"25","author":"A Beam","year":"2020","unstructured":"Beam, A. et al. Clinical concept embeddings learned from massive sources of multimodal medical data. Pac. Symp. Biocomput. 25, 295\u2013306 (2020).","journal-title":"Pac. Symp. Biocomput."},{"key":"519_CR21","doi-asserted-by":"publisher","first-page":"1495","DOI":"10.1145\/2939672.2939823","volume":"22","author":"E Choi","year":"2016","unstructured":"Choi, E. et al. Multi-layer representation learning for medical concepts. Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 22, 1495\u20131504 (2016).","journal-title":"Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min."},{"key":"519_CR22","unstructured":"Choi, E., Schuetz, A., Stewart, W. & Sun, J. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686 (2016)."},{"key":"519_CR23","doi-asserted-by":"publisher","first-page":"362","DOI":"10.1093\/jamia\/ocw112","volume":"24","author":"E Choi","year":"2017","unstructured":"Choi, E., Schuetz, A., Stewart, W. & Sun, J. Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inform. Assoc. 24, 362\u2013370 (2017).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"519_CR24","first-page":"417","volume":"2016","author":"Y Choi","year":"2016","unstructured":"Choi, Y., Chiu, C. & Sontag, D. Learning low-dimensional representations of medical concepts. AMIA Summits Transl. Sci. Proc. 2016, 417\u2013428 (2016).","journal-title":"AMIA Summits Transl. Sci. Proc."},{"key":"519_CR25","doi-asserted-by":"crossref","unstructured":"PenningtonJ., SocherR. & Manning, C. D. (eds Moschitti, A., Pang, B., Daelemans, W.) Glove: Global vectors for word representation.In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Association for Computational Linguistics: 2014) 1532\u20131543.","DOI":"10.3115\/v1\/D14-1162"},{"key":"519_CR26","unstructured":"Smith, S. L., Turban, D. H. P., Hamblin, S. & Hammerla, N. Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. in Proceedings of the Fifth International Conference on Learning Representations (ICLR) (2017)."},{"key":"519_CR27","doi-asserted-by":"crossref","unstructured":"Artetxe, M., Labaka, G. & Agirre, E. (eds Su, J., Duh, K., Carreras, X.) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing. (Association for Computational Linguistics: 2016) 2289\u20132294.","DOI":"10.18653\/v1\/D16-1250"},{"key":"519_CR28","unstructured":"Bass, E., Ellis, P. & Golding, H. Comparing the costs of the veterans\u2019 health care system with private-sector costs. Congressional Budget Office. (2017)."},{"key":"519_CR29","doi-asserted-by":"publisher","first-page":"441","DOI":"10.1136\/amiajnl-2011-000116","volume":"18","author":"S Nelson","year":"2011","unstructured":"Nelson, S., Zeng, K., Kilbourne, J., Powell, T. & Moore, R. Normalized names for clinical drugs: RxNorm at 6 years. J. Am. Med. Inform. Assoc. 18, 441\u2013448 (2011).","journal-title":"J. Am. Med. Inform. Assoc."},{"key":"519_CR30","unstructured":"Goldberg, Y. & Levy, O. word2vec Explained: deriving Mikolov et al.\u2019s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)."},{"key":"519_CR31","first-page":"2177","volume":"27","author":"O Levy","year":"2014","unstructured":"Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27, 2177\u20132185 (2014).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"519_CR32","doi-asserted-by":"publisher","first-page":"301","DOI":"10.1111\/j.1467-9868.2005.00503.x","volume":"67","author":"H Zou","year":"2005","unstructured":"Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301\u2013320 (2005).","journal-title":"J. R. Stat. Soc. Ser. B"},{"key":"519_CR33","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1075\/li.30.1.03nad","volume":"30","author":"D Nadeau","year":"2007","unstructured":"Nadeau, D. & Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 30, 3\u201326 (2007).","journal-title":"Lingvisticae Investig."},{"key":"519_CR34","doi-asserted-by":"publisher","first-page":"3426","DOI":"10.1038\/s41596-019-0227-6","volume":"14","author":"Y Zhang","year":"2019","unstructured":"Zhang, Y. et al. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP). Nat. Protocal 14, 3426\u20133444 (2019).","journal-title":"Nat. Protocal"},{"key":"519_CR35","first-page":"548","volume":"92","author":"B Efron","year":"1997","unstructured":"Efron, B. & Tibshirani, R. Improvements on cross-validation: the 632+ bootstrap method. J. Am. Stat. Assoc. 92, 548\u2013560 (1997).","journal-title":"J. Am. Stat. Assoc."},{"key":"519_CR36","doi-asserted-by":"publisher","first-page":"1255","DOI":"10.1093\/jamia\/ocz066","volume":"26","author":"KP Liao","year":"2019","unstructured":"Liao, K. P. et al. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. J. Am. Med. Inform. Assoc. 26, 1255\u20131262 (2019).","journal-title":"J. Am. Med. Inform. Assoc."}],"container-title":["npj Digital Medicine"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s41746-021-00519-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-021-00519-z","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s41746-021-00519-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,3]],"date-time":"2022-12-03T14:02:39Z","timestamp":1670076159000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s41746-021-00519-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,27]]},"references-count":36,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2021,12]]}},"alternative-id":["519"],"URL":"https:\/\/doi.org\/10.1038\/s41746-021-00519-z","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.03.13.21253486","asserted-by":"object"}]},"ISSN":["2398-6352"],"issn-type":[{"value":"2398-6352","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,10,27]]},"assertion":[{"value":"30 September 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 September 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 October 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}},{"value":"The study protocol was approved by the MGB Human Research Committee (IRB00010756). No patient contact occurred in this study which relied on secondary use of data allowing for waiver of informed consent as detailed by 45 CFR 46.116. These activities were approved through the VA Central IRB. They were supported by Million Veteran Program, VA Central IRB 10-02, and approved under VA Central IRB protocol 18\u201338.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics"}}],"article-number":"151"}}