{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T09:59:45Z","timestamp":1777543185734,"version":"3.51.4"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"e1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2016,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Objective Enormous amounts of healthcare data are becoming increasingly accessible through the large-scale adoption of electronic health records. In this work, structured and unstructured (textual) data are combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. We investigate whether integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation.<\/jats:p><jats:p>Methods Two separate data integration approaches were evaluated. Early data integration combines features of several sources within a single model, and late data integration learns a separate model per data source and combines these predictions with a meta-learner. This is evaluated on data sources and clinical codes from a broad set of medical specialties.<\/jats:p><jats:p>Results When compared with the best individual prediction source, late data integration leads to improvements in predictive power (eg, overall F-measure increased from 30.6% to 38.3% for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes), while early data integration is less consistent. The predictive strength strongly differs between medical specialties, both for ICD-9-CM diagnostic and procedural codes.<\/jats:p><jats:p>Discussion Structured data provides complementary information to unstructured data (and vice versa) for predicting ICD-9-CM codes. This can be captured most effectively by the proposed late data integration approach.<\/jats:p><jats:p>Conclusions We demonstrated that models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties.<\/jats:p>","DOI":"10.1093\/jamia\/ocv115","type":"journal-article","created":{"date-parts":[[2015,8,28]],"date-time":"2015-08-28T02:09:29Z","timestamp":1440727769000},"page":"e11-e19","source":"Crossref","is-referenced-by-count":64,"title":["Data integration of structured and unstructured sources for assigning clinical codes to patient stays"],"prefix":"10.1093","volume":"23","author":[{"given":"Elyne","family":"Scheurwegs","sequence":"first","affiliation":[{"name":"ADReM (Advanced Database Research and Modelling), Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp, Antwerp, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kim","family":"Luyckx","sequence":"additional","affiliation":[{"name":"Department of Medical Information, Antwerp University Hospital, Antwerp, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"L\u00e9on","family":"Luyten","sequence":"additional","affiliation":[{"name":"Department of Medical Information, Antwerp University Hospital, Antwerp, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Walter","family":"Daelemans","sequence":"additional","affiliation":[{"name":"Computational Linguistics and Psycholinguistics (CLiPS) Research Center, University of Antwerp, Antwerp, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tim","family":"Van den Bulcke","sequence":"additional","affiliation":[{"name":"Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp - Antwerp University Hospital, Belgium; ADReM (Advanced Database Research and Modelling), University of Antwerp, Antwerp, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2015,8,27]]},"reference":[{"key":"2020110612360247400_ocv115-B1","volume-title":"Use and Characteristics of Electronic Health Record Systems Among Office-Based Physician Practices, United States, 2001-2012","author":"Hsiao"},{"issue":"10","key":"2020110612360247400_ocv115-B2","doi-asserted-by":"crossref","first-page":"991","DOI":"10.1001\/jama.2013.890","article-title":"Improving the electronic health record\u2014are clinicians getting what they wished for?","volume":"309","author":"Cimino","year":"2013","journal-title":"JAMA."},{"key":"2020110612360247400_ocv115-B3","author":"WHO"},{"key":"2020110612360247400_ocv115-B4","author":"WHO"},{"key":"2020110612360247400_ocv115-B5","author":"WHO."},{"issue":"5","key":"2020110612360247400_ocv115-B6","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1089\/sur.2013.089","article-title":"Validity of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) Screening for Sepsis in Surgical Mortalities","volume":"15","author":"Ramanathan","year":"2014","journal-title":"Surg Infect."},{"issue":"16","key":"2020110612360247400_ocv115-B7","doi-asserted-by":"crossref","first-page":"2375","DOI":"10.1093\/bioinformatics\/btu197","article-title":"R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment","volume":"30","author":"Carroll","year":"2014","journal-title":"Bioinformatics."},{"key":"2020110612360247400_ocv115-B8","article-title":"Three Approaches to Automatic Assignment of ICD-9-CM Codes to Radiology Reports","volume":"279","author":"Goldstein","year":"2007","journal-title":"AMIA Ann Symp Proc."},{"key":"2020110612360247400_ocv115-B9","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1007\/978-3-642-14770-8_23","article-title":"Symbolic classification methods for patient discharge summaries encoding into ICD","volume-title":"Advances in Natural Language Processing","author":"Kevers","year":"2010"},{"issue":"5","key":"2020110612360247400_ocv115-B10","first-page":"516","article-title":"Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques","volume":"13","author":"Pakhomov","year":"2006","journal-title":"JAMIA."},{"issue":"3","key":"2020110612360247400_ocv115-B11","doi-asserted-by":"crossref","first-page":"S10","DOI":"10.1186\/1471-2105-9-S3-S10","article-title":"Automatic construction of rule-based ICD-9-CM coding systems","volume":"9","author":"Farkas","year":"2008","journal-title":"BMC Bioinformatics."},{"key":"2020110612360247400_ocv115-B12","first-page":"97","article-title":"A shared task involving multi-label classification of clinical free text","author":"Pestian","year":"2007","journal-title":"Assoc Computational Linguistics, In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing"},{"issue":"6","key":"2020110612360247400_ocv115-B13","first-page":"646","article-title":"A systematic literature review of automated clinical coding and classification systems","volume":"17","author":"Stanfill","year":"2010","journal-title":"JAMIA."},{"issue":"2","key":"2020110612360247400_ocv115-B14","first-page":"231","article-title":"Diagnosis code assignment: models and evaluation metrics","volume":"21","author":"Perotte","year":"2014","journal-title":"JAMIA."},{"issue":"5","key":"2020110612360247400_ocv115-B15","doi-asserted-by":"crossref","first-page":"952","DOI":"10.1097\/CCM.0b013e31820a92c6","article-title":"Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public-access intensive care unit database","volume":"39","author":"Saeed","year":"2011","journal-title":"Crit Care Med."},{"issue":"5","key":"2020110612360247400_ocv115-B16","first-page":"801","article-title":"Combining structured and unstructured data to identify a cohort of ICU patients who received dialysis","volume":"21","author":"Abhyankar","year":"2014","journal-title":"JAMIA."},{"issue":"e2","key":"2020110612360247400_ocv115-B17","first-page":"e341","article-title":"Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium","volume":"20","author":"Pathak","year":"2013","journal-title":"JAMIA."},{"key":"2020110612360247400_ocv115-B18","article-title":"DISEASES: Text mining and data integration of disease--gene associations","author":"Pletscher-Frankild","journal-title":"Methods."},{"issue":"4","key":"2020110612360247400_ocv115-B19","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1186\/gm39","article-title":"A kernel-based integration of genome-wide data for clinical decision support","volume":"1","author":"Daemen","year":"2009","journal-title":"Genome Med."},{"issue":"4","key":"2020110612360247400_ocv115-B20","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1108\/eb026457","article-title":"The derivation and application of the Bradford-Zipf distribution","volume":"24","author":"Brookes","year":"1968","journal-title":"J Document."},{"key":"2020110612360247400_ocv115-B21","author":"WHO"},{"key":"2020110612360247400_ocv115-B22","author":"BDSP"},{"key":"2020110612360247400_ocv115-B23","author":"RIZIV"},{"issue":"1","key":"2020110612360247400_ocv115-B24","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1111\/j.1749-4486.2008.01863.x","article-title":"A multidisciplinary audit of clinical coding accuracy in otolaryngology: financial, managerial and clinical governance considerations under payment-by-results","volume":"34","author":"Nouraei","year":"2009","journal-title":"Clin Otolaryngol."},{"key":"2020110612360247400_ocv115-B25","first-page":"191","article-title":"An efficient memory-based morphosyntactic Tagger and Parser for Dutch","volume":"7","author":"Bosch AVd, Busser","year":"2007","journal-title":"LOT Occasional Series."},{"key":"2020110612360247400_ocv115-B26","author":"McCallum"},{"issue":"1","key":"2020110612360247400_ocv115-B27","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach Learn."},{"issue":"1","key":"2020110612360247400_ocv115-B28","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA Data Mining Software: an Update","volume":"11","author":"Hall","year":"2009","journal-title":"SIGKDD Explorations."},{"issue":"5","key":"2020110612360247400_ocv115-B29","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inform Process Manag."},{"issue":"8","key":"2020110612360247400_ocv115-B30","doi-asserted-by":"crossref","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","article-title":"Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy","volume":"27","author":"Peng","year":"2005","journal-title":"Pattern Analysis and Machine Intelligence, IEEE Transactions on"},{"key":"2020110612360247400_ocv115-B31","doi-asserted-by":"crossref","DOI":"10.4061\/2009\/869093","article-title":"Data integration in genetics and genomics: methods and challenges","author":"Hamid","year":"2009","journal-title":"Hum Genom Proteomics."},{"issue":"4","key":"2020110612360247400_ocv115-B32","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1007\/BF00994110","article-title":"A Bayesian method for the induction of probabilistic networks from data","volume":"9","author":"Cooper","year":"1992","journal-title":"Mach Learn."},{"key":"2020110612360247400_ocv115-B33","author":"Jackson"},{"key":"2020110612360247400_ocv115-B34","doi-asserted-by":"crossref","first-page":"785","DOI":"10.1097\/00005650-200008000-00003","article-title":"Identification of in-hospital complications from claims data: is it valid?","volume":"38","author":"Lawthers","year":"2000","journal-title":"Med Care."},{"key":"2020110612360247400_ocv115-B35","doi-asserted-by":"crossref","first-page":"856","DOI":"10.1097\/00005650-200210000-00004","article-title":"Can administrative data be used to compare postoperative complication rates across hospitals?","volume":"40","author":"Romano","year":"2002","journal-title":"Med Care."}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/23\/e1\/e11\/34147958\/ocv115.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/23\/e1\/e11\/34147958\/ocv115.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,5,21]],"date-time":"2022-05-21T01:39:32Z","timestamp":1653097172000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/23\/e1\/e11\/2379791"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,8,27]]},"references-count":35,"journal-issue":{"issue":"e1","published-online":{"date-parts":[[2015,8,27]]},"published-print":{"date-parts":[[2016,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocv115","relation":{},"ISSN":["1527-974X","1067-5027"],"issn-type":[{"value":"1527-974X","type":"electronic"},{"value":"1067-5027","type":"print"}],"subject":[],"published-other":{"date-parts":[[2016,4]]},"published":{"date-parts":[[2015,8,27]]}}}