{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T22:11:54Z","timestamp":1773871914237,"version":"3.50.1"},"reference-count":55,"publisher":"Oxford University Press (OUP)","issue":"12","license":[{"start":{"date-parts":[[2020,10,11]],"date-time":"2020-10-11T00:00:00Z","timestamp":1602374400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100015711","name":"Michigan Institute for Data Science","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100015711","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-1553146"],"award-info":[{"award-number":["IIS-1553146"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000050","name":"National Heart, Lung, and Blood Institute","doi-asserted-by":"publisher","award":["R25HL147207"],"award-info":[{"award-number":["R25HL147207"]}],"id":[{"id":"10.13039\/100000050","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","award":["R01LM013325"],"award-info":[{"award-number":["R01LM013325"]}],"id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100015711","name":"Michigan Institute for Data Science","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100015711","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"the National Science Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000050","name":"the National Heart, Lung and Blood Institute","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000050","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,12,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757\u20130.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusions<\/jats:title>\n                  <jats:p>FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaa139","type":"journal-article","created":{"date-parts":[[2020,6,23]],"date-time":"2020-06-23T11:34:15Z","timestamp":1592912055000},"page":"1921-1934","source":"Crossref","is-referenced-by-count":77,"title":["Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data"],"prefix":"10.1093","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4213-2015","authenticated-orcid":false,"given":"Shengpu","family":"Tang","sequence":"first","affiliation":[{"name":"Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA"}]},{"given":"Parmida","family":"Davarmanesh","sequence":"additional","affiliation":[{"name":"Department of Mathematics, University of Michigan, Ann Arbor, USA"}]},{"given":"Yanmeng","family":"Song","sequence":"additional","affiliation":[{"name":"Department of Statistics, University of Michigan, Ann Arbor, USA"}]},{"given":"Danai","family":"Koutra","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA"}]},{"given":"Michael W","family":"Sjoding","sequence":"additional","affiliation":[{"name":"Department of Internal Medicine, University of Michigan, Ann Arbor, USA"},{"name":"Institution for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, USA"},{"name":"Michigan Integrated Center for Health Analytics and Medical Prediction, University of Michigan, Ann Arbor, USA"},{"name":"Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA"}]},{"given":"Jenna","family":"Wiens","sequence":"additional","affiliation":[{"name":"Department of Electrical Engineering and Computer Science, Division of Computer Science and Engineering, University of Michigan, Ann Arbor, USA"},{"name":"Institution for Healthcare Policy & Innovation, University of Michigan, Ann Arbor, USA"},{"name":"Michigan Integrated Center for Health Analytics and Medical Prediction, University of Michigan, Ann Arbor, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,10,11]]},"reference":[{"key":"2020121009243724300_ocaa139-B1","first-page":"467","article-title":"Patient risk stratification for hospital-associated C. diff as a time-series classification task","volume":"2012","author":"Wiens"},{"issue":"4","key":"2020121009243724300_ocaa139-B2","doi-asserted-by":"crossref","first-page":"425","DOI":"10.1017\/ice.2018.16","article-title":"A generalizable, data-driven approach to\u00a0predict daily risk of Clostridium difficile infection at two large academic\u00a0health centers","volume":"39","author":"Oh","year":"2018","journal-title":"Infect Control Hosp Epidemiol"},{"issue":"5","key":"2020121009243724300_ocaa139-B3","doi-asserted-by":"publisher","DOI":"10.1093\/ofid\/ofz186","article-title":"Using machine learning and the electronic health record to predict complicated Clostridium difficile infection","volume":"6","author":"Li","year":"2019","journal-title":"Open Forum Infect Dis"},{"issue":"3","key":"2020121009243724300_ocaa139-B4","doi-asserted-by":"crossref","first-page":"e28","DOI":"10.2196\/medinform.5909","article-title":"Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach","volume":"4","author":"Desautels","year":"2016","journal-title":"JMIR Med Inform"},{"issue":"299","key":"2020121009243724300_ocaa139-B5","doi-asserted-by":"crossref","first-page":"299ra122","DOI":"10.1126\/scitranslmed.aab3719","article-title":"A targeted real-time early warning score (TREWScore) for septic shock","volume":"7","author":"Henry","year":"2015","journal-title":"Sci Transl Med"},{"issue":"3","key":"2020121009243724300_ocaa139-B6","doi-asserted-by":"crossref","first-page":"e0214465","DOI":"10.1371\/journal.pone.0214465","article-title":"Machine learning for patient risk stratification for acute respiratory distress syndrome","volume":"14","author":"Zeiberg","year":"2019","journal-title":"PLOS One"},{"issue":"7","key":"2020121009243724300_ocaa139-B7","doi-asserted-by":"crossref","first-page":"1070","DOI":"10.1097\/CCM.0000000000003123","article-title":"The development of a machine learning inpatient acute kidney injury prediction model","volume":"46","author":"Koyner","year":"2018","journal-title":"Crit Care Med"},{"issue":"7767","key":"2020121009243724300_ocaa139-B8","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1038\/s41586-019-1390-1","article-title":"A clinically applicable approach to continuous prediction of future acute kidney injury","volume":"572","author":"Toma\u0161ev","year":"2019","journal-title":"Nature"},{"issue":"1","key":"2020121009243724300_ocaa139-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/srep26094","article-title":"Deep patient: an unsupervised representation to predict the future of patients from the electronic health records","volume":"6","author":"Miotto","year":"2016","journal-title":"Sci Rep"},{"key":"2020121009243724300_ocaa139-B10","first-page":"245","article-title":"Predicting in-hospital mortality of ICU patients: the PhysioNet\/computing in cardiology challenge 2012","volume":"39","author":"Silva","year":"2010","journal-title":"Comput Cardiol"},{"issue":"1","key":"2020121009243724300_ocaa139-B11","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1038\/s41597-019-0103-9","article-title":"Multitask learning and benchmarking with clinical time series data","volume":"6","author":"Harutyunyan","year":"2019","journal-title":"Sci Data"},{"key":"2020121009243724300_ocaa139-B12","doi-asserted-by":"crossref","first-page":"112","DOI":"10.1016\/j.jbi.2018.04.007","article-title":"Benchmarking deep learning models on large healthcare datasets","volume":"83","author":"Purushotham","year":"2018","journal-title":"J Biomed Inform"},{"key":"2020121009243724300_ocaa139-B13","author":"Wang"},{"issue":"1","key":"2020121009243724300_ocaa139-B14","doi-asserted-by":"crossref","first-page":"160035","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci Data"},{"issue":"1","key":"2020121009243724300_ocaa139-B15","doi-asserted-by":"crossref","first-page":"180178","DOI":"10.1038\/sdata.2018.178","article-title":"The eICU Collaborative Research Database, a freely available multi-center database for critical care research","volume":"5","author":"Pollard","year":"2018","journal-title":"Sci Data"},{"key":"2020121009243724300_ocaa139-B16","author":"Fiterau","year":"18\u201319, 2017; , ."},{"issue":"6","key":"2020121009243724300_ocaa139-B17","doi-asserted-by":"crossref","first-page":"S106","DOI":"10.1097\/MLR.0b013e3181de9e17","article-title":"Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches","volume":"48","author":"Wu","year":"2010","journal-title":"Med Care"},{"key":"2020121009243724300_ocaa139-B18","doi-asserted-by":"crossref","DOI":"10.4135\/9781412985628","volume-title":"Regression with dummy variables","author":"Hardy","year":"1993"},{"issue":"2","key":"2020121009243724300_ocaa139-B19","doi-asserted-by":"crossref","first-page":"295","DOI":"10.2307\/2528036","article-title":"The effectiveness of adjustment by subclassification in removing bias in observational studies","volume":"24","author":"Cochran","year":"1968","journal-title":"Biometrics"},{"issue":"23","key":"2020121009243724300_ocaa139-B20","doi-asserted-by":"crossref","first-page":"4124","DOI":"10.1002\/sim.6986","article-title":"Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model","volume":"35","author":"Collins","year":"2016","journal-title":"Stat Med"},{"issue":"30","key":"2020121009243724300_ocaa139-B21","first-page":"227","article-title":"World Health Organization. The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines","volume":"67","year":"1992","journal-title":"Wkly Epidemiol Rec"},{"key":"2020121009243724300_ocaa139-B22","author":"Zhang","year":"2008"},{"key":"2020121009243724300_ocaa139-B23","author":"Sherman","year":"4\u20138, 2017; ,"},{"key":"2020121009243724300_ocaa139-B24","volume-title":"Statistical analysis with missing data","author":"Little","year":"1987"},{"key":"2020121009243724300_ocaa139-B25","author":"Nemati"},{"key":"2020121009243724300_ocaa139-B26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.resuscitation.2016.02.005","article-title":"The value of vital sign trends for detecting clinical deterioration on the wards","volume":"102","author":"Churpek","year":"2016","journal-title":"Resuscitation"},{"issue":"1","key":"2020121009243724300_ocaa139-B27","doi-asserted-by":"crossref","first-page":"6085","DOI":"10.1038\/s41598-018-24271-9","article-title":"Recurrent neural networks for multivariate time series with missing values","volume":"8","author":"Che","year":"2018","journal-title":"Sci Rep"},{"issue":"3","key":"2020121009243724300_ocaa139-B28","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1093\/biomet\/63.3.581","article-title":"Inference and missing data","volume":"63","author":"Rubin","year":"1976","journal-title":"Biometrika"},{"issue":"5","key":"2020121009243724300_ocaa139-B29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v028.i05","article-title":"Building predictive models in R using the caret package","volume":"28","author":"Kuhn","year":"2008","journal-title":"J Stat Soft"},{"key":"2020121009243724300_ocaa139-B30","first-page":"37","article-title":"Feature selection for classification: a review","author":"Tang","year":"2014","journal-title":"Data Classification: Algorithms and Applications"},{"key":"2020121009243724300_ocaa139-B31","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/j.ymeth.2016.08.014","article-title":"Feature selection methods for big data bioinformatics: a survey from the search perspective","volume":"111","author":"Wang","year":"2016","journal-title":"Methods"},{"issue":"3","key":"2020121009243724300_ocaa139-B32","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1109\/34.990133","article-title":"Unsupervised feature selection using feature similarity","volume":"24","author":"Mitra","year":"2002","journal-title":"IEEE Trans Pattern Anal Machine Intell"},{"key":"2020121009243724300_ocaa139-B33","first-page":"1205","article-title":"Efficient feature selection via analysis of relevance and redundancy","volume":"5","author":"Yu","year":"2004","journal-title":"J Mach Learn Res"},{"issue":"2","key":"2020121009243724300_ocaa139-B34","doi-asserted-by":"crossref","first-page":"907","DOI":"10.1007\/s10462-019-09682-y","article-title":"A review of unsupervised feature selection methods","volume":"53","author":"Solorio-Fern\u00e1ndez","year":"2020","journal-title":"Artif Intell Rev"},{"key":"2020121009243724300_ocaa139-B35","author":"Oh"},{"key":"2020121009243724300_ocaa139-B36","author":"Zhang","year":"26\u201328, 2020."},{"issue":"5p2","key":"2020121009243724300_ocaa139-B37","doi-asserted-by":"crossref","first-page":"1620","DOI":"10.1111\/j.1475-6773.2005.00444.x","article-title":"Measuring diagnoses: ICD code accuracy","volume":"40","author":"O'Malley","year":"2005","journal-title":"Health Serv Res"},{"issue":"9","key":"2020121009243724300_ocaa139-B38","doi-asserted-by":"crossref","first-page":"1337","DOI":"10.1038\/s41591-019-0548-6","article-title":"Do no harm: a roadmap for responsible machine learning for health care","volume":"25","author":"Wiens","year":"2019","journal-title":"Nat Med"},{"issue":"1","key":"2020121009243724300_ocaa139-B39","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1038\/s41746-018-0029-1","article-title":"Scalable and accurate deep learning with electronic health records","volume":"1","author":"Rajkomar","year":"2018","journal-title":"NPJ Digit Med"},{"key":"2020121009243724300_ocaa139-B40","first-page":"281","article-title":"Random search for hyper-parameter optimization","volume":"13","author":"Bergstra","year":"2012","journal-title":"J Machine Learn Res"},{"issue":"2","key":"2020121009243724300_ocaa139-B41","doi-asserted-by":"crossref","first-page":"286","DOI":"10.1080\/15374410902740411","article-title":"Introduction to permutation and resampling-based hypothesis tests","volume":"38","author":"LaFleur","year":"2009","journal-title":"J Clin Child Adolesc Psychol"},{"issue":"3","key":"2020121009243724300_ocaa139-B42","doi-asserted-by":"crossref","first-page":"751","DOI":"10.1093\/biomet\/73.3.751","article-title":"An improved Bonferroni procedure for multiple tests of significance","volume":"73","author":"Simes","year":"1986","journal-title":"Biometrika"},{"key":"2020121009243724300_ocaa139-B43"},{"key":"2020121009243724300_ocaa139-B44","first-page":"2825","article-title":"Scikit-learn: machine learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J Mach Learn Res"},{"key":"2020121009243724300_ocaa139-B45","author":"Paszke"},{"key":"2020121009243724300_ocaa139-B46","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1007\/978-3-319-43742-2_20","article-title":"Mortality prediction in the ICU based on MIMIC-II results from the super ICU learner algorithm (SICULA) project","author":"Pirracchio","year":"2016","journal-title":"Secondary Analysis of Electronic Health Records"},{"key":"2020121009243724300_ocaa139-B47","author":"Johnson","year":"18\u201319, 2017; ,"},{"issue":"3","key":"2020121009243724300_ocaa139-B48","doi-asserted-by":"crossref","first-page":"826","DOI":"10.4338\/ACI-2017-03-CR-0046","article-title":"Barriers to achieving economies of scale in analysis of EHR data","volume":"8","author":"Sendak","year":"2017","journal-title":"Appl Clin Inform"},{"key":"2020121009243724300_ocaa139-B49","author":"Bender","year":"20\u201322, 2013; ,"},{"issue":"9","key":"2020121009243724300_ocaa139-B50","doi-asserted-by":"crossref","first-page":"600","DOI":"10.7326\/0003-4819-153-9-201011020-00010","article-title":"Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership","volume":"153","author":"Stang","year":"2010","journal-title":"Ann Intern Med"},{"issue":"4","key":"2020121009243724300_ocaa139-B51","doi-asserted-by":"crossref","first-page":"578","DOI":"10.1136\/amiajnl-2014-002747","article-title":"Launching PCORnet, a national patient-centered clinical research network","volume":"21","author":"Fleurence","year":"2014","journal-title":"J Am Med Inform Assoc"},{"issue":"4","key":"2020121009243724300_ocaa139-B52","doi-asserted-by":"crossref","first-page":"699","DOI":"10.1136\/amiajnl-2013-002162","article-title":"A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions","volume":"21","author":"Wiens","year":"2014","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"2020121009243724300_ocaa139-B53","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1093\/cid\/cix731","article-title":"Machine learning for healthcare: on the verge of a major shift in healthcare epidemiology","volume":"66","author":"Wiens","year":"2018","journal-title":"Clin Infect Dis"},{"key":"2020121009243724300_ocaa139-B54","author":"Sculley","year":"2014"},{"issue":"1","key":"2020121009243724300_ocaa139-B55","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1016\/j.compeleceng.2013.11.024","article-title":"A survey on feature selection methods","volume":"40","author":"Chandrashekar","year":"2014","journal-title":"Comput Electr Eng"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/12\/1921\/34838612\/ocaa139.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/12\/1921\/34838612\/ocaa139.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,12,10]],"date-time":"2020-12-10T14:39:45Z","timestamp":1607611185000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/27\/12\/1921\/5920826"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,11]]},"references-count":55,"journal-issue":{"issue":"12","published-online":{"date-parts":[[2020,10,11]]},"published-print":{"date-parts":[[2020,12,9]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaa139","relation":{},"ISSN":["1527-974X"],"issn-type":[{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,12]]},"published":{"date-parts":[[2020,10,11]]}}}