{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T01:44:40Z","timestamp":1769910280822,"version":"3.49.0"},"reference-count":80,"publisher":"Georg Thieme Verlag KG","issue":"04","funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["SFRH\/ BDE\/51605\/2011"],"award-info":[{"award-number":["SFRH\/ BDE\/51605\/2011"]}]},{"name":"Siemens Healthcare and the Centre for Management Studies of Instituto Superior T\u00e9cnico (CEG-IST, University of Lisbon)"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Appl Clin Inform"],"published-print":{"date-parts":[[2016,10]]},"abstract":"<jats:title>Summary<\/jats:title><jats:p>Background EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls.<\/jats:p><jats:p>Objectives This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them.<\/jats:p><jats:p>Methods Along standard data processing stages \u2013 extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework \u2013, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies.<\/jats:p><jats:p>Results Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: crosschecking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately.<\/jats:p><jats:p>Conclusions This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.<\/jats:p><jats:p>Citation: Ferr\u00e3o JC, Oliveira MD, Janela F, Martins HMG. Preprocessing structured clinical data for predictive modeling and decision support \u2013 a roadmap to tackle the challenges.<\/jats:p>","DOI":"10.4338\/aci-2016-03-soa-0035","type":"journal-article","created":{"date-parts":[[2016,12,7]],"date-time":"2016-12-07T03:06:08Z","timestamp":1481079968000},"page":"1135-1153","source":"Crossref","is-referenced-by-count":28,"title":["Preprocessing structured clinical data for predictive modeling and decision support"],"prefix":"10.4338","volume":"07","author":[{"given":"M\u00f3nica","family":"Oliveira","sequence":"first","affiliation":[]},{"given":"Filipe","family":"Janela","sequence":"first","affiliation":[]},{"given":"Henrique","family":"Martins","sequence":"first","affiliation":[]},{"given":"Jos\u00e9","family":"Ferr\u00e3o","sequence":"additional","affiliation":[]}],"member":"194","published-online":{"date-parts":[[2017,12,18]]},"reference":[{"key":"10.4338\/ACI-2016-03-SOA-0035-1","doi-asserted-by":"publisher","DOI":"10.1136\/amiajnl-2013-002117"},{"key":"10.4338\/ACI-2016-03-SOA-0035-2","doi-asserted-by":"publisher","DOI":"10.1056\/NEJMp1401111"},{"key":"10.4338\/ACI-2016-03-SOA-0035-3","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M2273"},{"key":"10.4338\/ACI-2016-03-SOA-0035-4","doi-asserted-by":"publisher","DOI":"10.1097\/MLR.0b013e3181de9e17"},{"key":"10.4338\/ACI-2016-03-SOA-0035-5","doi-asserted-by":"crossref","unstructured":"Berner ES. Clinical Decision Support Systems. 2nded. New York: Springer; 2007","DOI":"10.1007\/978-0-387-38319-4"},{"key":"10.4338\/ACI-2016-03-SOA-0035-6","doi-asserted-by":"publisher","DOI":"10.1016\/j.artmed.2007.04.005"},{"key":"10.4338\/ACI-2016-03-SOA-0035-7","doi-asserted-by":"crossref","unstructured":"Carter EM, Potts HWW. Predicting length of stay from an electronic patient record system: a primary total knee replacement example. BMC Med Inform Decis Mak 2014; 14(26).","DOI":"10.1186\/1472-6947-14-26"},{"key":"10.4338\/ACI-2016-03-SOA-0035-8","doi-asserted-by":"publisher","DOI":"10.7326\/0003-4819-144-10-200605160-00125"},{"key":"10.4338\/ACI-2016-03-SOA-0035-9","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M2334"},{"issue":"1","key":"10.4338\/ACI-2016-03-SOA-0035-10","doi-asserted-by":"crossref","first-page":"38","DOI":"10.3414\/ME9132","volume":"48","author":"Prokosch","year":"2009","journal-title":"Methods Inf Med"},{"key":"10.4338\/ACI-2016-03-SOA-0035-11","doi-asserted-by":"publisher","DOI":"10.1016\/S0933-3657(02)00049-0"},{"key":"10.4338\/ACI-2016-03-SOA-0035-12","unstructured":"Lin JH, Haug PJ. Data preparation framework for preprocessing clinical data in data mining. Proceedinfs of AMIA Annu Symp; 2006 Nov 11-15; Washington DC, USA. 2006. p. 489-93"},{"key":"10.4338\/ACI-2016-03-SOA-0035-13","first-page":"249","volume":"31","author":"Kotsiantis","year":"2007","journal-title":"Informatica"},{"key":"10.4338\/ACI-2016-03-SOA-0035-14","doi-asserted-by":"publisher","DOI":"10.1001\/jama.1988.03720230043028"},{"issue":"Suppl. 1","key":"10.4338\/ACI-2016-03-SOA-0035-15","first-page":"1","volume":"48","author":"Iavindrasana","year":"2009","journal-title":", Yearb Med Inform"},{"key":"10.4338\/ACI-2016-03-SOA-0035-16","unstructured":"Hand DJ, Mannila H, Smyth P. Principles of Data Mining. 3rdedition. Cambridge, USA: MIT Press; 2001"},{"key":"10.4338\/ACI-2016-03-SOA-0035-17","unstructured":"International Organization For Standardization. ISO\/TR 20514 Electronic health record - Definition, scope and context. 2005. doi:ISO\/TR 20514:2005(E)"},{"key":"10.4338\/ACI-2016-03-SOA-0035-18","unstructured":"International Organization For Standardization. ISO 18308 - Health informatics - Requirements for an electronic health record architecture. 2011"},{"key":"10.4338\/ACI-2016-03-SOA-0035-19","unstructured":"International Organization For Standardization. ISO 21090 - Health informatics - Harmonized data types for information interchange. 2011"},{"key":"10.4338\/ACI-2016-03-SOA-0035-20","unstructured":"International Organization For Standardization. ISO\/EN 13606 - Health Informatics - Electronic Health Record Communication. 2010"},{"issue":"Pt 1","key":"10.4338\/ACI-2016-03-SOA-0035-21","first-page":"161","volume":"160","author":"Santos","year":"2010","journal-title":"Stud Health Technol Inform"},{"key":"10.4338\/ACI-2016-03-SOA-0035-22","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M1888"},{"key":"10.4338\/ACI-2016-03-SOA-0035-23","unstructured":"Beale T, Heard S. OpenEHR Architecture Overview. 2006"},{"key":"10.4338\/ACI-2016-03-SOA-0035-24","unstructured":"Atzeni P, De Antonellis V. Relational database theory. Redwood City, USA: Benjamin-Cummings Publishing; 1993"},{"key":"10.4338\/ACI-2016-03-SOA-0035-25","doi-asserted-by":"publisher","DOI":"10.1016\/j.cmpb.2012.10.018"},{"key":"10.4338\/ACI-2016-03-SOA-0035-26","doi-asserted-by":"publisher","DOI":"10.1145\/1978915.1978919"},{"key":"10.4338\/ACI-2016-03-SOA-0035-27","doi-asserted-by":"crossref","unstructured":"Stalidis G, Prentza A, Vlachos IN, Maglavera S, Koutsouris D. Medical support system for continuation of care based on XML web technology. Int J Med Inform 2001; 64(2-3): 385-400. (01)00195-2","DOI":"10.1016\/S1386-5056(01)00195-2"},{"key":"10.4338\/ACI-2016-03-SOA-0035-28","doi-asserted-by":"crossref","unstructured":"Catley C, Frize M, A prototype XML-based implementation of an integrated \u201cintelligent\u201d neonatal intensive care unit. Proceedings of the 4thInt IEEE EMBS Spec Top Conf Inf Technol Appl Biomed; Apr 24-26 2003; Birmingham, UK. 2003. p. 322-325.","DOI":"10.1109\/ITAB.2003.1222543"},{"key":"10.4338\/ACI-2016-03-SOA-0035-29","unstructured":"Gainer V, Hackett K, Mendis M, Kuttan R, Pan W, Phillips LC, et al. Using the i2b2 hive for clinical discovery: an example. Proceedings of AMIA Annu Symp; 2007 Nov 10-14; Chicago, USA. 2007. p. 959"},{"key":"10.4338\/ACI-2016-03-SOA-0035-30","doi-asserted-by":"publisher","DOI":"10.1136\/jamia.2009.000893"},{"key":"10.4338\/ACI-2016-03-SOA-0035-31","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2012.01.009"},{"key":"10.4338\/ACI-2016-03-SOA-0035-32","unstructured":"Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, et al. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. Proceedings of AMIA Annu Symp; 2011 Oct 22-26; Washington DC, USA. 2011. p. 248-56"},{"key":"10.4338\/ACI-2016-03-SOA-0035-33","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2014.10.006"},{"key":"10.4338\/ACI-2016-03-SOA-0035-34","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2011.07.007"},{"key":"10.4338\/ACI-2016-03-SOA-0035-35","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2014.02.003"},{"key":"10.4338\/ACI-2016-03-SOA-0035-36","doi-asserted-by":"publisher","DOI":"10.1197\/jamia.M2522"},{"key":"10.4338\/ACI-2016-03-SOA-0035-37","unstructured":"Bradshaw RL, Matney S, Livne OE, et al. Architecture of a federated query engine for heterogeneous resources. Proceedings of AMIA Annu Symp; 2009 Nov 14-18; San Francisco, USA. 2009. p. 70-4"},{"key":"10.4338\/ACI-2016-03-SOA-0035-38","doi-asserted-by":"crossref","unstructured":"Tsoumakas G, Katakis I, Vlahavas I. Mining Multi-label Data. In: Mainon O, Rokach L, editors. Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 667-685","DOI":"10.1007\/978-0-387-09823-4_34"},{"key":"10.4338\/ACI-2016-03-SOA-0035-39","doi-asserted-by":"crossref","unstructured":"Wu L, Barash G, Bartolini C. A Service-oriented Architecture for Business Intelligence. Proceedings of the IEEE Int Conf Serv Comput Appl (SOCA); 2007 Jun 19-20; Newport Beach, USA. 2007. p. 279-285.","DOI":"10.1109\/SOCA.2007.6"},{"key":"10.4338\/ACI-2016-03-SOA-0035-40","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2013.12.012"},{"key":"10.4338\/ACI-2016-03-SOA-0035-41","doi-asserted-by":"crossref","unstructured":"Pietka E. Large-Scale Hospital Information System in clinical practice. Int Congr Ser 2003; 1256: 843-848. (03)00458-8","DOI":"10.1016\/S0531-5131(03)00458-8"},{"issue":"9","key":"10.4338\/ACI-2016-03-SOA-0035-42","first-page":"73","volume":"82","author":"AHIMA","year":"2008","journal-title":"J AHIMA"},{"key":"10.4338\/ACI-2016-03-SOA-0035-43","first-page":"30","volume":"82","author":"Holmes","year":"2011","journal-title":"J AHIMA"},{"key":"10.4338\/ACI-2016-03-SOA-0035-44","doi-asserted-by":"crossref","unstructured":"Moshkovich H. Rule induction in data mining: effect of ordinal scales. Expert Syst Appl 2001; 22(4): 303-311. (02)00018-0","DOI":"10.1016\/S0957-4174(02)00018-0"},{"issue":"4-5","key":"10.4338\/ACI-2016-03-SOA-0035-45","first-page":"273","volume":"35","author":"Cimino","year":"1996","journal-title":"Methods Inf Med"},{"key":"10.4338\/ACI-2016-03-SOA-0035-46","doi-asserted-by":"publisher","DOI":"10.1109\/MITP.2005.122"},{"key":"10.4338\/ACI-2016-03-SOA-0035-47","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2012.02.011"},{"key":"10.4338\/ACI-2016-03-SOA-0035-48","doi-asserted-by":"publisher","DOI":"10.1136\/jamia.1998.0050276"},{"key":"10.4338\/ACI-2016-03-SOA-0035-49","doi-asserted-by":"crossref","unstructured":"Doan A, Halevy A, Ives Z. Principles of Data Integration. 1sted. Morgan Kaufmann; 2012","DOI":"10.1016\/B978-0-12-416044-6.00001-6"},{"key":"10.4338\/ACI-2016-03-SOA-0035-50","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2006.09.001"},{"key":"10.4338\/ACI-2016-03-SOA-0035-51","doi-asserted-by":"crossref","unstructured":"Burgun A, Bodenreider O. Accessing and integrating data and knowledge for biomedical research. Yearb Med Inform 2008: 91-101","DOI":"10.1055\/s-0038-1638588"},{"key":"10.4338\/ACI-2016-03-SOA-0035-52","doi-asserted-by":"crossref","unstructured":"Giuse D. Health information systems challenges: the Heidelberg conference and the future. Int J Med Inform 2003; 69(2-3): 105-114. (02)00182-X","DOI":"10.1016\/S1386-5056(02)00182-X"},{"key":"10.4338\/ACI-2016-03-SOA-0035-53","doi-asserted-by":"publisher","DOI":"10.1097\/00001888-200501000-00009"},{"key":"10.4338\/ACI-2016-03-SOA-0035-54","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2006.11.006"},{"key":"10.4338\/ACI-2016-03-SOA-0035-55","first-page":"1157","volume":"3","author":"Guyon","year":"2003","journal-title":"J Mach Learn Res"},{"key":"10.4338\/ACI-2016-03-SOA-0035-56","first-page":"4","volume":"10","author":"Liu","year":"2010","journal-title":"in: JMLR Work Conf Proc"},{"key":"10.4338\/ACI-2016-03-SOA-0035-57","doi-asserted-by":"crossref","unstructured":"Dash M, Liu H. Feature selection for classification. Intell Data Anal 1997; 1: 131-156. (97)00008-5","DOI":"10.3233\/IDA-1997-1302"},{"issue":"4-5","key":"10.4338\/ACI-2016-03-SOA-0035-58","first-page":"394","volume":"37","author":"Cimino","year":"1998","journal-title":"Methods Inf Med"},{"key":"10.4338\/ACI-2016-03-SOA-0035-59","doi-asserted-by":"publisher","DOI":"10.1109\/MASSP.1987.1165576"},{"key":"10.4338\/ACI-2016-03-SOA-0035-60","doi-asserted-by":"publisher","DOI":"10.1109\/21.52545"},{"key":"10.4338\/ACI-2016-03-SOA-0035-61","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btm344"},{"key":"10.4338\/ACI-2016-03-SOA-0035-62","doi-asserted-by":"publisher","DOI":"10.1109\/TCBB.2012.33"},{"key":"10.4338\/ACI-2016-03-SOA-0035-63","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1957.10501412"},{"key":"10.4338\/ACI-2016-03-SOA-0035-64","doi-asserted-by":"publisher","DOI":"10.1023\/A:1016304305535"},{"key":"10.4338\/ACI-2016-03-SOA-0035-65","doi-asserted-by":"publisher","DOI":"10.1377\/hlthaff.28.2.323"},{"key":"10.4338\/ACI-2016-03-SOA-0035-66","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2009.07.001"},{"key":"10.4338\/ACI-2016-03-SOA-0035-67","doi-asserted-by":"publisher","DOI":"10.2105\/AJPH.87.4.548"},{"key":"10.4338\/ACI-2016-03-SOA-0035-68","doi-asserted-by":"publisher","DOI":"10.1136\/bmj.b2393"},{"key":"10.4338\/ACI-2016-03-SOA-0035-69","doi-asserted-by":"publisher","DOI":"10.1016\/j.jclinepi.2004.11.029"},{"key":"10.4338\/ACI-2016-03-SOA-0035-70","doi-asserted-by":"publisher","DOI":"10.13063\/2327-9214.1035"},{"key":"10.4338\/ACI-2016-03-SOA-0035-71","doi-asserted-by":"crossref","unstructured":"Allison PD. Missing Data. SAGE Publications, Inc.; 2001","DOI":"10.4135\/9781412985079"},{"key":"10.4338\/ACI-2016-03-SOA-0035-72","doi-asserted-by":"publisher","DOI":"10.1016\/j.jclinepi.2006.01.014"},{"key":"10.4338\/ACI-2016-03-SOA-0035-73","doi-asserted-by":"publisher","DOI":"10.1016\/j.artmed.2013.01.003"},{"key":"10.4338\/ACI-2016-03-SOA-0035-74","doi-asserted-by":"publisher","DOI":"10.1016\/j.artmed.2010.05.002"},{"key":"10.4338\/ACI-2016-03-SOA-0035-75","doi-asserted-by":"publisher","DOI":"10.4338\/ACI-2013-02-CR-0008"},{"key":"10.4338\/ACI-2016-03-SOA-0035-76","doi-asserted-by":"publisher","DOI":"10.4338\/ACI-2010-03-RA-0019"},{"key":"10.4338\/ACI-2016-03-SOA-0035-77","doi-asserted-by":"publisher","DOI":"10.3414\/ME10-01-0038"},{"key":"10.4338\/ACI-2016-03-SOA-0035-78","doi-asserted-by":"publisher","DOI":"10.1136\/amiajnl-2013-001684"},{"key":"10.4338\/ACI-2016-03-SOA-0035-79","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2014.01.010"},{"key":"10.4338\/ACI-2016-03-SOA-0035-80","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijmedinf.2012.05.018"}],"container-title":["Applied Clinical Informatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/www.thieme-connect.de\/products\/ejournals\/pdf\/10.4338\/ACI-2016-03-SOA-0035.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,9,16]],"date-time":"2019-09-16T04:53:23Z","timestamp":1568609603000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.thieme-connect.de\/DOI\/DOI?10.4338\/ACI-2016-03-SOA-0035"}},"subtitle":["A roadmap to tackle the challenges"],"short-title":[],"issued":{"date-parts":[[2016,10]]},"references-count":80,"journal-issue":{"issue":"04","published-online":{"date-parts":[[2017,12,18]]},"published-print":{"date-parts":[[2016]]}},"URL":"https:\/\/doi.org\/10.4338\/aci-2016-03-soa-0035","relation":{},"ISSN":["1869-0327"],"issn-type":[{"value":"1869-0327","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,10]]}}}