{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T04:55:56Z","timestamp":1780462556814,"version":"3.54.1"},"reference-count":22,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T00:00:00Z","timestamp":1631836800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T00:00:00Z","timestamp":1631836800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100004040","name":"KU Leuven","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100004040","id-type":"DOI","asserted-by":"crossref"}]},{"name":"the Flemish Government"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Med Inform Decis Mak"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration.\n<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Methods<\/jats:title>\n                <jats:p>We used EHR data collected from primary care in Flanders, Belgium during 1994\u20132015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1\u201310%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12911-021-01630-7","type":"journal-article","created":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T17:03:03Z","timestamp":1631898183000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":31,"title":["An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge"],"prefix":"10.1186","volume":"21","author":[{"given":"Xi","family":"Shi","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Charlotte","family":"Prins","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Gijs","family":"Van Pottelbergh","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pavlos","family":"Mamouris","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bert","family":"Vaes","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bart","family":"De Moor","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,9,17]]},"reference":[{"issue":"16","key":"1630_CR1","doi-asserted-by":"publisher","first-page":"1481","DOI":"10.1093\/eurheartj\/ehx487","volume":"39","author":"H Hemingway","year":"2018","unstructured":"Hemingway H, Asselbergs FW, Danesh J, et al. Big data from electronic health records for early and late translational cardiovascular research: challenges and potential. Eur Heart J. 2018;39(16):1481\u201395.","journal-title":"Eur Heart J"},{"issue":"1","key":"1630_CR2","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1136\/amiajnl-2011-000681","volume":"20","author":"NG Weiskopf","year":"2013","unstructured":"Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144\u201351.","journal-title":"J Am Med Inform Assoc"},{"issue":"5","key":"1630_CR3","doi-asserted-by":"publisher","first-page":"753","DOI":"10.1177\/0193945916689084","volume":"40","author":"SL Feder","year":"2018","unstructured":"Feder SL. Data quality in electronic health records research: quality domains and assessment methods. West J Nurs Res. 2018;40(5):753\u201366.","journal-title":"West J Nurs Res"},{"key":"1630_CR4","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1186\/s12911-019-0740-0","volume":"19","author":"AL Terry","year":"2019","unstructured":"Terry AL, Stewart M, Cejic S, et al. A basic model for assessing primary health care electronic medical record data quality. BMC Med Inform Decis Mak. 2019;19:30.","journal-title":"BMC Med Inform Decis Mak"},{"key":"1630_CR5","doi-asserted-by":"publisher","first-page":"19","DOI":"10.2174\/1874431101812010019","volume":"12","author":"M Mashoufi","year":"2018","unstructured":"Mashoufi M, Ayatollahi H, Khorasani-Zavareh D. A review of data quality assessment in emergency medical services. Open Med Inform J. 2018;12:19\u201332. https:\/\/doi.org\/10.2174\/1874431101812010019.","journal-title":"Open Med Inform J"},{"issue":"Suppl","key":"1630_CR6","doi-asserted-by":"publisher","first-page":"S21","DOI":"10.1097\/MLR.0b013e318257dd67","volume":"50","author":"MG Kahn","year":"2012","unstructured":"Kahn MG, Raebel MA, Glanz JM, et al. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. 2012;50(Suppl):S21\u20139. https:\/\/doi.org\/10.1097\/MLR.0b013e318257dd67.","journal-title":"Med Care"},{"issue":"1","key":"1630_CR7","doi-asserted-by":"publisher","first-page":"69","DOI":"10.4338\/ACI-2015-08-RA-0107","volume":"7","author":"SG Johnson","year":"2016","unstructured":"Johnson SG, Speedie S, Simon G, et al. Application of an ontology for characterizing data quality for a secondary use of EHR data. Appl Clin Inform. 2016;7(1):69\u201388. https:\/\/doi.org\/10.4338\/ACI-2015-08-RA-0107.","journal-title":"Appl Clin Inform"},{"issue":"1","key":"1630_CR8","doi-asserted-by":"publisher","first-page":"724","DOI":"10.1186\/s12913-020-05591-x","volume":"20","author":"C Njuguna","year":"2020","unstructured":"Njuguna C, Vandi M, Mugagga M, et al. Institutionalized data quality assessments: a critical pathway to improving the accuracy of integrated disease surveillance data in Sierra Leone. BMC Health Serv Res. 2020;20(1):724. https:\/\/doi.org\/10.1186\/s12913-020-05591-x.","journal-title":"BMC Health Serv Res"},{"issue":"1","key":"1630_CR9","doi-asserted-by":"publisher","first-page":"3","DOI":"10.13063\/2327-9214.1277","volume":"5","author":"H Estiri","year":"2017","unstructured":"Estiri H, Stephens K. DQe-v: a database-agnostic framework for exploring variability in electronic health record data across time and site location. EGEMS (Wash DC). 2017;5(1):3. https:\/\/doi.org\/10.13063\/2327-9214.1277.","journal-title":"EGEMS (Wash DC)"},{"issue":"1","key":"1630_CR10","doi-asserted-by":"publisher","first-page":"32","DOI":"10.5334\/egems.286","volume":"7","author":"JF Diaz-Garelli","year":"2019","unstructured":"Diaz-Garelli JF, Bernstam EV, Lee M, et al. DataGauge: a practical process for systematically designing and implementing quality assessments of repurposed clinical data. EGEMS (Wash DC). 2019;7(1):32. https:\/\/doi.org\/10.5334\/egems.286.","journal-title":"EGEMS (Wash DC)"},{"issue":"1","key":"1630_CR11","first-page":"1201","volume":"4","author":"O Dziadkowiec","year":"2016","unstructured":"Dziadkowiec O, Callahan T, Ozkaynak M, et al. Using a data quality framework to clean data extracted from the electronic health record: a case study. EGEMS (Wash DC). 2016;4(1):1201.","journal-title":"EGEMS (Wash DC)"},{"issue":"1","key":"1630_CR12","doi-asserted-by":"publisher","first-page":"14","DOI":"10.5334\/egems.218","volume":"5","author":"NG Weiskopf","year":"2017","unstructured":"Weiskopf NG, Bakken S, Hripcsak G, et al. A data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC). 2017;5(1):14. https:\/\/doi.org\/10.5334\/egems.218.","journal-title":"EGEMS (Wash DC)"},{"key":"1630_CR13","unstructured":"Miao Z, Sathyanarayanan S, Fong E, Paiva W, Delen D. An assessment and cleaning framework for electronic health records data. In: Industrial and systems engineering research conference. 2018."},{"key":"1630_CR14","doi-asserted-by":"publisher","first-page":"10164","DOI":"10.1038\/s41598-020-66925-7","volume":"10","author":"HTT Phan","year":"2020","unstructured":"Phan HTT, Borca F, Cable D, et al. Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Sci Rep. 2020;10:10164.","journal-title":"Sci Rep"},{"issue":"12","key":"1630_CR15","doi-asserted-by":"publisher","first-page":"1921","DOI":"10.1093\/jamia\/ocaa139","volume":"27","author":"S Tang","year":"2020","unstructured":"Tang S, Davarmanesh P, Song Y, et al. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J Am Med Inform Assoc. 2020;27(12):1921\u201334. https:\/\/doi.org\/10.1093\/jamia\/ocaa139.","journal-title":"J Am Med Inform Assoc"},{"key":"1630_CR16","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1186\/1472-6947-14-48","volume":"14","author":"C Truyers","year":"2014","unstructured":"Truyers C, Goderis G, Dewitte H, et al. The Intego database: background, methods and basic results of a Flemish general practice-based continuous morbidity registration project. BMC Med Inform Decis Mak. 2014;14:48.","journal-title":"BMC Med Inform Decis Mak"},{"issue":"1","key":"1630_CR17","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1007\/s10032-002-0082-8","volume":"5","author":"KU Schulz","year":"2002","unstructured":"Schulz KU, Mihov S. Fast string correction with levenshtein automata. Int J Doc Anal Recogn. 2002;5(1):67\u201385.","journal-title":"Int J Doc Anal Recogn"},{"key":"1630_CR18","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1016\/j.jbi.2018.11.007","volume":"88","author":"A Sarker","year":"2018","unstructured":"Sarker A, Gonzalez-Hernandez G. An unsupervised and customizable misspelling generator for mining noisy health-related text sources. J Biomed Inform. 2018;88:98\u2013107.","journal-title":"J Biomed Inform"},{"issue":"10","key":"1630_CR19","doi-asserted-by":"publisher","first-page":"e267","DOI":"10.1371\/journal.pmed.0020267","volume":"2","author":"J Van den Broeck","year":"2005","unstructured":"Van den Broeck J, Cunningham SA, Eeckels R, et al. Data cleaning: detecting, diagnosing, and editing data abnormalities. Plos Med. 2005;2(10):e267.","journal-title":"Plos Med"},{"key":"1630_CR20","doi-asserted-by":"publisher","DOI":"10.1136\/bmjopen-2018-023594","volume":"8","author":"G Van Pottelbergh","year":"2018","unstructured":"Van Pottelbergh G, Mamouris P, Opdeweegh N, et al. Is there a correlation between an eGFR slope measured over a 5-year period and incident cardiovascular events in the following 5 years among a Flemish general practice population: a retrospective cohort study. BMJ Open. 2018;8: e023594. https:\/\/doi.org\/10.1136\/bmjopen-2018-023594.","journal-title":"BMJ Open"},{"issue":"14","key":"1630_CR21","doi-asserted-by":"publisher","first-page":"2678","DOI":"10.1080\/00949655.2019.1630411","volume":"89","author":"A Florez","year":"2019","unstructured":"Florez A, Molenberghs B, Verbeke G, et al. Fast two-stage estimator for clustered count data with overdispersion. J Stat Comput Simul. 2019;89(14):2678\u201393. https:\/\/doi.org\/10.1080\/00949655.2019.1630411.","journal-title":"J Stat Comput Simul"},{"key":"1630_CR22","unstructured":"dataQualityR. https:\/\/rdrr.io\/cran\/dataQualityR\/man\/dataQualityR-package.html. Accessed 09 July 2021."}],"container-title":["BMC Medical Informatics and Decision Making"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-021-01630-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12911-021-01630-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12911-021-01630-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T17:03:55Z","timestamp":1631898235000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcmedinformdecismak.biomedcentral.com\/articles\/10.1186\/s12911-021-01630-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,17]]},"references-count":22,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["1630"],"URL":"https:\/\/doi.org\/10.1186\/s12911-021-01630-7","relation":{},"ISSN":["1472-6947"],"issn-type":[{"value":"1472-6947","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,17]]},"assertion":[{"value":"16 January 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 September 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 September 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Intego uses an opt-out methodology and is approved by the local ethical committee of the KU Leuven and in line with Belgian privacy regulations.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"267"}}