{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T16:11:24Z","timestamp":1772554284726,"version":"3.50.1"},"reference-count":55,"publisher":"Oxford University Press (OUP)","issue":"8","license":[{"start":{"date-parts":[[2020,7,4]],"date-time":"2020-07-04T00:00:00Z","timestamp":1593820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100016308","name":"Bpifrance","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100016308","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Discussion<\/jats:title>\n                  <jats:p>Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient\u2019s date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusion<\/jats:title>\n                  <jats:p>Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaa096","type":"journal-article","created":{"date-parts":[[2020,5,11]],"date-time":"2020-05-11T11:08:19Z","timestamp":1589195299000},"page":"1244-1251","source":"Crossref","is-referenced-by-count":54,"title":["Fold-stratified cross-validation for unbiased and privacy-preserving federated learning"],"prefix":"10.1093","volume":"27","author":[{"given":"Romain","family":"Bey","sequence":"first","affiliation":[{"name":"Centre of Research in Epidemiology and Statistics (CRESS), Universit\u00e9 de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France"}]},{"given":"Romain","family":"Goussault","sequence":"additional","affiliation":[{"name":"CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France"}]},{"given":"Fran\u00e7ois","family":"Grolleau","sequence":"additional","affiliation":[{"name":"Centre of Research in Epidemiology and Statistics (CRESS), Universit\u00e9 de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France"}]},{"given":"Mehdi","family":"Benchoufi","sequence":"additional","affiliation":[{"name":"Centre of Research in Epidemiology and Statistics (CRESS), Universit\u00e9 de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5277-4679","authenticated-orcid":false,"given":"Rapha\u00ebl","family":"Porcher","sequence":"additional","affiliation":[{"name":"Centre of Research in Epidemiology and Statistics (CRESS), Universit\u00e9 de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France"}]}],"member":"286","published-online":{"date-parts":[[2020,7,4]]},"reference":[{"issue":"7639","key":"2020110613100564500_ocaa096-B1","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1038\/nature21056","article-title":"Dermatologist-level classification of skin cancer with deep neural networks","volume":"542","author":"Esteva","year":"2017","journal-title":"Nature"},{"issue":"8","key":"2020110613100564500_ocaa096-B2","doi-asserted-by":"crossref","first-page":"500","DOI":"10.1038\/s41568-018-0016-5","article-title":"Artificial intelligence in radiology","volume":"18","author":"Hosny","year":"2018","journal-title":"Nat Rev Cancer"},{"issue":"11","key":"2020110613100564500_ocaa096-B3","doi-asserted-by":"crossref","first-page":"1716","DOI":"10.1038\/s41591-018-0213-5","article-title":"The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care","volume":"24","author":"Komorowski","year":"2018","journal-title":"Nat Med"},{"key":"2020110613100564500_ocaa096-B4","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1038\/s41746-018-0029-1","article-title":"Scalable and accurate deep learning with electronic health records","volume":"1","author":"Rajkomar","year":"2018","journal-title":"NPJ Digit Med"},{"issue":"11","key":"2020110613100564500_ocaa096-B5","doi-asserted-by":"crossref","first-page":"e1002695","DOI":"10.1371\/journal.pmed.1002695","article-title":"Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records","volume":"15","author":"Rahimian","year":"2018","journal-title":"PLoS Med"},{"key":"2020110613100564500_ocaa096-B6","volume-title":"Inference, and Prediction","author":"Hastie","year":"2009","edition":"2nd ed"},{"issue":"1","key":"2020110613100564500_ocaa096-B7","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1186\/1471-2288-14-137","article-title":"Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints","volume":"14","author":"van der Ploeg","year":"2014","journal-title":"BMC Med Res Methodol"},{"issue":"4","key":"2020110613100564500_ocaa096-B8","doi-asserted-by":"crossref","first-page":"351","DOI":"10.1007\/s12553-017-0179-1","article-title":"Google DeepMind and healthcare in an age of algorithms","volume":"7","author":"Powles","year":"2017","journal-title":"Health Technol"},{"key":"2020110613100564500_ocaa096-B9","author":"Caldicott","year":"2016"},{"issue":"8","key":"2020110613100564500_ocaa096-B10","doi-asserted-by":"crossref","first-page":"e1000167","DOI":"10.1371\/journal.pgen.1000167","article-title":"Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays","volume":"4","author":"Homer","year":"2008","journal-title":"PLoS Genet"},{"issue":"6117","key":"2020110613100564500_ocaa096-B11","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1126\/science.339.6117.262","article-title":"Genealogy databases enable naming of anonymous DNA donors","volume":"339","author":"Bohannon","year":"2013","journal-title":"Science"},{"issue":"6117","key":"2020110613100564500_ocaa096-B12","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1126\/science.1229566","article-title":"Identifying personal genomes by surname inference","volume":"339","author":"Gymrek","year":"2013","journal-title":"Science"},{"issue":"1","key":"2020110613100564500_ocaa096-B13","doi-asserted-by":"crossref","first-page":"3069","DOI":"10.1038\/s41467-019-10933-3","article-title":"Estimating the success of re-identifications in incomplete datasets using generative models","volume":"10","author":"Rocher","year":"2019","journal-title":"Nat Commun"},{"issue":"1","key":"2020110613100564500_ocaa096-B14","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1038\/s41591-018-0272-7","article-title":"Privacy in the age of medical big data","volume":"25","author":"Price","year":"2019","journal-title":"Nat Med"},{"key":"2020110613100564500_ocaa096-B15","first-page":"901","author":"Aggarwal","year":"2005"},{"key":"2020110613100564500_ocaa096-B16","first-page":"70","author":"Brickell","year":"2008"},{"issue":"1","key":"2020110613100564500_ocaa096-B17","doi-asserted-by":"crossref","first-page":"180286","DOI":"10.1038\/sdata.2018.286","article-title":"On the privacy-conscientious use of mobile phone data","volume":"5","author":"de Montjoye","year":"2018","journal-title":"Sci Data"},{"issue":"9","key":"2020110613100564500_ocaa096-B18","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1093\/jamia\/ocy058","article-title":"Hospitals\u2019 adoption of intra-system information exchange is negatively associated with inter-system information exchange","volume":"25","author":"Vest","year":"2018","journal-title":"J Am Med Inf Assoc"},{"issue":"5","key":"2020110613100564500_ocaa096-B19","doi-asserted-by":"crossref","first-page":"758","DOI":"10.1136\/amiajnl-2012-000862","article-title":"Grid Binary LOgistic REgression (GLORE): building shared models without sharing data","volume":"19","author":"Wu","year":"2012","journal-title":"J Am Med Inf Assoc"},{"key":"2020110613100564500_ocaa096-B20","doi-asserted-by":"crossref","first-page":"1212","DOI":"10.1093\/jamia\/ocv083","article-title":"WebDISCO: a web service for distributed cox model learning without patient-level data sharing","volume":"22","author":"Lu","year":"2015","journal-title":"J Am Med Inf Assoc"},{"key":"2020110613100564500_ocaa096-B21","first-page":"1310","author":"Shokri","year":"2015"},{"key":"2020110613100564500_ocaa096-B22","author":"McMahan","year":"2019"},{"key":"2020110613100564500_ocaa096-B23","first-page":"1175","author":"Bonawitz","year":"2017"},{"key":"2020110613100564500_ocaa096-B24","author":"Kairouz","year":"2019"},{"key":"2020110613100564500_ocaa096-B25","author":"Bonawitz","year":"2019"},{"issue":"4","key":"2020110613100564500_ocaa096-B26","doi-asserted-by":"crossref","first-page":"799","DOI":"10.1093\/jamia\/ocw167","article-title":"Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks","volume":"24","author":"Raisaro","year":"2017","journal-title":"J Am Med Inf Assoc"},{"issue":"4","key":"2020110613100564500_ocaa096-B27","doi-asserted-by":"crossref","first-page":"1328","DOI":"10.1109\/TCBB.2018.2854776","article-title":"MedCo: enabling secure and privacy-preserving exploration of distributed clinical and genomic data","volume":"16","author":"Raisaro","year":"2019","journal-title":"IEEE\/ACM Trans Comput Biol Bioinf"},{"key":"2020110613100564500_ocaa096-B28","author":"Ryffel","year":"2019"},{"key":"2020110613100564500_ocaa096-B29","author":"Galtier","year":"2019"},{"issue":"3","key":"2020110613100564500_ocaa096-B30","doi-asserted-by":"publisher","first-page":"376","DOI":"10.1093\/jamia\/ocz199","article-title":"Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm","volume":"27","author":"Duan","year":"2020","journal-title":"J Am Med Inf Assoc"},{"issue":"6176","key":"2020110613100564500_ocaa096-B31","doi-asserted-by":"crossref","first-page":"1203","DOI":"10.1126\/science.1248506","article-title":"Big data. The parable of Google Flu: traps in big data analysis","volume":"343","author":"Lazer","year":"2014","journal-title":"Science"},{"issue":"1","key":"2020110613100564500_ocaa096-B32","doi-asserted-by":"crossref","first-page":"eaao5580","DOI":"10.1126\/sciadv.aao5580","article-title":"The accuracy, fairness, and limits of predicting recidivism","volume":"4","author":"Dressel","year":"2018","journal-title":"Sci Adv"},{"key":"2020110613100564500_ocaa096-B33","author":"Kir\u00e1ly","year":"2019"},{"issue":"3","key":"2020110613100564500_ocaa096-B34","doi-asserted-by":"crossref","first-page":"800","DOI":"10.1148\/radiol.2017171920","article-title":"Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction","volume":"286","author":"Park","year":"2018","journal-title":"Radiology"},{"key":"2020110613100564500_ocaa096-B35","author":"Vollmer","year":"2019"},{"issue":"4","key":"2020110613100564500_ocaa096-B36","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2382577.2382579","article-title":"Leakage in data mining: formulation, detection, and avoidance","volume":"6","author":"Kaufman","year":"2012","journal-title":"ACM Trans Knowl Discov Data"},{"issue":"1","key":"2020110613100564500_ocaa096-B37","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1186\/1471-2288-14-36","article-title":"Evaluating bias due to data linkage error in electronic healthcare records","volume":"14","author":"Harron","year":"2014","journal-title":"BMC Med Res Methodol"},{"issue":"12","key":"2020110613100564500_ocaa096-B38","doi-asserted-by":"crossref","first-page":"e323","DOI":"10.2196\/jmir.5870","article-title":"Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view","volume":"18","author":"Luo","year":"2016","journal-title":"J Med Internet Res"},{"issue":"5","key":"2020110613100564500_ocaa096-B39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/gigascience\/gix019","article-title":"The need to approximate the use-case in clinical machine learning","volume":"6","author":"Saeb","year":"2017","journal-title":"Gigascience"},{"issue":"3","key":"2020110613100564500_ocaa096-B40","doi-asserted-by":"crossref","first-page":"219","DOI":"10.1136\/bmjqs-2012-001419","article-title":"Matching identifiers in electronic health records: implications for duplicate records and patient safety","volume":"22","author":"McCoy","year":"2013","journal-title":"BMJ Qual Saf"},{"issue":"9","key":"2020110613100564500_ocaa096-B41","doi-asserted-by":"crossref","first-page":"1114","DOI":"10.1093\/jamia\/ocy089","article-title":"Gaps in health information exchange between hospitals that treat many shared patients","volume":"25","author":"Everson","year":"2018","journal-title":"J Am Med Inf Assoc"},{"key":"2020110613100564500_ocaa096-B42","doi-asserted-by":"crossref","DOI":"10.1002\/9781119072454","volume-title":"Methodological Developments in Data Linkage","author":"Harron","year":"2015"},{"issue":"6","key":"2020110613100564500_ocaa096-B43","doi-asserted-by":"crossref","first-page":"946","DOI":"10.1016\/j.is.2012.11.005","article-title":"A taxonomy of privacy-preserving record linkage techniques","volume":"38","author":"Vatsalan","year":"2013","journal-title":"Inf Syst"},{"issue":"e1","key":"2020110613100564500_ocaa096-B44","doi-asserted-by":"crossref","first-page":"e155","DOI":"10.1136\/amiajnl-2012-001299","article-title":"Federated queries of clinical data repositories: the sum of the parts does not equal the whole","volume":"20","author":"Weber","year":"2013","journal-title":"J Am Med Inform Assoc"},{"issue":"1","key":"2020110613100564500_ocaa096-B45","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12911-016-0389-x","article-title":"Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation","volume":"17","author":"Yigzaw","year":"2017","journal-title":"BMC Med Inf Decis Mak"},{"issue":"S4","key":"2020110613100564500_ocaa096-B46","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1186\/s12920-018-0400-8","article-title":"Privacy-preserving record linkage in large databases using secure multiparty computation","volume":"11","author":"Laud","year":"2018","journal-title":"BMC Med Genomics"},{"issue":"1\u20132","key":"2020110613100564500_ocaa096-B47","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/S0004-3702(99)00094-6","article-title":"Unsupervised stratification of cross-validation for accuracy estimation","volume":"116","author":"Diamantidis","year":"2000","journal-title":"Art Int"},{"issue":"1","key":"2020110613100564500_ocaa096-B48","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci Data"},{"issue":"24","key":"2020110613100564500_ocaa096-B49","doi-asserted-by":"crossref","first-page":"2957","DOI":"10.1001\/jama.1993.03510240069035","article-title":"A new Simplified Acute Physiology Score (SAPS II) based on a European\/North American multicenter study","volume":"270","author":"Le Gall","year":"1993","journal-title":"JAMA"},{"key":"2020110613100564500_ocaa096-B50","first-page":"785","author":"Chen","year":"2016"},{"key":"2020110613100564500_ocaa096-B51","author":"Liu","year":"2019"},{"key":"2020110613100564500_ocaa096-B52","author":"Cheng","year":"2019"},{"issue":"1","key":"2020110613100564500_ocaa096-B53","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1016\/S2213-2600(14)70239-5","article-title":"Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study","volume":"3","author":"Pirracchio","year":"2015","journal-title":"Lancet Respir Med"},{"key":"2020110613100564500_ocaa096-B54","volume-title":"Anonymizing Health Data: Case Studies and Methods to Get You Started","author":"Emam","year":"2013"},{"issue":"3\u20134","key":"2020110613100564500_ocaa096-B55","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1561\/0400000042","article-title":"The algorithmic foundations of differential privacy","volume":"9","author":"Dwork","year":"2013","journal-title":"FNT Theor Comput Sci"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/8\/1244\/34152945\/ocaa096.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/27\/8\/1244\/34152945\/ocaa096.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,11,6]],"date-time":"2020-11-06T19:31:56Z","timestamp":1604691116000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/27\/8\/1244\/5867235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,4]]},"references-count":55,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2020,7,4]]},"published-print":{"date-parts":[[2020,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaa096","relation":{},"ISSN":["1527-974X"],"issn-type":[{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,8]]},"published":{"date-parts":[[2020,7,4]]}}}