{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T07:43:46Z","timestamp":1777535026983,"version":"3.51.4"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2020,10,7]],"date-time":"2020-10-07T00:00:00Z","timestamp":1602028800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/501100004233","name":"Universitat Polit\u00e8cnica de Val\u00e8ncia","doi-asserted-by":"publisher","award":["UPV-SUB.2-1302"],"award-info":[{"award-number":["UPV-SUB.2-1302"]}],"id":[{"id":"10.13039\/501100004233","id-type":"DOI","asserted-by":"publisher"}]},{"name":"FONDO SUPERA COVID-19 by CRUE-Santander Bank grant \u201cSeverity Subgroup Discovery and Classification on COVID-19 Real World Data"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,2,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Objective<\/jats:title>\n                  <jats:p>The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Materials and Methods<\/jats:title>\n                  <jats:p>We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Conclusions<\/jats:title>\n                  <jats:p>Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/jamia\/ocaa258","type":"journal-article","created":{"date-parts":[[2020,9,28]],"date-time":"2020-09-28T11:22:06Z","timestamp":1601292126000},"page":"360-364","source":"Crossref","is-referenced-by-count":67,"title":["Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset"],"prefix":"10.1093","volume":"28","author":[{"given":"Carlos","family":"S\u00e1ez","sequence":"first","affiliation":[{"name":"Biomedical Data Science Lab, Instituto Universitario de Tecnolog\u00edas de la Informaci\u00f3n y Comunicaciones, Universitat Polit\u00e8cnica de Val\u00e8ncia, Camino de Vera s\/n, Valencia 46022, Espa\u00f1a"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nekane","family":"Romero","sequence":"additional","affiliation":[{"name":"Biomedical Data Science Lab, Instituto Universitario de Tecnolog\u00edas de la Informaci\u00f3n y Comunicaciones, Universitat Polit\u00e8cnica de Val\u00e8ncia, Camino de Vera s\/n, Valencia 46022, Espa\u00f1a"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"J Alberto","family":"Conejero","sequence":"additional","affiliation":[{"name":"Instituto Universitario de Matem\u00e1tica Pura y Aplicada, Universitat Polit\u00e9cnica de Val\u00e8ncia, Valencia, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Juan M","family":"Garc\u00eda-G\u00f3mez","sequence":"additional","affiliation":[{"name":"Biomedical Data Science Lab, Instituto Universitario de Tecnolog\u00edas de la Informaci\u00f3n y Comunicaciones, Universitat Polit\u00e8cnica de Val\u00e8ncia, Camino de Vera s\/n, Valencia 46022, Espa\u00f1a"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2020,10,7]]},"reference":[{"key":"2021021519083129000_ocaa258-B1","doi-asserted-by":"crossref","first-page":"m1464","DOI":"10.1136\/bmj.m1464","article-title":"Prediction models for diagnosis and prognosis in COVID-19","volume":"369","author":"Sperrin","year":"2020","journal-title":"BMJ"},{"key":"2021021519083129000_ocaa258-B2","doi-asserted-by":"crossref","first-page":"m1328","DOI":"10.1136\/bmj.m1328","article-title":"Prediction models for diagnosis and prognosis of COVID-19 infection: systematic review and critical appraisal","volume":"369","author":"Wynants","year":"2020","journal-title":"BMJ"},{"issue":"1","key":"2021021519083129000_ocaa258-B3","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1038\/s41597-020-0448-0","article-title":"Epidemiological data from the COVID-19 outbreak, real-time case information","volume":"7","author":"Xu","year":"2020","journal-title":"Sci Data"},{"key":"2021021519083129000_ocaa258-B4","doi-asserted-by":"crossref","first-page":"433","DOI":"10.1002\/wics.101","article-title":"Principal component analysis","volume":"2","author":"Herv\u00e9","year":"2010","journal-title":"WIREs Comput Stat"},{"key":"2021021519083129000_ocaa258-B5","author":"Husson","year":"2017","edition":"2nd ed"},{"key":"2021021519083129000_ocaa258-B6","year":"2020"},{"key":"2021021519083129000_ocaa258-B7","article-title":"Accessed May 25,\u00a02020","year":"2020"},{"key":"2021021519083129000_ocaa258-B8","volume":"25,\u00a02020","year":"2020"},{"issue":"1","key":"2021021519083129000_ocaa258-B9","doi-asserted-by":"crossref","first-page":"521","DOI":"10.1016\/j.patcog.2011.06.019","article-title":"A unifying view on dataset shift in classification","volume":"45","author":"Moreno-Torres","year":"2012","journal-title":"Pattern Recognit"},{"issue":"14","key":"2021021519083129000_ocaa258-B10","doi-asserted-by":"crossref","first-page":"1347","DOI":"10.1056\/NEJMra1814259","article-title":"Machine learning in medicine","volume":"380","author":"Rajkomar","year":"2019","journal-title":"N Engl J Med"},{"issue":"1","key":"2021021519083129000_ocaa258-B11","doi-asserted-by":"crossref","first-page":"312","DOI":"10.1177\/0962280214545122","article-title":"Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances","volume":"26","author":"S\u00e1ez","year":"2017","journal-title":"Stat Methods Med Res"},{"issue":"4","key":"2021021519083129000_ocaa258-B12","doi-asserted-by":"crossref","first-page":"1408","DOI":"10.1093\/ije\/dyu192","article-title":"Understanding variation in disease risk: the elusive concept of frailty","volume":"44","author":"Aalen","year":"2015","journal-title":"Int J Epidemiol"},{"issue":"11","key":"2021021519083129000_ocaa258-B13","doi-asserted-by":"crossref","first-page":"1544","DOI":"10.1001\/jamainternmed.2018.3763","article-title":"Potential biases in machine learning algorithms using electronic health record data","volume":"178","author":"Gianfrancesco","year":"2018","journal-title":"JAMA Intern Med"},{"key":"2021021519083129000_ocaa258-B14","doi-asserted-by":"crossref","DOI":"10.1016\/j.diagmicrobio.2020.115070","article-title":"Accelerating the global response against the exponentially growing COVID-19 outbreak through decent data sharing","author":"Galvin","year":"2020","journal-title":"Diagn Microbiol Infect Dis"},{"issue":"1","key":"2021021519083129000_ocaa258-B15","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1038\/s41746-020-00308-0","article-title":"International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium","volume":"3","author":"Brat","year":"2020","journal-title":"NPJ Digit Med"},{"issue":"1","key":"2021021519083129000_ocaa258-B16","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1007\/s10334-008-0146-y","article-title":"Multiproject-multicenter evaluation of automatic brain tumor classification by magnetic resonance spectroscopy","volume":"22","author":"Garc\u00eda-G\u00f3mez","year":"2009","journal-title":"MAGMA"},{"issue":"6","key":"2021021519083129000_ocaa258-B17","doi-asserted-by":"crossref","first-page":"1085","DOI":"10.1093\/jamia\/ocw010","article-title":"Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories","volume":"23","author":"S\u00e1ez","year":"2016","journal-title":"J Am Med Inform Assoc"},{"issue":"6","key":"2021021519083129000_ocaa258-B18","doi-asserted-by":"crossref","first-page":"517","DOI":"10.1001\/jama.2017.7797","article-title":"Unintended consequences of machine learning in medicine","volume":"318","author":"Cabitza","year":"2017","journal-title":"JAMA"},{"issue":"2","key":"2021021519083129000_ocaa258-B19","doi-asserted-by":"crossref","first-page":"e034396","DOI":"10.1136\/bmjopen-2019-034396","article-title":"Data-driven discovery of changes in clinical code usage over time: a case-study on changes in cardiovascular disease recording in two English electronic health records databases (2001\u20132015)","volume":"10","author":"Rockenschaub","year":"2020","journal-title":"BMJ Open"},{"issue":"8","key":"2021021519083129000_ocaa258-B20","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/giaa079","article-title":"EHRtemporalVariability: delineating temporal data-set shifts in electronic health records","volume":"9","author":"S\u00e1ez","year":"2020","journal-title":"GigaScience"},{"issue":"4","key":"2021021519083129000_ocaa258-B21","doi-asserted-by":"crossref","first-page":"950","DOI":"10.1007\/s10618-014-0378-6","article-title":"Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality","volume":"29","author":"S\u00e1ez","year":"2015","journal-title":"Data Min Knowl Discov"},{"issue":"3","key":"2021021519083129000_ocaa258-B22","doi-asserted-by":"crossref","first-page":"148","DOI":"10.1002\/bjs.9736","article-title":"Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement","volume":"102","author":"Collins","year":"2015","journal-title":"Br J Surg"},{"key":"2021021519083129000_ocaa258-B23","doi-asserted-by":"crossref","first-page":"104954","DOI":"10.1016\/j.cmpb.2019.06.013","article-title":"Guest editorial: Special issue in biomedical data quality assessment methods","volume":"181","author":"S\u00e1ez","year":"2019","journal-title":"Comput Methods Programs Biomed"},{"key":"2021021519083129000_ocaa258-B24","first-page":"29","author":"Wirth","year":"2000"},{"issue":"11","key":"2021021519083129000_ocaa258-B25","doi-asserted-by":"crossref","first-page":"1043","DOI":"10.1001\/jama.2020.1039","article-title":"Randomized clinical trials of artificial intelligence","volume":"323","author":"Angus","year":"2020","journal-title":"JAMA"},{"key":"2021021519083129000_ocaa258-B26","volume-title":"Introducing MLOps","author":"Stenac","year":"2021"},{"issue":"3","key":"2021021519083129000_ocaa258-B27","doi-asserted-by":"crossref","first-page":"150","DOI":"10.2471\/BLT.20.251561","article-title":"Data sharing for novel coronavirus (COVID-19)","volume":"98","author":"Moorthy","year":"2020","journal-title":"Bull World Health Organ"},{"key":"2021021519083129000_ocaa258-B28","doi-asserted-by":"crossref","DOI":"10.1093\/jamia\/ocaa196","article-title":"The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment","author":"Haendel","year":"2020","journal-title":"J Am Med Inform Assoc"}],"container-title":["Journal of the American Medical Informatics Association"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/jamia\/advance-article-pdf\/doi\/10.1093\/jamia\/ocaa258\/34191904\/ocaa258.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/28\/2\/360\/36270266\/ocaa258.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/jamia\/article-pdf\/28\/2\/360\/36270266\/ocaa258.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,2,16]],"date-time":"2021-02-16T14:37:44Z","timestamp":1613486264000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/jamia\/article\/28\/2\/360\/5919075"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,7]]},"references-count":28,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2020,10,7]]},"published-print":{"date-parts":[[2021,2,15]]}},"URL":"https:\/\/doi.org\/10.1093\/jamia\/ocaa258","relation":{},"ISSN":["1067-5027","1527-974X"],"issn-type":[{"value":"1067-5027","type":"print"},{"value":"1527-974X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,2,1]]},"published":{"date-parts":[[2020,10,7]]}}}