{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,12]],"date-time":"2026-06-12T23:20:42Z","timestamp":1781306442774,"version":"3.54.1"},"reference-count":66,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T00:00:00Z","timestamp":1709164800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Big Data"],"abstract":"<jats:p>Combining training data from multiple sources increases sample size and reduces confounding, leading to more accurate and less biased machine learning models. In healthcare, however, direct pooling of data is often not allowed by data custodians who are accountable for minimizing the exposure of sensitive information. Federated learning offers a promising solution to this problem by training a model in a decentralized manner thus reducing the risks of data leakage. Although there is increasing utilization of federated learning on clinical data, its efficacy on individual-level genomic data has not been studied. This study lays the groundwork for the adoption of federated learning for genomic data by investigating its applicability in two scenarios: phenotype prediction on the UK Biobank data and ancestry prediction on the 1000 Genomes Project data. We show that federated models trained on data split into independent nodes achieve performance close to centralized models, even in the presence of significant inter-node heterogeneity. Additionally, we investigate how federated model accuracy is affected by communication frequency and suggest approaches to reduce computational complexity or communication costs.<\/jats:p>","DOI":"10.3389\/fdata.2024.1266031","type":"journal-article","created":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T00:35:42Z","timestamp":1709166942000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":27,"title":["Efficacy of federated learning on genomic data: a study on the UK Biobank and the 1000 Genomes Project"],"prefix":"10.3389","volume":"7","author":[{"given":"Dmitry","family":"Kolobkov","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Satyarth","family":"Mishra Sharma","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Aleksandr","family":"Medvedev","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Mikhail","family":"Lebedev","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Egor","family":"Kosaretskiy","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ruslan","family":"Vakhitov","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1965","published-online":{"date-parts":[[2024,2,29]]},"reference":[{"key":"B1","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1145\/2976749.2978318","article-title":"\u201cDeep learning with differential privacy,\u201d","volume-title":"Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security","author":"Abadi","year":"2016"},{"key":"B2","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2111.04263","article-title":"Federated learning based on dynamic regularization","author":"Acar","year":"2021","journal-title":"arXiv preprint arXiv:2111.04263"},{"key":"B3","doi-asserted-by":"publisher","first-page":"e1008773","DOI":"10.1371\/journal.pgen.1008773","article-title":"Scalable probabilistic PCA for large-scale genetic variation data","volume":"16","author":"Agrawal","year":"2020","journal-title":"PLoS Genet."},{"key":"B4","doi-asserted-by":"publisher","first-page":"1045450","DOI":"10.3389\/fgene.2022.1045450","article-title":"Democratizing clinical-genomic data: how federated platforms can promote benefits sharing in genomics","volume":"13","author":"Alvarellos","year":"2023","journal-title":"Front. Genet."},{"key":"B5","doi-asserted-by":"publisher","first-page":"1346","DOI":"10.1038\/s41588-020-00740-8","article-title":"Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements","volume":"52","author":"Amariuta","year":"2020","journal-title":"Nat. Genet."},{"key":"B6","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2007.14390","article-title":"Flower: a friendly federated learning research framework","author":"Beutel","year":"2020","journal-title":"arXiv preprint arXiv:2007.14390"},{"key":"B7","doi-asserted-by":"publisher","first-page":"695","DOI":"10.1038\/ng.f.136","article-title":"Common and rare variants in multifactorial susceptibility to common diseases","volume":"40","author":"Bodmer","year":"2008","journal-title":"Nat. Genet."},{"key":"B8","doi-asserted-by":"publisher","first-page":"646","DOI":"10.1038\/s41588-020-0651-0","article-title":"Privacy challenges and research opportunities for genomic data sharing","volume":"52","author":"Bonomi","year":"2020","journal-title":"Nat. Genet."},{"key":"B9","doi-asserted-by":"publisher","first-page":"D1005","DOI":"10.1093\/nar\/gky1120","article-title":"The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019","volume":"47","author":"Buniello","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"B10","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1186\/s13742-015-0047-8","article-title":"Second-generation PLINK: rising to the challenge of larger and richer datasets","volume":"4","author":"Chang","year":"2015","journal-title":"Gigascience"},{"key":"B11","doi-asserted-by":"publisher","first-page":"lsz016","DOI":"10.1093\/jlb\/lsz016","article-title":"Genetic discrimination: emerging ethical challenges in the context of advancing technology","volume":"7","author":"Chapman","year":"2020","journal-title":"J. Law Biosci."},{"key":"B12","doi-asserted-by":"publisher","first-page":"83","DOI":"10.1109\/MIS.2020.2988604","article-title":"Fedhealth: a federated transfer learning framework for wearable healthcare","volume":"35","author":"Chen","year":"2020","journal-title":"IEEE Intell. Syst."},{"key":"B13","first-page":"643","article-title":"\u201cMultiparty computation from somewhat homomorphic encryption,\u201d","volume-title":"Annual Cryptology Conference","author":"Damg\u00e5rd","year":"2012"},{"key":"B14","doi-asserted-by":"publisher","first-page":"620","DOI":"10.1016\/j.ajhg.2021.02.013","article-title":"Negative selection on complex traits limits phenotype prediction accuracy between populations","volume":"108","author":"Durvasula","year":"2021","journal-title":"Am. J. Hum. Genet."},{"key":"B15","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1038\/nrg3472","article-title":"Meta-analysis methods for genome-wide association studies and beyond","volume":"14","author":"Evangelou","year":"2013","journal-title":"Nat. Rev. Genet."},{"key":"B16","author":"Falcon","year":"2020","journal-title":"PyTorchLightning\/Pytorch-Lightning: 0.7.6 Release"},{"key":"B17","doi-asserted-by":"publisher","first-page":"579","DOI":"10.1038\/s41588-019-0394-y","volume":"51","year":"2019","journal-title":"Nat. Genet."},{"key":"B18","doi-asserted-by":"publisher","first-page":"520","DOI":"10.1038\/s41576-019-0144-0","article-title":"Genomics of disease risk in globally diverse populations","volume":"20","author":"Gurdasani","year":"2019","journal-title":"Nat. Rev. Genet."},{"key":"B19","doi-asserted-by":"crossref","first-page":"1090","DOI":"10.1109\/ICDM51629.2021.00127","article-title":"\u201cFederated principal component analysis for genome-wide association studies,\u201d","volume-title":"2021 IEEE International Conference on Data Mining (ICDM)","author":"Hartebrodt","year":"2021"},{"key":"B20","doi-asserted-by":"publisher","first-page":"vbac026","DOI":"10.1093\/bioadv\/vbac026","article-title":"Federated horizontally partitioned principal component analysis for biomedical applications","volume":"2","author":"Hartebrodt","year":"2022","journal-title":"Bioinform. Adv."},{"key":"B21","doi-asserted-by":"crossref","DOI":"10.1201\/b18401","volume-title":"Statistical Learning With Sparsity: The Lasso and Generalizations","author":"Hastie","year":"2015"},{"key":"B22","doi-asserted-by":"publisher","first-page":"299","DOI":"10.1016\/j.tig.2017.02.002","article-title":"Comparative approaches to genetic discrimination: chasing shadows?","volume":"33","author":"Joly","year":"2017","journal-title":"Trends Genet."},{"key":"B23","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1145\/3533708","article-title":"Federated learning for healthcare domain-pipeline, applications and challenges","volume":"3","author":"Joshi","year":"2022","journal-title":"ACM Trans. Comput. Healthcare"},{"key":"B24","doi-asserted-by":"crossref","DOI":"10.1007\/978-981-16-8193-6","volume-title":"Machine Learning","author":"Jung","year":"2022"},{"key":"B25","first-page":"5132","article-title":"\u201cScaffold: stochastic controlled averaging for federated learning,\u201d","volume-title":"International Conference on Machine Learning","author":"Karimireddy","year":"2020"},{"key":"B26","doi-asserted-by":"publisher","first-page":"e33720","DOI":"10.2196\/33720","article-title":"Next-generation capabilities in trusted research environments: interview study","volume":"24","author":"Kavianpour","year":"2022","journal-title":"J. Med. Internet Res."},{"key":"B27","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1007\/s10897-016-0014-2","article-title":"Ancestry testing and the practice of genetic counseling","volume":"26","author":"Kirkpatrick","year":"2017","journal-title":"J. Genet. Counsel."},{"key":"B28","doi-asserted-by":"crossref","first-page":"794","DOI":"10.1109\/WorldS450073.2020.9210355","article-title":"\u201cSurvey of personalization techniques for federated learning,\u201d","volume-title":"2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4)","author":"Kulkarni","year":"2020"},{"key":"B29","doi-asserted-by":"publisher","first-page":"420","DOI":"10.1038\/s41588-021-00783-5","article-title":"The polygenic score catalog as an open database for reproducibility and systematic evaluation","volume":"53","author":"Lambert","year":"2021","journal-title":"Nat. Genet."},{"key":"B30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-019-51258-x","article-title":"Genomic prediction of 16 complex disease risks including heart attack, diabetes, breast and prostate cancer","volume":"9","author":"Lello","year":"2019","journal-title":"Sci. Rep."},{"key":"B31","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13073-020-00742-5","article-title":"Polygenic risk scores: from research tools to clinical instruments","volume":"12","author":"Lewis","year":"2020","journal-title":"Genome Med."},{"key":"B32","doi-asserted-by":"publisher","first-page":"101765","DOI":"10.1016\/j.media.2020.101765","article-title":"Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: abide results","volume":"65","author":"Li","year":"2020","journal-title":"Med. Image Anal."},{"key":"B33","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1907.02189","article-title":"On the convergence of fedavg on non-iid data","author":"Li","year":"2019","journal-title":"arXiv preprint arXiv:1907.02189"},{"key":"B34","doi-asserted-by":"crossref","first-page":"3602","DOI":"10.1109\/ICIP42928.2021.9506589","article-title":"\u201cFrom gradient leakage to adversarial attacks in federated learning,\u201d","author":"Lim","year":"2021","journal-title":"2021 IEEE International Conference on Image Processing (ICIP)"},{"key":"B35","doi-asserted-by":"publisher","first-page":"2867","DOI":"10.1093\/bioinformatics\/btq559","article-title":"Robust relationship inference in genome-wide association studies","volume":"26","author":"Manichaikul","year":"2010","journal-title":"Bioinformatics"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2111.05628","article-title":"Machine learning models disclosure from trusted research environments (TRE), challenges and opportunities","author":"Mansouri-Benssassi","year":"2021","journal-title":"arXiv preprint arXiv:2111.05628"},{"key":"B37","doi-asserted-by":"publisher","first-page":"584","DOI":"10.1038\/s41588-019-0379-x","article-title":"Clinical use of current polygenic risk scores may exacerbate health disparities","volume":"51","author":"Martin","year":"","journal-title":"Nat. Genet."},{"key":"B38","doi-asserted-by":"crossref","first-page":"584","DOI":"10.1038\/s41588-019-0379-x","article-title":"Current clinical use of polygenic scores will risk exacerbating health disparities","volume":"51","author":"Martin","year":"","journal-title":"Nat. Genet."},{"key":"B39","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1146\/annurev-biodatasci-122120-104825","article-title":"The all of us data and research center: Creating a secure, scalable, and sustainable ecosystem for biomedical research","volume":"6","author":"Mayo","year":"2023","journal-title":"Annu. Rev. Biomed. Data Sci."},{"key":"B40","first-page":"1273","article-title":"\u201cCommunication-efficient learning of deep networks from decentralized data,\u201d","volume-title":"Artificial Intelligence and Statistics","author":"McMahan","year":"2017"},{"key":"B41","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1145\/3458864.3466628","article-title":"\u201cPPFL: privacy-preserving federated learning with trusted execution environments,\u201d","volume-title":"MobiSys '21: Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services","author":"Mo","year":"2021"},{"key":"B42","doi-asserted-by":"publisher","first-page":"619","DOI":"10.1016\/j.future.2020.10.007","article-title":"A survey on security and privacy of federated learning","volume":"115","author":"Mothukuri","year":"2021","journal-title":"Future Gen. Comput. Syst."},{"key":"B43","doi-asserted-by":"publisher","DOI":"10.1101\/2020.06.05.136382","article-title":"sPLINK: a federated, privacy-preserving tool as a robust alternative to meta-analysis in genome-wide association studies","author":"Nasirigerdeh","year":"2020","journal-title":"bioRxiv"},{"key":"B44","author":"Nik-Zainal","year":"2022","journal-title":"Multi-party Trusted Research Environment Federation: Establishing Infrastructure for Secure Analysis Across Different Clinical-Genomic Datasets"},{"key":"B45","first-page":"8024","article-title":"\u201cPyTorch: an imperative style, high-performance deep learning library,\u201d","volume-title":"Advances in Neural Information Processing Systems 32","author":"Paszke","year":"2019"},{"key":"B46","doi-asserted-by":"publisher","first-page":"e190","DOI":"10.1371\/journal.pgen.0020190","article-title":"Population structure and eigenanalysis","volume":"2","author":"Patterson","year":"2006","journal-title":"PLoS Genet."},{"key":"B47","doi-asserted-by":"publisher","first-page":"904","DOI":"10.1038\/ng1847","article-title":"Principal components analysis corrects for stratification in genome-wide association studies","volume":"38","author":"Price","year":"2006","journal-title":"Nat. Genet."},{"key":"B48","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1016\/j.ajhg.2021.11.008","article-title":"Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort","volume":"109","author":"Priv\u00e9","year":"2022","journal-title":"Am. J. Hum. Genet."},{"key":"B49","doi-asserted-by":"publisher","first-page":"2781","DOI":"10.1093\/bioinformatics\/bty185","article-title":"Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr","volume":"34","author":"Prive","year":"2018","journal-title":"Bioinformatics"},{"key":"B50","doi-asserted-by":"publisher","first-page":"4449","DOI":"10.1093\/bioinformatics\/btaa520","article-title":"Efficient toolkit implementing best practices for principal component analysis of population genetic data","volume":"36","author":"Priv\u00e9","year":"2020","journal-title":"Bioinformatics"},{"key":"B51","doi-asserted-by":"publisher","first-page":"e1009141","DOI":"10.1371\/journal.pgen.1009141","article-title":"A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank","volume":"16","author":"Qian","year":"2020","journal-title":"PLoS Genet."},{"key":"B52","doi-asserted-by":"publisher","first-page":"134","DOI":"10.1002\/gepi.22105","article-title":"Methods for meta-analysis of multiple traits using GWAS summary statistics","volume":"42","author":"Ray","year":"2018","journal-title":"Genet. Epidemiol."},{"key":"B53","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41746-020-00323-1","article-title":"The future of digital health with federated learning","volume":"3","author":"Rieke","year":"2020","journal-title":"NPJ Digit. Med."},{"key":"B54","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1109\/TCBB.2018.2829760","article-title":"Safety: secure GWAS in federated environment through a hybrid solution","volume":"16","author":"Sadat","year":"2018","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"B55","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-021-21286-1","article-title":"Population-specific causal disease effect sizes in functionally important regions impacted by selection","volume":"12","author":"Shi","year":"2021","journal-title":"Nat. Commun."},{"key":"B56","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1038\/s41588-020-00757-z","article-title":"Genetics of 35 blood and urine biomarkers in the UK Biobank","volume":"53","author":"Sinnott-Armstrong","year":"2021","journal-title":"Nat. Genet."},{"key":"B57","doi-asserted-by":"publisher","first-page":"68","DOI":"10.1038\/nature15393","article-title":"A global reference for human genetic variation","volume":"526","year":"2015","journal-title":"Nature"},{"key":"B58","doi-asserted-by":"publisher","first-page":"1278","DOI":"10.1126\/science.aaf6162","article-title":"A federated ecosystem for sharing genomic, clinical data","volume":"352","year":"2016","journal-title":"Science"},{"key":"B59","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J. R. Stat. Soc. Ser. B Stat. Methodol."},{"key":"B60","year":"2021","journal-title":"Building Trusted Research Environments - Principles and Best Practices; Towards TRE ecosystems"},{"key":"B61","doi-asserted-by":"publisher","first-page":"e24207","DOI":"10.2196\/24207","article-title":"Federated learning of electronic health records to improve mortality prediction in hospitalized patients with covid-19: machine learning approach","volume":"9","author":"Vaid","year":"2021","journal-title":"JMIR Med. Inform."},{"key":"B62","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1186\/1472-6939-11-21","article-title":"Caught you: threats to confidentiality due to the public release of large-scale genetic data sets","volume":"11","author":"Wjst","year":"2010","journal-title":"BMC Med. Ethics"},{"key":"B63","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s41666-020-00082-4","article-title":"Federated learning for healthcare informatics","volume":"5","author":"Xu","year":"2021","journal-title":"J. Healthcare Inform. Res."},{"key":"B64","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s42003-021-01681-6","article-title":"Genetic ancestry plays a central role in population pharmacogenomics","volume":"4","author":"Yang","year":"2021","journal-title":"Commun. Biol."},{"key":"B65","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3298981","article-title":"Federated machine learning: concept and applications","volume":"10","author":"Yang","year":"2019","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"B66","doi-asserted-by":"publisher","first-page":"512","DOI":"10.1515\/eng-2019-0059","article-title":"\u2018Zhores'\u2014Petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology","volume":"9","author":"Zacharov","year":"2019","journal-title":"Open Eng."}],"container-title":["Frontiers in Big Data"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdata.2024.1266031\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,13]],"date-time":"2024-11-13T02:34:20Z","timestamp":1731465260000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/fdata.2024.1266031\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,29]]},"references-count":66,"alternative-id":["10.3389\/fdata.2024.1266031"],"URL":"https:\/\/doi.org\/10.3389\/fdata.2024.1266031","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.01.24.23284898","asserted-by":"object"}]},"ISSN":["2624-909X"],"issn-type":[{"value":"2624-909X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,29]]},"article-number":"1266031"}}