{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,15]],"date-time":"2026-06-15T10:26:40Z","timestamp":1781519200509,"version":"3.54.1"},"reference-count":46,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2021,9,4]],"date-time":"2021-09-04T00:00:00Z","timestamp":1630713600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"AI chair of excellence HUMANIA","award":["ANR project ANR-19-CHIA-0022-01"],"award-info":[{"award-number":["ANR project ANR-19-CHIA-0022-01"]}]},{"DOI":"10.13039\/100017574","name":"United Health Foundation","doi-asserted-by":"publisher","award":["1990"],"award-info":[{"award-number":["1990"]}],"id":[{"id":"10.13039\/100017574","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004316","name":"IBM","doi-asserted-by":"publisher","award":["AI Horizons Network"],"award-info":[{"award-number":["AI Horizons Network"]}],"id":[{"id":"10.13039\/100004316","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must \u201cfairly\u201d represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.<\/jats:p>","DOI":"10.3390\/e23091165","type":"journal-article","created":{"date-parts":[[2021,9,6]],"date-time":"2021-09-06T13:15:56Z","timestamp":1630934156000},"page":"1165","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":88,"title":["The Problem of Fairness in Synthetic Healthcare Data"],"prefix":"10.3390","volume":"23","author":[{"given":"Karan","family":"Bhanot","sequence":"first","affiliation":[{"name":"Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"},{"name":"OptumLabs, Eden Prairie, MN 55344, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Miao","family":"Qi","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3078-4566","authenticated-orcid":false,"given":"John S.","family":"Erickson","sequence":"additional","affiliation":[{"name":"Rensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9266-1783","authenticated-orcid":false,"given":"Isabelle","family":"Guyon","sequence":"additional","affiliation":[{"name":"LISN, CNRS\/INRIA, Universit\u00e9 Paris-Saclay, 91190 Gif-sur-Yvette, France"},{"name":"ChaLearn, San Francisco, CA 94115, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kristin P.","family":"Bennett","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"},{"name":"Rensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USA"},{"name":"Department of Mathematics, Rensselaer Polytechnic Institute, Troy, NY 12180, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2021,9,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s00392-016-1025-6","article-title":"Electronic health records to facilitate clinical research","volume":"106","author":"Cowie","year":"2017","journal-title":"Clin. Res. Cardiol."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"e10076","DOI":"10.1002\/lrh2.10076","article-title":"Use of EHRs data for clinical research: Historical progress and current applications","volume":"3","author":"Nordo","year":"2019","journal-title":"Learn. Health. Syst."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Chakraborty, P., and Farooq, F. (2019, January 4\u20138). A Robust Framework for Accelerated Outcome-Driven Risk Factor Identification from EHR. Proceedings of the KDD\u021919: 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA.","DOI":"10.1145\/3292500.3330718"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1146\/annurev-publhealth-040617-014208","article-title":"Big Data in Public Health: Terminology, Machine Learning, and Privacy","volume":"39","author":"Mooney","year":"2018","journal-title":"Annu. Rev. Public Health"},{"key":"ref_5","unstructured":"(2021, June 24). Health Insurance Portability and Accountability Act of 1996 (HIPAA), Available online: https:\/\/www.cdc.gov\/phlp\/publications\/topic\/hipaa.html."},{"key":"ref_6","unstructured":"(2021, June 24). US Department of Health and Human Services. Your Rights Under HIPAA, Available online: https:\/\/www.hhs.gov\/hipaa\/for-individuals\/guidance-materials-for-consumers\/index.html."},{"key":"ref_7","unstructured":"European Parliament and of the Council (27 April 2016) (2016). Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95\/46\/EC (General Data Protection Regulation). L119, European Council."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"2000178","DOI":"10.1002\/bies.202000178","article-title":"Superposition of COVID-19 waves, anticipating a sustained wave, and lessons for the future","volume":"42","author":"Lai","year":"2020","journal-title":"BioEssays"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2002324","DOI":"10.1002\/advs.202002324","article-title":"Relieving Cost of Epidemic by Parrondo\u2019s Paradox: A COVID-19 Case Study","volume":"7","author":"Cheong","year":"2020","journal-title":"Adv. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"El Emam, K., Mosquera, L., Jonker, E., and Sood, H. (2021). Evaluating the utility of synthetic COVID-19 case data. JAMIA Open, 4.","DOI":"10.1093\/jamiaopen\/ooab012"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"703","DOI":"10.1093\/cid\/ciaa815","article-title":"The disproportionate impact of COVID-19 on racial and ethnic minorities in the United States","volume":"72","author":"Tai","year":"2021","journal-title":"Clin. Infect. Dis."},{"key":"ref_12","unstructured":"(2021, June 24). NIH Office of Extramural Research; U.S. Department of Health and Human Services. Ethics in Clinical Research, Available online: https:\/\/clinicalcenter.nih.gov\/recruit\/ethics.html."},{"key":"ref_13","unstructured":"(2021, June 27). NIH Clinical Center; U.S. Department of Health and Human Services. Notice of NIH\u2019s Interest in Diversity, Available online: https:\/\/grants.nih.gov\/grants\/guide\/notice-files\/NOT-OD-20-031.html."},{"key":"ref_14","unstructured":"(2021, June 24). U.S. Bureau of Labor Statistics. ATUS News Releases, Available online: https:\/\/www.bls.gov\/tus\/."},{"key":"ref_15","unstructured":"Dash, S., Dutta, R., Guyon, I., Pavao, A., Yale, A., and Bennett, K.P. (2019). Synthetic Event Time Series Health Data Generation. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Yale, A.J. (2020). Privacy Preserving Synthetic Health Data Generation and Evaluation. [Ph.D. Thesis, Rensselaer Polytechnic Institute].","DOI":"10.1016\/j.neucom.2019.12.136"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Bhanot, K., Dash, S., Pedersen, J., Guyon, I., and Bennett, K.P. (2021, January 6\u20138). Quantifying Resemblance of Synthetic Medical Time-Series. Proceedings of the 29th European Symposium on Artificial Neural Networks ESANN, Online.","DOI":"10.14428\/esann\/2021.ES2021-108"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. (2018). Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. arXiv.","DOI":"10.1109\/CSF.2018.00027"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1016\/j.neucom.2019.12.136","article-title":"Generation and evaluation of privacy preserving synthetic health data","volume":"416","author":"Yale","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Abramowicz, W., and Klein, G. (2020). Synthesizing Quality Open Data Assets from Private Health Research Studies. Business Information Systems Workshops, Springer International Publishing.","DOI":"10.1007\/978-3-030-53337-3"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"539","DOI":"10.1007\/s40615-016-0256-6","article-title":"Use of Electronic Health Record Data to Evaluate the Impact of Race on 30-Day Mortality in Patients Admitted to the Intensive Care Unit","volume":"4","author":"Mundkur","year":"2016","journal-title":"J. Racial Ethnic Health Disparities"},{"key":"ref_22","unstructured":"Gajane, P. (2017). On formalizing fairness in prediction with machine learning. arXiv."},{"key":"ref_23","unstructured":"Kleinberg, J., Mullainathan, S., and Raghavan, M. (2016). Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv."},{"key":"ref_24","unstructured":"Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2021, June 24). Generative Adversarial Nets. Available online: https:\/\/papers.nips.cc\/paper\/2014\/file\/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf."},{"key":"ref_25","unstructured":"Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., and Sun, J. (2017, January 18\u201319). Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. Proceedings of the 2nd Machine Learning for Healthcare Conference, Boston, MA, USA."},{"key":"ref_26","unstructured":"Buolamwini, J., and Gebru, T. (2018, January 23\u201324). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Cheng, V., Suriyakumar, V.M., Dullerud, N., Joshi, S., and Ghassemi, M. (2021, January 3\u201310). Can You Fake It Until You Make It? Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness. Proceedings of the FAccT \u201921: 2021 ACM Conference on Fairness, Accountability, and Transparency, Association for Computing Machinery, New York, NY, USA.","DOI":"10.1145\/3442188.3445879"},{"key":"ref_28","unstructured":"Gupta, A., Bhatt, D., and Pandey, A. (2021). Transitioning from Real to Synthetic data: Quantifying the bias in model. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"3:1","DOI":"10.1147\/JRD.2019.2945519","article-title":"Fairness GAN: Generating datasets with fairness properties using a generative adversarial network","volume":"63","author":"Sattigeri","year":"2019","journal-title":"IBM J. Res. Dev."},{"key":"ref_30","unstructured":"Jagielski, M., Kearns, M.J., Mao, J., Oprea, A., Roth, A., Sharifi-Malvajerdi, S., and Ullman, J.R. (2018). Differentially Private Fair Learning. arXiv."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"2720","DOI":"10.1001\/jama.291.22.2720","article-title":"Participation in cancer clinical trials race-, sex-, and age-based disparities","volume":"291","author":"Murthy","year":"2004","journal-title":"JAMA"},{"key":"ref_32","first-page":"239","article-title":"The representativeness of eligible patients in type 2 diabetes trials: A case study using GIST 2.0","volume":"25","author":"Sen","year":"2018","journal-title":"JAMIA"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"325","DOI":"10.1016\/j.jbi.2016.09.003","article-title":"GIST 2.0: A scalable multi-trait metric for quantifying population representativeness of individual clinical studies","volume":"63","author":"Sen","year":"2016","journal-title":"J. Biomed. Inform."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Qi, M., Cahan, O., Foreman, M.A., Gruen, D.M., Das, A.K., and Bennett, K.P. (2021). Quantifying representativeness in randomized clinical trials using machine learning fairness metrics. medRxiv, preprint.","DOI":"10.1101\/2021.06.23.21259272"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Garg, P., Villasenor, J.D., and Foggo, V. (2020). Fairness Metrics: A Comparative Analysis. arXiv.","DOI":"10.1109\/BigData50022.2020.9378025"},{"key":"ref_36","unstructured":"Hinnefeld, J.H., Cooman, P., Mammo, N., and Deese, R. (2018). Evaluating Fairness Metrics in the Presence of Dataset Bias. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Hutchinson, B., and Mitchell, M. (2018). 50 Years of Test (Un)fairness: Lessons for Machine Learning. arXiv.","DOI":"10.1145\/3287560.3287600"},{"key":"ref_38","unstructured":"Corbett-Davies, S., and Goel, S. (2018). The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. arXiv."},{"key":"ref_39","first-page":"1","article-title":"AI fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias","volume":"4","author":"Bellamy","year":"2018","journal-title":"IBM J. Res. Dev."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Feldman, M., Friedler, S., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. (2014). Certifying and removing disparate impact. arXiv.","DOI":"10.1145\/2783258.2783311"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"160035","DOI":"10.1038\/sdata.2016.35","article-title":"MIMIC-III, a freely accessible critical care database","volume":"3","author":"Johnson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1272","DOI":"10.1002\/aur.2128","article-title":"Clustering of co-occurring conditions in autism spectrum disorder during early childhood: A retrospective analysis of medical claims data","volume":"12","author":"Vargason","year":"2019","journal-title":"Autism Res."},{"key":"ref_43","unstructured":"Mirza, M., and Osindero, S. (2014). Conditional Generative Adversarial Nets. arXiv."},{"key":"ref_44","unstructured":"(2021, June 24). Karan Bhanot and Andrew Yale. Synthetic_Data. Available online: https:\/\/github.com\/TheRensselaerIDEA\/synthetic_data."},{"key":"ref_45","unstructured":"Kumar, G., Jain, S., and Singh, U.P. (2020). Stock Market Forecasting Using Computational Intelligence: A Survey. Arch. Comput. Methods Eng., 1\u201333."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"1857","DOI":"10.1016\/j.patcog.2005.01.025","article-title":"Clustering of time series data\u2014A survey","volume":"38","author":"Liao","year":"2005","journal-title":"Pattern Recognit."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/9\/1165\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:56:32Z","timestamp":1760165792000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/23\/9\/1165"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,4]]},"references-count":46,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2021,9]]}},"alternative-id":["e23091165"],"URL":"https:\/\/doi.org\/10.3390\/e23091165","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,4]]}}}