{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,16]],"date-time":"2026-03-16T23:05:00Z","timestamp":1773702300227,"version":"3.50.1"},"reference-count":28,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2025,11,14]],"date-time":"2025-11-14T00:00:00Z","timestamp":1763078400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"FCT \u2013 Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["UID\/05105\/2025"],"award-info":[{"award-number":["UID\/05105\/2025"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>The growing demand for data-driven solutions in healthcare is often hindered by limited access to high-quality datasets due to privacy concerns, data imbalance, and regulatory constraints. Synthetic data generation has emerged as a promising strategy to address these challenges by creating artificial yet statistically valid datasets that preserve the underlying patterns of real data without compromising patient confidentiality. This study explores methodologies for generating synthetic data tailored to binary and multi-class classification problems within the health domain. We employ advanced techniques such as probabilistic modelling, generative adversarial networks, and data augmentation strategies to replicate realistic feature distributions and class relationships. A comprehensive evaluation is conducted using benchmark healthcare datasets, measuring fidelity, diversity, and utility of the synthetic data in downstream predictive modelling tasks. The original dataset consisted of 2125 imbalanced cases, both in the binary and multi-class classification scenarios. Experimental results demonstrate that models trained on synthetic datasets achieve performance levels comparable to those trained on real data, particularly in scenarios with severe class imbalance. The findings underscore the potential of synthetic data as a privacy-preserving enabler for robust machine learning applications in healthcare, facilitating innovation while adhering to strict data protection regulations.<\/jats:p>","DOI":"10.3390\/info16110986","type":"journal-article","created":{"date-parts":[[2025,11,14]],"date-time":"2025-11-14T17:33:21Z","timestamp":1763141601000},"page":"986","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain"],"prefix":"10.3390","volume":"16","author":[{"given":"Camila","family":"Guerreiro","sequence":"first","affiliation":[{"name":"Research on Economics, Management and Information Technologies, REMIT, Portucalense University, 4200-072 Porto, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4418-2590","authenticated-orcid":false,"given":"F\u00e1tima","family":"Leal","sequence":"additional","affiliation":[{"name":"Research on Economics, Management and Information Technologies, REMIT, Portucalense University, 4200-072 Porto, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2021-9141","authenticated-orcid":false,"given":"Micaela","family":"Pinho","sequence":"additional","affiliation":[{"name":"Research on Economics, Management and Information Technologies, REMIT, Portucalense University, 4200-072 Porto, Portugal"},{"name":"Instituto Jur\u00eddico Portucalense, IJP, Portucalense University, 4200-072 Porto, Portugal"},{"name":"Research Unit in Governance, Competitiveness and Public Policy, GOVCOPP, Aveiro University, 3810-193 Aveiro, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"2892","DOI":"10.1016\/j.csbj.2024.07.005","article-title":"Synthetic data generation methods in healthcare: A review on open-source tools and methods","volume":"23","author":"Pezoulas","year":"2024","journal-title":"Comput. Struct. Biotechnol. J."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1146\/annurev-biodatasci-103123-094844","article-title":"Conditional Generative Models for Synthetic Tabular Data: Applications for Precision Medicine and Diverse Representations","volume":"8","author":"Liu","year":"2025","journal-title":"Annu. Rev. Biomed. Data Sci."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Nasimov, R., Nasimova, N., Mirzakhalilov, S., Tokdemir, G., Rizwan, M., Abdusalomov, A., and Cho, Y.I. (2024). GAN-Based Novel Approach for Generating Synthetic Medical Tabular Data. Bioengineering, 11.","DOI":"10.3390\/bioengineering11121288"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Endres, M., Mannarapotta Venugopal, A., and Tran, T.S. (2022, January 22\u201324). Synthetic Data Generation: A Comparative Study. Proceedings of the IDEAS \u201922: Proceedings of the 26th International Database Engineered Applications Symposium, New York, NY, USA.","DOI":"10.1145\/3548785.3548793"},{"key":"ref_5","unstructured":"Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019, January 8\u201314). Modeling tabular data using conditional GAN. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"104830","DOI":"10.1016\/j.trc.2024.104830","article-title":"Copula-based transferable models for synthetic population generation","volume":"169","author":"Yang","year":"2024","journal-title":"Transp. Res. Part C Emerg. Technol."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Hahn, W., Sch\u00fctte, K., Schultz, K., Wolkenhauer, O., Sedlmayr, M., Schuler, U., Eichler, M., Bej, S., and Wolfien, M. (2022). Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care. J. Pers. Med., 12.","DOI":"10.3390\/jpm12081278"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Abedi, M., Hempel, L., Sadeghi, S., and Kirsten, T. (2022). GAN-Based Approaches for Generating Structured Data in the Medical Domain. Appl. Sci., 12.","DOI":"10.3390\/app12147075"},{"key":"ref_9","first-page":"398","article-title":"How to fairly allocate scarce medical resources? Controversial preferences of healthcare professionals with different personal characteristics","volume":"17","author":"Pinho","year":"2022","journal-title":"Health Econ. Policy Law"},{"key":"ref_10","unstructured":"Sun, Y., Cuesta-Infante, A., and Veeramachaneni, K. (February, January 27). Learning vine copula models for synthetic data generation. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA. AAAI\u201919\/IAAI\u201919\/EAAI\u201919."},{"key":"ref_11","first-page":"235","article-title":"Synthetic data generation for tabular health records: A systematic review","volume":"510","author":"Epelde","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"2863","DOI":"10.21105\/joss.02863","article-title":"Synthia: Multidimensional synthetic data generation in Python","volume":"6","author":"Meyer","year":"2021","journal-title":"J. Open Source Softw."},{"key":"ref_13","unstructured":"\u00c1lvaro, R., Adeli, H., Dzemyda, G., Moreira, F., and Colla, V. (2024). Balancing Plug-In for Stream-Based Classification. Information Systems and Technologies: WorldCIST 2023, Volume 1, Springer."},{"key":"ref_14","unstructured":"Nelsen, R.B. (2006). An Introduction to Copulas, Springer. [2nd ed.]."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"107654","DOI":"10.1016\/j.csda.2022.107654","article-title":"Efficient and feasible inference for high-dimensional normal copula regression models","volume":"179","author":"Nikoloulopoulos","year":"2023","journal-title":"Comput. Stat. Data Anal."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Patki, N., Wedge, R., and Veeramachaneni, K. (2016, January 17\u201319). The Synthetic Data Vault. Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada.","DOI":"10.1109\/DSAA.2016.49"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Jadon, A., and Kumar, S. (2023, January 25\u201327). Leveraging Generative AI Models for Synthetic Data Generation in Healthcare: Balancing Research and Privacy. Proceedings of the 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkey.","DOI":"10.1109\/SmartNets58706.2023.10215825"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"5965","DOI":"10.1007\/s10115-025-02394-6","article-title":"A Synthetic Over-sampling method with Minority and Majority classes for imbalance problems","volume":"67","author":"Khorshidi","year":"2025","journal-title":"Knowl. Inf. Syst."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"485","DOI":"10.1016\/j.ins.2021.12.018","article-title":"Differentially private synthetic medical data generation using convolutional GANs","volume":"586","author":"Torfi","year":"2022","journal-title":"Inf. Sci."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Gurcan, F., and Soylu, A. (2024). Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis. Cancers, 16.","DOI":"10.3390\/cancers16193417"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1007\/s10462-024-10884-2","article-title":"Handling imbalanced medical datasets: Review of a decade of research","volume":"57","author":"Salmi","year":"2024","journal-title":"Artif. Intell. Rev."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"110415","DOI":"10.1016\/j.asoc.2023.110415","article-title":"A broad review on class imbalance learning techniques","volume":"143","author":"Rezvani","year":"2023","journal-title":"Appl. Soft Comput."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Abdulsadig, R.S., and Villegas, E. (2024). A comparative study in class imbalance mitigation when working with healthcare data. Front. Digit. Health, 6.","DOI":"10.3389\/fdgth.2024.1377165"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"e52615","DOI":"10.2196\/52615","article-title":"Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial","volume":"3","author":"Yan","year":"2024","journal-title":"JMIR AI"},{"key":"ref_25","first-page":"427","article-title":"A systematic analysis of performance measures for classification tasks","volume":"45","author":"Sokolova","year":"2009","journal-title":"Inf. Process. Manag. Int. J."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Ayyanar, M., Jegananathan, S., Parthasarathy, S., Jayaraman, V., and Lakshminarayanan, A.R. (2022, January 25\u201327). Predicting the Cardiac Diseases using SelectKBest Method Equipped Light Gradient Boosting Machine. Proceedings of the 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India.","DOI":"10.1109\/ICOEI53556.2022.9777224"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"1194","DOI":"10.1111\/j.1524-4733.2008.00321.x","article-title":"Citizen\u2019s preferences regarding principles to guide health care allocation decisions in Thailand","volume":"11","author":"Kasemsup","year":"2008","journal-title":"Value Health"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1017\/S1744133118000403","article-title":"Attitudes of health professionals concerning bedside rationing criteria: A survey from Portugal","volume":"15","author":"Pinho","year":"2020","journal-title":"Health Econ. Policy Law"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/11\/986\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T05:19:43Z","timestamp":1763443183000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/11\/986"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,14]]},"references-count":28,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2025,11]]}},"alternative-id":["info16110986"],"URL":"https:\/\/doi.org\/10.3390\/info16110986","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,14]]}}}