{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T17:33:37Z","timestamp":1772300017386,"version":"3.50.1"},"reference-count":41,"publisher":"MDPI AG","issue":"15","license":[{"start":{"date-parts":[[2023,7,26]],"date-time":"2023-07-26T00:00:00Z","timestamp":1690329600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Mathematics"],"abstract":"<jats:p>Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks.<\/jats:p>","DOI":"10.3390\/math11153278","type":"journal-article","created":{"date-parts":[[2023,7,27]],"date-time":"2023-07-27T02:07:17Z","timestamp":1690423637000},"page":"3278","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":23,"title":["On the Quality of Synthetic Generated Tabular Data"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-8663-6750","authenticated-orcid":false,"given":"Erica","family":"Espinosa","sequence":"first","affiliation":[{"name":"Department of Mathematics Engineering, Politecnico di Milano, 20133 Milan, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0507-7504","authenticated-orcid":false,"given":"Alvaro","family":"Figueira","sequence":"additional","affiliation":[{"name":"Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal"},{"name":"INESCTEC, 4200-465 Porto, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,26]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"A systematic review on imbalanced data challenges in machine learning: Applications and solutions","volume":"52","author":"Kaur","year":"2019","journal-title":"ACM Comput. Surv."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","article-title":"Learning from imbalanced data: Open challenges and future directions","volume":"5","author":"Krawczyk","year":"2016","journal-title":"Prog. Artif. Intell."},{"key":"ref_3","unstructured":"Weng, W.H., Deaton, J., Natarajan, V., Elsayed, G.F., and Liu, Y. (2020, January 7\u20138). Addressing the real-world class imbalance problem in dermatology. Proceedings of the Machine Learning for Health, PMLR, Durham, NC, USA."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1109\/TR.2021.3118026","article-title":"A comparative study of class rebalancing methods for security bug report classification","volume":"70","author":"Zheng","year":"2021","journal-title":"IEEE Trans. Reliab."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Rivera, G., Florencia, R., Garc\u00eda, V., Ruiz, A., and S\u00e1nchez-Sol\u00eds, J.P. (2020). News classification for identifying traffic incident points in a Spanish-speaking country: A real-world case study of class imbalance learning. Appl. Sci., 10.","DOI":"10.3390\/app10186253"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Isangediok, M., and Gajamannage, K. (2022). Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes. arXiv.","DOI":"10.1109\/BigData55660.2022.10020723"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Varmedja, D., Karanovic, M., Sladojevic, S., Arsenovic, M., and Anderla, A. (2019, January 20\u201322). Credit card fraud detection-machine learning methods. Proceedings of the 2019 18th International Symposium INFOTEH-JAHORINA (INFOTEH), East Sarajevo, Bosnia and Herzegovina.","DOI":"10.1109\/INFOTEH.2019.8717766"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Salah, I., Jouini, K., and Korbaa, O. (2023). On the use of text augmentation for stance and fake news detection. J. Inf. Telecommun., 1\u201317.","DOI":"10.1080\/24751839.2023.2198820"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"316","DOI":"10.1007\/978-3-031-04829-6_28","article-title":"On Creation of Synthetic Samples from GANs for Fake News Identification Algorithms","volume":"Volume 3","author":"Vaz","year":"2022","journal-title":"Information Systems and Technologies: WorldCIST 2022"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., and Greenspan, H. (2018, January 4\u20137). Synthetic data augmentation using GAN for improved liver lesion classification. Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA.","DOI":"10.1109\/ISBI.2018.8363576"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"1007","DOI":"10.1007\/s10845-020-01710-x","article-title":"Synthetic data augmentation for surface defect detection and classification using deep learning","volume":"33","author":"Jain","year":"2022","journal-title":"J. Intell. Manuf."},{"key":"ref_12","unstructured":"Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.A. (2018). Data augmentation using synthetic data for time series classification with deep residual networks. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.neucom.2022.04.053","article-title":"Synthetic data generation for tabular health records: A systematic review","volume":"493","author":"Hernandez","year":"2022","journal-title":"Neurocomputing"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Assefa, S.A., Dervovic, D., Mahfouz, M., Tillman, R.E., Reddy, P., and Veloso, M. (2020, January 15\u201316). Generating synthetic data in finance: Opportunities, challenges and pitfalls. Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA.","DOI":"10.1145\/3383455.3422554"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Shafique, R., Rustam, F., Choi, G.S., D\u00edez, I.d.l.T., Mahmood, A., Lipari, V., Velasco, C.L.R., and Ashraf, I. (2023). Breast cancer prediction using fine needle aspiration features and upsampling with supervised machine learning. Cancers, 15.","DOI":"10.3390\/cancers15030681"},{"key":"ref_16","first-page":"50","article-title":"An improved capsule network (WaferCaps) for wafer bin map classification based on DCGAN data upsampling","volume":"35","author":"Danishvar","year":"2021","journal-title":"IEEE Trans. Semicond. Manuf."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"172","DOI":"10.3390\/ai4010008","article-title":"Improving Classification Performance in Credit Card Fraud Detection by Using New Data Augmentation","volume":"4","author":"Strelcenia","year":"2023","journal-title":"AI"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: Synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"J. Artif. Intell. Res."},{"key":"ref_19","unstructured":"Arjovsky, M., Chintala, S., and Bottou, L. (2017, January 6\u201311). Wasserstein generative adversarial networks. Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia."},{"key":"ref_20","unstructured":"Doersch, C. (2016). Tutorial on variational autoencoders. arXiv."},{"key":"ref_21","unstructured":"Pardo, L. (2005). Statistical Inference Based on Divergence Measures, Chapman & Hall\/CRC Press."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1214\/aoms\/1177729694","article-title":"On information and sufficiency","volume":"22","author":"Kullback","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., and Sales, A.P. (2020). Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol., 20.","DOI":"10.1186\/s12874-020-00977-1"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Golub, G.H., and Van Loan, C.F. (2013). Matrix Computations, Johns Hopkins University Press.","DOI":"10.56021\/9781421407944"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1093\/mnras\/225.1.155","article-title":"A multidimensional version of the Kolmogorov\u2013Smirnov test","volume":"225","author":"Fasano","year":"1987","journal-title":"Mon. Not. R. Astron. Soc."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"663","DOI":"10.1111\/rssa.12358","article-title":"General and specific utility measures for synthetic data","volume":"181","author":"Snoke","year":"2018","journal-title":"J. R. Stat. Soc. Ser. A"},{"key":"ref_28","unstructured":"Becker, B., and Kohavi, R. (1996). UCI Machine Learning Repository, Department of Information and Computer Science, University of California."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Wang, W.Y. (2017). \u201cLiar, Liar pants on fire\u201d: A new benchmark dataset for fake news detection. arXiv.","DOI":"10.18653\/v1\/P17-2067"},{"key":"ref_30","unstructured":"Agrawal, R., Srikant, R., and Thomas, D. Proceedings of the Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14\u201316 June 2005."},{"key":"ref_31","unstructured":"Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. (2013, January 17\u201319). Learning fair representations. Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA."},{"key":"ref_32","first-page":"6478","article-title":"Retiring adult: New datasets for fair machine learning","volume":"34","author":"Ding","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"102025","DOI":"10.1016\/j.ipm.2019.03.004","article-title":"An overview of online fake news: Characterization, detection, and discussion","volume":"57","author":"Zhang","year":"2020","journal-title":"Inf. Process. Manag."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"11765","DOI":"10.1007\/s11042-020-10183-2","article-title":"FakeBERT: Fake news detection in social media with a BERT-based deep learning approach","volume":"80","author":"Kaliyar","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_35","first-page":"100007","article-title":"Fake news detection: A hybrid CNN-RNN based deep learning approach","volume":"1","author":"Nasir","year":"2021","journal-title":"Int. J. Inf. Manag. Data Insights"},{"key":"ref_36","unstructured":"Vaz, B.G. (2022). Using GANs to Create Synthetic Datasets for Fake News Detection Models. [Master\u2019s Thesis, Universidade do Porto]."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Han, H., Wang, W.Y., and Mao, B.H. (2005, January 23\u201326). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Proceedings of the Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China.","DOI":"10.1007\/11538059_91"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Ping, H., Stoyanovich, J., and Howe, B. (2017, January 27\u201329). Datasynthesizer: Privacy-preserving synthetic datasets. Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA.","DOI":"10.1145\/3085504.3091117"},{"key":"ref_39","first-page":"7335","article-title":"Modeling tabular data using conditional gan","volume":"32","author":"Xu","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_40","unstructured":"(2023, March 17). Copula GAN Synthesizer. Available online: https:\/\/docs.sdv.dev\/sdv\/single-table-data\/modeling\/synthesizers\/copulagansynthesizer."},{"key":"ref_41","unstructured":"Solatorio, A.V., and Dupriez, O. (2023). REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. arXiv."}],"container-title":["Mathematics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-7390\/11\/15\/3278\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:19:11Z","timestamp":1760127551000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-7390\/11\/15\/3278"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,26]]},"references-count":41,"journal-issue":{"issue":"15","published-online":{"date-parts":[[2023,8]]}},"alternative-id":["math11153278"],"URL":"https:\/\/doi.org\/10.3390\/math11153278","relation":{},"ISSN":["2227-7390"],"issn-type":[{"value":"2227-7390","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,26]]}}}