{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,21]],"date-time":"2025-10-21T00:20:56Z","timestamp":1761006056897,"version":"build-2065373602"},"reference-count":29,"publisher":"PeerJ","license":[{"start":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T00:00:00Z","timestamp":1760918400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"abstract":"<jats:p>Data augmentation is a critical technique for enhancing model performance in scenarios with limited, sparse, or imbalanced datasets. While existing methods often focus on homogeneous data types (<jats:italic>e.g<\/jats:italic>., continuous-only or categorical-only), real-world datasets frequently contain mixed data types (continuous, integer, and categorical), posing significant challenges for synthetic data generation. This article introduces a novel empirical copula-based framework for generating synthetic data that preserves both marginal and joint probability distributions and dependencies of mixed-type features. Our method addresses missing values, handles heterogeneous data through type-specific transformations, and introduces controlled noise to enhance diversity while maintaining statistical fidelity. We demonstrate the efficacy of this approach using synthetic and experimental benchmark datasets such as the Census Income and the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, demonstrating its ability to generate realistic synthetic samples that retain the statistical properties of the original data. The proposed method is implemented in an open-source Python class, ensuring reproducibility and scalability.<\/jats:p>","DOI":"10.7717\/peerj-cs.3228","type":"journal-article","created":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T08:13:15Z","timestamp":1760947995000},"page":"e3228","source":"Crossref","is-referenced-by-count":0,"title":["Empirical copula-based data augmentation for mixed-type datasets: a robust approach for synthetic data generation"],"prefix":"10.7717","volume":"11","author":[{"given":"Mohsen","family":"Ben Hassine","sequence":"first","affiliation":[{"name":"Computer Sciences, Universit\u00e9 Tunis Carthage, Ariana, Ariana, Tunisia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lamine","family":"Mili","sequence":"additional","affiliation":[{"name":"Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University (Virginia Tech), Virginia, Virginia, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"4443","published-online":{"date-parts":[[2025,10,20]]},"reference":[{"key":"10.7717\/peerj-cs.3228\/ref-1","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1007\/s13042-022-01553-3","article-title":"Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers","volume":"14","author":"Bayer","year":"2023","journal-title":"International Journal of Machine Learning and Cybernetics"},{"key":"10.7717\/peerj-cs.3228\/ref-2","first-page":"51","article-title":"MTCopula: Synthetic complex data generation using copula","author":"Benali","year":"2021"},{"key":"10.7717\/peerj-cs.3228\/ref-3","first-page":"184","article-title":"Data augmentation with variational autoencoders and manifold sampling","author":"Chadebec","year":"2021"},{"key":"10.7717\/peerj-cs.3228\/ref-4","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"Journal of Artificial Intelligence Research"},{"key":"10.7717\/peerj-cs.3228\/ref-5","first-page":"113","article-title":"Autoaugment: learning augmentation strategies from data","author":"Cubuk","year":"2019"},{"key":"10.7717\/peerj-cs.3228\/ref-6","first-page":"1528","article-title":"A kernel theory of modern data augmentation","author":"Dao","year":"2019"},{"key":"10.7717\/peerj-cs.3228\/ref-7","first-page":"94","article-title":"Synthetic data generation: a comparative study","author":"Endres","year":"2022"},{"key":"10.7717\/peerj-cs.3228\/ref-8","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2008.09202","article-title":"Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning","author":"Engelmann","year":"2020"},{"issue":"8","key":"10.7717\/peerj-cs.3228\/ref-9","doi-asserted-by":"publisher","first-page":"7422","DOI":"10.1609\/aaai.v35i8.16910","article-title":"Learning to augment for data-scarce domain BERT knowledge distillation","volume":"35","author":"Feng","year":"2021","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"10.7717\/peerj-cs.3228\/ref-10","first-page":"2672","article-title":"Generative adversarial nets","volume-title":"Advances in Neural Information Processing Systems (NeurIPS 27)","author":"Goodfellow","year":"2014"},{"issue":"17","key":"10.7717\/peerj-cs.3228\/ref-11","doi-asserted-by":"publisher","first-page":"3509","DOI":"10.3390\/electronics13173509","article-title":"A systematic review of synthetic data generation techniques using generative AI","volume":"13","author":"Goyal","year":"2024","journal-title":"Electronics"},{"issue":"7","key":"10.7717\/peerj-cs.3228\/ref-12","doi-asserted-by":"publisher","first-page":"144","DOI":"10.14569\/IJACSA.2017.080720","article-title":"A copula statistic for measuring nonlinear dependence with application to feature selection in machine learning","volume":"8","author":"Hassine","year":"2017","journal-title":"International Journal of Advanced Computer Science and Application"},{"key":"10.7717\/peerj-cs.3228\/ref-13","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2203.17250","article-title":"Generation and simulation of synthetic datasets with copulas","author":"Houssou","year":"2022"},{"key":"10.7717\/peerj-cs.3228\/ref-14","doi-asserted-by":"publisher","first-page":"10123","DOI":"10.1007\/s00521-023-08459-3","article-title":"Data Augmentation techniques in time series domain: a survey and taxonomy","volume":"35","author":"Iglesias","year":"2023","journal-title":"Neural Computing and Applications"},{"issue":"1","key":"10.7717\/peerj-cs.3228\/ref-15","doi-asserted-by":"publisher","first-page":"101171","DOI":"10.1016\/j.imu.2023.101171","article-title":"Data augmentation guided breast cancer diagnosis and prognosis using an integrated deep-generative framework based on breast tumor\u2019s morphological information","volume":"37","author":"Inan","year":"2023","journal-title":"Informatics in Medicine Unlocked"},{"key":"10.7717\/peerj-cs.3228\/ref-16","article-title":"Generating multi-type temporal sequences to mitigate class-imbalanced problem","volume-title":"Machine Learning and Knowledge Discovery in Databases","author":"Jiang","year":"2021"},{"key":"10.7717\/peerj-cs.3228\/ref-17","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2101.00598","article-title":"Copula flows for synthetic data generation","author":"Kamthe","year":"2021"},{"key":"10.7717\/peerj-cs.3228\/ref-18","first-page":"17564","article-title":"TabDDPM: modelling tabular data with diffusion models (2023)","volume":"202","author":"Kotelnikov","year":"2023"},{"issue":"5","key":"10.7717\/peerj-cs.3228\/ref-19","doi-asserted-by":"publisher","first-page":"831","DOI":"10.1007\/s11633-022-1411-7","article-title":"A survey of synthetic data augmentation methods in machine vision","volume":"21","author":"Mumuni","year":"2024","journal-title":"Machine Intelligence Research"},{"key":"10.7717\/peerj-cs.3228\/ref-20","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1806.03384","article-title":"Data synthesis based on generative adversarial networks","author":"Park","year":"2018"},{"key":"10.7717\/peerj-cs.3228\/ref-21","first-page":"399","article-title":"The synthetic data vault","author":"Patki","year":"2016"},{"issue":"7","key":"10.7717\/peerj-cs.3228\/ref-22","doi-asserted-by":"publisher","first-page":"1601","DOI":"10.3390\/electronics12071601","article-title":"Nonparametric generation of synthetic data using copulas","volume":"12","author":"Restrepo","year":"2023","journal-title":"Electronics"},{"key":"10.7717\/peerj-cs.3228\/ref-23","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1007\/978-3-031-78192-6_6","article-title":"PostAugment: adversarial data augmentation with hard sample suppression by incorrect class likelihood","volume-title":"Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science","volume":"15310","author":"Sawada","year":"2025"},{"key":"10.7717\/peerj-cs.3228\/ref-24","doi-asserted-by":"publisher","first-page":"60","DOI":"10.1186\/s40537-019-0197-0","article-title":"A survey on image data augmentation for deep learning","volume":"6","author":"Shorten","year":"2019","journal-title":"Journal of Big Data"},{"key":"10.7717\/peerj-cs.3228\/ref-25","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1401.7645","article-title":"Comment on detecting novel associations in large data sets","author":"Simon","year":"2014"},{"key":"10.7717\/peerj-cs.3228\/ref-26","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2405.09591","article-title":"A comprehensive survey on data augmentation","author":"Wang","year":"2025"},{"key":"10.7717\/peerj-cs.3228\/ref-27","first-page":"7333","article-title":"Modeling tabular data using conditional gan","volume-title":"Advances in Neural Information Processing Systems (NeurIPS 32)","author":"Xu","year":"2019"},{"key":"10.7717\/peerj-cs.3228\/ref-28","doi-asserted-by":"publisher","first-page":"110204","DOI":"10.1016\/j.patcog.2023.110204","article-title":"Investigating the effectiveness of data augmentation from similarity and diversity: an empirical study","volume":"148","author":"Yang","year":"2024","journal-title":"Pattern Recognition"},{"key":"10.7717\/peerj-cs.3228\/ref-29","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2409.06290","article-title":"Entropy-driven adaptive data augmentation framework for image classification","author":"Yang","year":"2024"}],"container-title":["PeerJ Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/peerj.com\/articles\/cs-3228.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3228.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3228.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/peerj.com\/articles\/cs-3228.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,20]],"date-time":"2025-10-20T08:13:20Z","timestamp":1760948000000},"score":1,"resource":{"primary":{"URL":"https:\/\/peerj.com\/articles\/cs-3228"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,20]]},"references-count":29,"alternative-id":["10.7717\/peerj-cs.3228"],"URL":"https:\/\/doi.org\/10.7717\/peerj-cs.3228","archive":["CLOCKSS","LOCKSS","Portico"],"relation":{},"ISSN":["2376-5992"],"issn-type":[{"value":"2376-5992","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,10,20]]},"article-number":"e3228"}}