{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T16:59:57Z","timestamp":1781629197844,"version":"3.54.5"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T00:00:00Z","timestamp":1767744000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T00:00:00Z","timestamp":1767744000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Information Technology Lab of National Institute of Standards and Technology","award":["70NANB21H092"],"award-info":[{"award-number":["70NANB21H092"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["SN COMPUT. SCI."],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Datasets used in machine learning often contain sensitive information, including personally identifiable health and financial details. A common challenge faced by organizations and researchers is the risk of privacy breaches when using real-world data. Synthetic data can be used as an alternative to the real-world data. In existing synthetic data generation techniques, an encoder processes the real-world data to map it into a lower-dimensional latent space. Random sampling is then performed in this latent space. Subsequently, a decoder network is utilized to generate synthetic data from these sampled points in the latent space. Such approaches typically require generating a large number of synthetic samples to approximate the performance of real-world data, subsequently slowing down downstream machine learning tasks. Addressing this, we introduce a combinatorial approach to sampling the latent space, motivated by our empirical findings within this study that most model predictions are largely influenced by interactions between a few features. In some cases, just using a small number of features produces accuracy better than using entire features. Through this approach, we generate samples that utilize t-way interactions among the t latent dimensions out of n. Our experimental results indicate that our approach requires fewer samples than traditional random sampling to achieve comparable model performance for real-world data sets. We also show that when integrated with a differentially private mechanism, our approach incurs a smaller decline in model performance than existing random sampling approach.<\/jats:p>","DOI":"10.1007\/s42979-025-04540-x","type":"journal-article","created":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T11:08:27Z","timestamp":1767784107000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A Combinatorial Approach to Synthetic Data Generation for Machine Learning"],"prefix":"10.1007","volume":"7","author":[{"given":"Krishna","family":"Khadka","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jaganmohan","family":"Chandrasekaran","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yu","family":"Lei","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Raghu","family":"Kacker","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"D. Richard","family":"Kuhn","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,1,7]]},"reference":[{"key":"4540_CR1","unstructured":"Lu Y, Shen M, Wang H, Wang X, van Rechem C, Wei W. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062. 2023."},{"key":"4540_CR2","unstructured":"De Cristofaro E. An overview of privacy in machine learning. arXiv preprint arXiv:2005.08679, 2020."},{"key":"4540_CR3","unstructured":"Kingma DP, Welling M. Auto-encoding variational Bayes. 2013."},{"key":"4540_CR4","unstructured":"An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE. 2015;2(1):1\u201318."},{"key":"4540_CR5","doi-asserted-by":"crossref","unstructured":"Pol AA, Berger V, Germain C, Cerminara G, Pierini M. Anomaly detection with conditional variational autoencoders. In 2019 18th IEEE international conference on machine learning and applications (ICMLA) 2019; (pp. 1651\u20131657). IEEE.","DOI":"10.1109\/ICMLA.2019.00270"},{"key":"4540_CR6","unstructured":"Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair, S., & Bengio Y. Generative adversarial nets. Advances in neural information processing systems, 2014; 27"},{"key":"4540_CR7","doi-asserted-by":"crossref","unstructured":"Wang Y, Yu B, Wang L, Zu C, Lalush DS, Lin W, et al. 3D conditional generative adversarial networks for high-quality PET image estimation at low dose. Neuroimage. 2018;174:550\u201362.","DOI":"10.1016\/j.neuroimage.2018.03.045"},{"key":"4540_CR8","doi-asserted-by":"crossref","unstructured":"Kameoka H, Kaneko T, Tanaka K, Hojo N. Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop (SLT) 2018; (pp. 266\u2013273). IEEE.","DOI":"10.1109\/SLT.2018.8639535"},{"key":"4540_CR9","unstructured":"Humayun AI, Balestriero R, Baraniuk R. . Magnet: uniform sampling from deep generative network manifolds without retraining. In: International conference on learning representations. 2021."},{"key":"4540_CR10","doi-asserted-by":"crossref","unstructured":"Bousmalis K, Irpan A, Wohlhart P, Bai Y, Kelcey M, Kalakrishnan M, ... & Vanhoucke, V. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE international conference on robotics and automation (ICRA) 2018; (pp. 4243-4250). IEEE.","DOI":"10.1109\/ICRA.2018.8460875"},{"key":"4540_CR11","doi-asserted-by":"crossref","unstructured":"Lei Y, Kacker R, Kuhn DR, Okun V, Lawrence J. IPOG: a general strategy for t-way software testing. In 14th Annual IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS\u201907) 2007; (pp. 549-556). IEEE.","DOI":"10.1109\/ECBS.2007.47"},{"issue":"2","key":"4540_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1883612.1883618","volume":"43","author":"C Nie","year":"2011","unstructured":"Nie C, Leung H. A survey of combinatorial testing. ACM Comput Surv. 2011;43(2):1\u201329.","journal-title":"ACM Comput Surv"},{"key":"4540_CR13","unstructured":"Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 2015; 28."},{"key":"4540_CR14","unstructured":"Tejashvi. Tour & Travels Customer Churn Prediction. Kaggle. Available:https:\/\/www.kaggle.com\/datasets\/tejashvi14\/tour-travels-customer-churn-prediction, Accessed 23 Jan 2022. 2021."},{"key":"4540_CR15","unstructured":"Oliabev, A. HELOC. Kaggle. Available: https:\/\/www.kaggle.com\/datasets\/averkiyoliabev\/home-equity-line-of-creditheloc. 2021."},{"key":"4540_CR16","unstructured":"Dua D, Graff C. UCI machine learning repository. Available: http:\/\/archive.ics.uci.edu\/ml. 2017."},{"key":"4540_CR17","unstructured":"Yeh, Lien. UCI machine learning repository. Available: https:\/\/archive.ics.uci.edu\/ml\/datasets\/default+of+credit+card+clients. 2016."},{"key":"4540_CR18","doi-asserted-by":"crossref","unstructured":"Khadka K, Chandrasekaran J, Lei Y, Kacker RN, Kuhn DR. Synthetic Data Generation Using Combinatorial Testing and Variational Autoencoder. In: 2023 IEEE International conference on software testing, verification and validation workshops (ICSTW) 2023; (pp. 228\u2013236). IEEE.","DOI":"10.1109\/ICSTW58534.2023.00048"},{"key":"4540_CR19","doi-asserted-by":"crossref","unstructured":"Dwork, C. . Differential privacy. In: International colloquium on automata, languages, and programming (pp. 1-12). Berlin, Heidelberg: Springer 2006.","DOI":"10.1007\/11787006_1"},{"key":"4540_CR20","doi-asserted-by":"crossref","unstructured":"Dwork C, Roth A. The algorithmic foundations of differential privacy. Foundations and Trends$$\\text{\\textregistered} $$ in Theoretical Computer Science. 2014;9(3\u20134):211\u2013407.","DOI":"10.1561\/0400000042"},{"key":"4540_CR21","unstructured":"Zhong H, Bu K. Privacy-utility trade-off. arXiv preprint 2022; arXiv:2204.12057."},{"issue":"4\u20135","key":"4540_CR22","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1016\/0925-2312(93)90006-O","volume":"5","author":"SI Amari","year":"1993","unstructured":"Amari SI. Backpropagation and stochastic gradient descent method. Neurocomputing. 1993;5(4\u20135):185\u201396.","journal-title":"Neurocomputing"},{"key":"4540_CR23","doi-asserted-by":"crossref","unstructured":"Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security 2016; (pp. 308\u2013318).","DOI":"10.1145\/2976749.2978318"},{"key":"4540_CR24","doi-asserted-by":"crossref","unstructured":"Ribeiro MT, Singh S, Guestrin C. Why should i trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016; (pp. 1135-1144).","DOI":"10.1145\/2939672.2939778"},{"key":"4540_CR25","doi-asserted-by":"crossref","unstructured":"Inan A, Kantarcioglu M, Ghinita G, Bertino E. Private record matching using differential privacy. In Proceedings of the 13th International Conference on Extending Database Technology 2010; (pp. 123\u2013134).","DOI":"10.1145\/1739041.1739059"},{"key":"4540_CR26","doi-asserted-by":"crossref","unstructured":"McSherry F, Mironov I. Differentially private recommender systems: building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining 2009; (pp. 627\u2013636).","DOI":"10.1145\/1557019.1557090"},{"key":"4540_CR27","doi-asserted-by":"publisher","unstructured":"Patki N, Wedge R, Veeramachaneni K. The synthetic data vault.  2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016; pp. 399\u2013410. https:\/\/doi.org\/10.1109\/DSAA.2016.49.","DOI":"10.1109\/DSAA.2016.49."},{"key":"4540_CR28","unstructured":"Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Advances in neural information processing systems, 2019; 32."},{"key":"4540_CR29","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825\u201330."},{"key":"4540_CR30","unstructured":"Sdmetrics, Synthetic Data Metrics, DataCebo, Inc., 2023, 4, Version 0.9.3, https:\/\/docs.sdv.dev\/sdmetrics\/"},{"key":"4540_CR31","doi-asserted-by":"crossref","unstructured":"Borazjany MN, Yu L, Lei Y, Kacker R, Kuhn, R. Combinatorial testing of ACTS: A case study. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation 2012; (pp. 591\u2013600). IEEE.","DOI":"10.1109\/ICST.2012.146"},{"key":"4540_CR32","doi-asserted-by":"crossref","unstructured":"McKnight PE, Najab J. Mann\u2013Whitney U Test. The Corsini encyclopedia of psychology, 2010;\u2019 1\u20131.","DOI":"10.1002\/9780470479216.corpsy0524"},{"issue":"1","key":"4540_CR33","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0197-0","volume":"6","author":"C Shorten","year":"2019","unstructured":"Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1\u201348.","journal-title":"J Big Data"},{"key":"4540_CR34","doi-asserted-by":"crossref","unstructured":"Antoniou A, Storkey A, Edwards H. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340 2017.","DOI":"10.1007\/978-3-030-01424-7_58"},{"issue":"8","key":"4540_CR35","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1109\/MC.2018.3191268","volume":"51","author":"J Isaak","year":"2018","unstructured":"Isaak J, Hanna MJ. User data privacy: Facebook, Cambridge Analytica, and privacy protection. Computer. 2018;51(8):56\u20139.","journal-title":"Computer"},{"issue":"2","key":"4540_CR36","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3436755","volume":"54","author":"B Liu","year":"2021","unstructured":"Liu B, Ding M, Shaham S, Rahayu W, Farokhi F, Lin Z. When machine learning meets privacy: a survey and outlook. ACM Comput Surv. 2021;54(2):1\u201336.","journal-title":"ACM Comput Surv"},{"key":"4540_CR37","unstructured":"Huang Y, Song Z, Li K, Aror, S. Instahide: instance-hiding schemes for private distributed learning.  Proceedings of the International Conference on Machine Learning (ICML). 2020"},{"issue":"2","key":"4540_CR38","doi-asserted-by":"publisher","first-page":"223","DOI":"10.3390\/biomedicines10020223","volume":"10","author":"B Ahmad","year":"2022","unstructured":"Ahmad B, et al. Brain tumor classification using a combination of variational autoencoders and generative adversarial networks. Biomedicines. 2022;10(2):223.","journal-title":"Biomedicines"},{"issue":"1","key":"4540_CR39","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1007\/s42421-021-00035-2","volume":"3","author":"Z Islam","year":"2021","unstructured":"Islam Z, Abdel-Aty M. Sensor-based transportation mode recognition using variational autoencoder. J Big Data Anal Transport. 2021;3(1):15\u201326.","journal-title":"J Big Data Anal Transport"},{"key":"4540_CR40","unstructured":"Kim Y et al. Semi-amortized variational autoencoders.  International Conference on Machine Learning. PMLR. 2018."},{"key":"4540_CR41","unstructured":"Vardhan LVH, Kok S. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37th International Conference on Machine Learning (ICML). 2020."},{"key":"4540_CR42","unstructured":"Darabi S, Elor Y. Synthesizing multi-modal minority samples for tabular data. arXiv preprint arXiv:2105.08204. 2021."},{"key":"4540_CR43","unstructured":"Borisov V, et al. Deep neural networks and tabular data: a survey. IEEE Trans Neural Netw Learn Syst. 2022."},{"key":"4540_CR44","unstructured":"Choi E. et al. Generating multi-label discrete patient records using generative adversarial networks.  Machine Learning for Healthcare Conference. PMLR. 2017."},{"key":"4540_CR45","unstructured":"Youngmin P. et al. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018."},{"key":"4540_CR46","unstructured":"Kerber R. Chimerge: discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence. 1992"}],"container-title":["SN Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-025-04540-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s42979-025-04540-x","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s42979-025-04540-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T16:15:27Z","timestamp":1781626527000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s42979-025-04540-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,7]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["4540"],"URL":"https:\/\/doi.org\/10.1007\/s42979-025-04540-x","relation":{},"ISSN":["2661-8907"],"issn-type":[{"value":"2661-8907","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,7]]},"assertion":[{"value":"8 December 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 November 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Research Involving Human and\/or Animals"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Informed Consent"}}],"article-number":"59"}}