{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,12]],"date-time":"2026-04-12T00:16:09Z","timestamp":1775952969252,"version":"3.50.1"},"reference-count":37,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,11,29]],"date-time":"2021-11-29T00:00:00Z","timestamp":1638144000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,11,29]],"date-time":"2021-11-29T00:00:00Z","timestamp":1638144000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"LOEWE","award":["Diffusible Signals"],"award-info":[{"award-number":["Diffusible Signals"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear.<\/jats:p><jats:p>This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.<\/jats:p>","DOI":"10.1186\/s13040-021-00283-6","type":"journal-article","created":{"date-parts":[[2021,11,29]],"date-time":"2021-11-29T07:02:32Z","timestamp":1638169352000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":71,"title":["Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making"],"prefix":"10.1186","volume":"14","author":[{"given":"Jacqueline","family":"Beinecke","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3108-8311","authenticated-orcid":false,"given":"Dominik","family":"Heider","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,11,29]]},"reference":[{"issue":"1","key":"283_CR1","doi-asserted-by":"publisher","first-page":"110","DOI":"10.1016\/j.canlet.2016.05.033","volume":"382","author":"J-E Bibault","year":"2016","unstructured":"Bibault J-E, Giraud P, Burgun A. Big data and machine learning in radiation oncology: state of the art and future prospects. Cancer Lett. 2016;382(1):110\u20137. https:\/\/doi.org\/10.1016\/j.canlet.2016.05.033.","journal-title":"Cancer Lett"},{"key":"283_CR2","doi-asserted-by":"publisher","first-page":"170","DOI":"10.1016\/j.media.2016.06.037","volume":"33","author":"A Madabhushi","year":"2016","unstructured":"Madabhushi A, Lee G. Image analysis and machine learning in digital pathology: challenges and opportunities. Med Image Anal. 2016;33:170\u20135. https:\/\/doi.org\/10.1016\/j.media.2016.06.037.","journal-title":"Med Image Anal"},{"key":"283_CR3","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1007\/s10549-016-4035-1","volume":"161","author":"A Yala","year":"2017","unstructured":"Yala A, Barzilay R, Salama L, Griffin M, Sollender G, Bardia A, et al. Using machine learning to parse breast pathology reports. Breast Cancer Research Treat. 2017;161:203\u201311. https:\/\/doi.org\/10.1007\/s10549-016-4035-1.","journal-title":"Breast Cancer Research Treat"},{"key":"283_CR4","doi-asserted-by":"publisher","first-page":"1559","DOI":"10.1038\/s41591-018-0177-5","volume":"24","author":"N Coudray","year":"2018","unstructured":"Coudray N, Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Feny\u00f6 D, et al. Classication and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat Med. 2018;24:1559\u201367. https:\/\/doi.org\/10.1038\/s41591-018-0177-5.","journal-title":"Nat Med"},{"issue":"1","key":"283_CR5","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1186\/s12859-018-2090-9","volume":"19","author":"P Chen","year":"2018","unstructured":"Chen P, Pan C. Diabetes classification model based on boosting algorithms. BMC Bioinformatics. 2018;19(1):109. https:\/\/doi.org\/10.1186\/s12859-018-2090-9.","journal-title":"BMC Bioinformatics"},{"key":"283_CR6","doi-asserted-by":"publisher","first-page":"101706","DOI":"10.1016\/j.artmed.2019.101706","volume":"100","author":"S Sp\u00e4nig","year":"2019","unstructured":"Sp\u00e4nig S, Emberger-Klein A, Sowa J-P, Canbay A, Menrad K, Heider D. The virtual doctor: an interactive clinical-decision-support system based on deep learning for non -invasive prediction of diabetes. Artif Intell Med. 2019;100:101706. https:\/\/doi.org\/10.1016\/j.artmed.2019.101706.","journal-title":"Artif Intell Med"},{"issue":"6","key":"283_CR7","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1038\/nrg3920","volume":"16","author":"MW Libbrecht","year":"2015","unstructured":"Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16(6):321\u201332. https:\/\/doi.org\/10.1038\/nrg3920.","journal-title":"Nat Rev Genet"},{"issue":"10","key":"283_CR8","doi-asserted-by":"publisher","first-page":"790","DOI":"10.1038\/nrmicro1477","volume":"4","author":"T Lengauer","year":"2006","unstructured":"Lengauer T, Sing T. Bioinformatics-assisted anti-HIV therapy. Nat Rev Microb. 2006;4(10):790\u20137. https:\/\/doi.org\/10.1038\/nrmicro1477.","journal-title":"Nat Rev Microb"},{"key":"283_CR9","doi-asserted-by":"publisher","unstructured":"Heider D, Dybowski JN, Wilms C, Hoffmann D. A simple structure-based model for the prediction of HIV-1 co-receptor tropism. BioData Min. 2014;7. https:\/\/doi.org\/10.1186\/1756-0381-7-14.","DOI":"10.1186\/1756-0381-7-14"},{"issue":"1","key":"283_CR10","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1186\/s13040-019-0196-x","volume":"12","author":"S Sp\u00e4nig","year":"2019","unstructured":"Sp\u00e4nig S, Heider D. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens. BioData Min. 2019;12(1):7. https:\/\/doi.org\/10.1186\/s13040-019-0196-x.","journal-title":"BioData Min"},{"issue":"14","key":"283_CR11","doi-asserted-by":"publisher","first-page":"2458","DOI":"10.1093\/bioinformatics\/bty984","volume":"35","author":"J Schwarz","year":"2019","unstructured":"Schwarz J, Heider D. Guess: projecting machine learning scores to well-calibrated probability estimates for clinical decision making. Bioinformatics. 2019;35(14):2458\u201365. https:\/\/doi.org\/10.1093\/bioinformatics\/bty984.","journal-title":"Bioinformatics"},{"issue":"1","key":"283_CR12","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s13040-016-0114-4","volume":"9","author":"U Neumann","year":"2016","unstructured":"Neumann U, Riemenschneider M, Sowa J-P, Baars T, K\u00e4lsch J, Canbay A, et al. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Min. 2016;9(1):36. https:\/\/doi.org\/10.1186\/s13040-016-0114-4.","journal-title":"BioData Min"},{"issue":"1","key":"283_CR13","doi-asserted-by":"publisher","first-page":"112","DOI":"10.1093\/bioinformatics\/btr597.1105.0828","volume":"28","author":"DJ Stekhoven","year":"2012","unstructured":"Stekhoven DJ, B\u00fchlmann P. Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112\u20138. https:\/\/doi.org\/10.1093\/bioinformatics\/btr597.1105.0828.","journal-title":"Bioinformatics"},{"issue":"4","key":"283_CR14","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","volume":"5","author":"B Krawczyk","year":"2016","unstructured":"Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221\u201332. https:\/\/doi.org\/10.1007\/s13748-016-0094-0.","journal-title":"Prog Artif Intell"},{"key":"283_CR15","unstructured":"Dua D, Graff C. UCI machine learning repository. 2017. http:\/\/archive.ics.uci.edu\/ml. Accessed 1 Feb 2021."},{"issue":"23","key":"283_CR16","doi-asserted-by":"publisher","first-page":"9193","DOI":"10.1073\/pnas.87.23.9193","volume":"87","author":"WH Wolberg","year":"1990","unstructured":"Wolberg WH, Mangasarian OL. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci U S A. 1990;87(23):9193\u20136. https:\/\/doi.org\/10.1073\/pnas.87.23.9193.","journal-title":"Proc Natl Acad Sci U S A"},{"key":"283_CR17","unstructured":"Haberman SJ. Generalized Residuals for Log-linear Models. In: Proceedings of the 9th International Biometrics Conference. Boston; 1976. p. 104\u201322."},{"key":"283_CR18","unstructured":"Kelwin F, J.F. Jaime S. Cardoso: transfer learning with partial observability applied to cervical Cancer screening. In: Iberian Conference on Pattern Recognition and Image Analysis. Faro: Springer; 2017."},{"issue":"10","key":"283_CR19","doi-asserted-by":"publisher","first-page":"3120","DOI":"10.1166\/asl.2016.7980","volume":"22","author":"MR Sobar","year":"2016","unstructured":"Sobar MR, Wijaya A. Behavior determinant based cervical cancer early detection with machine learning algorithm. Adv Sci Lett. 2016;22(10):3120\u20133. https:\/\/doi.org\/10.1166\/asl.2016.7980.","journal-title":"Adv Sci Lett"},{"issue":"16","key":"283_CR20","doi-asserted-by":"publisher","first-page":"12564","DOI":"10.1016\/j.eswa.2012.05.028","volume":"39","author":"D Gil","year":"2012","unstructured":"Gil D, Girela JL, Juan JD, Gomez-Torres MJ, Johnsson M. Predicting seminal quality with artificial intelligence methods. Expert Syst Appl. 2012;39(16):12564\u201373. https:\/\/doi.org\/10.1016\/j.eswa.2012.05.028.","journal-title":"Expert Syst Appl"},{"key":"283_CR21","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-10442-9","volume-title":"Personality traits and drug consumption","author":"E Fehrman","year":"2019","unstructured":"Fehrman E, Egan V, Gorban AN, Levesley J, Mirkes EM, Muhammad AK. Personality traits and drug consumption: Springer; 2019. https:\/\/doi.org\/10.1007\/978-3-030-10442-9."},{"issue":"2","key":"283_CR22","doi-asserted-by":"publisher","first-page":"236","DOI":"10.1016\/j.jhep.2013.03.016","volume":"59","author":"R Lichtinghagen","year":"2013","unstructured":"Lichtinghagen R, Pietsch D, Bantel H, Manns MP, Brand K, Bahr MJ. The enhanced liver fibrosis (elf) score: normal values, influence factors and proposed cut-off values. J Hepatol. 2013;59(2):236\u201342. https:\/\/doi.org\/10.1016\/j.jhep.2013.03.016.","journal-title":"J Hepatol"},{"issue":"7","key":"283_CR23","doi-asserted-by":"publisher","first-page":"101444","DOI":"10.1371\/journal.pone.0101444","volume":"9","author":"J-P Sowa","year":"2013","unstructured":"Sowa J-P, Atmaca O, Kahraman A, Schlattjan M, Lindner M, Sydor S, et al. Non-invasive separation of alcoholic and non-alcoholic liver disease with predictive modeling. PLoS One. 2013;9(7):101444. https:\/\/doi.org\/10.1371\/journal.pone.0101444.","journal-title":"PLoS One"},{"key":"283_CR24","doi-asserted-by":"publisher","first-page":"311","DOI":"10.3109\/14767050009053454","volume":"5","author":"D Ayres de Campos","year":"2000","unstructured":"Ayres de Campos D, Bernardes J, Garrido A, Marques-de-sa J, Pereira-leite L. Sisporto 2.0: A program for automated analysis of cardiotocograms. J Matern Fetal Med. 2000;5:311\u20138. https:\/\/doi.org\/10.3109\/14767050009053454.","journal-title":"J Matern Fetal Med"},{"key":"283_CR25","doi-asserted-by":"publisher","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002:16, 321\u2013357. https:\/\/doi.org\/10.1613\/jair.953.","DOI":"10.1613\/jair.953"},{"key":"283_CR26","doi-asserted-by":"publisher","first-page":"1322","DOI":"10.1109\/IJCNN.2008.4633969","volume-title":"Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, Part of the IEEE World Congress on Computational Intelligence, WCCI 2008, Hong Kong, China, June 1\u20136, 2008","author":"H He","year":"2008","unstructured":"He H, Bai Y, Garcia EA, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, Part of the IEEE World Congress on Computational Intelligence, WCCI 2008, Hong Kong, China, June 1\u20136, 2008: IEEE; 2008. p. 1322\u20138. https:\/\/doi.org\/10.1109\/IJCNN.2008.4633969."},{"issue":"2","key":"283_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2907070","volume":"49","author":"P Branco","year":"2016","unstructured":"Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv. 2016;49(2):1\u201350. https:\/\/doi.org\/10.1145\/2907070.","journal-title":"ACM Comput Surv"},{"issue":"9","key":"283_CR28","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v011.i09","volume":"11","author":"A Karatzoglou","year":"2004","unstructured":"Karatzoglou A, Smola A, Hornik K, Zeileis A. kernlab an S4 package for kernel methods in R. J Stat Softw. 2004;11(9):1\u201320.","journal-title":"J Stat Softw"},{"issue":"3","key":"283_CR29","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1093\/biomet\/76.3.503","volume":"76","author":"P Burman","year":"1989","unstructured":"Burman P. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika. 1989;76(3):503\u201314. https:\/\/doi.org\/10.1093\/biomet\/76.3.503.","journal-title":"Biometrika"},{"issue":"20","key":"283_CR30","doi-asserted-by":"publisher","first-page":"3940","DOI":"10.1093\/bioinformatics\/bti623","volume":"21","author":"T Sing","year":"2005","unstructured":"Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinform. 2005;21(20):3940\u20131. https:\/\/doi.org\/10.1093\/bioinformatics\/bti623.","journal-title":"Bioinform"},{"issue":"2","key":"283_CR31","doi-asserted-by":"publisher","first-page":"112","DOI":"10.1002\/cem.858","volume":"18","author":"YD Qing-Song Xu","year":"2004","unstructured":"Qing-Song Xu YD. Yi-Zeng Liang: Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J Chemom. 2004;18(2):112\u201320. https:\/\/doi.org\/10.1002\/cem.858.","journal-title":"J Chemom"},{"issue":"1","key":"283_CR32","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","volume":"57","author":"Y Benjamini","year":"1995","unstructured":"Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57(1):289\u2013300. https:\/\/doi.org\/10.1111\/j.2517-6161.1995.tb02031.x.","journal-title":"J R Stat Soc Ser B"},{"issue":"1","key":"283_CR33","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. https:\/\/doi.org\/10.1186\/s12864-019-6413-7.","journal-title":"BMC Genomics"},{"key":"283_CR34","unstructured":"Taneja S, Suri B, Kothari C. Application of Balancing Techniques with Ensemble Approach for Credit Card Fraud Detection. In: International Conference on Computing, Power and Communication Technologies (GUCON), New Delhi, India; 2019. p. 753\u20138."},{"issue":"4","key":"283_CR35","doi-asserted-by":"publisher","first-page":"275","DOI":"10.3390\/educsci9040275","volume":"9","author":"TM Barros","year":"2019","unstructured":"Barros TM, Souza Neto PA, Silva I, Guedes LA. Predictive models for imbalanced data: a school dropout perspective. Educ Sci. 2019;9(4):275\u201392. https:\/\/doi.org\/10.3390\/educsci9040275.","journal-title":"Educ Sci"},{"issue":"9","key":"283_CR36","doi-asserted-by":"publisher","first-page":"3307","DOI":"10.3390\/app10093307","volume":"10","author":"K Davagdorj","year":"2020","unstructured":"Davagdorj K, Lee JS, Pham VH, Ryu KH. A comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention. Appl Sci. 2020;10(9):3307\u201327. https:\/\/doi.org\/10.3390\/app10093307.","journal-title":"Appl Sci"},{"key":"283_CR37","doi-asserted-by":"publisher","first-page":"7940","DOI":"10.1109\/ACCESS.2016.2619719","volume":"4","author":"A Amin","year":"2016","unstructured":"Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, et al. Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access. 2016;4:7940\u201357. https:\/\/doi.org\/10.1109\/ACCESS.2016.2619719.","journal-title":"IEEE Access"}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-021-00283-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-021-00283-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-021-00283-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,11,29]],"date-time":"2021-11-29T19:02:58Z","timestamp":1638212578000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-021-00283-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,11,29]]},"references-count":37,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["283"],"URL":"https:\/\/doi.org\/10.1186\/s13040-021-00283-6","relation":{},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,11,29]]},"assertion":[{"value":"17 May 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 November 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 November 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"49"}}