{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T01:36:50Z","timestamp":1773193010334,"version":"3.50.1"},"reference-count":66,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,11,30]],"date-time":"2023-11-30T00:00:00Z","timestamp":1701302400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,11,30]],"date-time":"2023-11-30T00:00:00Z","timestamp":1701302400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Machine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case\u2013control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.<\/jats:p>","DOI":"10.1186\/s40537-023-00853-x","type":"journal-article","created":{"date-parts":[[2023,11,30]],"date-time":"2023-11-30T05:01:53Z","timestamp":1701320513000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["The use of class imbalanced learning methods on ULSAM data to predict the case\u2013control status in genome-wide association studies"],"prefix":"10.1186","volume":"10","author":[{"given":"R. Onur","family":"\u00d6ztornaci","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hamzah","family":"Syed","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrew P.","family":"Morris","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bahar","family":"Ta\u015fdelen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,11,30]]},"reference":[{"key":"853_CR1","doi-asserted-by":"publisher","first-page":"1202","DOI":"10.1038\/ejhg.2015.269","volume":"24","author":"J Fadista","year":"2016","unstructured":"Fadista J, Manning AK, Florez JC, Groop L. The (in) famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur J Hum Genet. 2016;24:1202\u20135.","journal-title":"Eur J Hum Genet"},{"key":"853_CR2","doi-asserted-by":"publisher","first-page":"S51","DOI":"10.1002\/gepi.20473","volume":"33","author":"S Szymczak","year":"2009","unstructured":"Szymczak S, Biernacka JM, Cordell HJ, Gonz\u00e1lez-Recio O, K\u00f6nig IR, Zhang H, Sun YV. Machine learning in genome-wide association studies. Genet Epidemiol. 2009;33:S51\u20137.","journal-title":"Genet Epidemiol"},{"key":"853_CR3","doi-asserted-by":"publisher","first-page":"1384","DOI":"10.1093\/bioinformatics\/btr159","volume":"27","author":"E Cosgun","year":"2011","unstructured":"Cosgun E, Limdi NA, Duarte CW. High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans. Bioinformatics. 2011;27:1384\u20139.","journal-title":"Bioinformatics"},{"key":"853_CR4","first-page":"281","volume":"39","author":"Y Tang","year":"2008","unstructured":"Tang Y, Zhang Y-Q, Chawla NV, Krasser S. SVMs modeling for highly imbalanced classification, IEEE transactions on systems, man, and cybernetics. Part B. 2008;39:281\u20138.","journal-title":"Part B"},{"key":"853_CR5","doi-asserted-by":"publisher","first-page":"736","DOI":"10.3390\/genes12050736","volume":"12","author":"X Dai","year":"2021","unstructured":"Dai X, Fu G, Zhao S, Zeng Y. Statistical learning methods applicable to genome-wide association studies on unbalanced case-control disease data. Genes. 2021;12:736.","journal-title":"Genes"},{"key":"853_CR6","doi-asserted-by":"publisher","first-page":"1335","DOI":"10.1038\/s41588-018-0184-y","volume":"50","author":"W Zhou","year":"2018","unstructured":"Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335\u201341.","journal-title":"Nat Genet"},{"key":"853_CR7","doi-asserted-by":"publisher","first-page":"284","DOI":"10.1016\/j.jpsychires.2021.04.014","volume":"138","author":"Z Bao","year":"2021","unstructured":"Bao Z, Zhao X, Li J, Zhang G, Wu H, Ning Y, Li MD, Yang Z. Prediction of repeated-dose intravenous ketamine response in major depressive disorder using the GWAS-based machine learning approach. J Psychiatr Res. 2021;138:284\u201390.","journal-title":"J Psychiatr Res"},{"key":"853_CR8","doi-asserted-by":"publisher","first-page":"559","DOI":"10.1086\/519795","volume":"81","author":"S Purcell","year":"2007","unstructured":"Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Human Genet. 2007;81:559\u201375.","journal-title":"Am J Human Genet"},{"key":"853_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41398-021-01281-2","volume":"11","author":"S Kinreich","year":"2021","unstructured":"Kinreich S, McCutcheon VV, Aliev F, Meyers JL, Kamarajan C, Pandey AK, Chorlian DB, Zhang J, Kuang W, Pandey G. Predicting alcohol use disorder remission: a longitudinal multimodal multi-featured machine learning approach. Transl Psychiatry. 2021;11:1\u201310.","journal-title":"Transl Psychiatry"},{"key":"853_CR10","doi-asserted-by":"publisher","first-page":"412","DOI":"10.3390\/ijms18020412","volume":"18","author":"KY He","year":"2017","unstructured":"He KY, Ge D, He MM. Big data analytics for genomic medicine. Int J Mol Sci. 2017;18:412.","journal-title":"Int J Mol Sci"},{"key":"853_CR11","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1097\/YPG.0b013e32834dc40d","volume":"22","author":"M Pirooznia","year":"2012","unstructured":"Pirooznia M, Fayaz Seifuddin JJ, Mahon PB, Potash JB, Zandi PP, B.G.S. Consortium. Data mining approaches for genome-wide association of mood disorders. Psychiatr Genet. 2012;22:55.","journal-title":"Psychiatr Genet"},{"key":"853_CR12","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1111\/rssb.12001","volume":"75","author":"Y Fan","year":"2013","unstructured":"Fan Y, Tang CY. Tuning parameter selection in high dimensional penalized likelihood. J Royal Stat Soc Series B. 2013;75:531\u201352.","journal-title":"J Royal Stat Soc Series B"},{"key":"853_CR13","first-page":"4237","volume-title":"Statistical challenges of high-dimensional data","author":"IM Johnstone","year":"2009","unstructured":"Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. London: The Royal Society Publishing; 2009. p. 4237\u201353."},{"key":"853_CR14","volume-title":"The elements of statistical learning: data mining, inference, and prediction, by trevor hastie, robert tibshirani, jerome friedman","author":"K Nordhausen","year":"2009","unstructured":"Nordhausen K. The elements of statistical learning: data mining, inference, and prediction, by trevor hastie, robert tibshirani, jerome friedman. New York: Wiley Online Library; 2009."},{"key":"853_CR15","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/ncomms8208","volume":"6","author":"HH Draisma","year":"2015","unstructured":"Draisma HH, Pool R, Kobl M, Jansen R, Petersen A-K, Vaarhorst AA, Yet I, Haller T, Demirkan A, Esko T. Genome-wide association study identifies novel genetic variants contributing to variation in blood metabolite levels. Nat Commun. 2015;6:1\u20139.","journal-title":"Nat Commun"},{"key":"853_CR16","first-page":"30","volume":"2","author":"H Shi","year":"2011","unstructured":"Shi H, Medway C, Brown K, Kalsheker N, Morgan K. Using Fisher\u2019s method with PLINK \u2018LD clumped\u2019output to compare SNP effects across genome-wide association study (GWAS) datasets. Int J Mol Epidemiol Genet. 2011;2:30.","journal-title":"Int J Mol Epidemiol Genet"},{"key":"853_CR17","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1109\/TEVC.2012.2199119","volume":"17","author":"U Bhowan","year":"2012","unstructured":"Bhowan U, Johnston M, Zhang M, Yao X. Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Trans Evol Comput. 2012;17:368\u201386.","journal-title":"IEEE Trans Evol Comput"},{"key":"853_CR18","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321\u201357.","journal-title":"J Artif Intell Res"},{"key":"853_CR19","doi-asserted-by":"publisher","first-page":"429","DOI":"10.3233\/IDA-2002-6504","volume":"6","author":"N Japkowicz","year":"2002","unstructured":"Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent Data Anal. 2002;6:429\u201349.","journal-title":"Intelligent Data Anal"},{"key":"853_CR20","first-page":"1","volume":"14","author":"L Lusa","year":"2013","unstructured":"Lusa L. Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinf. 2013;14:1\u201313.","journal-title":"BMC Bioinf"},{"key":"853_CR21","doi-asserted-by":"crossref","unstructured":"Turhan S, \u00d6zkan Y, Y\u00fcrekli BS, Suner A, Do\u011fu E. S\u0131n\u0131f Dengesizli\u011fi Varl\u0131\u011f\u0131nda Hastal\u0131k Tan\u0131s\u0131 i\u00e7in Kolektif \u00d6\u011frenme Y\u00f6ntemlerinin Kar\u015f\u0131la\u015ft\u0131r\u0131lmas\u0131: Diyabet Tan\u0131s\u0131 \u00d6rne\u011fi, Turkiye Klinikleri Journal of Biostatistics. 2020; 12.","DOI":"10.5336\/biostatic.2019-66816"},{"key":"853_CR22","doi-asserted-by":"publisher","first-page":"1729569","DOI":"10.1080\/23322039.2020.1729569","volume":"8","author":"S Shrivastava","year":"2020","unstructured":"Shrivastava S, Jeyanthi PM, Singh S. Failure prediction of Indian Banks using SMOTE, Lasso regression, bagging and boosting. Cogent Econom Finance. 2020;8:1729569.","journal-title":"Cogent Econom Finance"},{"key":"853_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1155\/2018\/9704672","volume":"2018","author":"J-H Seo","year":"2018","unstructured":"Seo J-H, Kim Y-H. Machine-learning approach to optimize smote ratio in class imbalance dataset for intrusion detection. Computational \u0130ntell Neurosci. 2018;2018:1.","journal-title":"Computational \u0130ntell Neurosci"},{"key":"853_CR24","doi-asserted-by":"crossref","unstructured":"Hu F, Li H, A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE, Mathematical Problems in Engineering, 2013.","DOI":"10.1155\/2013\/694809"},{"key":"853_CR25","first-page":"1017","volume":"34","author":"Z Zheng","year":"2015","unstructured":"Zheng Z, Cai Y, Li Y. Oversampling method for imbalanced classification. Computing and Informatics. 2015;34:1017\u201337.","journal-title":"Computing and Informatics"},{"key":"853_CR26","doi-asserted-by":"crossref","unstructured":"Wang Q, Luo Z, Huang J, Feng Y, Liu Z, A novel ensemble method for imbalanced data learning: bagging of extrapolation-SMOTE SVM, Computational intelligence and neuroscience, 2017 (2017).","DOI":"10.1155\/2017\/1827016"},{"key":"853_CR27","doi-asserted-by":"crossref","unstructured":"Wang H-Y, Combination approach of SMOTE and biased-SVM for imbalanced datasets, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 228\u2013231.","DOI":"10.1109\/IJCNN.2008.4633794"},{"key":"853_CR28","first-page":"1322","volume":"2008","author":"H He","year":"2008","unstructured":"He H, Bai Y, Garcia EA, Li S, ADASYN: Adaptive synthetic sampling approach for imbalanced learning,. IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE. 2008;2008:1322\u20138.","journal-title":"IEEE"},{"key":"853_CR29","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.523","volume":"7","author":"A Alhudhaif","year":"2021","unstructured":"Alhudhaif A. A novel multi-class imbalanced EEG signals classification based on the adaptive synthetic sampling (ADASYN) approach. PeerJ Computer Science. 2021;7: e523.","journal-title":"PeerJ Computer Science"},{"key":"853_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-021-00460-8","volume":"8","author":"R Zuech","year":"2021","unstructured":"Zuech R, Hancock J, Khoshgoftaar TM. Detecting web attacks using random undersampling and ensemble learners. J Big Data. 2021;8:1\u201320.","journal-title":"J Big Data"},{"key":"853_CR31","first-page":"1988","volume":"33","author":"R Razavi-Far","year":"2019","unstructured":"Razavi-Far R, Farajzadeh-Zanajni M, Wang B, Saif M, Chakrabarti S. Imputation-based ensemble techniques for class imbalance learning. IEEE Trans Knowl Data Eng. 2019;33:1988\u20132001.","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"853_CR32","unstructured":"Han J, Pei J, Kamber M, Data mining: concepts and techniques, Elsevier 2011."},{"key":"853_CR33","doi-asserted-by":"crossref","unstructured":"Alpaydin E, Introduction to machine learning, MIT press2020.","DOI":"10.7551\/mitpress\/13811.001.0001"},{"key":"853_CR34","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L. Random forests. Mach Learn. 2001;45:5\u201332.","journal-title":"Mach Learn"},{"key":"853_CR35","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1016\/j.ygeno.2012.04.003","volume":"99","author":"X Chen","year":"2012","unstructured":"Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99:323\u20139.","journal-title":"Genomics"},{"key":"853_CR36","doi-asserted-by":"publisher","first-page":"217","DOI":"10.1080\/01431160412331269698","volume":"26","author":"M Pal","year":"2005","unstructured":"Pal M. Random forest classifier for remote sensing classification. Int J Remote Sens. 2005;26:217\u201322.","journal-title":"Int J Remote Sens"},{"key":"853_CR37","unstructured":"Strobl C, Zeileis A, Danger: High power!\u2013exploring the statistical properties of a test for random forest variable importance, 2008."},{"key":"853_CR38","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1023\/A:1012487302797","volume":"46","author":"I Guyon","year":"2002","unstructured":"Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389\u2013422.","journal-title":"Mach Learn"},{"key":"853_CR39","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1002\/wics.49","volume":"1","author":"A Mammone","year":"2009","unstructured":"Mammone A, Turchi M, Cristianini N. Support vector machines. Wiley Interdiscip Rev Comput Stat. 2009;1:283\u20139.","journal-title":"Wiley Interdiscip Rev Comput Stat"},{"key":"853_CR40","doi-asserted-by":"publisher","first-page":"906","DOI":"10.1093\/bioinformatics\/16.10.906","volume":"16","author":"TS Furey","year":"2000","unstructured":"Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16:906\u201314.","journal-title":"Bioinformatics"},{"key":"853_CR41","unstructured":"I. Nitze, U. Schulthess, H. Asche, Comparison of machine learning algorithms random forest, artificial neural network and support vector machine to maximum likelihood for supervised crop type classification, Proceedings of the 4th GEOBIA, Rio de Janeiro, Brazil, 2012; 79: 3540."},{"key":"853_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/srep36671","volume":"6","author":"B Mieth","year":"2016","unstructured":"Mieth B, Kloft M, Rodr\u00edguez JA, Sonnenburg S, Vobruba R, Morcillo-Su\u00e1rez C, Farr\u00e9 X, Marigorta UM, Fehr E, Dickhaus T. Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci Rep. 2016;6:1\u201314.","journal-title":"Sci Rep"},{"key":"853_CR43","doi-asserted-by":"publisher","first-page":"1321","DOI":"10.1093\/bioinformatics\/btm026","volume":"23","author":"KLS Ng","year":"2007","unstructured":"Ng KLS, Mishra SK. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007;23:1321\u201330.","journal-title":"Bioinformatics"},{"key":"853_CR44","doi-asserted-by":"publisher","first-page":"631","DOI":"10.1093\/bioinformatics\/bti033","volume":"21","author":"A Statnikov","year":"2005","unstructured":"Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005;21:631\u201343.","journal-title":"Bioinformatics"},{"key":"853_CR45","first-page":"4624","volume":"10","author":"F Deng","year":"2020","unstructured":"Deng F, Shen L, Wang H, Zhang L. Classify multicategory outcome in patients with lung adenocarcinoma using clinical, transcriptomic and clinico-transcriptomic data: machine learning versus multinomial models. Am J Cancer Res. 2020;10:4624.","journal-title":"Am J Cancer Res"},{"key":"853_CR46","doi-asserted-by":"publisher","DOI":"10.1109\/72.159058","author":"SK Pal","year":"1992","unstructured":"Pal SK, Mitra S. Multilayer perceptron, fuzzy sets, and classification. IEEE Trans Neural Netw. 1992. https:\/\/doi.org\/10.1109\/72.159058.","journal-title":"IEEE Trans Neural Netw"},{"key":"853_CR47","doi-asserted-by":"crossref","first-page":"668","DOI":"10.1109\/TCBB.2018.2868667","volume":"17","author":"P Fergus","year":"2018","unstructured":"Fergus P, Montanez CC, Abdulaimma B, Lisboa P, Chalmers C, Pineles B. Utilizing deep learning and genome wide association studies for epistatic-driven preterm birth classification in African-American women. IEEE\/ACM Trans Comput Biol Bioinf. 2018;17:668\u201378.","journal-title":"IEEE\/ACM Trans Comput Biol Bioinf"},{"key":"853_CR48","unstructured":"\u00c7. Elmas, Y.Z. Uygulamalar\u0131, Yapay Sinir A\u011flar\u0131, Bulan\u0131k Mant\u0131k, Genetik Algoritmalar, 1, Bas\u0131m, Ankara: Se\u00e7kin Yay\u0131nc\u0131l\u0131k, (2007)."},{"issue":"2011","key":"853_CR49","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(2011):2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"853_CR50","doi-asserted-by":"publisher","first-page":"854","DOI":"10.1038\/ejhg.2017.78","volume":"25","author":"JR Staley","year":"2017","unstructured":"Staley JR, Jones E, Kaptoge S, Butterworth AS, Sweeting MJ, Wood AM, Howson JM. A comparison of Cox and logistic regression for use in genome-wide association studies of cohort and case-cohort design. Eur J Hum Genet. 2017;25:854\u201362.","journal-title":"Eur J Hum Genet"},{"key":"853_CR51","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1002\/gepi.20359","volume":"33","author":"J Wakefield","year":"2009","unstructured":"Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol. 2009;33:79\u201386.","journal-title":"Genet Epidemiol"},{"key":"853_CR52","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12864-019-6413-7","volume":"21","author":"D Chicco","year":"2020","unstructured":"Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21:1\u201313.","journal-title":"BMC Genom"},{"key":"853_CR53","doi-asserted-by":"publisher","first-page":"4180","DOI":"10.1021\/acs.jcim.9b01162","volume":"60","author":"S Korkmaz","year":"2020","unstructured":"Korkmaz S. Deep learning-based imbalanced data classification for drug discovery. J Chem Inf Model. 2020;60:4180\u201390.","journal-title":"J Chem Inf Model"},{"key":"853_CR54","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1517\/03009734000000060","volume":"105","author":"H Lithell","year":"2000","unstructured":"Lithell H, Sundstr\u00f6m J, \u00c4rnl\u00f6v J, Bj\u00f6rklund K, H\u00e4nni A, Hedman A, Zethelius B, Byberg L, Kilander L, Reneland R. Epidemiological and clinical studies on insulin resistance and diabetes. Upsala J Med Sci. 2000;105:135\u201350.","journal-title":"Upsala J Med Sci"},{"key":"853_CR55","unstructured":"N. Lavesson, P. Davidsson, Quantifying the impact of learning algorithm parameter tuning, AAAI, 2006, pp. 395\u2013400."},{"key":"853_CR56","unstructured":"G. Van Rossum, Python Programming language, USENIX annual technical conference, 2007, pp. 1\u201336."},{"key":"853_CR57","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-019-3158-x","volume":"20","author":"J De Velasco Oriol","year":"2019","unstructured":"De Velasco Oriol J, Vallejo EE, Estrada K, Tam\u00e9z Pe\u00f1a JG, Initiative DN. Benchmarking machine learning models for late-onset alzheimer\u2019s disease prediction from genomic data. BMC Bioinf. 2019;20:1\u201317.","journal-title":"BMC Bioinf"},{"key":"853_CR58","doi-asserted-by":"publisher","first-page":"1213","DOI":"10.1016\/j.ajhg.2019.11.001","volume":"105","author":"F Priv\u00e9","year":"2019","unstructured":"Priv\u00e9 F, Vilhj\u00e1lmsson BJ, Aschard H, Blum MG. Making the most of clumping and thresholding for polygenic scores. Am J Human Genet. 2019;105:1213\u201321.","journal-title":"Am J Human Genet"},{"key":"853_CR59","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-017-03011-5","volume":"7","author":"M Schubach","year":"2017","unstructured":"Schubach M, Re M, Robinson PN, Valentini G. Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Sci Rep. 2017;7:1\u201312.","journal-title":"Sci Rep"},{"key":"853_CR60","doi-asserted-by":"publisher","first-page":"1102","DOI":"10.1166\/jmihi.2016.1807","volume":"6","author":"J Li","year":"2016","unstructured":"Li J, Fong S, Mohammed S, Fiaidhi J, Chen Q, Tan Z. Solving the under-fitting problem for decision tree algorithms by incremental swarm optimization in rare-event healthcare classification. J Med Imaging Health Inf. 2016;6:1102\u201310.","journal-title":"J Med Imaging Health Inf"},{"key":"853_CR61","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1016\/j.ijmedinf.2016.09.014","volume":"97","author":"T Zheng","year":"2017","unstructured":"Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Informatics. 2017;97:120\u20137.","journal-title":"Int J Med Informatics"},{"key":"853_CR62","doi-asserted-by":"publisher","DOI":"10.1038\/nbt.4235","author":"R Poplin","year":"2018","unstructured":"Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT. Creating a universal SNP and small indel variant caller with deep neural networks. Biorxiv. 2018. https:\/\/doi.org\/10.1038\/nbt.4235.","journal-title":"Biorxiv"},{"issue":"1","key":"853_CR63","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1186\/s12911-022-01775-z","volume":"22","author":"S Sadeghi","year":"2022","unstructured":"Sadeghi S, Khalili D, Ramezankhani A, Mansournia MA, Parsaeian M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inform Decis Mak. 2022;22(1):36.","journal-title":"BMC Med Inform Decis Mak"},{"key":"853_CR64","volume":"9","author":"M Temraz","year":"2022","unstructured":"Temraz M, Keane MT. Solving the class imbalance problem using a counterfactual method for data augmentation. Mach Learn Appl. 2022;9: 100375.","journal-title":"Mach Learn Appl"},{"issue":"18","key":"853_CR65","doi-asserted-by":"publisher","first-page":"459","DOI":"10.1007\/s12665-022-10578-4","volume":"81","author":"S Demir","year":"2022","unstructured":"Demir S, \u015eahin EK. Liquefaction prediction with robust machine learning algorithms (SVM, RF, and XGBoost) supported by genetic algorithm-based feature selection and parameter optimization from the perspective of data processing. Environ Earth Sci. 2022;81(18):459.","journal-title":"Environ Earth Sci"},{"key":"853_CR66","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1186\/1472-6947-13-30","volume":"13","author":"Z Afzal","year":"2013","unstructured":"Afzal Z, Schuemie MJ, van Blijderveen JC, Sen EF, Sturkenboom MC, Kors JA. Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records. BMC Med Inform Decis Mak. 2013;13:30.","journal-title":"BMC Med Inform Decis Mak"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00853-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-023-00853-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-023-00853-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,4]],"date-time":"2024-11-04T07:14:22Z","timestamp":1730704462000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-023-00853-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,30]]},"references-count":66,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["853"],"URL":"https:\/\/doi.org\/10.1186\/s40537-023-00853-x","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.01.05.522884","asserted-by":"object"}]},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,30]]},"assertion":[{"value":"27 March 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"21 November 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 November 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"174"}}