{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,16]],"date-time":"2026-07-16T12:00:22Z","timestamp":1784203222631,"version":"3.55.0"},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2023,4,25]],"date-time":"2023-04-25T00:00:00Z","timestamp":1682380800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,4,25]],"date-time":"2023-04-25T00:00:00Z","timestamp":1682380800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"publisher","award":["R01MH121394"],"award-info":[{"award-number":["R01MH121394"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"publisher","award":["R01MH121394"],"award-info":[{"award-number":["R01MH121394"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000025","name":"National Institute of Mental Health","doi-asserted-by":"publisher","award":["R01MH121394"],"award-info":[{"award-number":["R01MH121394"]}],"id":[{"id":"10.13039\/100000025","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>In many healthcare applications, datasets for classification may be highly imbalanced due to the rare occurrence of target events such as disease onset. The SMOTE (Synthetic Minority Over-sampling Technique) algorithm has been developed as an effective resampling method for imbalanced data classification by oversampling samples from the minority class. However, samples generated by SMOTE may be ambiguous, low-quality and non-separable with the majority class. To enhance the quality of generated samples, we proposed a novel self-inspected adaptive SMOTE (SASMOTE) model that leverages an adaptive nearest neighborhood selection algorithm to identify the \u201cvisible\u201d nearest neighbors, which are used to generate samples likely to fall into the minority class. To further enhance the quality of the generated samples, an uncertainty elimination via self-inspection approach is introduced in the proposed SASMOTE model. Its objective is to filter out the generated samples that are highly uncertain and inseparable with the majority class. The effectiveness of the proposed algorithm is compared with existing SMOTE-based algorithms and demonstrated through two real-world case studies in healthcare, including risk gene discovery and fatal congenital heart disease prediction. By generating the higher quality synthetic samples, the proposed algorithm is able to help achieve better prediction performance (in terms of F1 score) on average compared to the other methods, which is promising to enhance the usability of machine learning models on highly imbalanced healthcare data.<\/jats:p>","DOI":"10.1186\/s13040-023-00330-4","type":"journal-article","created":{"date-parts":[[2023,4,25]],"date-time":"2023-04-25T10:03:55Z","timestamp":1682417035000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":77,"title":["A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare"],"prefix":"10.1186","volume":"16","author":[{"given":"Tanapol","family":"Kosolwattana","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenang","family":"Liu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Renjie","family":"Hu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shizhong","family":"Han","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hua","family":"Chen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ying","family":"Lin","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2023,4,25]]},"reference":[{"issue":"04","key":"330_CR1","doi-asserted-by":"publisher","first-page":"687","DOI":"10.1142\/S0218001409007326","volume":"23","author":"Y Sun","year":"2009","unstructured":"Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell. 2009;23(04):687\u2013719.","journal-title":"Int J Pattern Recognit Artif Intell."},{"key":"330_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1155\/2018\/6275435","volume":"2018","author":"Y Zhao","year":"2018","unstructured":"Zhao Y, Wong ZSY, Tsui KL. A framework of rebalancing imbalanced healthcare data for rare events\u2019 classification: a case of look-alike sound-alike mix-up incident detection. J Healthc Eng. 2018;2018:1\u201311. https:\/\/doi.org\/10.1155\/2018\/6275435.","journal-title":"J Healthc Eng"},{"issue":"1","key":"330_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1756-0381-6-16","volume":"6","author":"M Nakamura","year":"2013","unstructured":"Nakamura M, Kajiwara Y, Otsuka A, Kimura H. Lvq-smote-learning vector quantization based synthetic minority over-sampling technique for biomedical data. BioData Min. 2013;6(1):1\u201310.","journal-title":"BioData Min."},{"issue":"1","key":"330_CR4","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13040-016-0117-1","volume":"9","author":"J Li","year":"2016","unstructured":"Li J, Fong S, Sung Y, Cho K, Wong R, Wong KK. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Min. 2016;9(1):1\u201315.","journal-title":"BioData Min."},{"key":"330_CR5","doi-asserted-by":"publisher","DOI":"10.3389\/fgene.2020.500064","volume":"11","author":"Y Lin","year":"2020","unstructured":"Lin Y, Afshar S, Rajadhyaksha AM, Potash JB, Han S. A machine learning approach to predicting autism risk genes: Validation of known genes and discovery of new candidates. Front Genet. 2020;11: 500064.","journal-title":"Front Genet."},{"key":"330_CR6","doi-asserted-by":"publisher","unstructured":"Li Y, Shi Z, Liu C, Tian W, Kong Z, Williams CB. Augmented Time Regularized Generative Adversarial Network (ATR-GAN) for Data Augmentation in Online Process Anomaly Detection. IEEE Trans Autom Sci Eng. 2021:1\u201318. https:\/\/doi.org\/10.1109\/TASE.2021.3118635.","DOI":"10.1109\/TASE.2021.3118635"},{"key":"330_CR7","unstructured":"Weiss GM, McCarthy K, Zabar B. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: Stahlbock R, Crone SF, Lessmann S, editors. Proceedings of the 2007 International Conference on Data Mining, DMIN 2007. Las Vegas: CSREA Press; 2007. p. 35\u201341."},{"issue":"1","key":"330_CR8","doi-asserted-by":"publisher","first-page":"137","DOI":"10.1186\/1471-2288-14-137","volume":"14","author":"T van der Ploeg","year":"2014","unstructured":"van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014;14(1):137.","journal-title":"BMC Med Res Methodol."},{"key":"330_CR9","doi-asserted-by":"publisher","unstructured":"Bellinger C, Sharma S, Japkowicz N. One-Class versus Binary Classification: Which and When? In: 2012 11th International Conference on Machine Learning and Applications, vol\u00a02. 2012. p. 102\u2013106. https:\/\/doi.org\/10.1109\/ICMLA.2012.212.","DOI":"10.1109\/ICMLA.2012.212"},{"key":"330_CR10","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1016\/j.aca.2013.10.050","volume":"806","author":"M Hao","year":"2014","unstructured":"Hao M, Wang Y, Bryant SH. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta. 2014;806:117\u201327.","journal-title":"Anal Chim Acta."},{"key":"330_CR11","doi-asserted-by":"publisher","unstructured":"Salzberg SL. C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn. 1994;16(3):235\u2013240. https:\/\/doi.org\/10.1007\/BF00993309.","DOI":"10.1007\/BF00993309"},{"key":"330_CR12","doi-asserted-by":"publisher","unstructured":"Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Comput Intell. 2004;20(1):18\u201336. https:\/\/doi.org\/10.1111\/j.0824-7935.2004.t01-1-00228.x.","DOI":"10.1111\/j.0824-7935.2004.t01-1-00228.x"},{"key":"330_CR13","unstructured":"Branco P, Torgo L, Ribeiro RP. A Survey of Predictive Modelling under Imbalanced Distributions. CoRR. 2015. arXiv:abs\/1505.01658. 1505.01658."},{"key":"330_CR14","doi-asserted-by":"publisher","unstructured":"Chawla NV, Japkowicz N, Kotcz A. Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explor Newsl. 2004;6(1):1\u20136. https:\/\/doi.org\/10.1145\/1007730.1007733.","DOI":"10.1145\/1007730.1007733"},{"key":"330_CR15","doi-asserted-by":"crossref","unstructured":"Dubey R, Zhou J, Wang Y, Thompson PM, Ye J, Alzheimer\u2019s Disease Neuroimaging Initiative. Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study. Neuroimage. 2014;87:220\u2013241.","DOI":"10.1016\/j.neuroimage.2013.10.005"},{"issue":"1","key":"330_CR16","first-page":"863","volume":"61","author":"A Fern\u00e1ndez","year":"2018","unstructured":"Fern\u00e1ndez A, Garc\u00eda S, Herrera F, Chawla NV. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. J Artif Int Res. 2018;61(1):863\u2013905.","journal-title":"J Artif Int Res."},{"issue":"2","key":"330_CR17","first-page":"90","volume":"6","author":"S He","year":"2019","unstructured":"He S. BSMOTE with LDA for high-dimensional and class imbalanced ovarian cancer data. Int J Sci. 2019;6(2):90\u2013101.","journal-title":"Int J Sci"},{"key":"330_CR18","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res. 2002;16:321\u201357. https:\/\/doi.org\/10.1613\/jair.953.","journal-title":"J Artif Intell Res."},{"key":"330_CR19","doi-asserted-by":"publisher","unstructured":"Verbiest N, Ramentol E, Cornelis C, Herrera F. Improving SMOTE with Fuzzy Rough Prototype Selection to Detect Noise in Imbalanced Classification Data, vol 7637. 2012. https:\/\/doi.org\/10.1007\/978-3-642-34654-5_18.","DOI":"10.1007\/978-3-642-34654-5_18"},{"issue":"11","key":"330_CR20","doi-asserted-by":"publisher","first-page":"1546","DOI":"10.3844\/jcssp.2020.1546.1557","volume":"16","author":"KM Hasib","year":"2020","unstructured":"Hasib KM, Iqbal MS, Shah FM, Al Mahmud J, Popel MH, Showrov MIH, et al. A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem. J Comput Sci. 2020;16(11):1546\u201357. https:\/\/doi.org\/10.3844\/jcssp.2020.1546.1557.","journal-title":"J Comput Sci."},{"key":"330_CR21","doi-asserted-by":"publisher","unstructured":"Batista GEAPA, Prati RC, Monard MC. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explor Newsl. 2004;6(1):20\u201329. https:\/\/doi.org\/10.1145\/1007730.1007735.","DOI":"10.1145\/1007730.1007735"},{"issue":"5","key":"330_CR22","doi-asserted-by":"publisher","first-page":"796","DOI":"10.1109\/TPAMI.2007.70735","volume":"30","author":"T Lin","year":"2008","unstructured":"Lin T, Zha H. Riemannian Manifold Learning. IEEE Trans Pattern Anal Mach Intell. 2008;30(5):796\u2013809. https:\/\/doi.org\/10.1109\/TPAMI.2007.70735.","journal-title":"IEEE Trans Pattern Anal Mach Intell."},{"key":"330_CR23","unstructured":"Raghu M, Blumer K, Sayres R, Obermeyer Z, Kleinberg B, Mullainathan S, et al. Direct uncertainty prediction for medical second opinions. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97. PMLR; 2019.\u00a0p. 5281\u201390."},{"key":"330_CR24","unstructured":"Kendall A, Gal Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? CoRR. 2017. arXiv:abs\/1703.04977. 1703.04977."},{"issue":"6","key":"330_CR25","doi-asserted-by":"publisher","first-page":"1164","DOI":"10.1016\/j.jbi.2012.07.011","volume":"45","author":"CL Chi","year":"2012","unstructured":"Chi CL, Nick Street W, Robinson JG, Crawford MA. Individualized patient-centered lifestyle recommendations: an expert system for communicating patient specific cardiovascular risk information and prioritizing lifestyle options. J Biomed Inform. 2012;45(6):1164\u201374.","journal-title":"J Biomed Inform."},{"issue":"6","key":"330_CR26","doi-asserted-by":"publisher","first-page":"483","DOI":"10.1093\/oxfordjournals.aje.a009302","volume":"146","author":"LE Chambless","year":"1997","unstructured":"Chambless LE, Heiss G, Folsom AR, Rosamond W, Szklo M, Sharrett AR, et al. Association of coronary heart disease incidence with carotid arterial wall thickness and major risk factors: the Atherosclerosis Risk in Communities (ARIC) Study, 1987\u20131993. Am J Epidemiol. 1997;146(6):483\u201394.","journal-title":"Am J Epidemiol."},{"key":"330_CR27","doi-asserted-by":"publisher","unstructured":"Dogan A, Li Y, Peter Odo C, Sonawane K, Lin Y, Liu C. A utility-based machine learning-driven personalized lifestyle recommendation for cardiovascular disease prevention.\u00a0J Biomed Inform. 2023:104342. https:\/\/doi.org\/10.1016\/j.jbi.2023.104342.","DOI":"10.1016\/j.jbi.2023.104342"},{"issue":"1","key":"330_CR28","doi-asserted-by":"publisher","first-page":"2959","DOI":"10.1038\/s41598-017-03011-5","volume":"7","author":"M Schubach","year":"2017","unstructured":"Schubach M, Re M, Robinson PN, Valentini G. Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Sci Rep. 2017;7(1):2959. https:\/\/doi.org\/10.1038\/s41598-017-03011-5.","journal-title":"Sci Rep."},{"key":"330_CR29","doi-asserted-by":"publisher","first-page":"39707","DOI":"10.1109\/ACCESS.2021.3064084","volume":"9","author":"A Ishaq","year":"2021","unstructured":"Ishaq A, Sadiq S, Umer M, Ullah S, Mirjalili S, Rupapara V, et al. Improving the Prediction of Heart Failure Patients\u2019 Survival Using SMOTE and Effective Data Mining Techniques. IEEE Access. 2021;9:39707\u201316. https:\/\/doi.org\/10.1109\/ACCESS.2021.3064084.","journal-title":"IEEE Access."},{"issue":"5","key":"330_CR30","doi-asserted-by":"publisher","first-page":"92","DOI":"10.1007\/s10916-018-0940-7","volume":"42","author":"M Maniruzzaman","year":"2018","unstructured":"Maniruzzaman M, Rahman MJ, Al-MehediHasan M, Suri HS, Abedin MM, El-Baz A, et al. Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers. J Med Syst. 2018;42(5):92. https:\/\/doi.org\/10.1007\/s10916-018-0940-7.","journal-title":"J Med Syst."},{"key":"330_CR31","doi-asserted-by":"publisher","first-page":"102232","DOI":"10.1109\/ACCESS.2019.2929866","volume":"7","author":"Q Wang","year":"2019","unstructured":"Wang Q, Cao W, Guo J, Ren J, Cheng Y, Davis DN. DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values. IEEE Access. 2019;7:102232\u20138. https:\/\/doi.org\/10.1109\/ACCESS.2019.2929866.","journal-title":"IEEE Access."}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-023-00330-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-023-00330-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-023-00330-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,4,25]],"date-time":"2023-04-25T10:05:08Z","timestamp":1682417108000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-023-00330-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,4,25]]},"references-count":31,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,12]]}},"alternative-id":["330"],"URL":"https:\/\/doi.org\/10.1186\/s13040-023-00330-4","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-1647776\/v1","asserted-by":"object"}]},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,4,25]]},"assertion":[{"value":"12 May 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 April 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"15"}}