{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,18]],"date-time":"2025-10-18T10:53:36Z","timestamp":1760784816002},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T00:00:00Z","timestamp":1584316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T00:00:00Z","timestamp":1584316800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In <jats:italic>Machine Learning<\/jats:italic>, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of <jats:italic>Machine Learning<\/jats:italic> algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (<jats:italic>Area Under the Receiver Operating Characteristic Curve<\/jats:italic>, <jats:italic>Area Under the Precision-Recall Curve<\/jats:italic>, <jats:italic>Geometric Mean<\/jats:italic>) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the <jats:italic>Area Under the Receiver Operating Characteristic Curve<\/jats:italic> metric as the rarity level decreases, while corresponding scores with the <jats:italic>Area Under the Precision-Recall Curve<\/jats:italic> and <jats:italic>Geometric Mean<\/jats:italic> metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the <jats:italic>Area Under the Receiver Operating Characteristic Curve<\/jats:italic> metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the <jats:italic>Area Under the Precision-Recall Curve<\/jats:italic> and <jats:italic>Geometric Mean<\/jats:italic> metrics as the rarity level decreases. Overall, with regard to both case studies, the <jats:italic>Gradient-Boosted Trees<\/jats:italic> (GBT) learner performs the best.<\/jats:p>","DOI":"10.1186\/s40537-020-00301-0","type":"journal-article","created":{"date-parts":[[2020,3,16]],"date-time":"2020-03-16T13:07:07Z","timestamp":1584364027000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["Investigating class rarity in big data"],"prefix":"10.1186","volume":"7","author":[{"given":"Tawfiq","family":"Hasanin","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Taghi M.","family":"Khoshgoftaar","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Joffrey L.","family":"Leevy","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Richard A.","family":"Bauder","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,3,16]]},"reference":[{"key":"301_CR1","doi-asserted-by":"crossref","unstructured":"Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on contemporary computing (IC3). NewYork: IEEE; 2013. p. 404\u2013409.","DOI":"10.1109\/IC3.2013.6612229"},{"issue":"1","key":"301_CR2","doi-asserted-by":"publisher","first-page":"42","DOI":"10.1186\/s40537-018-0151-6","volume":"5","author":"JL Leevy","year":"2018","unstructured":"Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.","journal-title":"J Big Data"},{"key":"301_CR3","first-page":"194","volume":"2","author":"RC Soltysik","year":"2013","unstructured":"Soltysik RC, Yarnold PR. Megaoda large sample and big data time trials: separating the chaff. Optim Data Anal. 2013;2:194\u20137.","journal-title":"Optim Data Anal"},{"issue":"2","key":"301_CR4","doi-asserted-by":"publisher","first-page":"423","DOI":"10.2308\/acch-51068","volume":"29","author":"M Cao","year":"2015","unstructured":"Cao M, Chychyla R, Stewart T. Big data analytics in financial statement audits. Account Horizons. 2015;29(2):423\u20139.","journal-title":"Account Horizons"},{"key":"301_CR5","doi-asserted-by":"crossref","unstructured":"Bauder RA, Khoshgoftaar TM, Hasanin, T. An empirical study on class rarity in big data. In: 2018 17th IEEE International Conference on machine learning and applications (ICMLA). Newyork: IEEE ; 2018. p. 785\u2013790. IEEE","DOI":"10.1109\/ICMLA.2018.00125"},{"key":"301_CR6","doi-asserted-by":"crossref","unstructured":"Bauder R, Khoshgoftaar T. Medicare fraud detection using random forest with class imbalanced big data. In: 2018 IEEE International Conference on information reuse and integration (IRI). Newyork: IEEE; 2018. p. 80\u201387.","DOI":"10.1109\/IRI.2018.00019"},{"key":"301_CR7","volume-title":"Data mining: practical machine learning tools and techniques","author":"IH Witten","year":"2016","unstructured":"Witten IH, Frank E, Hall MA, Pal CJ. Data mining: practical machine learning tools and techniques. Amsterdam: Morgan Kaufmann; 2016."},{"issue":"2","key":"301_CR8","doi-asserted-by":"publisher","first-page":"171","DOI":"10.1086\/587826","volume":"83","author":"JD Olden","year":"2008","unstructured":"Olden JD, Lawler JJ, Poff NL. Machine learning methods without tears: a primer for ecologists. Q Rev Biol. 2008;83(2):171\u201393.","journal-title":"Q Rev Biol"},{"issue":"1","key":"301_CR9","doi-asserted-by":"publisher","first-page":"107","DOI":"10.1023\/A:1008699112516","volume":"15","author":"J Galindo","year":"2000","unstructured":"Galindo J, Tamayo P. Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications. Comput Econ. 2000;15(1):107\u201343.","journal-title":"Comput Econ"},{"key":"301_CR10","doi-asserted-by":"crossref","unstructured":"Seliya N, Khoshgoftaar TM, Van\u00a0Hulse J. A study on the relationships of classifier performance metrics. In: 2009 21st IEEE International Conference on tools with artificial intelligence. Newyork: IEEE; 2009. p. 59\u201366.","DOI":"10.1109\/ICTAI.2009.25"},{"key":"301_CR11","doi-asserted-by":"crossref","unstructured":"Triguero I, Galar M, Merino D, Maillo J, Bustince H. Herrera, F. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: Evolutionary Computation (CEC), 2016 IEEE Congress on; Newyork: IEE; 2016. p. 640\u2013647.","DOI":"10.1109\/CEC.2016.7743853"},{"key":"301_CR12","unstructured":"Apache Hadoop. http:\/\/hadoop.apache.org\/"},{"key":"301_CR13","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4302-1943-9","volume-title":"Pro Hadoop","author":"J Venner","year":"2009","unstructured":"Venner J. Pro Hadoop. Berkeley: Apress; 2009."},{"key":"301_CR14","volume-title":"Hadoop: the definitive guide","author":"T White","year":"2012","unstructured":"White T. Hadoop: the definitive guide. Sebastopol: O\u2019Reilly Media Inc; 2012."},{"key":"301_CR15","doi-asserted-by":"crossref","unstructured":"Bauder RA, Khoshgoftaar TM, Hasanin T. Data sampling approaches with severely imbalanced big data for medicare fraud detection. In: 2018 IEEE 30th International Conference on tools with artificial intelligence (ICTAI). Newyork: IEEE; 2018. p. 137\u2013142.","DOI":"10.1109\/ICTAI.2018.00030"},{"issue":"1","key":"301_CR16","doi-asserted-by":"publisher","first-page":"63","DOI":"10.1186\/s40537-019-0225-0","volume":"6","author":"JM Johnson","year":"2019","unstructured":"Johnson JM, Khoshgoftaar TM. Medicare fraud detection using neural networks. J Big Data. 2019;6(1):63.","journal-title":"J Big Data"},{"key":"301_CR17","unstructured":"Calvert C, Khoshgoftaar TM, Kemp C, Najafabadi MM. Detecting slow http post dos attacks using netflow features. In: The Thirty-second International FLAIRS Conference (2019)."},{"key":"301_CR18","unstructured":"Calvert C, Khoshgoftaar TM, Kemp C, Najafabadi MM. Detection of slowloris attacks using netflow traffic. In: 24th ISSAT International Conference on reliability and quality in design. 2018; p. 191\u2013196."},{"issue":"3","key":"301_CR19","doi-asserted-by":"publisher","first-page":"275","DOI":"10.1162\/evco.2009.17.3.275","volume":"17","author":"S Garc\u00eda","year":"2009","unstructured":"Garc\u00eda S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput. 2009;17(3):275\u2013306.","journal-title":"Evol Comput"},{"key":"301_CR20","doi-asserted-by":"publisher","first-page":"112","DOI":"10.1016\/j.ins.2014.03.043","volume":"285","author":"S Del R\u00edo","year":"2014","unstructured":"Del R\u00edo S, L\u00f3pez V, Ben\u00edtez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inf Sci. 2014;285:112\u201337.","journal-title":"Inf Sci"},{"issue":"1","key":"301_CR21","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1109\/TCIAIG.2013.2285651","volume":"6","author":"AK Baughman","year":"2013","unstructured":"Baughman AK, Chuang W, Dixon KR, Benz Z, Basilico J. Deepqa jeopardy! gamification: a machine-learning perspective. IEEE Trans Comput Intell AI Games. 2013;6(1):55\u201366.","journal-title":"IEEE Trans Comput Intell AI Games"},{"issue":"3","key":"301_CR22","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1609\/aimag.v31i3.2303","volume":"31","author":"D Ferrucci","year":"2010","unstructured":"Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J, et al. Building watson: an overview of the deepqa project. AI Mag. 2010;31(3):59\u201379.","journal-title":"AI Mag"},{"key":"301_CR23","unstructured":"LEIE: Medicare provider utilization and payment data: physician and other supplier. https:\/\/oig.hhs.gov\/exclusions\/index.asp"},{"key":"301_CR24","doi-asserted-by":"crossref","unstructured":"Liu Y-h, Zhang H-q, Yang Y-j. A dos attack situation assessment method based on qos. In: Proceedings of 2011 International Conference on computer science and network technology. Newyork: IEEE; 2011. p. 1041\u20131045.","DOI":"10.1109\/ICCSNT.2011.6182139"},{"key":"301_CR25","doi-asserted-by":"crossref","unstructured":"Yevsieieva O, Helalat SM. Analysis of the impact of the slow http dos and ddos attacks on the cloud environment. In: 2017 4th International scientific-practical Conference problems of infocommunications. Science and technology (PIC S&T). Newyork: IEEE; 2017. p. 519\u2013523.","DOI":"10.1109\/INFOCOMMST.2017.8246453"},{"key":"301_CR26","doi-asserted-by":"crossref","unstructured":"Hirakaw T, Ogura K, Bista BB, Takata T. A defense method against distributed slow http dos attack. In: 2016 19th International Conference on network-based information systems (NBiS)). Newyork: IEEE; 2016. p. 519\u2013523.","DOI":"10.1109\/NBiS.2016.58"},{"key":"301_CR27","unstructured":"Slowloris.py. https:\/\/github.com\/gkbrk\/slowloris"},{"key":"301_CR28","unstructured":"Apache Spark MLlib. https:\/\/spark.apache.org\/mllib\/"},{"key":"301_CR29","first-page":"95","volume":"10","author":"M Zaharia","year":"2010","unstructured":"Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010;10:95.","journal-title":"HotCloud"},{"issue":"34","key":"301_CR30","first-page":"1","volume":"17","author":"X Meng","year":"2016","unstructured":"Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, et al. Mllib: Machine learning in apache spark. JMLR. 2016;17(34):1\u20137.","journal-title":"JMLR"},{"key":"301_CR31","doi-asserted-by":"crossref","unstructured":"Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et\u00a0al. Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on cloud computing. Newyork: ACM; 2013. p. 5.","DOI":"10.1145\/2523616.2523633"},{"issue":"1","key":"301_CR32","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1186\/s40537-018-0138-3","volume":"5","author":"M Herland","year":"2018","unstructured":"Herland M, Khoshgoftaar TM, Bauder RA. Big data fraud detection using multiple medicare data sources. J Big Data. 2018;5(1):29.","journal-title":"J Big Data"},{"key":"301_CR33","doi-asserted-by":"crossref","unstructured":"Seiffert C, Khoshgoftaar TM, Van\u00a0Hulse J, Napolitano A. Mining data with rare events: a case study. In: 19th IEEE International Conference on tools with artificial intelligence (ICTAI 2007). Newyork: IEEE; 2007. vol 2, p. 132\u2013139. IEEE.","DOI":"10.1109\/ICTAI.2007.71"},{"issue":"1","key":"301_CR34","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1186\/2196-1115-1-2","volume":"1","author":"M Herland","year":"2014","unstructured":"Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big data. 2014;1(1):2.","journal-title":"J Big data"},{"key":"301_CR35","first-page":"0118432","volume":"10","author":"T Saito","year":"2015","unstructured":"Saito T, Rehmsmeier M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10:0118432.","journal-title":"PLoS ONE"},{"key":"301_CR36","unstructured":"Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on artificial intelligence. Burlington: Morgan Kaufmann Publishers Inc; 1995. Vol 2, p. 1137\u20131143."},{"key":"301_CR37","doi-asserted-by":"crossref","unstructured":"Van\u00a0Hulse J, Khoshgoftaar TM, Napolitano A. An empirical comparison of repetitive undersampling techniques. In: 2009 IEEE International Conference on information reuse & integration. Newyork: IEEE; 2009. p. 29\u201334.","DOI":"10.1109\/IRI.2009.5211614"},{"issue":"1","key":"301_CR38","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1214\/009053604000001048","volume":"33","author":"A Gelman","year":"2005","unstructured":"Gelman A. Analysis of variance-why it is more important than ever1. Ann Stat. 2005;33(1):1\u201353.","journal-title":"Ann Stat"},{"key":"301_CR39","doi-asserted-by":"publisher","first-page":"99","DOI":"10.2307\/3001913","volume":"1","author":"JW Tukey","year":"1949","unstructured":"Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;1:99\u2013114.","journal-title":"Biometrics"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00301-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s40537-020-00301-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00301-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,3,16]],"date-time":"2021-03-16T00:52:21Z","timestamp":1615855941000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-020-00301-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,16]]},"references-count":39,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["301"],"URL":"https:\/\/doi.org\/10.1186\/s40537-020-00301-0","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,16]]},"assertion":[{"value":"9 December 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 March 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"23"}}