{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T07:19:40Z","timestamp":1740122380176,"version":"3.37.3"},"reference-count":49,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100006831","name":"u.s. air force","doi-asserted-by":"publisher","award":["FA9550-17-1-010"],"award-info":[{"award-number":["FA9550-17-1-010"]}],"id":[{"id":"10.13039\/100006831","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Min Knowl Disc"],"published-print":{"date-parts":[[2022,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The presence of imbalanced classes is more and more common in practical applications and it is known to heavily compromise the learning process. In this paper we propose a new method aimed at addressing this issue in binary supervised classification. Re-balancing the class sizes has turned out to be a fruitful strategy to overcome this problem. Our proposal performs re-balancing through matrix sketching. Matrix sketching is a recently developed data compression technique that is characterized by the property of preserving most of the linear information that is present in the data. Such property is guaranteed by the Johnson-Lindenstrauss\u2019 Lemma (1984) and allows to embed an <jats:italic>n<\/jats:italic>-dimensional space into a reduced one without distorting, within an <jats:inline-formula><jats:alternatives><jats:tex-math>$$\\epsilon $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mi>\u03f5<\/mml:mi>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula>-size interval, the distances between any pair of points. We propose to use matrix sketching as an alternative to the standard re-balancing strategies that are based on random under-sampling the majority class or random over-sampling the minority one. We assess the properties of our method when combined with linear discriminant analysis (LDA), classification trees (C4.5) and Support Vector Machines (SVM) on simulated and real data. Results show that sketching can represent a sound alternative to the most widely used rebalancing methods.<\/jats:p>","DOI":"10.1007\/s10618-021-00791-3","type":"journal-article","created":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T09:07:27Z","timestamp":1634461647000},"page":"174-208","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Matrix sketching for supervised classification with imbalanced classes"],"prefix":"10.1007","volume":"36","author":[{"given":"Roberta","family":"Falcone","sequence":"first","affiliation":[]},{"given":"Laura","family":"Anderlucci","sequence":"additional","affiliation":[]},{"given":"Angela","family":"Montanari","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"issue":"2","key":"791_CR1","doi-asserted-by":"publisher","first-page":"283","DOI":"10.1093\/biomet\/asaa062","volume":"108","author":"DC Ahfock","year":"2021","unstructured":"Ahfock DC, Astle WJ, Richardson S (2021) Statistical properties of sketching algorithms. Biometrika 108(2):283\u2013297","journal-title":"Biometrika"},{"issue":"1","key":"791_CR2","doi-asserted-by":"publisher","first-page":"302","DOI":"10.1137\/060673096","volume":"39","author":"N Ailon","year":"2009","unstructured":"Ailon N, Chazelle B (2009) The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J Comput 39(1):302\u2013322","journal-title":"SIAM J Comput"},{"key":"791_CR3","doi-asserted-by":"crossref","unstructured":"Almogahed BA, Kakadiaris IA (2014) Empowering imbalanced data in supervised learning: a semi-supervised learning approach. In: International Conference on Artificial Neural Networks, Springer, pp 523\u2013530","DOI":"10.1007\/978-3-319-11179-7_66"},{"key":"791_CR4","volume-title":"An introduction to multivariate statistical analysis","author":"TW Anderson","year":"1962","unstructured":"Anderson TW (1962) An introduction to multivariate statistical analysis. Wiley, New York"},{"key":"791_CR5","doi-asserted-by":"crossref","unstructured":"Batista GEDAPA, Silva DF, Prati RC (2012) An experimental design to evaluate class imbalance treatment methods. In: 2012 11th International Conference on Machine Learning and Applications, vol\u00a02, pp 95\u2013101, 10.1109\/ICMLA.2012.162","DOI":"10.1109\/ICMLA.2012.162"},{"issue":"3","key":"791_CR6","doi-asserted-by":"publisher","first-page":"605","DOI":"10.1007\/s10994-017-5670-4","volume":"107","author":"C Bellinger","year":"2018","unstructured":"Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605\u2013637","journal-title":"Mach Learn"},{"issue":"3\/4","key":"791_CR7","doi-asserted-by":"publisher","first-page":"317","DOI":"10.2307\/2332671","volume":"36","author":"GE Box","year":"1949","unstructured":"Box GE (1949) A general distribution theory for a class of likelihood criteria. Biometrika 36(3\/4):317\u2013346","journal-title":"Biometrika"},{"issue":"2","key":"791_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2907070","volume":"49","author":"P Branco","year":"2016","unstructured":"Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv (CSUR) 49(2):1\u201350","journal-title":"ACM Comput Surv (CSUR)"},{"key":"791_CR9","doi-asserted-by":"publisher","first-page":"321","DOI":"10.1613\/jair.953","volume":"16","author":"NV Chawla","year":"2002","unstructured":"Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321\u2013357","journal-title":"J Artif Intell Res"},{"issue":"1","key":"791_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/1007730.1007733","volume":"6","author":"NV Chawla","year":"2004","unstructured":"Chawla NV, Japkowicz N, Kotcz A (2004) Editorial of the special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1\u20136","journal-title":"ACM Sigkdd Explor Newsl"},{"issue":"6","key":"791_CR11","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1145\/3019134","volume":"63","author":"KL Clarkson","year":"2017","unstructured":"Clarkson KL, Woodruff DP (2017) Low-rank approximation and regression in input sparsity time. J ACM (JACM) 63(6):54","journal-title":"J ACM (JACM)"},{"issue":"3","key":"791_CR12","first-page":"273","volume":"20","author":"C Cortes","year":"1995","unstructured":"Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273\u2013297","journal-title":"Mach Learn"},{"key":"791_CR13","unstructured":"Dobriban E, Liu S (2018) A new theory for sketching in linear regression. arXiv:1810.06089, Short version at NeurIPS 2019"},{"key":"791_CR14","doi-asserted-by":"crossref","unstructured":"Domingos P (1999) Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 155\u2013164","DOI":"10.1145\/312129.312220"},{"key":"791_CR15","unstructured":"Dua D, Graff C (2019) UCI machine learning repository. http:\/\/archive.ics.uci.edu\/ml"},{"key":"791_CR16","unstructured":"Falcone R (2019) Supervised classification with matrix sketching. PhD thesis, University of Bologna"},{"issue":"2","key":"791_CR17","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1007\/s40747-017-0037-9","volume":"3","author":"A Fern\u00e1ndez","year":"2017","unstructured":"Fern\u00e1ndez A, del R\u00edo S, Chawla NV, Herrera F (2017) An insight into imbalanced big data classification: outcomes and challenges. Complex Intell Syst 3(2):105\u2013120","journal-title":"Complex Intell Syst"},{"issue":"2","key":"791_CR18","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1111\/j.1469-1809.1936.tb02137.x","volume":"7","author":"RA Fisher","year":"1936","unstructured":"Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179\u2013188","journal-title":"Ann Eugen"},{"issue":"5","key":"791_CR19","doi-asserted-by":"publisher","first-page":"1693","DOI":"10.1214\/14-AOS1220","volume":"42","author":"W Fithian","year":"2014","unstructured":"Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Statist 42(5):1693","journal-title":"Ann Statist"},{"key":"791_CR20","doi-asserted-by":"publisher","first-page":"147","DOI":"10.2307\/1968346","volume":"34","author":"A Haar","year":"1933","unstructured":"Haar A (1933) Der massbegriff in der theorie der kontinuierlichen gruppen. Ann Math 34:147\u2013169","journal-title":"Ann Math"},{"key":"791_CR21","doi-asserted-by":"publisher","first-page":"220","DOI":"10.1016\/j.eswa.2016.12.035","volume":"73","author":"G Haixiang","year":"2017","unstructured":"Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220\u2013239","journal-title":"Expert Syst Appl"},{"key":"791_CR22","unstructured":"He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322\u20131328"},{"issue":"10","key":"791_CR23","doi-asserted-by":"publisher","first-page":"3595","DOI":"10.1080\/03610929008830400","volume":"19","author":"N Henze","year":"1990","unstructured":"Henze N, Zirkler B (1990) A class of invariant consistent tests for multivariate normality. Commun Statist - Theor Methods 19(10):3595\u20133617","journal-title":"Commun Statist - Theor Methods"},{"key":"791_CR24","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781139020411","volume-title":"Matrix Anal","author":"RA Horn","year":"2012","unstructured":"Horn RA, Johnson CR (2012) Matrix Anal. Cambridge University Press, Cambridge"},{"key":"791_CR25","unstructured":"Hu XS, Zhang RJ (2013) Clustering-based subset ensemble learning method for imbalanced data. In: 2013 International Conference on Machine Learning and Cybernetics, IEEE, vol\u00a01, pp 35\u201339"},{"issue":"1","key":"791_CR26","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1145\/1007730.1007737","volume":"6","author":"T Jo","year":"2004","unstructured":"Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsl 6(1):40\u201349","journal-title":"ACM Sigkdd Explor Newsl"},{"key":"791_CR27","doi-asserted-by":"publisher","first-page":"2177","DOI":"10.1016\/j.jmva.2005.05.010","volume":"97","author":"H Joe","year":"2006","unstructured":"Joe H (2006) Generating random correlation matrices based on partial correlations. J Multivar Anal 97:2177\u20132189","journal-title":"J Multivar Anal"},{"issue":"1","key":"791_CR28","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1090\/conm\/026\/737400","volume":"26","author":"WB Johnson","year":"1984","unstructured":"Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Mathe 26(1):189\u2013206","journal-title":"Contemp Mathe"},{"issue":"4","key":"791_CR29","doi-asserted-by":"publisher","first-page":"221","DOI":"10.1007\/s13748-016-0094-0","volume":"5","author":"B Krawczyk","year":"2016","unstructured":"Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progress Artif Intell 5(4):221\u2013232","journal-title":"Progress Artif Intell"},{"key":"791_CR30","doi-asserted-by":"crossref","unstructured":"Liu XY, Zhou ZH (2013) Ensemble methods for class imbalance learning. Imbalanced Learning: Foundations, Algorithms and Applications pp 61\u201382","DOI":"10.1002\/9781118646106.ch4"},{"key":"791_CR31","doi-asserted-by":"crossref","unstructured":"Lunardon N, Menardi G, Torelli N (2014) ROSE: A package for binary imbalanced learning. R journal 6(1)","DOI":"10.32614\/RJ-2014-008"},{"issue":"1","key":"791_CR32","doi-asserted-by":"publisher","first-page":"168","DOI":"10.1016\/j.csda.2010.06.014","volume":"55","author":"M Maalouf","year":"2011","unstructured":"Maalouf M, Trafalis TB (2011) Robust weighted kernel logistic regression in imbalanced and rare events data. Comput Statist Data Anal 55(1):168\u2013183","journal-title":"Comput Statist Data Anal"},{"issue":"6","key":"791_CR33","doi-asserted-by":"publisher","first-page":"777","DOI":"10.3844\/jcssp.2018.777.792","volume":"14","author":"S Maheshwari","year":"2018","unstructured":"Maheshwari S, Jain R, Jadon R (2018) An insight into rare class problem: analysis and potential solutions. J Comput Sci 14(6):777\u2013792","journal-title":"J Comput Sci"},{"key":"791_CR34","unstructured":"Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, vol 126"},{"issue":"3","key":"791_CR35","doi-asserted-by":"publisher","first-page":"519","DOI":"10.1093\/biomet\/57.3.519","volume":"57","author":"KV Mardia","year":"1970","unstructured":"Mardia KV (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57(3):519\u2013530","journal-title":"Biometrika"},{"key":"791_CR36","volume-title":"Discriminant analysis and statistical pattern recognition","author":"G McLachlan","year":"2004","unstructured":"McLachlan G (2004) Discriminant analysis and statistical pattern recognition. Wiley, Hoboken"},{"issue":"1","key":"791_CR37","doi-asserted-by":"publisher","first-page":"92","DOI":"10.1007\/s10618-012-0295-5","volume":"28","author":"G Menardi","year":"2014","unstructured":"Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Mining Knowl Discov 28(1):92\u2013122","journal-title":"Data Mining Knowl Discov"},{"key":"791_CR38","doi-asserted-by":"crossref","unstructured":"Mullick SS, Datta S, Das S (2019) Generative adversarial minority oversampling. In: Proceedings of the IEEE\/CVF International Conference on Computer Vision, pp 1695\u20131704","DOI":"10.1109\/ICCV.2019.00178"},{"issue":"4","key":"791_CR39","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1016\/j.inffus.2008.04.001","volume":"10","author":"S Panigrahi","year":"2009","unstructured":"Panigrahi S, Kundu A, Sural S, Majumdar AK (2009) Credit card fraud detection: a fusion approach using Dempster-Shafer theory and Bayesian learning. Inform Fusion 10(4):354\u2013363","journal-title":"Inform Fusion"},{"key":"791_CR40","volume-title":"C4.5: Programs for machine learning","author":"JR Quinlan","year":"1993","unstructured":"Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, USA"},{"issue":"3","key":"791_CR41","first-page":"506","volume":"9","author":"BV Ramana","year":"2012","unstructured":"Ramana BV, Babu MSP, Venkateswarlu N (2012) A critical comparative study of liver patients from USA and India: an exploratory analysis. Int J Comput Sci Issues (IJCSI) 9(3):506","journal-title":"Int J Comput Sci Issues (IJCSI)"},{"key":"791_CR42","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-28699-5","volume-title":"Emerging paradigms in machine learning","author":"S Ramanna","year":"2013","unstructured":"Ramanna S, Jain LC, Howlett RJ (2013) Emerging paradigms in machine learning. Springer, Berlin"},{"key":"791_CR43","doi-asserted-by":"crossref","unstructured":"Rodriguez D, Herraiz I, Harrison R, Dolado J, Riquelme JC (2014) Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pp 1\u201310","DOI":"10.1145\/2601248.2601294"},{"issue":"1\u20132","key":"791_CR44","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1561\/0400000060","volume":"10","author":"DP Woodruff","year":"2014","unstructured":"Woodruff DP (2014) Sketching as a tool for numerical linear algebra. Found Trends Theor Comput Sci 10(1\u20132):1\u2013157","journal-title":"Found Trends Theor Comput Sci"},{"issue":"06","key":"791_CR45","doi-asserted-by":"publisher","first-page":"1417","DOI":"10.1142\/S0218001493000698","volume":"07","author":"KS Woods","year":"1993","unstructured":"Woods KS, Doss CC, Bowyer KW, Solka JL, Priebe CE, Kegelmeyer WP (1993) comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recognit Artif Intell 07(06):1417\u20131436. https:\/\/doi.org\/10.1142\/S0218001493000698","journal-title":"Int J Pattern Recognit Artif Intell"},{"issue":"2","key":"791_CR46","doi-asserted-by":"publisher","first-page":"557","DOI":"10.1016\/j.patcog.2006.01.009","volume":"40","author":"J Xie","year":"2007","unstructured":"Xie J, Qiu Z (2007) The effect of imbalanced data sets on LDA: a theoretical and empirical analysis. Pattern Recognit 40(2):557\u2013562","journal-title":"Pattern Recognit"},{"issue":"5","key":"791_CR47","first-page":"1109","volume":"37","author":"JH Xue","year":"2014","unstructured":"Xue JH, Hall P (2014) Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis? IEEE Trans Pattern Anal Mach Intell 37(5):1109\u20131112","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"5","key":"791_CR48","doi-asserted-by":"publisher","first-page":"1558","DOI":"10.1016\/j.patcog.2007.11.008","volume":"41","author":"JH Xue","year":"2008","unstructured":"Xue JH, Titterington DM (2008) Do unbalanced data have a negative effect on LDA? Pattern Recognit 41(5):1558\u20131571","journal-title":"Pattern Recognit"},{"issue":"6","key":"791_CR49","doi-asserted-by":"publisher","first-page":"666","DOI":"10.1109\/TST.2012.6374368","volume":"17","author":"H Yu","year":"2012","unstructured":"Yu H, Ni J, Dan Y, Xu S (2012) Mining and integrating reliable decision rules for imbalanced cancer gene expression data sets. Tsinghua Sci Technol 17(6):666\u2013673","journal-title":"Tsinghua Sci Technol"}],"container-title":["Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-021-00791-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10618-021-00791-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-021-00791-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,1,22]],"date-time":"2022-01-22T06:23:29Z","timestamp":1642832609000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10618-021-00791-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":49,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,1]]}},"alternative-id":["791"],"URL":"https:\/\/doi.org\/10.1007\/s10618-021-00791-3","relation":{},"ISSN":["1384-5810","1573-756X"],"issn-type":[{"type":"print","value":"1384-5810"},{"type":"electronic","value":"1573-756X"}],"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"28 July 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 August 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 October 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}