{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T18:44:30Z","timestamp":1773254670016,"version":"3.50.1"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"14","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2011,7,15]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking.<\/jats:p><jats:p>Results: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype.<\/jats:p><jats:p>Availability: R code can be found at: http:\/\/www.mpi-inf.mpg.de\/~laura\/Clustering.r.<\/jats:p><jats:p>Contact: \u00a0laura.tolosi@mpi-inf.mpg.de<\/jats:p><jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btr300","type":"journal-article","created":{"date-parts":[[2011,5,17]],"date-time":"2011-05-17T03:25:21Z","timestamp":1305602721000},"page":"1986-1994","source":"Crossref","is-referenced-by-count":355,"title":["Classification with correlated features: unreliability of feature ranking and solutions"],"prefix":"10.1093","volume":"27","author":[{"given":"Laura","family":"Tolo\u015fi","sequence":"first","affiliation":[]},{"given":"Thomas","family":"Lengauer","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2011,5,16]]},"reference":[{"issue":"19 Part 1","key":"2023012712450272700_B1","doi-asserted-by":"crossref","first-page":"7012","DOI":"10.1158\/1078-0432.CCR-05-0177","article-title":"Bladder cancer stage and outcome defined by array based comparative genomic hybridization","volume":"11","author":"Blaveri","year":"2005","journal-title":"Clin. Cancer Res."},{"key":"2023012712450272700_B2","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"2023012712450272700_B3","doi-asserted-by":"crossref","first-page":"818","DOI":"10.1158\/0008-5472.CAN-06-3307","article-title":"Deletion of chromosome 11q predicts response to anthracycline-based chemotherapy in early breast cancer","volume":"67","author":"Climent","year":"2007","journal-title":"Cancer Res."},{"key":"2023012712450272700_B4","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1016\/j.jmva.2004.02.012","article-title":"Finding predictive gene groups from microarray data","volume":"90","author":"Dettling","year":"2004","journal-title":"J. Multivar. Anal."},{"key":"2023012712450272700_B5","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1186\/1471-2105-7-3","article-title":"Gene selection and classification of microarray data using random forest","volume":"7","author":"D\u00edaz-Uriarte","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023012712450272700_B6","doi-asserted-by":"crossref","first-page":"2076","DOI":"10.1016\/S0959-8049(98)00241-X","article-title":"Mapping loss of heterozygozity at chromosome 13q: loss at 13q12-q13 is associated with breast tumor progression and poor prognosis","volume":"34","author":"Eiriksdottir","year":"1998","journal-title":"Eur. J. Cancer"},{"key":"2023012712450272700_B7","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1111\/j.1467-9868.2008.00674.x","article-title":"Sure independence screening for ultrahigh dimensional feature space","volume":"70","author":"Fan","year":"2008","journal-title":"J. R. Stat. Soc. Ser. B Stat. Methodol."},{"key":"2023012712450272700_B8","first-page":"1","article-title":"Regularization paths for generalized linear models via coordinate descent","volume":"33","author":"Friedman","year":"2010","journal-title":"J. Stat. Softwr."},{"key":"2023012712450272700_B9","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1007\/978-1-84800-155-8_7","article-title":"Graph implementations for nonsmooth convex programs","volume-title":"Recent Advances in Learning and Control","author":"Grant","year":"2008"},{"key":"2023012712450272700_B10","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-21606-5","volume-title":"The Elements of Statistical Learning.","author":"Hastie","year":"2001"},{"key":"2023012712450272700_B11","doi-asserted-by":"crossref","first-page":"1465","DOI":"10.1101\/gr.5460106","article-title":"Novel patterns of genome rearrangement and their association with survival in breast cancer","volume":"16","author":"Hicks","year":"2006","journal-title":"Genome Res."},{"key":"2023012712450272700_B12","doi-asserted-by":"crossref","first-page":"1590","DOI":"10.1016\/S0140-6736(03)13308-9","article-title":"Gene expression predictors of breast cancer outcomes","volume":"361","author":"Huang","year":"2003","journal-title":"Lancet"},{"key":"2023012712450272700_B13","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng1167","article-title":"Gene expression phenotypic models that predict the activity of oncogenic pathways","volume":"34","author":"Huang","year":"2003","journal-title":"Nat. Genet."},{"key":"2023012712450272700_B14","first-page":"53","article-title":"Improved gene selection for classification of microarrays","volume":"8","author":"J\u00e4ger","year":"2003","journal-title":"Pac. Sympos. Biocomput."},{"key":"2023012712450272700_B15","first-page":"218","article-title":"Stability of feature selection algorithms","author":"Kalousis","year":"2005","journal-title":"ICDM '05 Proceedings"},{"key":"2023012712450272700_B16","first-page":"18","article-title":"Classification and regression by randomForest","volume":"2","author":"Liaw","year":"2002","journal-title":"R News"},{"key":"2023012712450272700_B17","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1186\/1471-2105-8-60","article-title":"Supervised group Lasso with applications to microarray data analysis","volume":"8","author":"Ma","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2023012712450272700_B18","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1111\/j.1467-9868.2007.00627.x","article-title":"The group lasso for logistic regression","volume":"70","author":"Meier","year":"2008","journal-title":"J. R. Stat. Soc. B"},{"key":"2023012712450272700_B19","doi-asserted-by":"crossref","first-page":"368","DOI":"10.2353\/jmoldx.2007.060167","article-title":"Optimization of quantitative MGMT promoter methylation analysis using pyrosequencing and combined bisulfite restriction analysis","volume":"9","author":"Mikeska","year":"2007","journal-title":"J. Mol. Diagn."},{"key":"2023012712450272700_B20","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1186\/1471-2105-9-87","article-title":"Building pathway clusters from Random Forests classification using class votes","volume":"9","author":"Pang","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023012712450272700_B21","doi-asserted-by":"crossref","first-page":"212","DOI":"10.1093\/biostatistics\/kxl002","article-title":"Averaged gene expression for regression","volume":"8","author":"Park","year":"2007","journal-title":"Biostatistics"},{"key":"2023012712450272700_B22","doi-asserted-by":"crossref","first-page":"i375","DOI":"10.1093\/bioinformatics\/btn188","article-title":"Classification of arrayCGH data using fused SVM","volume":"24","author":"Rapaport","year":"2008","journal-title":"Bioinformatics"},{"key":"2023012712450272700_B23","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","article-title":"Silhouettes: a graphical aid to the interpretation and validation of cluster analysis","volume":"20","author":"Rousseeuw","year":"1987","journal-title":"J. Comput. Appl. Math."},{"key":"2023012712450272700_B24","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1186\/1471-2105-9-307","article-title":"Conditional variable importance for random forests","volume":"9","author":"Strobl","year":"2008","journal-title":"BMC Bioinformatics"},{"key":"2023012712450272700_B25","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression shrinkage and selection via the Lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J. R. Stat. Soc. B"},{"key":"2023012712450272700_B26","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1038\/415530a","article-title":"Gene expression profiling predicts clinical outcome of breast cancer","volume":"415","author":"van't","year":"2001","journal-title":"Nature"},{"key":"2023012712450272700_B27","doi-asserted-by":"crossref","DOI":"10.1145\/1401890.1401986","article-title":"Stable feature selection via dense feature groups","author":"Yu","year":"2008","journal-title":"Proceedings of the 14th ACM KDD'08."},{"key":"2023012712450272700_B28","first-page":"1509","article-title":"One-step sparse estimates in nonconcave penalized likelihood models","volume":"36","author":"Zou","year":"2008","journal-title":"Ann. Stat."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/14\/1986\/48933244\/bioinformatics_27_14_1986.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/14\/1986\/48933244\/bioinformatics_27_14_1986.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,5]],"date-time":"2025-03-05T15:17:40Z","timestamp":1741187860000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/27\/14\/1986\/194387"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,5,16]]},"references-count":28,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2011,7,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btr300","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2011,7]]},"published":{"date-parts":[[2011,5,16]]}}}