{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T20:09:22Z","timestamp":1776283762106,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"21","license":[{"start":{"date-parts":[[2018,5,10]],"date-time":"2018-05-10T00:00:00Z","timestamp":1525910400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["CRU303 Z2"],"award-info":[{"award-number":["CRU303 Z2"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["FOR2488 P7"],"award-info":[{"award-number":["FOR2488 P7"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["KO2250\/5-1"],"award-info":[{"award-number":["KO2250\/5-1"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The procedure is included in the ranger package, available at https:\/\/cran.r-project.org\/package=ranger and https:\/\/github.com\/imbs-hl\/ranger.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty373","type":"journal-article","created":{"date-parts":[[2018,5,8]],"date-time":"2018-05-08T11:11:04Z","timestamp":1525777864000},"page":"3711-3718","source":"Crossref","is-referenced-by-count":666,"title":["The revival of the Gini importance?"],"prefix":"10.1093","volume":"34","author":[{"given":"Stefano","family":"Nembrini","sequence":"first","affiliation":[{"name":"Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, Gainesville, FL, USA"}]},{"given":"Inke R","family":"K\u00f6nig","sequence":"additional","affiliation":[{"name":"Institut f\u00fcr Medizinische Biometrie und Statistik, Universit\u00e4t zu L\u00fcbeck, Universit\u00e4tsklinikum Schleswig-Holstein, Campus L\u00fcbeck, L\u00fcbeck, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8542-6291","authenticated-orcid":false,"given":"Marvin N","family":"Wright","sequence":"additional","affiliation":[{"name":"Institut f\u00fcr Medizinische Biometrie und Statistik, Universit\u00e4t zu L\u00fcbeck, Universit\u00e4tsklinikum Schleswig-Holstein, Campus L\u00fcbeck, L\u00fcbeck, Germany"},{"name":"Leibniz Institute for Prevention Research and Epidemiology \u2013 BIPS, Bremen, Germany"}]}],"member":"286","published-online":{"date-parts":[[2018,5,10]]},"reference":[{"key":"2023012712351525700_bty373-B1","doi-asserted-by":"crossref","first-page":"1340","DOI":"10.1093\/bioinformatics\/btq134","article-title":"Permutation importance: a corrected feature importance measure","volume":"26","author":"Altmann","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012712351525700_bty373-B2","volume-title":"Combinatorics of Permutations","author":"B\u00f3na","year":"2012"},{"key":"2023012712351525700_bty373-B3","doi-asserted-by":"crossref","first-page":"292","DOI":"10.1093\/bib\/bbr053","article-title":"Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations","volume":"13","author":"Boulesteix","year":"2012","journal-title":"Brief Bioinform"},{"key":"2023012712351525700_bty373-B4","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1007\/BF00058655","article-title":"Bagging predictors","volume":"24","author":"Breiman","year":"1996","journal-title":"Mach. Learn"},{"key":"2023012712351525700_bty373-B5","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023012712351525700_bty373-B6","volume-title":"Classification and Regression Trees","author":"Breiman","year":"1984"},{"key":"2023012712351525700_bty373-B7","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1093\/bib\/bbq011","article-title":"Letter to the editor: stability of random forest importance measures","volume":"12","author":"Calle","year":"2011","journal-title":"Brief Bioinform"},{"key":"2023012712351525700_bty373-B8","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1186\/1471-2105-7-3","article-title":"Gene selection and classification of microarray data using random forest","volume":"7","author":"D\u00edaz-Uriarte","year":"2006","journal-title":"BMC Bioinform"},{"key":"2023012712351525700_bty373-B9","author":"Degenhardt","year":"2017"},{"key":"2023012712351525700_bty373-B10","author":"Deng","year":"2013"},{"key":"2023012712351525700_bty373-B11","doi-asserted-by":"crossref","DOI":"10.2202\/1544-6115.1691","article-title":"Random forests for genetic association studies","volume":"10","author":"Goldstein","year":"2011","journal-title":"Stat. Appl. Genet. Mol. Biol"},{"key":"2023012712351525700_bty373-B12","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1126\/science.286.5439.531","article-title":"Molecular classification of cancer: class discovery and class prediction by gene expression monitoring","volume":"286","author":"Golub","year":"1999","journal-title":"Science"},{"key":"2023012712351525700_bty373-B13","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1016\/j.csda.2012.09.020","article-title":"A new variable selection approach using random forests","volume":"60","author":"Hapfelmeier","year":"2013","journal-title":"Comput. Stat. Data Anal"},{"key":"2023012712351525700_bty373-B14","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1198\/106186006X133933","article-title":"Unbiased recursive partitioning: a conditional inference framework","volume":"15","author":"Hothorn","year":"2006","journal-title":"J. Comput. Graph Stat"},{"key":"2023012712351525700_bty373-B15","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1007\/s10994-014-5451-2","article-title":"The effect of splitting on random forests","volume":"99","author":"Ishwaran","year":"2015","journal-title":"Mach. Learn"},{"key":"2023012712351525700_bty373-B16","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1214\/08-AOAS169","article-title":"Random survival forests","volume":"2","author":"Ishwaran","year":"2008","journal-title":"Ann. Appl. Stat"},{"key":"2023012712351525700_bty373-B17","doi-asserted-by":"crossref","DOI":"10.1007\/s11634-016-0276-4","article-title":"A computationally fast variable importance test for random forests for high-dimensional data","author":"Janitza","year":"2016","journal-title":"Adv. Data Anal. Classif"},{"key":"2023012712351525700_bty373-B18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v036.i11","article-title":"Feature selection with the boruta package","volume":"36","author":"Kursa","year":"2010","journal-title":"J. Stat. Softw"},{"key":"2023012712351525700_bty373-B19","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1093\/bib\/bbr016","article-title":"Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures","volume":"12","author":"Nicodemus","year":"2011","journal-title":"Brief Bioinform"},{"key":"2023012712351525700_bty373-B20","doi-asserted-by":"crossref","first-page":"1884","DOI":"10.1093\/bioinformatics\/btp331","article-title":"Predictor correlation impacts machine learning algorithms: implications for genomic studies","volume":"25","author":"Nicodemus","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012712351525700_bty373-B21","doi-asserted-by":"crossref","first-page":"110.","DOI":"10.1186\/1471-2105-11-110","article-title":"The behaviour of random forest permutation-based variable importance measures under predictor correlation","volume":"11","author":"Nicodemus","year":"2010","journal-title":"BMC Bioinform"},{"key":"2023012712351525700_bty373-B22","first-page":"530","volume-title":"Advances in Neural Information Processing Systems","author":"Noordewier","year":"1991"},{"key":"2023012712351525700_bty373-B23","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1007\/11908029_58","volume-title":"International Conference on Rough Sets and Current Trends in Computing","author":"Rudnicki","year":"2006"},{"key":"2023012712351525700_bty373-B24","doi-asserted-by":"crossref","first-page":"611","DOI":"10.1198\/106186008X344522","article-title":"A bias correction algorithm for the Gini variable importance measure in classification trees","volume":"17","author":"Sandri","year":"2008","journal-title":"J. Comput. Graph Stat"},{"key":"2023012712351525700_bty373-B25","doi-asserted-by":"crossref","first-page":"450","DOI":"10.1016\/j.eswa.2016.07.018","article-title":"On the use of Harrell\u2019s C for clinical risk prediction via random survival forests","volume":"63","author":"Schmid","year":"2016","journal-title":"Expert Syst. Appl"},{"key":"2023012712351525700_bty373-B26","doi-asserted-by":"crossref","first-page":"25.","DOI":"10.1186\/1471-2105-8-25","article-title":"Bias in random forest variable importance measures: illustrations, sources and a solution","volume":"8","author":"Strobl","year":"2007","journal-title":"BMC Bioinform"},{"key":"2023012712351525700_bty373-B27","doi-asserted-by":"crossref","first-page":"307.","DOI":"10.1186\/1471-2105-9-307","article-title":"Conditional variable importance for random forests","volume":"9","author":"Strobl","year":"2008","journal-title":"BMC Bioinform"},{"key":"2023012712351525700_bty373-B28","doi-asserted-by":"crossref","first-page":"7.","DOI":"10.1186\/s13040-016-0087-3","article-title":"r2vim: a new variable selection method for random forests in genome-wide association studies","volume":"9","author":"Szymczak","year":"2016","journal-title":"BioData Min"},{"key":"2023012712351525700_bty373-B29","first-page":"2181","volume-title":"IJCNN\u201906. International Joint Conference on Neural Networks","author":"Tuv","year":"2006"},{"key":"2023012712351525700_bty373-B30","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1038\/415530a","article-title":"Gene expression profiling predicts clinical outcome of breast cancer","volume":"415","author":"van\u2019T Veer","year":"2002","journal-title":"Nature"},{"key":"2023012712351525700_bty373-B31","doi-asserted-by":"crossref","first-page":"2615","DOI":"10.1093\/bioinformatics\/bts483","article-title":"An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data","volume":"28","author":"Walters","year":"2012","journal-title":"Bioinformatics"},{"key":"2023012712351525700_bty373-B32","doi-asserted-by":"crossref","first-page":"445","DOI":"10.1016\/j.ajhg.2009.03.011","article-title":"Genetic control of human brain transcript expression in Alzheimer disease","volume":"84","author":"Webster","year":"2009","journal-title":"Am. J. Hum. Genet"},{"key":"2023012712351525700_bty373-B33","doi-asserted-by":"crossref","DOI":"10.18637\/jss.v077.i01","article-title":"ranger: a fast implementation of random forests for high dimensional data in C++ and R","volume":"77","author":"Wright","year":"2017","journal-title":"J. Stat. Softw"},{"key":"2023012712351525700_bty373-B34","doi-asserted-by":"crossref","first-page":"1272","DOI":"10.1002\/sim.7212","article-title":"Unbiased split variable selection for random survival forests using maximally selected rank statistics","volume":"36","author":"Wright","year":"2017","journal-title":"Stat. Med"},{"key":"2023012712351525700_bty373-B35","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1198\/016214506000000843","article-title":"Controlling variable selection by the addition of pseudovariables","volume":"102","author":"Wu","year":"2007","journal-title":"J. Am. Stat. Assoc"},{"key":"2023012712351525700_bty373-B36","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1002\/widm.1114","article-title":"Mining data with random forests: current options for real-world applications","volume":"4","author":"Ziegler","year":"2014","journal-title":"Wires Data Min. Knowl"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/21\/3711\/48920785\/bioinformatics_34_21_3711.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/21\/3711\/48920785\/bioinformatics_34_21_3711.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T13:25:27Z","timestamp":1674825927000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/21\/3711\/4994791"}},"subtitle":[],"editor":[{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,5,10]]},"references-count":36,"journal-issue":{"issue":"21","published-print":{"date-parts":[[2018,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty373","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,11,1]]},"published":{"date-parts":[[2018,5,10]]}}}