{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,17]],"date-time":"2026-04-17T04:44:31Z","timestamp":1776401071192,"version":"3.51.2"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T00:00:00Z","timestamp":1741564800000},"content-version":"vor","delay-in-days":9,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,3,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Random forest (RF) regression is popular machine learning method to develop prediction models for continuous outcomes. Variable selection, also known as feature selection or reduction, involves selecting a subset of predictor variables for modeling. Potential benefits of variable selection are methodologic (i.e. improving prediction accuracy and computational efficiency) and practical (i.e. reducing the burden of data collection and improving efficiency). Several variable selection methods leveraging RFs have been proposed, but there is limited evidence to guide decisions on which methods may be preferable for different types of datasets with continuous outcomes. Using 59 publicly available datasets in a benchmarking study, we evaluated the implementation of 13 RF variable selection methods. Performance of variable selection was measured via out-of-sample R2 of a RF that used the variables selected for each method. Simplicity of variable selection was measured via the percent reduction in the number of variables selected out of the number of variables available. Efficiency was measured via computational time required to complete the variable selection. Based on our benchmarking study, variable selection methods implemented in the Boruta and aorsf R packages selected the best subset of variables for axis-based RF models, whereas methods implemented in the aorsf R package selected the best subset of variables for oblique RF models. A significant contribution of this study is the ability to assess different variable selection methods in the setting of RF regression for continuous outcomes to identify preferable methods using an open science approach.<\/jats:p>","DOI":"10.1093\/bib\/bbaf096","type":"journal-article","created":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T10:54:56Z","timestamp":1741604096000},"source":"Crossref","is-referenced-by-count":7,"title":["A comparison of random forest variable selection methods for regression modeling of continuous outcomes"],"prefix":"10.1093","volume":"26","author":[{"given":"Nathaniel S","family":"O\u2019Connell","sequence":"first","affiliation":[{"name":"Department of Biostatistics and Data Science, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, NC 27157 ,","place":["United States"]}]},{"given":"Byron C","family":"Jaeger","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Data Science, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, NC 27157 ,","place":["United States"]}]},{"given":"Garrett S","family":"Bullock","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Data Science, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, NC 27157 ,","place":["United States"]},{"name":"Department of Orthopedic Surgery, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, NC 27157 ,","place":["United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0679-8730","authenticated-orcid":false,"given":"Jaime Lynn","family":"Speiser","sequence":"additional","affiliation":[{"name":"Department of Biostatistics and Data Science, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, NC 27157 ,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2025,3,10]]},"reference":[{"key":"2025031010544639800_ref1","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach Learn"},{"key":"2025031010544639800_ref2","volume-title":"Classification and Regression Trees","author":"Breiman","year":"1984"},{"key":"2025031010544639800_ref3","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1016\/j.neunet.2018.12.010","article-title":"An extensive experimental survey of regression methods","volume":"111","author":"Fern\u00e1ndez-Delgado","year":"2019","journal-title":"Neural Netw"},{"key":"2025031010544639800_ref4","doi-asserted-by":"publisher","first-page":"887","DOI":"10.1002\/sim.6351","article-title":"Random forest classification of etiologies for an orphan disease","volume":"34","author":"Speiser","year":"2015","journal-title":"Stat Med"},{"key":"2025031010544639800_ref5","doi-asserted-by":"publisher","first-page":"e000262","DOI":"10.1136\/fmch-2019-000262","article-title":"Variable selection strategies and its importance in clinical prediction modelling","volume":"8","author":"Chowdhury","year":"2020","journal-title":"Fam Med Community Health"},{"key":"2025031010544639800_ref6","doi-asserted-by":"publisher","first-page":"431","DOI":"10.1002\/bimj.201700067","article-title":"Variable selection\u2013a review and recommendations for the practicing statistician","volume":"60","author":"Heinze","year":"2018","journal-title":"Biom J"},{"key":"2025031010544639800_ref7","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1016\/j.ijmedinf.2018.05.006","article-title":"Comparison of variable selection methods for clinical predictive modeling","volume":"116","author":"Sanchez-Pinto","year":"2018","journal-title":"Int J Med Inform"},{"key":"2025031010544639800_ref8","doi-asserted-by":"publisher","first-page":"492","DOI":"10.1093\/bib\/bbx124","article-title":"Evaluation of variable selection methods for random forests and omics data sets","volume":"20","author":"Degenhardt","year":"2017","journal-title":"Brief Bioinform"},{"key":"2025031010544639800_ref9","doi-asserted-by":"publisher","first-page":"6241","DOI":"10.1016\/j.eswa.2013.05.051","article-title":"Feature subset selection filter\u2013wrapper based on low quality data","volume":"40","author":"Cadenas","year":"2013","journal-title":"Expert Syst Appl"},{"key":"2025031010544639800_ref10","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1016\/j.csda.2012.09.020","article-title":"A new variable selection approach using random forests","volume":"60","author":"Hapfelmeier","year":"2013","journal-title":"Comput Stat Data Anal"},{"key":"2025031010544639800_ref11","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1016\/j.eswa.2019.05.028","article-title":"A comparison of random forest variable selection methods for classification prediction modeling","volume":"134","author":"Speiser","year":"2019","journal-title":"Expert Syst Appl"},{"key":"2025031010544639800_ref12","doi-asserted-by":"publisher","DOI":"10.1016\/j.csda.2022.107689","article-title":"Efficient permutation testing of variable importance measures by the example of random forests","volume":"181","author":"Hapfelmeier","year":"2023","journal-title":"Comput Stat Data Anal"},{"key":"2025031010544639800_ref13","doi-asserted-by":"publisher","first-page":"385","DOI":"10.1214\/aoms\/1177705900","article-title":"Simplified estimation from censored normal samples","volume":"31","author":"Dixon","year":"1960","journal-title":"Ann Math Stat"},{"key":"2025031010544639800_ref14","doi-asserted-by":"publisher","first-page":"2200212","DOI":"10.1002\/bimj.202200212","article-title":"On the role of benchmarking data sets and simulations in method comparison studies","volume":"66","author":"Friedrich","year":"2024","journal-title":"Biom J"},{"key":"2025031010544639800_ref15","first-page":"1","article-title":"Caret package","volume":"28","author":"Kuhn","year":"2008","journal-title":"J Stat Softw"},{"key":"2025031010544639800_ref16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s11634-016-0270-x","article-title":"A computationally fast variable importance test for random forests for high-dimensional data","volume":"12","author":"Janitza","year":"2015","journal-title":"Adv Data Anal Classif"},{"key":"2025031010544639800_ref17","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18637\/jss.v036.i11","article-title":"Feature selection with the Boruta package","volume":"36","author":"Kursa","year":"2010","journal-title":"J Stat Softw"},{"key":"2025031010544639800_ref18","doi-asserted-by":"publisher","first-page":"3483","DOI":"10.1016\/j.patcog.2013.05.018","article-title":"Gene selection with guided regularized random forest","volume":"46","author":"Deng","year":"2013","journal-title":"Pattern Recognit"},{"key":"2025031010544639800_ref19","volume-title":"Random Forests for Survival, Regression and Classification (RF-SRC), R Package Version 1.6","author":"Ishwaran","year":"2014"},{"key":"2025031010544639800_ref20","doi-asserted-by":"publisher","first-page":"19","DOI":"10.32614\/RJ-2015-018","article-title":"VSURF: an R package for variable selection using random forests","volume":"7","author":"Genuer","year":"2015","journal-title":"R J"},{"key":"2025031010544639800_ref21","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-25966-4_33","volume-title":"International Workshop on Multiple Classifier Systems","author":"Svetnik","year":"2004"},{"key":"2025031010544639800_ref22","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1186\/1471-2105-5-81","article-title":"Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes","volume":"5","author":"Jiang","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2025031010544639800_ref23","volume-title":"Party: A Laboratory for Recursive Partytioning","author":"Hothorn","year":"2010"},{"key":"2025031010544639800_ref24","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23783-6_29","volume-title":"Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5\u20139, 2011, Proceedings, Part II 22","author":"Menze","year":"2011"},{"key":"2025031010544639800_ref25","doi-asserted-by":"publisher","first-page":"192","DOI":"10.1080\/10618600.2023.2231048","article-title":"Accelerated and interpretable oblique random survival forests","volume":"33","author":"Jaeger","year":"2024","journal-title":"J Comput Graph Stat"},{"key":"2025031010544639800_ref26","doi-asserted-by":"publisher","first-page":"4705","DOI":"10.21105\/joss.04705","article-title":"Aorsf: an R package for supervised learning using the oblique random survival forest","volume":"7","author":"Jaeger","year":"2022","journal-title":"J Open Source Softw"},{"key":"2025031010544639800_ref27","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1186\/1471-2105-7-3","article-title":"Gene selection and classification of microarray data using random forest","volume":"7","author":"D\u00edaz-Uriarte","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2025031010544639800_ref28","first-page":"1","article-title":"AUCRF: variable selection with random forest and the area under the curve","author":"Urrea","year":"2012"},{"key":"2025031010544639800_ref29","doi-asserted-by":"publisher","first-page":"977","DOI":"10.1007\/s00180-017-0742-2","article-title":"OpenML: an R package to connect to the machine learning platform OpenML","volume":"34","author":"Casalicchio","year":"2017","journal-title":"Comput Stat"},{"key":"2025031010544639800_ref30","volume-title":"Modeldata: Data Sets Useful for Modeling Examples","author":"Kuhn","year":"2022"},{"key":"2025031010544639800_ref31","doi-asserted-by":"publisher","first-page":"151","DOI":"10.1016\/j.eswa.2016.12.008","article-title":"Automatic selection of molecular descriptors using random forest: application to drug discovery","volume":"72","author":"Cano","year":"2017","journal-title":"Expert Syst Appl"},{"key":"2025031010544639800_ref32","doi-asserted-by":"publisher","first-page":"1340","DOI":"10.1093\/bioinformatics\/btq134","article-title":"Permutation importance: a corrected feature importance measure","volume":"26","author":"Altmann","year":"2010","journal-title":"Bioinformatics"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/2\/bbaf096\/62364727\/bbaf096.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bib\/article-pdf\/26\/2\/bbaf096\/62364727\/bbaf096.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,10]],"date-time":"2025-03-10T10:54:59Z","timestamp":1741604099000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/doi\/10.1093\/bib\/bbaf096\/8068235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3]]},"references-count":32,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,3,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbaf096","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025,3]]},"published":{"date-parts":[[2025,3]]},"article-number":"bbaf096"}}