{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T04:07:34Z","timestamp":1775275654628,"version":"3.50.1"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"13","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians.<\/jats:p><jats:p>Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive.<\/jats:p><jats:p>Availability and implementation: Python source codes of all tested methods, as well as the MATLAB scripts used for data simulation, can be found in the Supplementary Material.<\/jats:p><jats:p>Contact: \u00a0vahuynh@ulg.ac.be, or p.geurts@ulg.ac.be<\/jats:p><jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts238","type":"journal-article","created":{"date-parts":[[2012,4,27]],"date-time":"2012-04-27T01:38:05Z","timestamp":1335490685000},"page":"1766-1774","source":"Crossref","is-referenced-by-count":108,"title":["Statistical interpretation of machine learning-based feature importance scores for biomarker discovery"],"prefix":"10.1093","volume":"28","author":[{"given":"V\u00e2n Anh","family":"Huynh-Thu","sequence":"first","affiliation":[{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"},{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"}]},{"given":"Yvan","family":"Saeys","sequence":"additional","affiliation":[{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"}]},{"given":"Louis","family":"Wehenkel","sequence":"additional","affiliation":[{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"},{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"}]},{"given":"Pierre","family":"Geurts","sequence":"additional","affiliation":[{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"},{"name":"1 Department of Electrical Engineering and Computer Science, Systems and Modeling and 2GIGA-Research, Bioinformatics and Modeling, University of Li\u00e8ge, 4000 Li\u00e8ge, Belgium and 3Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Gent, Belgium"}]}],"member":"286","published-online":{"date-parts":[[2012,4,25]]},"reference":[{"key":"2023012512423510500_B1","doi-asserted-by":"crossref","first-page":"392","DOI":"10.1093\/bioinformatics\/btp630","article-title":"Robust biomarker identification for cancer diagnosis with ensemble feature selection methods","volume":"26","author":"Abeel","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012512423510500_B2","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1038\/35000501","article-title":"Distinct types of di?use large B-cell lymphoma identi\u00feed by gene expression pro\u00feling","volume":"403","author":"Alizadeh","year":"2000","journal-title":"Nature"},{"key":"2023012512423510500_B3","doi-asserted-by":"crossref","first-page":"1340","DOI":"10.1093\/bioinformatics\/btq134","article-title":"Permutation importance: a corrected feature importance measure","volume":"26","author":"Altmann","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012512423510500_B4","doi-asserted-by":"crossref","first-page":"6562","DOI":"10.1073\/pnas.102102699","article-title":"Selection bias in gene extraction on the basis of microarray gene-expression data","volume":"99","author":"Ambroise","year":"2002","journal-title":"Proc. Nati. Acad. Sci."},{"key":"2023012512423510500_B5","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: A practical and powerful approach to multiple testing","volume":"57","author":"Benjamini","year":"1995","journal-title":"J. Roy. Stat. Soci., Ser. B (Methodol.)"},{"key":"2023012512423510500_B6","doi-asserted-by":"crossref","first-page":"144","DOI":"10.1145\/130385.130401","article-title":"A training algorithm for optimal margin classifiers","volume-title":"Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory","author":"Boser","year":"1992"},{"key":"2023012512423510500_B7","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"2023012512423510500_B8","doi-asserted-by":"crossref","first-page":"27:1","DOI":"10.1145\/1961189.1961199","article-title":"Libsvm: a library for support vector machines","volume":"2","author":"Chang","year":"2011","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"2023012512423510500_B9","doi-asserted-by":"crossref","first-page":"822","DOI":"10.1038\/35090585","article-title":"Delineation of prognostic biomarkers in prostate cancer","volume":"412","author":"Dhanasekaran","year":"2001","journal-title":"Nature"},{"key":"2023012512423510500_B10","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1109\/TIT.1975.1055373","article-title":"k-Nearest-neighbor Bayes-risk estimation","volume":"21","author":"Fukunaga","year":"1975","journal-title":"IEEE Trans. Inform. Theory"},{"key":"2023012512423510500_B11","doi-asserted-by":"crossref","DOI":"10.1007\/BF02595811","volume-title":"Resampling-based multiple testing for microarray data analysis.","author":"Ge","year":"2003"},{"key":"2023012512423510500_B12","first-page":"881","article-title":"Some step-down procedures controlling the false discovery rate under dependence","volume":"18","author":"Ge","year":"2008","journal-title":"Stati. Sin."},{"key":"2023012512423510500_B13","doi-asserted-by":"crossref","first-page":"3138","DOI":"10.1093\/bioinformatics\/bti494","article-title":"Proteomic mass spectra classification using decision tree based ensemble methods","volume":"21","author":"Geurts","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012512423510500_B14","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/s10994-006-6226-1","article-title":"Extremely randomized trees","volume":"36","author":"Geurts","year":"2006","journal-title":"Mach. Learn."},{"key":"2023012512423510500_B15","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1126\/science.286.5439.531","article-title":"Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring","volume":"286","author":"Golub","year":"1999","journal-title":"Science"},{"key":"2023012512423510500_B16","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1023\/A:1012487302797","article-title":"Gene selection for cancer classification using support vector machines","volume":"46","author":"Guyon","year":"2002","journal-title":"Mach. Learn."},{"key":"2023012512423510500_B17","volume-title":"The Elements of Statistical Learning.","author":"Hastie","year":"2003"},{"key":"2023012512423510500_B18","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1016\/j.compbiolchem.2010.07.002","article-title":"Stable feature selection for biomarker discovery","volume":"34","author":"He","year":"2010","journal-title":"Computational Biology and Chemistry"},{"key":"2023012512423510500_B19","first-page":"60","article-title":"Exploiting tree-based variable importances to selectively identify relevant variables","volume-title":"JMLR: Workshop and Conference proceedings","author":"Huynh-Thu","year":"2008"},{"key":"2023012512423510500_B20","volume-title":"Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.","author":"Pearl","year":"1988"},{"key":"2023012512423510500_B21","first-page":"1357","article-title":"Variable selection using svm based criteria","volume":"3","author":"Rakotomamonjy","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"2023012512423510500_B22","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1023\/A:1025667309714","article-title":"Theoretical and empirical analysis of ReliefF and RReliefF","volume":"53","author":"Robnik-Sikonja","year":"2003","journal-title":"Mach. Learn. J."},{"key":"2023012512423510500_B23","doi-asserted-by":"crossref","first-page":"2507","DOI":"10.1093\/bioinformatics\/btm344","article-title":"A review of feature selection techniques in bioinformatics","volume":"23","author":"Saeys","year":"2007","journal-title":"Bioinformatics"},{"key":"2023012512423510500_B24","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1038\/nm0102-68","article-title":"Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning","volume":"8","author":"Shipp","year":"2002","journal-title":"Nat. Med."},{"key":"2023012512423510500_B25","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1016\/S1535-6108(02)00030-2","article-title":"Gene expression correlates of clinical prostate cancer behavior","volume":"1","author":"Singh","year":"2002","journal-title":"Cancer Cell"},{"key":"2023012512423510500_B26","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/btp621","article-title":"Pitfalls of supervised feature selection","volume":"26","author":"Smialowski","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012512423510500_B27","first-page":"1399","article-title":"Ranking a random feature for variable and feature selection","volume":"3","author":"Stoppiglia","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"2023012512423510500_B28","doi-asserted-by":"crossref","first-page":"9440","DOI":"10.1073\/pnas.1530509100","article-title":"Statistical significance for genomewide studies","volume":"100","author":"Storey","year":"2003","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012512423510500_B29","first-page":"1341","article-title":"Feature selection with ensembles, artificial variables, and redundancy elimination","volume":"10","author":"Tuv","year":"2009","journal-title":"J. Mach. Learn. Res."},{"key":"2023012512423510500_B30","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1016\/S0140-6736(05)17947-1","article-title":"Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer","volume":"365","author":"Wang","year":"2005","journal-title":"Lancet"},{"key":"2023012512423510500_B31","volume-title":"Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment.","author":"Westfall","year":"1993"},{"key":"2023012512423510500_B32","first-page":"1","article-title":"Significance of gene ranking for classification of microarray samples","volume":"3","author":"Zhang","year":"2006","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/13\/1766\/48867719\/bioinformatics_28_13_1766.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/13\/1766\/48867719\/bioinformatics_28_13_1766.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,23]],"date-time":"2024-04-23T23:25:24Z","timestamp":1713914724000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/13\/1766\/234473"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,4,25]]},"references-count":32,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2012,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts238","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,7,1]]},"published":{"date-parts":[[2012,4,25]]}}}