{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,26]],"date-time":"2026-03-26T14:13:48Z","timestamp":1774534428298,"version":"3.50.1"},"reference-count":39,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2023,11,3]],"date-time":"2023-11-03T00:00:00Z","timestamp":1698969600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,11,3]],"date-time":"2023-11-03T00:00:00Z","timestamp":1698969600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100014597","name":"Universidade da Coru\u00f1a","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100014597","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Intell Inf Syst"],"published-print":{"date-parts":[[2024,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The growth of Big Data has resulted in an overwhelming increase in the volume of data available, including the number of features. Feature selection, the process of selecting relevant features and discarding irrelevant ones, has been successfully used to reduce the dimensionality of datasets. However, with numerous feature selection approaches in the literature, determining the best strategy for a specific problem is not straightforward. In this study, we compare the performance of various feature selection approaches to a random selection to identify the most effective strategy for a given type of problem. We use a large number of datasets to cover a broad range of real-world challenges. We evaluate the performance of seven popular feature selection approaches and five classifiers. Our findings show that feature selection is a valuable tool in machine learning and that correlation-based feature selection is the most effective strategy regardless of the scenario. Additionally, we found that using improper thresholds with ranker approaches produces results as poor as randomly selecting a subset of features.<\/jats:p>","DOI":"10.1007\/s10844-023-00823-y","type":"journal-article","created":{"date-parts":[[2023,11,3]],"date-time":"2023-11-03T09:02:59Z","timestamp":1699002179000},"page":"459-483","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Finding a needle in a haystack: insights on feature selection for classification tasks"],"prefix":"10.1007","volume":"62","author":[{"given":"Laura","family":"Mor\u00e1n-Fern\u00e1ndez","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ver\u00f3nica","family":"Bol\u00f3n-Canedo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,11,3]]},"reference":[{"key":"823_CR1","unstructured":"Bache, K., & Linchman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. [Online; accessed December 2022]. http:\/\/archive.ics.uci.edu\/ml\/"},{"issue":"1","key":"823_CR2","first-page":"2653","volume":"18","author":"A Benavoli","year":"2017","unstructured":"Benavoli, A., Corani, G., Dem\u0161ar, J., et al. (2017). Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research, 18(1), 2653\u20132688.","journal-title":"The Journal of Machine Learning Research"},{"issue":"5","key":"823_CR3","doi-asserted-by":"publisher","first-page":"5947","DOI":"10.1016\/j.eswa.2010.11.028","volume":"38","author":"V Bol\u00f3n-Canedo","year":"2011","unstructured":"Bol\u00f3n-Canedo, V., S\u00e1nchez-Maro\u00f1o, N., & Alonso-Betanzos, A. (2011). Feature selection and classification in multiple class datasets: An application to kdd cup 99 dataset. Expert Systems with Applications, 38(5), 5947\u20135957. https:\/\/doi.org\/10.1016\/j.eswa.2010.11.028","journal-title":"Expert Systems with Applications"},{"issue":"3","key":"823_CR4","doi-asserted-by":"publisher","first-page":"483","DOI":"10.1007\/s10115-012-0487-8","volume":"34","author":"V Bol\u00f3n-Canedo","year":"2013","unstructured":"Bol\u00f3n-Canedo, V., S\u00e1nchez-Maro\u00f1o, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483\u2013519. https:\/\/doi.org\/10.1007\/s10115-012-0487-8","journal-title":"Knowledge and Information Systems"},{"key":"823_CR5","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1016\/j.ins.2014.05.042","volume":"282","author":"V Bol\u00f3n-Canedo","year":"2014","unstructured":"Bol\u00f3n-Canedo, V., S\u00e1nchez-Maro\u00f1o, N., Alonso-Betanzos, A., et al. (2014). A review of microarray datasets and applied feature selection methods. Information Sciences, 282, 111\u2013135. https:\/\/doi.org\/10.1016\/j.ins.2014.05.042","journal-title":"Information Sciences"},{"key":"823_CR6","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1016\/j.knosys.2015.05.014","volume":"86","author":"V Bol\u00f3n-Canedo","year":"2015","unstructured":"Bol\u00f3n-Canedo, V., S\u00e1nchez-Maro\u00f1o, N., & Alonso-Betanzos, A. (2015). Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-Based Systems, 86, 33\u201345. https:\/\/doi.org\/10.1016\/j.knosys.2015.05.014","journal-title":"Knowledge-Based Systems"},{"issue":"9","key":"823_CR7","doi-asserted-by":"publisher","first-page":"843","DOI":"10.1080\/088395101753210773","volume":"15","author":"A Chouchoulas","year":"2001","unstructured":"Chouchoulas, A., & Shen, Q. (2001). Rough set-aided keyword reduction for text categorization. Applied Artificial Intelligence, 15(9), 843\u2013873. https:\/\/doi.org\/10.1080\/088395101753210773","journal-title":"Applied Artificial Intelligence"},{"issue":"14","key":"823_CR8","doi-asserted-by":"publisher","first-page":"i427","DOI":"10.1093\/bioinformatics\/btz333","volume":"35","author":"H Climente-Gonz\u00e1lez","year":"2019","unstructured":"Climente-Gonz\u00e1lez, H., Azencott, C. A., Kaski, S., et al. (2019). Block hsic lasso: model-free biomarker detection for ultra-high dimensional data. Bioinformatics, 35(14), i427\u2013i435. https:\/\/doi.org\/10.1093\/bioinformatics\/btz333","journal-title":"Bioinformatics"},{"key":"823_CR9","unstructured":"Dem\u0161ar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7(Jan), 1\u201330"},{"issue":"2000","key":"823_CR10","first-page":"32","volume":"1","author":"DL Donoho","year":"2000","unstructured":"Donoho, D. L., et al. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture, 1(2000), 32.","journal-title":"AMS Math Challenges Lecture"},{"issue":"1","key":"823_CR11","first-page":"3133","volume":"15","author":"M Fern\u00e1ndez-Delgado","year":"2014","unstructured":"Fern\u00e1ndez-Delgado, M., Cernadas, E., Barro, S., et al. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1), 3133\u20133181.","journal-title":"The Journal of Machine Learning Research"},{"key":"823_CR12","doi-asserted-by":"publisher","unstructured":"Furxhi, I., Murphy, F., Mullins, M., et al. (2020). Nanotoxicology data for in silico tools: a literature review. Nanotoxicology, 1\u201326,. https:\/\/doi.org\/10.1080\/17435390.2020.1729439","DOI":"10.1080\/17435390.2020.1729439"},{"key":"823_CR13","doi-asserted-by":"publisher","unstructured":"Grgic-Hlaca, N., Zafar, M. B., & Gummadi, K. P. et\u00a0al (2018). Beyond distributive fairness in algorithmic decision making: Feature selection for procedurally fair learning. In: AAAI, (pp. 51\u201360). https:\/\/doi.org\/10.1609\/aaai.v32i1.11296","DOI":"10.1609\/aaai.v32i1.11296"},{"key":"823_CR14","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-35488-8","author":"I Guyon","year":"2008","unstructured":"Guyon, I., Gunn, S., Nikravesh, M., et al. (2008). Feature extraction: foundations and applications, vol 207. Springer, New York.https:\/\/doi.org\/10.1007\/978-3-540-35488-8","journal-title":"Springer, New York."},{"key":"823_CR15","unstructured":"Hall, MA. (1999). Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato"},{"key":"823_CR16","unstructured":"Hall, MA., & Smith, L. A. (1998). Practical feature subset selection for machine learning. C McDonald (Ed), Computer Science\u201998 Proceedings of the 21st Australasian Computer Science Conference ACSC\u201998"},{"issue":"1","key":"823_CR17","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1109\/TIT.1968.1054102","volume":"14","author":"G Hughes","year":"1968","unstructured":"Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1), 55\u201363. https:\/\/doi.org\/10.1109\/TIT.1968.1054102","journal-title":"IEEE Transactions on Information Theory"},{"key":"823_CR18","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2020.101752","volume":"92","author":"SM Kasongo","year":"2020","unstructured":"Kasongo, S. M., & Sun, Y. (2020). A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Computers & Security, 92, 101752. https:\/\/doi.org\/10.1016\/j.cose.2020.101752","journal-title":"Computers & Security"},{"key":"823_CR19","doi-asserted-by":"publisher","unstructured":"Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. In: European conference on machine learning, Springer, 171\u2013182. https:\/\/doi.org\/10.1007\/3-540-57868-4_57","DOI":"10.1007\/3-540-57868-4_57"},{"issue":"3","key":"823_CR20","doi-asserted-by":"publisher","first-page":"779","DOI":"10.1007\/s10844-022-00725-5","volume":"59","author":"M Kopczynski","year":"2022","unstructured":"Kopczynski, M., & Grzes, T. (2022). Fpga supported rough set reduct calculation for big datasets. Journal of Intelligent Information Systems, 59(3), 779\u2013799. https:\/\/doi.org\/10.1007\/s10844-022-00725-5","journal-title":"Journal of Intelligent Information Systems"},{"key":"823_CR21","unstructured":"Kuncheva, L. I. (2020). Bayesian-analysis-for-comparing-classifiers. https:\/\/github.com\/LucyKuncheva\/Bayesian-Analysis-for-Comparing-Classifiers"},{"key":"823_CR22","unstructured":"LeCun, Y., Cortes, C., Burges, C. (1998). Mnist database of handwritten digits. [Online; accessed December 2022]. http:\/\/yann.lecun.com\/exdb\/mnist\/"},{"key":"823_CR23","doi-asserted-by":"publisher","unstructured":"Lewis, D. D. (1992). Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, 212\u2013217. https:\/\/doi.org\/10.3115\/1075527.1075574","DOI":"10.3115\/1075527.1075574"},{"key":"823_CR24","doi-asserted-by":"publisher","DOI":"10.1201\/9781420035933","volume-title":"Subset selection in regression","author":"A Miller","year":"2002","unstructured":"Miller, A. (2002). Subset selection in regression. New York: CRC Press."},{"key":"823_CR25","doi-asserted-by":"publisher","unstructured":"Mor\u00e1n-Fern\u00e1ndez, L., Bol\u00f3n-Canedo, V. (2021). Dimensionality reduction: Is feature selection more effective than random selection? In: International Work-Conference on Artificial Neural Networks, Springer, 113\u2013125. https:\/\/doi.org\/10.1007\/978-3-030-85030-2_10","DOI":"10.1007\/978-3-030-85030-2_10"},{"issue":"3","key":"823_CR26","doi-asserted-by":"publisher","first-page":"1067","DOI":"10.1007\/s10115-016-1003-3","volume":"51","author":"L Mor\u00e1n-Fern\u00e1ndez","year":"2017","unstructured":"Mor\u00e1n-Fern\u00e1ndez, L., Bol\u00f3n-Canedo, V., & Alonso-Betanzos, A. (2017). Can classification performance be predicted by complexity measures? a study using microarray data. Knowledge and Information Systems, 51(3), 1067\u20131090. https:\/\/doi.org\/10.1007\/s10115-016-1003-3","journal-title":"Knowledge and Information Systems"},{"key":"823_CR27","unstructured":"Mor\u00e1n-Fern\u00e1ndez, L., Bol\u00f3n-Canedo, V., & Alonso-Betanzos, A. (2020). Do we need hundreds of classifiers or a good feature selection? In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 399\u2013404"},{"key":"823_CR28","unstructured":"Navarro, F. F. G. (2011). Feature selection in cancer research: microarray gene expression and in vivo 1h-mrs domains. PhD thesis, Universitat Polit\u00e8cnica de Catalunya (UPC)"},{"key":"823_CR29","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-011-3534-4","author":"Z Pawlak","year":"1991","unstructured":"Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data, vol 9. Springer Science & Business Media. https:\/\/doi.org\/10.1007\/978-94-011-3534-4","journal-title":"Springer Science & Business Media"},{"issue":"8","key":"823_CR30","doi-asserted-by":"publisher","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","volume":"27","author":"H Peng","year":"2005","unstructured":"Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226\u20131238. https:\/\/doi.org\/10.1109\/TPAMI.2005.159","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"823_CR31","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2019.103375","volume":"112","author":"B Remeseiro","year":"2019","unstructured":"Remeseiro, B., & Bol\u00f3n-Canedo, V. (2019). A review of feature selection methods in medical applications. Computers in Biology and Medicine, 112, 103375. https:\/\/doi.org\/10.1016\/j.compbiomed.2019.103375","journal-title":"Computers in Biology and Medicine"},{"key":"823_CR32","doi-asserted-by":"publisher","unstructured":"Salau, A. O., & Jain, S. (2019). Feature extraction: a survey of the types, techniques, applications. In: 2019 International Conference on Signal Processing and Communication (ICSC), IEEE, 158\u2013164. https:\/\/doi.org\/10.1109\/ICSC45622.2019.8938371","DOI":"10.1109\/ICSC45622.2019.8938371"},{"key":"#cr-split#-823_CR33.1","unstructured":"Scully, P. M. D., & Jensen, R. K. (2011). Investigating rough set feature selection for gene expression analysis (BSc Computer Science dissertation). [Online"},{"key":"#cr-split#-823_CR33.2","unstructured":"accessed July 2023]. https:\/\/petescully.co.uk\/2015\/08\/28\/weka-package-rsarsubseteval\/"},{"key":"823_CR34","doi-asserted-by":"publisher","unstructured":"Shahrjooihaghighi, A., & Frigui, H. (2021). Local feature selection for multiple instance learning. Journal of Intelligent Information Systems, 1\u201325,. https:\/\/doi.org\/10.1007\/s10844-021-00680-7","DOI":"10.1007\/s10844-021-00680-7"},{"issue":"3","key":"823_CR35","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1016\/S0952-1976(00)00010-5","volume":"13","author":"Q Shen","year":"2000","unstructured":"Shen, Q., & Chouchoulas, A. (2000). A modular approach to generating fuzzy rules with reduced attributes for the monitoring of complex systems. Engineering Applications of Artificial Intelligence, 13(3), 263\u2013278. https:\/\/doi.org\/10.1016\/S0952-1976(00)00010-5","journal-title":"Engineering Applications of Artificial Intelligence"},{"issue":"7","key":"823_CR36","doi-asserted-by":"publisher","first-page":"1341","DOI":"10.1162\/neco.1996.8.7.1341","volume":"8","author":"DH Wolpert","year":"1996","unstructured":"Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341\u20131390. https:\/\/doi.org\/10.1162\/neco.1996.8.7.1341","journal-title":"Neural Computation"},{"key":"823_CR37","unstructured":"Yang, H. H., & Moody, J. (2000). Data visualization and feature selection: New algorithms for nongaussian data. In: Advances in Neural Information Processing Systems, pp 687\u2013693"},{"issue":"2","key":"823_CR38","doi-asserted-by":"publisher","first-page":"207","DOI":"10.3233\/IDA-2009-0364","volume":"13","author":"Z Zhao","year":"2009","unstructured":"Zhao, Z., & Liu, H. (2009). Searching for interacting features in subset selection. Intelligent Data Analysis, 13(2), 207\u2013228. https:\/\/doi.org\/10.3233\/IDA-2009-0364","journal-title":"Intelligent Data Analysis"}],"container-title":["Journal of Intelligent Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10844-023-00823-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10844-023-00823-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10844-023-00823-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,2]],"date-time":"2024-05-02T14:18:59Z","timestamp":1714659539000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10844-023-00823-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,3]]},"references-count":39,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,4]]}},"alternative-id":["823"],"URL":"https:\/\/doi.org\/10.1007\/s10844-023-00823-y","relation":{},"ISSN":["0925-9902","1573-7675"],"issn-type":[{"value":"0925-9902","type":"print"},{"value":"1573-7675","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,11,3]]},"assertion":[{"value":"8 May 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 October 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 October 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 November 2023","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"The authors have no competing interests to declare that are relevant to the content of this article.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of interest"}}]}}