{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T00:31:11Z","timestamp":1778113871396,"version":"3.51.4"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"S3","license":[{"start":{"date-parts":[[2022,3,1]],"date-time":"2022-03-01T00:00:00Z","timestamp":1646092800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,28]],"date-time":"2022-04-28T00:00:00Z","timestamp":1651104000000},"content-version":"vor","delay-in-days":58,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100004917","name":"Cancer Prevention and Research Institute of Texas","doi-asserted-by":"publisher","award":["CPRIT RP170668, RP180734 and RP210045"],"award-info":[{"award-number":["CPRIT RP170668, RP180734 and RP210045"]}],"id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000009","name":"Foundation for the National Institutes of Health","doi-asserted-by":"publisher","award":["R01LM012806"],"award-info":[{"award-number":["R01LM012806"]}],"id":[{"id":"10.13039\/100000009","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004917","name":"Cancer Prevention and Research Institute of Texas","doi-asserted-by":"publisher","award":["CPRIT 180734"],"award-info":[{"award-number":["CPRIT 180734"]}],"id":[{"id":"10.13039\/100004917","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2022,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>As many complex omics data have been generated during the last two decades, dimensionality reduction problem has been a challenging issue in better mining such data. The omics data typically consists of many features. Accordingly, many feature selection algorithms have been developed. The performance of those feature selection methods often varies by specific data, making the discovery and interpretation of results challenging.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods and results<\/jats:title><jats:p>In this study, we performed a comprehensive comparative study of five widely used supervised feature selection methods (mRMR, INMIFS, DFS, SVM-RFE-CBR and VWMRmR) for multi-omics datasets. Specifically, we used five representative datasets: gene expression (Exp), exon expression (ExpExon), DNA methylation (hMethyl27), copy number variation (Gistic2), and pathway activity dataset (Paradigm IPLs) from a multi-omics study of acute myeloid leukemia (LAML) from The Cancer Genome Atlas (TCGA). The different feature subsets selected by the aforesaid five different feature selection algorithms are assessed using three evaluation criteria: (1) classification accuracy (Acc), (2) representation entropy (RE) and (3) redundancy rate (RR). Four different classifiers, viz., C4.5, NaiveBayes, KNN, and AdaBoost, were used to measure the classification accuary (Acc) for each selected feature subset. The VWMRmR algorithm obtains the best Acc for three datasets (ExpExon, hMethyl27 and Paradigm IPLs). The VWMRmR algorithm offers the best RR (obtained using normalized mutual information) for three datasets (Exp, Gistic2 and Paradigm IPLs), while it gives the best RR (obtained using Pearson correlation coefficient) for two datasets (Gistic2 and Paradigm IPLs). It also obtains the best RE for three datasets (Exp, Gistic2 and Paradigm IPLs). Overall, the VWMRmR algorithm yields best performance for all three evaluation criteria for majority of the datasets. In addition, we identified signature genes using supervised learning collected from the overlapped top feature set among five feature selection methods. We obtained a 7-gene signature (<jats:italic>ZMIZ1, ENG, FGFR1, PAWR, KRT17, MPO<\/jats:italic>and<jats:italic>LAT2<\/jats:italic>) for EXP, a 9-gene signature for ExpExon, a 7-gene signature for hMethyl27, one single-gene signature (<jats:italic>PIK<\/jats:italic>3<jats:italic>CG<\/jats:italic>) for Gistic2 and a 3-gene signature for Paradigm IPLs.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>We performed a comprehensive comparison of the performance evaluation of five well-known feature selection methods for mining features from various high-dimensional datasets. We identified signature genes using supervised learning for the specific omic data for the disease. The study will help incorporate higher order dependencies among features.<\/jats:p><\/jats:sec>","DOI":"10.1186\/s12859-022-04678-y","type":"journal-article","created":{"date-parts":[[2022,4,28]],"date-time":"2022-04-28T16:04:12Z","timestamp":1651161852000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer"],"prefix":"10.1186","volume":"23","author":[{"given":"Tapas","family":"Bhadra","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4107-6784","authenticated-orcid":false,"given":"Saurav","family":"Mallik","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Neaj","family":"Hasan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3477-0914","authenticated-orcid":false,"given":"Zhongming","family":"Zhao","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,4,28]]},"reference":[{"key":"4678_CR1","doi-asserted-by":"publisher","DOI":"10.1002\/9780470872352","volume-title":"Computational intelligence and pattern analysis in biological informatics","author":"U Maulik","year":"2010","unstructured":"Maulik U, Bandyopadhyay S, Wang JTL. Computational intelligence and pattern analysis in biological informatics. Singapore: Wiley; 2010."},{"key":"4678_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3402\/jev.v3.23129","volume":"3","author":"M Aqil","year":"2014","unstructured":"Aqil M, Naqvi AR, Mallik S, et al. The HIV NEF protein modulates cellular and exosomal mirna profiles in human monocytic cells. J Extracell Vesicles. 2014;3:1\u201312.","journal-title":"J Extracell Vesicles"},{"key":"4678_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-019-56847-4","volume":"10","author":"G Qin","year":"2020","unstructured":"Qin G, Mallik S, Mitra R, et al. Microrna and transcription factor co-regulatory networks and subtype classification of seminoma and non-seminoma in testicular germ cell tumors. Nat Sci Rep. 2020;10:1\u201314.","journal-title":"Nat Sci Rep"},{"key":"4678_CR4","doi-asserted-by":"publisher","DOI":"10.1109\/TCBB.2020.3020537","author":"K Mallick","year":"2020","unstructured":"Mallick K, Mallik S, Bandyopadhyay S, Chakraborty S. A novel graph topology based go-similarity measure for signature detection from multi-omics data and its application to other problems. IEEE\/ACM Trans Comput Biol Bioinform. 2020. https:\/\/doi.org\/10.1109\/TCBB.2020.3020537.","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"issue":"1\u20132","key":"4678_CR5","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1016\/S0004-3702(97)00063-5","volume":"97","author":"AL Blum","year":"1997","unstructured":"Blum AL, Langley P. Selection of relevant features and examples in machine learning. Artif Intell. 1997;97(1\u20132):245\u201371.","journal-title":"Artif Intell"},{"key":"4678_CR6","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1093\/bib\/bby120","volume":"21","author":"S Mallik","year":"2018","unstructured":"Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform. 2018;21:368\u201394.","journal-title":"Brief Bioinform"},{"key":"4678_CR7","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1186\/s12918-018-0650-2","volume":"12","author":"S Mallik","year":"2018","unstructured":"Mallik S, Zhao Z. Identification of gene signatures from RNA-SEQ data using pareto-optimal cluster algorithm. BMC Syst Biol. 2018;12:21\u20139.","journal-title":"BMC Syst Biol"},{"key":"4678_CR8","doi-asserted-by":"publisher","DOI":"10.1201\/9780203998076","volume-title":"Pattern recognition algorithms for data mining","author":"SK Pal","year":"2004","unstructured":"Pal SK, Mitra P. Pattern recognition algorithms for data mining. Boca Raton: CRC Press; 2004."},{"key":"4678_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/genes9010007","volume":"9","author":"S Mallik","year":"2017","unstructured":"Mallik S, Zhao Z. Congems: condensed gene co-expression module discovery through rule-based learning and its application to lung squamous cell carcinoma. Genes. 2017;9:1\u201325.","journal-title":"Genes"},{"key":"4678_CR10","doi-asserted-by":"publisher","first-page":"673","DOI":"10.1109\/TCBB.2016.2636207","volume":"15","author":"S Bandyopadhyay","year":"2018","unstructured":"Bandyopadhyay S, Mallik S. Integrating multiple data sources for combinatorial marker discovery: a study in tumorigenesis. IEEE\/ACM Trans Comput Biol Bioinform. 2018;15:673\u201387.","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"4678_CR11","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-16615-0","volume-title":"Multiobjective genetic algorithms for clustering: applications in data mining and bioinformatics","author":"U Maulik","year":"2011","unstructured":"Maulik U, Bandyopadhyay S, Mukhopadhyay A. Multiobjective genetic algorithms for clustering: applications in data mining and bioinformatics. New York: Springer; 2011."},{"key":"4678_CR12","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1109\/TNB.2017.2650217","volume":"16","author":"S Mallik","year":"2017","unstructured":"Mallik S, Bhadra T, Maulik U. Identifying epigenetic biomarkers using maximal relevance and minimal redundancy based feature selection for multi-omics data. IEEE Trans Nanobiosci. 2017;16:3\u201310.","journal-title":"IEEE Trans Nanobiosci"},{"key":"4678_CR13","doi-asserted-by":"publisher","first-page":"931","DOI":"10.3390\/genes11080931","volume":"11","author":"S Mallik","year":"2020","unstructured":"Mallik S, Seth S, Bhadra T, Bandyopadhyay S. A linear regression and deep learning approach for detecting reliable genetic alterations in cancer using DNA methylation and gene expression data. Genes. 2020;11:931.","journal-title":"Genes"},{"issue":"1","key":"4678_CR14","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1109\/34.824819","volume":"22","author":"AK Jain","year":"2000","unstructured":"Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell. 2000;22(1):4\u201337.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"4678_CR15","unstructured":"Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics"},{"issue":"1","key":"4678_CR16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.2202\/1544-6115.1743","volume":"11","author":"M Bhattacharyya","year":"2012","unstructured":"Bhattacharyya M, Feuerbach L, Bhadra T, Lengauer T, Bandyopadhyay S. Microrna transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol. 2012;11(1):1\u201325.","journal-title":"Stat Appl Genet Mol Biol"},{"issue":"6","key":"4678_CR17","doi-asserted-by":"publisher","first-page":"66722","DOI":"10.1371\/journal.pone.0066722","volume":"8","author":"T Bhadra","year":"2013","unstructured":"Bhadra T, Bhattacharyya M, Feuerbach L, Lengauer T, Bandyopadhyay S. Dna methylation patterns facilitate the identification of microrna transcription start sites: a brain-specific study. PLoS ONE. 2013;8(6):66722.","journal-title":"PLoS ONE"},{"key":"4678_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.ins.2021.02.034","volume":"566","author":"T Bhadra","year":"2021","unstructured":"Bhadra T, Bandyopadhyay S. Supervised feature selection using integration of densest subgraph finding with floating forward-backward search. Inf Sci. 2021;566:1\u201318.","journal-title":"Inf Sci"},{"key":"4678_CR19","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1016\/j.patrec.2013.12.008","volume":"40","author":"S Bandyopadhyay","year":"2014","unstructured":"Bandyopadhyay S, Bhadra T, Maulik U, Mitra P. Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recognit Lett. 2014;40:104\u201312.","journal-title":"Pattern Recognit Lett"},{"issue":"8","key":"4678_CR20","doi-asserted-by":"publisher","first-page":"4042","DOI":"10.1016\/j.eswa.2014.12.010","volume":"42","author":"T Bhadra","year":"2015","unstructured":"Bhadra T, Bandyopadhyay S. Unsupervised feature selection using an improved version of differential evolution. Expert Syst Appl. 2015;42(8):4042\u201353.","journal-title":"Expert Syst Appl"},{"key":"4678_CR21","first-page":"1157","volume":"3","author":"I Guyon","year":"2003","unstructured":"Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157\u201382.","journal-title":"J Mach Learn Res"},{"issue":"4","key":"4678_CR22","doi-asserted-by":"publisher","first-page":"537","DOI":"10.1109\/72.298224","volume":"5","author":"R Battiti","year":"1994","unstructured":"Battiti R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw. 1994;5(4):537\u201350.","journal-title":"IEEE Trans Neural Netw"},{"issue":"12","key":"4678_CR23","doi-asserted-by":"publisher","first-page":"1667","DOI":"10.1109\/TPAMI.2002.1114861","volume":"24","author":"N Kwak","year":"2002","unstructured":"Kwak N, Choi CH. Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell. 2002;24(12):1667\u201371.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"8","key":"4678_CR24","doi-asserted-by":"publisher","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","volume":"27","author":"H Peng","year":"2005","unstructured":"Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226\u201338.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"2","key":"4678_CR25","doi-asserted-by":"publisher","first-page":"189","DOI":"10.1109\/TNN.2008.2005601","volume":"20","author":"PA Estevez","year":"2009","unstructured":"Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw. 2009;20(2):189\u2013201.","journal-title":"IEEE Trans Neural Netw"},{"key":"4678_CR26","doi-asserted-by":"crossref","unstructured":"Vinh LT, Thang ND, Lee YK. An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In: Tenth IEEE\/IPSJ international symposium on applications and the internet (SAINT), 2010. p. 395\u201398.","DOI":"10.1109\/SAINT.2010.50"},{"key":"4678_CR27","first-page":"189","volume":"25","author":"S Bandyopadhyay","year":"2015","unstructured":"Bandyopadhyay S, Bhadra T, Maulik U. Variable weighted maximal relevance minimal redundancy criterion for feature selection using normalized mutual information. J Mult-valued Log S. 2015;25:189.","journal-title":"J Mult-valued Log S"},{"key":"4678_CR28","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1023\/A:1012487302797","volume":"46","author":"I Guyon","year":"2002","unstructured":"Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46:389\u2013422.","journal-title":"Mach Learn"},{"key":"4678_CR29","doi-asserted-by":"publisher","first-page":"353","DOI":"10.1016\/j.snb.2015.02.025","volume":"212","author":"K Yan","year":"2015","unstructured":"Yan K, Zhang D. Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators B Chem. 2015;212:353\u201363.","journal-title":"Sens Actuators B Chem"},{"issue":"4","key":"4678_CR30","doi-asserted-by":"publisher","first-page":"796","DOI":"10.1109\/TNNLS.2015.2424721","volume":"27","author":"H Tao","year":"2016","unstructured":"Tao H, Hou C, Nie F, Jiao Y, Yi D. Effective discriminative feature selection with non-trivial solutions. IEEE Trans Neural Netw Learn Syst. 2016;27(4):796\u2013808.","journal-title":"IEEE Trans Neural Netw Learn Syst"},{"key":"4678_CR31","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41587-020-0546-8","volume":"38","author":"MJ Goldman","year":"2020","unstructured":"Goldman MJ, Craft B, Hastie M, et al. Visualizing and interpreting cancer genomics data via the xena platform. Nat Biotechnol. 2020;38:1\u20134.","journal-title":"Nat Biotechnol"},{"key":"4678_CR32","unstructured":"The cancer genome atlas (TCGA) acute myeloid leukemia (LAML) dataset. https:\/\/xenabrowser.net\/datapages\/?cohort=TCGA%20Acute%20Myeloid%20Leukemia%20(LAML)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443. Accessed 25 July 2019."},{"issue":"1","key":"4678_CR33","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1109\/TCBB.2013.147","volume":"11","author":"S Bandyopadhyay","year":"2014","unstructured":"Bandyopadhyay S, Mallik S, Mukhopadhyay A. A survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE\/ACM Trans Comput Biol Bioinform. 2014;11(1):95\u2013115.","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"issue":"6","key":"4678_CR34","doi-asserted-by":"publisher","first-page":"1119","DOI":"10.1109\/TSMC.2017.2726553","volume":"49","author":"T Bhadra","year":"2019","unstructured":"Bhadra T, Mallik S, Bandyopadhyay S. Identification of multiview gene modules using mutual information-based hypograph mining. IEEE Trans Syst Man Cybern. 2019;49(6):1119\u201330.","journal-title":"IEEE Trans Syst Man Cybern"},{"key":"4678_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.2202\/1544-6115.1027","volume":"3","author":"GK Smyth","year":"2004","unstructured":"Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:1.","journal-title":"Stat Appl Genet Mol Biol"},{"issue":"1","key":"4678_CR36","doi-asserted-by":"publisher","first-page":"10","DOI":"10.1145\/1656274.1656278","volume":"11","author":"M Hall","year":"2009","unstructured":"Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explor. 2009;11(1):10\u20138.","journal-title":"ACM SIGKDD Explor"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04678-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-022-04678-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-022-04678-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,3]],"date-time":"2023-02-03T22:25:06Z","timestamp":1675463106000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-022-04678-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3]]},"references-count":36,"journal-issue":{"issue":"S3","published-print":{"date-parts":[[2022,3]]}},"alternative-id":["4678"],"URL":"https:\/\/doi.org\/10.1186\/s12859-022-04678-y","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3]]},"assertion":[{"value":"11 April 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 April 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 April 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"153"}}