{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,21]],"date-time":"2026-03-21T04:05:24Z","timestamp":1774065924756,"version":"3.50.1"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T00:00:00Z","timestamp":1717113600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T00:00:00Z","timestamp":1717113600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper introduces a novel graph-based filter method for automatic feature selection (abbreviated as GB-AFS) for multi-class classification tasks. The method determines the minimum combination of features required to sustain prediction performance while maintaining complementary discriminating abilities between different classes. It does not require any user-defined parameters such as the number of features to select. The minimum number of features is selected using our newly developed Mean Simplified Silhouette (abbreviated as MSS) index, designed to evaluate the clustering results for the feature selection task. To illustrate the effectiveness and generality of the method, we applied the GB-AFS method using various combinations of statistical measures and dimensionality reduction techniques. The experimental results demonstrate the superior performance of the proposed GB-AFS over other filter-based techniques and automatic feature selection approaches, and demonstrate that the GB-AFS method is independent of the statistical measure or the dimensionality reduction technique chosen by the user. Moreover, the proposed method maintained the accuracy achieved when utilizing all features while using only 7\u2013<jats:inline-formula><jats:alternatives><jats:tex-math>$$30\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>30<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> of the original features. This resulted in an average time saving ranging from <jats:inline-formula><jats:alternatives><jats:tex-math>$$15\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>15<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> for the smallest dataset to <jats:inline-formula><jats:alternatives><jats:tex-math>$$70\\%$$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>70<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> for the largest. Our code is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/davidlevinwork\/gbfs\/\">https:\/\/github.com\/davidlevinwork\/gbfs\/<\/jats:ext-link>.<\/jats:p>","DOI":"10.1186\/s40537-024-00934-5","type":"journal-article","created":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T11:09:42Z","timestamp":1717153782000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["GB-AFS: graph-based automatic feature selection for multi-class classification via Mean Simplified Silhouette"],"prefix":"10.1186","volume":"11","author":[{"given":"David","family":"Levin","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Gonen","family":"Singer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,5,31]]},"reference":[{"issue":"19","key":"934_CR1","doi-asserted-by":"publisher","first-page":"2507","DOI":"10.1093\/bioinformatics\/btm344","volume":"23","author":"Y Saeys","year":"2007","unstructured":"Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507\u201317.","journal-title":"Bioinformatics"},{"key":"934_CR2","unstructured":"Liu H, Motoda H. Feature Selection for Knowledge Discovery and Data Mining. vol. 454. Springer, 2012."},{"issue":"1","key":"934_CR3","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-019-0241-0","volume":"6","author":"K Tadist","year":"2019","unstructured":"Tadist K, Najah S, Nikolov NS, Mrabti F, Zahi A. Feature selection methods and genomic big data: a systematic review. J Big Data. 2019;6(1):1\u201324.","journal-title":"J Big Data"},{"issue":"1","key":"934_CR4","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1186\/s40537-020-00327-4","volume":"7","author":"R-C Chen","year":"2020","unstructured":"Chen R-C, Dewi C, Huang S-W, Caraka RE. Selecting critical features for data classification based on machine learning methods. J Big Data. 2020;7(1):52.","journal-title":"J Big Data"},{"issue":"2","key":"934_CR5","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1109\/MIS.2017.38","volume":"32","author":"J Li","year":"2017","unstructured":"Li J, Liu H. Challenges of feature selection for big data analytics. IEEE Intell Syst. 2017;32(2):9\u201315.","journal-title":"IEEE Intell Syst"},{"issue":"1","key":"934_CR6","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1016\/j.compeleceng.2013.11.024","volume":"40","author":"G Chandrashekar","year":"2014","unstructured":"Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40(1):16\u201328.","journal-title":"Comput Electr Eng"},{"key":"934_CR7","doi-asserted-by":"publisher","first-page":"919","DOI":"10.1016\/j.procs.2016.07.111","volume":"91","author":"J Miao","year":"2016","unstructured":"Miao J, Niu L. A survey on feature selection. Procedia Comput Sci. 2016;91:919\u201326.","journal-title":"Procedia Comput Sci"},{"key":"934_CR8","doi-asserted-by":"publisher","first-page":"57","DOI":"10.1007\/s10462-016-9516-4","volume":"49","author":"RB Pereira","year":"2018","unstructured":"Pereira RB, Plastino A, Zadrozny B, Merschmann LH. Categorizing feature selection methods for multi-label classification. Artif Intell Rev. 2018;49:57\u201378.","journal-title":"Artif Intell Rev"},{"issue":"6","key":"934_CR9","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3136625","volume":"50","author":"J Li","year":"2017","unstructured":"Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: a data perspective. ACM Comput Surveys (CSUR). 2017;50(6):1\u201345.","journal-title":"ACM Comput Surveys (CSUR)"},{"key":"934_CR10","doi-asserted-by":"crossref","unstructured":"Jovi\u0107 A, Brki\u0107 K, Bogunovi\u0107 N. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2015;1200\u20131205. IEEE","DOI":"10.1109\/MIPRO.2015.7160458"},{"issue":"1","key":"934_CR11","first-page":"3","volume":"19","author":"B Venkatesh","year":"2019","unstructured":"Venkatesh B, Anuradha J. A review of feature selection and its methods. Cybern Inf Technol. 2019;19(1):3\u201326.","journal-title":"Cybern Inf Technol"},{"key":"934_CR12","unstructured":"Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Icml, 1997;97: 35. Citeseer"},{"key":"934_CR13","doi-asserted-by":"publisher","DOI":"10.3389\/fbinf.2022.927312","volume":"2","author":"N Pudjihartono","year":"2022","unstructured":"Pudjihartono N, Fadason T, Kempa-Liehr AW, O\u2019Sullivan JM. A review of feature selection methods for machine learning-based disease risk prediction. Front Bioinformatics. 2022;2: 927312.","journal-title":"Front Bioinformatics"},{"issue":"13","key":"934_CR14","doi-asserted-by":"publisher","first-page":"1898","DOI":"10.1016\/j.ins.2005.07.015","volume":"176","author":"ER Hruschka","year":"2006","unstructured":"Hruschka ER, Campello RJ, De Castro LN. Evolving clusters in gene-expression data. Inf Sci. 2006;176(13):1898\u2013927.","journal-title":"Inf Sci"},{"key":"934_CR15","first-page":"5812","volume":"33","author":"Y You","year":"2020","unstructured":"You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y. Graph contrastive learning with augmentations. Adv Neural Inf Process Syst. 2020;33:5812\u201323.","journal-title":"Adv Neural Inf Process Syst"},{"issue":"8","key":"934_CR16","first-page":"1548","volume":"33","author":"D Cai","year":"2010","unstructured":"Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2010;33(8):1548\u201360.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"934_CR17","unstructured":"Briola A, Aste T. Topological feature selection: a graph-based filter feature selection approach. arXiv preprint arXiv:2302.09543 2023."},{"key":"934_CR18","unstructured":"Friedman S, Singer G, Rabin N. Graph-based extreme feature selection for multi-class classification tasks. arXiv preprint arXiv:2303.01792 2023."},{"issue":"1","key":"934_CR19","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1016\/j.acha.2006.04.006","volume":"21","author":"RR Coifman","year":"2006","unstructured":"Coifman RR, Lafon S. Diffusion maps. Appl Comput Harmon Anal. 2006;21(1):5\u201330.","journal-title":"Appl Comput Harmon Anal"},{"key":"934_CR20","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2019.113024","volume":"142","author":"A Hashemi","year":"2020","unstructured":"Hashemi A, Dowlatshahi MB, Nezamabadi-Pour H. Mgfs: a multi-label graph-based feature selection algorithm via pagerank centrality. Expert Syst Appl. 2020;142: 113024.","journal-title":"Expert Syst Appl"},{"key":"934_CR21","doi-asserted-by":"crossref","unstructured":"Xing W, Ghorbani A. Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004., 2004;305\u2013314. IEEE","DOI":"10.1109\/DNSR.2004.1344743"},{"issue":"1","key":"934_CR22","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1177\/0165551521991037","volume":"49","author":"B Parlak","year":"2023","unstructured":"Parlak B, Uysal AK. A novel filter feature selection method for text classification: extensive feature selector. J Inf Sci. 2023;49(1):59\u201378.","journal-title":"J Inf Sci"},{"issue":"12","key":"934_CR23","doi-asserted-by":"publisher","first-page":"4396","DOI":"10.1109\/TPAMI.2020.3002843","volume":"43","author":"G Roffo","year":"2020","unstructured":"Roffo G, Melzi S, Castellani U, Vinciarelli A, Cristani M. Infinite feature selection: a graph-based feature filtering approach. IEEE Trans Pattern Anal Mach Intell. 2020;43(12):4396\u2013410.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"18","key":"934_CR24","doi-asserted-by":"publisher","first-page":"3766","DOI":"10.1016\/j.ins.2011.04.050","volume":"181","author":"TF Cov\u00f5es","year":"2011","unstructured":"Cov\u00f5es TF, Hruschka ER. Towards improving cluster-based feature selection with a simplified silhouette filter. Inf Sci. 2011;181(18):3766\u201382.","journal-title":"Inf Sci"},{"key":"934_CR25","doi-asserted-by":"crossref","unstructured":"Wang F, Franco-Penya H-H, Kelleher JD, Pugh J, Ross R. An analysis of the application of simplified silhouette to the evaluation of k-means clustering validity. In: Machine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, 2017;291\u2013305. Springer.","DOI":"10.1007\/978-3-319-62416-7_21"},{"issue":"8","key":"934_CR26","doi-asserted-by":"publisher","first-page":"1193","DOI":"10.3390\/rs10081193","volume":"10","author":"Y Wang","year":"2018","unstructured":"Wang Y, Qi Q, Liu Y. Unsupervised segmentation evaluation using area-weighted variance and Jeffries-Matusita distance for remote sensing images. Remote Sens. 2018;10(8):1193.","journal-title":"Remote Sens"},{"issue":"9","key":"934_CR27","doi-asserted-by":"publisher","first-page":"3283","DOI":"10.1109\/TGRS.2009.2019126","volume":"47","author":"VA Tolpekin","year":"2009","unstructured":"Tolpekin VA, Stein A. Quantification of the effects of land-cover-class spectral separability on the accuracy of Markov-random-field-based superresolution mapping. IEEE Trans Geosci Remote Sens. 2009;47(9):3283\u201397.","journal-title":"IEEE Trans Geosci Remote Sens"},{"key":"934_CR28","unstructured":"Maaten L, Hinton G. Visualizing data using t-sne. J Mach Learning Res. 2008;9(11)."},{"key":"934_CR29","unstructured":"Hinton GE, Roweis S. Stochastic neighbor embedding. Adv Neural Inf Proc Syst. 2002;15."},{"key":"934_CR30","unstructured":"Van Der\u00a0Maaten L. Learning a parametric embedding by preserving local structure. In: Artificial Intelligence and Statistics, 2009;384\u2013391. PMLR."},{"key":"934_CR31","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","volume":"20","author":"PJ Rousseeuw","year":"1987","unstructured":"Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53\u201365.","journal-title":"J Comput Appl Math"},{"key":"934_CR32","doi-asserted-by":"crossref","unstructured":"Satopaa V, Albrecht J, Irwin D, Raghavan B. Finding a \u201cKneedle\u201d in a haystack: Detecting knee points in system behavior. In: 2011 31st International Conference on Distributed Computing Systems Workshops, 2011;166\u2013171. IEEE.","DOI":"10.1109\/ICDCSW.2011.20"},{"key":"934_CR33","unstructured":"Microsoft: Microsoft Malware Prediction. Kaggle 2019. https:\/\/www.kaggle.com\/c\/microsoft-malware-prediction\/data."},{"key":"934_CR34","unstructured":"Kaufman L, Rousseeuw PJ. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 2009."},{"key":"934_CR35","unstructured":"Arthur D, Vassilvitskii S. K-means++ the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007;1027\u20131035."},{"key":"934_CR36","unstructured":"Hruschka ER, Covoes TF. Feature selection for cluster analysis: an approach based on the simplified silhouette criterion. In: International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC\u201906), 2005;1: 32\u201338. IEEE."},{"key":"934_CR37","doi-asserted-by":"publisher","unstructured":"Cole R, Fanty M. ISOLET. UCI Machine Learning Repository. 1994. https:\/\/doi.org\/10.24432\/C51G69.","DOI":"10.24432\/C51G69"},{"key":"934_CR38","doi-asserted-by":"publisher","unstructured":"Campos D, Bernardes J. Cardiotocography. UCI Machine Learning Repository. 2010. https:\/\/doi.org\/10.24432\/C51S4N.","DOI":"10.24432\/C51S4N"},{"key":"934_CR39","doi-asserted-by":"publisher","unstructured":"Higuera C, Gardiner K, Cios K. Mice Protein Expression. UCI Machine Learning Repository. 2015. https:\/\/doi.org\/10.24432\/C50S3Z.","DOI":"10.24432\/C50S3Z"},{"key":"934_CR40","unstructured":"Olteanu A. GTZAN Dataset\u2014Music Genre Classification. Kaggle 2020. https:\/\/www.kaggle.com\/datasets\/andradaolteanu\/gtzan-dataset-music-genre-classification."},{"key":"934_CR41","doi-asserted-by":"publisher","first-page":"23","DOI":"10.1023\/A:1025667309714","volume":"53","author":"M Robnik-\u0160ikonja","year":"2003","unstructured":"Robnik-\u0160ikonja M, Kononenko I. Theoretical and empirical analysis of relieff and rrelieff. Mach Learn. 2003;53:23\u201369.","journal-title":"Mach Learn"},{"issue":"1","key":"934_CR42","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s12859-016-1423-9","volume":"18","author":"M Radovic","year":"2017","unstructured":"Radovic M, Ghalwash M, Filipovic N, Obradovic Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics. 2017;18(1):1\u201314.","journal-title":"BMC Bioinformatics"},{"key":"934_CR43","unstructured":"Hall MA. Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato 1999."},{"key":"934_CR44","doi-asserted-by":"crossref","unstructured":"Arlot S, Celisse A. A survey of cross-validation procedures for model selection. 2010.","DOI":"10.1214\/09-SS054"},{"issue":"3","key":"934_CR45","first-page":"184","volume":"29","author":"X Manfei","year":"2017","unstructured":"Manfei X, Fralick D, Zheng JZ, Wang B, Changyong F, et al. The differences and similarities between two-sample t-test and paired t-test. Shanghai Arch Psychiatry. 2017;29(3):184.","journal-title":"Shanghai Arch Psychiatry"},{"key":"934_CR46","doi-asserted-by":"crossref","unstructured":"McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. 2018. arXiv preprint arXiv:1802.03426.","DOI":"10.21105\/joss.00861"},{"issue":"1","key":"934_CR47","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1007\/BF00532240","volume":"70","author":"L R\u00fcschendorf","year":"1985","unstructured":"R\u00fcschendorf L. The Wasserstein distance and approximation theorems. Probab Theory Relat Fields. 1985;70(1):117\u201329.","journal-title":"Probab Theory Relat Fields"},{"key":"934_CR48","doi-asserted-by":"crossref","unstructured":"Beran R. Minimum Hellinger distance estimates for parametric models. Ann Stat. 1977;445\u2013463.","DOI":"10.1214\/aos\/1176343842"},{"key":"934_CR49","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2023.119799","volume":"223","author":"R Haba","year":"2023","unstructured":"Haba R, Singer G, Naftali S, Kramer MR, Ratnovsky A. A remote and personalised novel approach for monitoring asthma severity levels from EEG signals utilizing classification algorithms. Expert Syst Appl. 2023;223: 119799.","journal-title":"Expert Syst Appl"},{"key":"934_CR50","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2024.107914","volume":"132","author":"L Rabkin","year":"2024","unstructured":"Rabkin L, Cohen I, Singer G. Resource allocation in ordinal classification problems: a prescriptive framework utilizing machine learning and mathematical programming. Eng Appl Artif Intell. 2024;132: 107914.","journal-title":"Eng Appl Artif Intell"},{"key":"934_CR51","doi-asserted-by":"publisher","DOI":"10.1016\/j.engappai.2022.105741","volume":"119","author":"DA Shifman","year":"2023","unstructured":"Shifman DA, Cohen I, Huang K, Xian X, Singer G. An adaptive machine learning algorithm for the resource-constrained classification problem. Eng Appl Artif Intell. 2023;119: 105741.","journal-title":"Eng Appl Artif Intell"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-00934-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-024-00934-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-024-00934-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,31]],"date-time":"2024-05-31T11:11:57Z","timestamp":1717153917000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-024-00934-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,31]]},"references-count":51,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["934"],"URL":"https:\/\/doi.org\/10.1186\/s40537-024-00934-5","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,31]]},"assertion":[{"value":"23 November 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 May 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 May 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"The authors give the Publisher permission to publish the work.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"No conflict of interest regarding the paper.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"79"}}