{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T18:10:24Z","timestamp":1770142224394,"version":"3.49.0"},"reference-count":66,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2023,8,1]],"date-time":"2023-08-01T00:00:00Z","timestamp":1690848000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"FCT\u2014Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["SFRH\/BD\/145472\/2019"],"award-info":[{"award-number":["SFRH\/BD\/145472\/2019"]}]},{"name":"FCT\u2014Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["UIDB\/50008\/2020"],"award-info":[{"award-number":["UIDB\/50008\/2020"]}]},{"name":"FCT\u2014Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["C645008882-00000055"],"award-info":[{"award-number":["C645008882-00000055"]}]},{"name":"Instituto de Telecomunica\u00e7\u00f5es; Portuguese Recovery and Resilience Plan","award":["SFRH\/BD\/145472\/2019"],"award-info":[{"award-number":["SFRH\/BD\/145472\/2019"]}]},{"name":"Instituto de Telecomunica\u00e7\u00f5es; Portuguese Recovery and Resilience Plan","award":["UIDB\/50008\/2020"],"award-info":[{"award-number":["UIDB\/50008\/2020"]}]},{"name":"Instituto de Telecomunica\u00e7\u00f5es; Portuguese Recovery and Resilience Plan","award":["C645008882-00000055"],"award-info":[{"award-number":["C645008882-00000055"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BioMedInformatics"],"abstract":"<jats:p>Early disease detection using microarray data is vital for prompt and efficient treatment. However, the intricate nature of these data and the ongoing need for more precise interpretation techniques make it a persistently active research field. Numerous gene expression datasets are publicly available, containing microarray data that reflect the activation status of thousands of genes in patients who may have a specific disease. These datasets encompass a vast number of genes, resulting in high-dimensional feature vectors that present significant challenges for human analysis. Consequently, pinpointing the genes frequently associated with a particular disease becomes a crucial task. In this paper, we present a method capable of determining the frequency with which a gene (feature) is selected for the classification of a specific disease, by incorporating feature discretization and selection techniques into a machine learning pipeline. The experimental results demonstrate high accuracy and a low false negative rate, while significantly reducing the data\u2019s dimensionality in the process. The resulting subsets of genes are manageable for clinical experts, enabling them to verify the presence of a given disease.<\/jats:p>","DOI":"10.3390\/biomedinformatics3030040","type":"journal-article","created":{"date-parts":[[2023,8,1]],"date-time":"2023-08-01T09:32:35Z","timestamp":1690882355000},"page":"585-604","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A Machine Learning Pipeline for Cancer Detection on Microarray Data: The Role of Feature Discretization and Feature Selection"],"prefix":"10.3390","volume":"3","author":[{"given":"Adara","family":"Nogueira","sequence":"first","affiliation":[{"name":"ISEL\u2014Instituto Superior de Engenharia de Lisboa, Instituto Polit\u00e9cnico de Lisboa, 1959-007 Lisboa, Portugal"},{"name":"IST\u2014Instituto Superior T\u00e9cnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6508-0932","authenticated-orcid":false,"given":"Artur","family":"Ferreira","sequence":"additional","affiliation":[{"name":"ISEL\u2014Instituto Superior de Engenharia de Lisboa, Instituto Polit\u00e9cnico de Lisboa, 1959-007 Lisboa, Portugal"},{"name":"Instituto de Telecomunica\u00e7\u00f5es, 1049-001 Lisboa, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0970-7745","authenticated-orcid":false,"given":"M\u00e1rio","family":"Figueiredo","sequence":"additional","affiliation":[{"name":"IST\u2014Instituto Superior T\u00e9cnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal"},{"name":"Instituto de Telecomunica\u00e7\u00f5es, 1049-001 Lisboa, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2023,8,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1007\/978-1-4939-9442-7_4","article-title":"A Review of Microarray Datasets: Where to Find Them and Specific Characteristics","volume":"1986","year":"2019","journal-title":"Methods Mol. Biol."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University.","DOI":"10.1201\/9781420050646.ptb6"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1109\/TIT.1968.1054102","article-title":"On the mean accuracy of statistical pattern recognizers","volume":"14","author":"Hughes","year":"1968","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Nogueira, A., Ferreira, A., and Figueiredo, M. (2022, January 3\u20135). A Step Towards the Explainability of Microarray Data for Cancer Diagnosis with Machine Learning Techniques. Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM), Online.","DOI":"10.5220\/0010980100003122"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1109\/TKDE.2012.35","article-title":"A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning","volume":"25","author":"Garcia","year":"2013","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_6","unstructured":"Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification, John Wiley & Sons. [2nd ed.]."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Escolano, F., Suau, P., and Bonev, B. (2009). Information Theory in Computer Vision and Pattern Recognition, Springer.","DOI":"10.1007\/978-1-84882-297-9"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, Springer. [2nd ed.].","DOI":"10.1007\/978-0-387-84858-7"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. (2006). Feature Extraction: Foundations and Applications, Springer.","DOI":"10.1007\/978-3-540-35488-8"},{"key":"ref_10","unstructured":"Simon, R., Korn, E., McShane, L., Radmacher, M., Wright, G., and Zhao, Y. (2003). Design and Analysis of DNA Microarray Investigations, Springer."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ferreira, A., and Figueiredo, M. (2015, January 17\u201319). Exploiting the bin-class histograms for feature selection on discrete data. Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis, Santiago de Compostela, Spain.","DOI":"10.1007\/978-3-319-19390-8_39"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1373","DOI":"10.1162\/089976603321780317","article-title":"Laplacian eigenmaps for dimensionality reduction and data representation","volume":"15","author":"Belkin","year":"2003","journal-title":"Neural Comput."},{"key":"ref_13","unstructured":"Dougherty, J., Kohavi, R., and Sahami, M. (1995). Machine Learning Proceedings 1995, Elsevier."},{"key":"ref_14","unstructured":"Fayyad, U., and Irani, K. (1993, January 9\u201311). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the International Joint Conference on Uncertainty in AI, Washington, DC, USA."},{"key":"ref_15","unstructured":"Alpaydin, E. (2014). Introduction to Machine Learning, The MIT Press. [3rd ed.]."},{"key":"ref_16","unstructured":"He, X., Cai, D., and Niyogi, P. (2005, January 5\u20138). Laplacian score for feature selection. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhao, Z., and Liu, H. (2007, January 20\u201324). Spectral feature selection for supervised and unsupervised learning. Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA.","DOI":"10.1145\/1273496.1273641"},{"key":"ref_18","unstructured":"Liu, L., Kang, J., Yu, J., and Wang, Z. (November, January 30). A comparative study on unsupervised feature selection methods for text clustering. Proceedings of the 2005 International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, China."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1111\/j.1469-1809.1936.tb02137.x","article-title":"The use of multiple measurements in taxonomic problems","volume":"7","author":"Fisher","year":"1936","journal-title":"Ann. Eugen."},{"key":"ref_20","unstructured":"Yu, L., and Liu, H. (2003, January 21\u201324). Feature selection for high-dimensional data: A fast correlation-based filter solution. Proceedings of the International Conference on Machine Learning (ICML), Washington, DC, USA."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1226","DOI":"10.1109\/TPAMI.2005.159","article-title":"Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy","volume":"27","author":"Peng","year":"2005","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell. (PAMI)"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Kononenko, I. (1994, January 6\u20138). Estimating attributes: Analysis and extensions of RELIEF. Proceedings of the European Conference on Machine Learning, Catania, Italy.","DOI":"10.1007\/3-540-57868-4_57"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1794","DOI":"10.1016\/j.patrec.2012.05.019","article-title":"Efficient feature selection filters for high-dimensional data","volume":"33","author":"Ferreira","year":"2012","journal-title":"Pattern Recognit. Lett."},{"key":"ref_24","unstructured":"Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., and Liu, H. (2010). Advancing Feature Selection Research\u2014ASU Feature Selection Repository, Computer Science & Engineering, Arizona State University. Technical Report."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"906","DOI":"10.1093\/bioinformatics\/16.10.906","article-title":"Support vector machine classification and validation of cancer tissue samples using microarray expression data","volume":"16","author":"Furey","year":"2000","journal-title":"Bioinformatics"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"103375","DOI":"10.1016\/j.compbiomed.2019.103375","article-title":"A review of feature selection methods in medical applications","volume":"112","author":"Remeseiro","year":"2019","journal-title":"Comput. Biol. Med."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"927312","DOI":"10.3389\/fbinf.2022.927312","article-title":"A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction","volume":"2","author":"Pudjihartono","year":"2022","journal-title":"Front. Bioinform."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"4543","DOI":"10.1007\/s10489-021-02550-9","article-title":"A comprehensive survey on feature selection in the various fields of machine learning","volume":"52","author":"Dhal","year":"2022","journal-title":"Appl. Intell."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"1106","DOI":"10.1109\/TCBB.2012.33","article-title":"A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis","volume":"9","author":"Lazar","year":"2012","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform."},{"key":"ref_30","unstructured":"Manikandan, G., and Abirami, S. (2018). Knowledge Computing and its Applications: Knowledge Computing in Specific Domains: Volume II, Springer."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"78533","DOI":"10.1109\/ACCESS.2019.2922987","article-title":"A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification","volume":"7","author":"Almugren","year":"2019","journal-title":"IEEE Access"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1186\/s40537-021-00441-x","article-title":"A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector","volume":"8","author":"Arowolo","year":"2021","journal-title":"J. Big Data"},{"key":"ref_33","unstructured":"Alpaydin, E. (2010). Introduction to Machine Learning, The MIT Press. [2nd ed.]."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Boser, B., Guyon, I., and Vapnik, V. (1992, January 27\u201329). A training algorithm for optimal margin classifiers. Proceedings of the Annual ACM Workshop on Computational Learning Theory, Pittsburgh, PA, USA.","DOI":"10.1145\/130385.130401"},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1023\/A:1009715923555","article-title":"A tutorial on support vector machines for pattern recognition","volume":"2","author":"Burges","year":"1998","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Vapnik, V. (1999). The Nature of Statistical Learning Theory, Springe.","DOI":"10.1007\/978-1-4757-3264-1"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1109\/72.991427","article-title":"A comparison of methods for multi-class support vector machines","volume":"13","author":"Hsu","year":"2002","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_38","unstructured":"Weston, J., and Watkins, C. (1998). Multi-Class Support Vector Machines, Department of Computer Science, Royal Holloway, University of London. Technical Report."},{"key":"ref_39","unstructured":"Breiman, L. (1984). Classification and Regression Trees, Chapman & Hall\/CRC. [1st ed.]."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1007\/BF00116251","article-title":"Induction of decision trees","volume":"1","author":"Quinlan","year":"1986","journal-title":"Mach. Learn."},{"key":"ref_41","unstructured":"Quinlan, J. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann."},{"key":"ref_42","unstructured":"Quinlan, J. (1996, January 4\u20138). Bagging, boosting, and C4.5. Proceedings of the National Conference on Artificial Intelligence, Portland, OR, USA."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"476","DOI":"10.1109\/TSMCC.2004.843247","article-title":"Top-down induction of decision trees classifiers\u2014A survey","volume":"35","author":"Rokach","year":"2005","journal-title":"IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev."},{"key":"ref_44","unstructured":"Yip, W., Amin, S., and Li, C. (2011). Handbook of Statistical Bioinformatics, Springer."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1016\/j.ijmedinf.2005.05.002","article-title":"GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data","volume":"74","author":"Statnikov","year":"2005","journal-title":"Int. J. Med. Inform."},{"key":"ref_46","unstructured":"Witten, I., Frank, E., Hall, M., and Pal, C. (2016). Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kauffmann. [4th ed.]."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1109\/JSTSP.2008.923858","article-title":"Information-theoretic feature selection in microarray data using variable complementarity","volume":"2","author":"Meyer","year":"2008","journal-title":"IEEE J. Sel. Top. Signal Process."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"631","DOI":"10.1093\/bioinformatics\/bti033","article-title":"A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis","volume":"21","author":"Statnikov","year":"2005","journal-title":"Bioinformatics"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Diaz-Uriarte, R., and Andres, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinform., 7.","DOI":"10.1186\/1471-2105-7-3"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Li, Z., Xie, W., and Liu, T. (2018). Efficient feature selection and classification for microarray data. PLoS ONE, 13.","DOI":"10.1371\/journal.pone.0202167"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Consiglio, A., Casalino, G., Castellano, G., Grillo, G., Perlino, E., Vessio, G., and Licciulli, F. (2021). Explaining Ovarian Cancer Gene Expression Profiles with Fuzzy Rules and Genetic Algorithms. Electronics, 10.","DOI":"10.3390\/electronics10040375"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"2507","DOI":"10.1093\/bioinformatics\/btm344","article-title":"A review of feature selection techniques in bioinformatics","volume":"23","author":"Saeys","year":"2007","journal-title":"Bioinformatics"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"AbdElNabi, M.L.R., Wajeeh Jasim, M., El-Bakry, H.M., Hamed, N., Taha, M., and Khalifa, N.E.M. (2020). Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques. Symmetry, 12.","DOI":"10.3390\/sym12030408"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"7270","DOI":"10.1016\/j.eswa.2012.01.096","article-title":"Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods","volume":"39","year":"2012","journal-title":"Expert Syst. Appl."},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Jirapech-Umpai, T., and Aitken, S. (2005). Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinform., 6.","DOI":"10.1186\/1471-2105-6-148"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"3236","DOI":"10.1016\/j.patcog.2007.02.007","article-title":"Markov blanket-embedded genetic algorithm for gene selection","volume":"40","author":"Zhu","year":"2007","journal-title":"Pattern Recognit."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"530","DOI":"10.1038\/415530a","article-title":"Gene expression profiling predicts clinical outcome of breast cancer","volume":"415","author":"Dai","year":"2002","journal-title":"Nature"},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/415436a","article-title":"Prediction of central nervous system embryonal tumour outcome based on gene expression","volume":"415","author":"Pomeroy","year":"2002","journal-title":"Nature"},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"6745","DOI":"10.1073\/pnas.96.12.6745","article-title":"Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays","volume":"96","author":"Alon","year":"1999","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1126\/science.286.5439.531","article-title":"Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring","volume":"286","author":"Golub","year":"1999","journal-title":"Science"},{"key":"ref_62","doi-asserted-by":"crossref","first-page":"13790","DOI":"10.1073\/pnas.191502998","article-title":"Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses","volume":"98","author":"Bhattacharjee","year":"2001","journal-title":"Natl. Acad. Sci. USA"},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"503","DOI":"10.1038\/35000501","article-title":"Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling","volume":"403","author":"Alizadeh","year":"2000","journal-title":"Nature"},{"key":"ref_64","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1038\/ng765","article-title":"MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia","volume":"30","author":"Armstrong","year":"2002","journal-title":"Nat. Genet."},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Basegmez, H., Sezer, E., and Erol, C. (2021). Optimization for Gene Selection and Cancer Classification. Proceedings, 74.","DOI":"10.3390\/proceedings2021074021"},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"673","DOI":"10.1038\/89044","article-title":"Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks","volume":"7","author":"Khan","year":"2001","journal-title":"Nat. Med."}],"container-title":["BioMedInformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2673-7426\/3\/3\/40\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:23:43Z","timestamp":1760127823000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2673-7426\/3\/3\/40"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,1]]},"references-count":66,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["biomedinformatics3030040"],"URL":"https:\/\/doi.org\/10.3390\/biomedinformatics3030040","relation":{},"ISSN":["2673-7426"],"issn-type":[{"value":"2673-7426","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,1]]}}}