{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:29:45Z","timestamp":1750307385372},"reference-count":67,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2019,3,20]],"date-time":"2019-03-20T00:00:00Z","timestamp":1553040000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Leverhulme Trust Research Grant","award":["RPG-2016-015"],"award-info":[{"award-number":["RPG-2016-015"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,5,21]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Biologists very often use enrichment methods based on statistical hypothesis tests to identify gene properties that are significantly over-represented in a given set of genes of interest, by comparison with a \u2018background\u2019 set of genes. These enrichment methods, although based on rigorous statistical foundations, are not always the best single option to identify patterns in biological data. In many cases, one can also use classification algorithms from the machine-learning field. Unlike enrichment methods, classification algorithms are designed to maximize measures of predictive performance and are capable of analysing combinations of gene properties, instead of one property at a time. In practice, however, the majority of studies use either enrichment or classification methods (rather than both), and there is a lack of literature discussing the pros and cons of both types of method. The goal of this paper is to compare and contrast enrichment and classification methods, offering two contributions. First, we discuss the (to some extent complementary) advantages and disadvantages of both types of methods for identifying gene properties that discriminate between gene classes. Second, we provide a set of high-level recommendations for using enrichment and classification methods. Overall, by highlighting the strengths and the weaknesses of both types of methods we argue that both should be used in bioinformatics analyses.<\/jats:p>","DOI":"10.1093\/bib\/bbz028","type":"journal-article","created":{"date-parts":[[2019,2,22]],"date-time":"2019-02-22T20:12:34Z","timestamp":1550866354000},"page":"803-814","source":"Crossref","is-referenced-by-count":19,"title":["Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes"],"prefix":"10.1093","volume":"21","author":[{"given":"Fabio","family":"Fabris","sequence":"first","affiliation":[{"name":"School of Computing, University of Kent, Kent, CT2 7NF, UK"}]},{"given":"Daniel","family":"Palmer","sequence":"first","affiliation":[{"name":"Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK"}]},{"given":"Jo\u00e3o Pedro","family":"de Magalh\u00e3es","sequence":"first","affiliation":[{"name":"Integrative Genomics of Ageing Group, Institute of Ageing and Chronic Disease, University of Liverpool, Liverpool, UK"}]},{"given":"Alex A","family":"Freitas","sequence":"first","affiliation":[{"name":"School of Computing, University of Kent, Kent, CT2 7NF, UK"}]}],"member":"286","published-online":{"date-parts":[[2019,3,20]]},"reference":[{"issue":"3","key":"2020051819280779300_ref1","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1053\/j.seminhematol.2008.04.003","article-title":"A dirty dozen: twelve P-Value misconceptions","volume":"45","author":"Goodman","year":"2008","journal-title":"Semin Hematol"},{"issue":"1","key":"2020051819280779300_ref2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/nar\/gkn923","article-title":"Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists","volume":"37","author":"Huang","year":"2009","journal-title":"Nucleic Acids Res"},{"key":"2020051819280779300_ref3","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/978-1-4939-3743-1_14","article-title":"Gene ontology: pitfalls, biases, and remedies","volume-title":"The Gene Ontology Handbook","author":"Gaudet","year":"2017"},{"issue":"2","key":"2020051819280779300_ref4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1371\/journal.pcbi.1002375","article-title":"Ten years of pathway analysis: current approaches and outstanding challenges","volume":"8","author":"Khatri","year":"2012","journal-title":"PLoS Comput Biol"},{"issue":"3","key":"2020051819280779300_ref5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1371\/journal.pbio.1002106","article-title":"The extent and consequences of p-hacking in science","volume":"13","author":"Head","year":"2015","journal-title":"PLoS Biol"},{"issue":"1","key":"2020051819280779300_ref6","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1177\/0956797613504966","article-title":"The new statistics: why and how","volume":"25","author":"Cumming","year":"2014","journal-title":"Psychol Sci"},{"issue":"7","key":"2020051819280779300_ref7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.cell.2018.05.015","article-title":"Next-generation machine learning for biological networks","volume":"173","author":"Camacho","year":"2018","journal-title":"Cell"},{"issue":"6","key":"2020051819280779300_ref8","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1038\/nrg3920","article-title":"Machine learning in genetics and genomics","volume":"16","author":"Libbrecht","year":"2017","journal-title":"Nat Rev Gen"},{"issue":"3","key":"2020051819280779300_ref9","doi-asserted-by":"crossref","first-page":"435","DOI":"10.2174\/1568026613666131204105110","article-title":"Bioinformatics tools for the functional interpretation of quantitative proteomics results","volume":"14","author":"Villavicencio-Diaz","year":"2014","journal-title":"Curr Top Med Chem"},{"issue":"6","key":"2020051819280779300_ref10","first-page":"1370","article-title":"Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data","volume":"19","author":"Yan","year":"2017","journal-title":"Brief Bioinform"},{"issue":"43","key":"2020051819280779300_ref11","doi-asserted-by":"crossref","first-page":"15545","DOI":"10.1073\/pnas.0506580102","article-title":"Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles","volume":"102","author":"Subramanian","year":"2005","journal-title":"Proc Natl Acad Sci U S A"},{"issue":"1","key":"2020051819280779300_ref12","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1093\/nar\/28.1.27","article-title":"KEGG: Kyoto encyclopedia of genes and genomes","volume":"28","author":"Kanehisa","year":"2000","journal-title":"Nucleic Acids Res"},{"issue":"D1","key":"2020051819280779300_ref13","doi-asserted-by":"crossref","first-page":"D649","DOI":"10.1093\/nar\/gkx1132","article-title":"The reactome pathway knowledgebase","volume":"46","author":"Fabregat","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2020051819280779300_ref14","doi-asserted-by":"crossref","first-page":"D133","DOI":"10.1093\/nar\/gkv1156","article-title":"Regulondb version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond","volume":"44","author":"Gama-Castro","year":"2016","journal-title":"Nucleic Acids Res"},{"issue":"3","key":"2020051819280779300_ref15","doi-asserted-by":"crossref","first-page":"414","DOI":"10.1093\/bioinformatics\/btw623","article-title":"Combining multiple tools outperforms individual methods in gene set enrichment analyses","volume":"33","author":"Alhamdoosh","year":"2017","journal-title":"Bioinformatics"},{"issue":"1","key":"2020051819280779300_ref16","doi-asserted-by":"crossref","first-page":"334","DOI":"10.1186\/s12859-015-0751-5","article-title":"Comparative study on gene set and pathway topology-based enrichment methods","volume":"16","author":"Bayerlov\u00e1","year":"2015","journal-title":"BMC Bioinformatics"},{"issue":"3","key":"2020051819280779300_ref17","doi-asserted-by":"crossref","first-page":"478","DOI":"10.1002\/ana.20736","article-title":"Mitochondrial dysfunction as a cause of axonal degeneration in multiple sclerosis patients","volume":"59","author":"Dutta","year":"2006","journal-title":"Ann Neurol"},{"issue":"7518","key":"2020051819280779300_ref18","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1038\/nature13438","article-title":"Proteogenomic characterization of human colon and rectal cancer","volume":"513","author":"Zhang","year":"2014","journal-title":"Nature"},{"issue":"7117","key":"2020051819280779300_ref19","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1038\/nature05354","article-title":"Resveratrol improves health and survival of mice on a high-calorie diet","volume":"444","author":"Baur","year":"2006","journal-title":"Nature"},{"issue":"6","key":"2020051819280779300_ref20","doi-asserted-by":"crossref","first-page":"1109","DOI":"10.1016\/j.cell.2006.11.013","article-title":"Resveratrol improves mitochondrial function and protects against metabolic disease by activating sirt1 and pgc-1$\\alpha$","volume":"127","author":"Lagouge","year":"2006","journal-title":"Cell"},{"issue":"3","key":"2020051819280779300_ref21","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1016\/j.cels.2018.03.003","article-title":"Pan-cancer alterations of the myc oncogene and its proximal network across the cancer genome atlas","volume":"6","author":"Schaub","year":"2018","journal-title":"Cell Syst"},{"issue":"1","key":"2020051819280779300_ref22","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1186\/s13046-017-0495-3","article-title":"Cancer cells increase endothelial cell tube formation and survival by activating the pi3k\/akt signalling pathway","volume":"36","author":"Cheng","year":"2017","journal-title":"J Exp Clin Cancer Res"},{"issue":"2","key":"2020051819280779300_ref23","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1007\/s10522-017-9683-y","article-title":"A review of supervised machine learning applied to ageing research","volume":"18","author":"Fabris","year":"2017","journal-title":"Biogerontology"},{"issue":"184","key":"2020051819280779300_ref24","first-page":"1","article-title":"An expanded evaluation of protein function prediction methods shows an improvement in accuracy","volume":"17","author":"Jiang","year":"2016","journal-title":"Genome Biol"},{"issue":"1","key":"2020051819280779300_ref25","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat Genet"},{"issue":"D1","key":"2020051819280779300_ref26","doi-asserted-by":"crossref","first-page":"D369","DOI":"10.1093\/nar\/gkw1102","article-title":"The BioGRID interaction database: 2017 update","volume":"45","author":"Chatr-Aryamontri","year":"2017","journal-title":"Nucleic Acids Res"},{"issue":"D1","key":"2020051819280779300_ref27","doi-asserted-by":"crossref","first-page":"D362","DOI":"10.1093\/nar\/gkw937","article-title":"The STRING database in 2017: quality-controlled protein\u2013protein association networks, made broadly accessible","volume":"45","author":"Szklarczyk","year":"2017","journal-title":"Nucleic Acids Res"},{"issue":"14","key":"2020051819280779300_ref28","doi-asserted-by":"crossref","first-page":"2449","DOI":"10.1093\/bioinformatics\/bty087","article-title":"A new approach for interpreting random forest models and its application to the biology of ageing","volume":"34","author":"Fabris","year":"2018","journal-title":"Bioinformatics"},{"issue":"6","key":"2020051819280779300_ref29","doi-asserted-by":"crossref","first-page":"979","DOI":"10.3233\/IDA-2011-0505","article-title":"Selecting different protein representations and classification algorithms in hierarchical protein function prediction","volume":"15","author":"Silla","year":"2011","journal-title":"Intelligent Data Analysis"},{"issue":"D1","key":"2020051819280779300_ref30","doi-asserted-by":"crossref","first-page":"D1124","DOI":"10.1093\/nar\/gku1042","article-title":"GeneFriends: a human RNA-seq\u2013based gene and transcript co-expression database","volume":"43","author":"van Dam","year":"2014","journal-title":"Nucleic Acids Res"},{"issue":"4","key":"2020051819280779300_ref31","first-page":"575","article-title":"Gene co-expression analysis for functional classification and gene\u2013disease predictions","volume":"19","author":"van Dam","year":"2018","journal-title":"Brief Bioinform"},{"issue":"5","key":"2020051819280779300_ref32","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1089\/bio.2015.0032","article-title":"A novel approach to high-quality postmortem tissue procurement: the GTEx project","volume":"13","author":"Carithers","year":"2015","journal-title":"Biopreserv Biobank"},{"key":"2020051819280779300_ref33","volume-title":"Probabilistic Graphical Models: Principles and Techniques","author":"Koller","year":"2009"},{"key":"2020051819280779300_ref34","first-page":"2017","article-title":"Methods for interpreting and understanding deep neural networks","volume":"73","author":"Montavon","journal-title":"Digit Signal Process"},{"issue":"21","key":"2020051819280779300_ref35","first-page":"4804","article-title":"Systematic analysis of the gerontome reveals links between aging and age-related diseases","volume":"25","author":"Fernandes","year":"2016","journal-title":"Hum Mol Genet"},{"issue":"27","key":"2020051819280779300_ref36","first-page":"11","article-title":"A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related","volume":"12","author":"Freitas","year":"2011","journal-title":"BMC Genomics"},{"issue":"5","key":"2020051819280779300_ref37","first-page":"851","article-title":"Deep learning in bioinformatics","volume":"18","author":"Min","year":"2017","journal-title":"Brief Bioinform"},{"key":"2020051819280779300_ref38","volume-title":"Deep Learning","author":"Goodfellow","year":"2016"},{"key":"2020051819280779300_ref39","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1016\/j.ymeth.2017.06.034","article-title":"Near perfect protein multi-label classification with deep neural networks","volume":"132","author":"Szalkai","year":"2018","journal-title":"Methods"},{"issue":"1\u20132","key":"2020051819280779300_ref40","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1007\/s10618-010-0175-9","article-title":"A survey of hierarchical classification across different application domains","volume":"44","author":"Silla","year":"2011","journal-title":"Data Min Knowl Discov"},{"issue":"4094","key":"2020051819280779300_ref41","first-page":"13","article-title":"Prediction and characterization of human ageing-related proteins by using machine learning","volume":"8","author":"Kerepesi","year":"2018","journal-title":"Sci Rep"},{"issue":"1","key":"2020051819280779300_ref42","doi-asserted-by":"crossref","first-page":"172","DOI":"10.1109\/TCBB.2008.47","article-title":"On the importance of comprehensible classification models for protein function prediction","volume":"7","author":"Freitas","year":"2010","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2020051819280779300_ref43","volume-title":"Data Mining: Practical Machine Learning Tools and Techniques","author":"Witten","year":"2016"},{"key":"2020051819280779300_ref44","doi-asserted-by":"crossref","first-page":"220","DOI":"10.1016\/j.eswa.2016.12.035","article-title":"Learning from class-imbalanced data: review of methods and applications","volume":"73","author":"Haixiang","year":"2017","journal-title":"Expert Syst Appl"},{"issue":"2","key":"2020051819280779300_ref45","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1109\/TCBB.2014.2355218","article-title":"Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods","volume":"12","author":"Wan","year":"2015","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"issue":"1","key":"2020051819280779300_ref46","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1038\/nprot.2008.211","article-title":"Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources","volume":"4","author":"Huang","year":"2009","journal-title":"Nat Protoc"},{"issue":"86","key":"2020051819280779300_ref47","first-page":"1","article-title":"BayGO: Bayesian analysis of ontology term enrichment in microarray data","volume":"7","author":"V\u00eancio","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2020051819280779300_ref48","doi-asserted-by":"crossref","first-page":"905","DOI":"10.1093\/bioinformatics\/btq059","article-title":"Go-bayes: Gene Ontology\u2013based overrepresentation analysis using a Bayesian approach","volume":"26","author":"Zhang","year":"2010","journal-title":"Bioinformatics"},{"key":"2020051819280779300_ref49","doi-asserted-by":"crossref","first-page":"3523","DOI":"10.1093\/nar\/gkq045","article-title":"GOing Bayesian: model-based gene set analysis of genome-scale data","volume":"38","author":"Bauer","year":"2010","journal-title":"Nucleic Acids Res"},{"issue":"21","key":"2020051819280779300_ref50","doi-asserted-by":"crossref","first-page":"9622","DOI":"10.1093\/nar\/gkt752","article-title":"A modular framework for gene set analysis integrating multilevel omics data","volume":"41","author":"Sass","year":"2013","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"2020051819280779300_ref51","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1080\/00031305.2016.1154108","article-title":"The ASA\u2019s statement on p-values: context, process, and purpose","volume":"70","author":"Wasserstein","year":"2016","journal-title":"Am Stat"},{"issue":"8","key":"2020051819280779300_ref52","doi-asserted-by":"crossref","first-page":"980","DOI":"10.1093\/bioinformatics\/btm051","article-title":"Analyzing gene expression data in terms of gene sets: methodological issues","volume":"23","author":"Goeman","year":"2007","journal-title":"Bioinformatics"},{"issue":"2","key":"2020051819280779300_ref53","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1093\/bib\/bbl019","article-title":"Enrichment analysis in high-throughput genomics-accounting for dependency in the NULL","volume":"8","author":"Gold","year":"2006","journal-title":"Brief Bioinform"},{"issue":"8","key":"2020051819280779300_ref54","doi-asserted-by":"crossref","first-page":"1184","DOI":"10.1038\/nprot.2009.97","article-title":"Mapping identifiers for the integration of genomic datasets with the R\/Bioconductor package biomaRt","volume":"4","author":"Durinck","year":"2009","journal-title":"Nat Protoc"},{"key":"2020051819280779300_ref55","doi-asserted-by":"crossref","DOI":"10.1201\/b12207","volume-title":"Ensemble Methods: Foundations and Algorithms","author":"Zhou","year":"2012"},{"issue":"4","key":"2020051819280779300_ref56","doi-asserted-by":"crossref","first-page":"296","DOI":"10.2174\/157489310794072508","article-title":"A review of ensemble methods in bioinformatics","volume":"5","author":"Yang","year":"2010","journal-title":"Curr Bioinform"},{"key":"2020051819280779300_ref57","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1016\/j.eswa.2017.04.003","article-title":"An up-to-date comparison of state-of-the-art classification algorithms","volume":"82","author":"Zhang","year":"2017","journal-title":"Expert Syst Appl"},{"issue":"1","key":"2020051819280779300_ref58","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA data mining software: an update","volume":"11","author":"Hall","year":"2009","journal-title":"SIGKDD Explor"},{"key":"2020051819280779300_ref59","first-page":"2962","article-title":"Efficient and robust automated machine learning","author":"Feurer","year":"2015","journal-title":"Advances in Neural Information Processing Systems 28"},{"issue":"25","key":"2020051819280779300_ref60","first-page":"1","article-title":"Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA","volume":"18","author":"Kotthoff","year":"2017","journal-title":"J Mach Learn Res"},{"key":"2020051819280779300_ref61","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511921803","volume-title":"Evaluating Learning Algorithms: A Classification Perspective","author":"Japkowicz","year":"2011"},{"issue":"1","key":"2020051819280779300_ref62","doi-asserted-by":"crossref","first-page":"437","DOI":"10.1186\/s12864-017-3786-3","article-title":"Co-expressed pathways database for tomato: a database to predict pathways relevant to a query gene","volume":"18","author":"Narise","year":"2017","journal-title":"BMC Genomics"},{"issue":"1","key":"2020051819280779300_ref63","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-10-47","article-title":"A general modular framework for gene set enrichment analysis","volume":"10","author":"Ackermann","year":"2009","journal-title":"BMC Bioinformatics"},{"issue":"25","key":"2020051819280779300_ref64","doi-asserted-by":"crossref","first-page":"8961","DOI":"10.1073\/pnas.0502674102","article-title":"Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays","volume":"102","author":"Pan","year":"2005","journal-title":"Proc Natl Acad Sci U S A"},{"key":"2020051819280779300_ref65","first-page":"4","article-title":"Comprehensive comparison of gene set analysis tools","volume-title":"Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP)","author":"Liu","year":"2011"},{"issue":"1","key":"2020051819280779300_ref66","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1186\/1471-2105-3-17","article-title":"The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data","volume":"3","author":"Mutch","year":"2002","journal-title":"BMC Bioinformatics"},{"issue":"D1","key":"2020051819280779300_ref67","doi-asserted-by":"crossref","first-page":"D1124","DOI":"10.1093\/nar\/gku1042","article-title":"Genefriends: a human RNA-seq-based gene and transcript co-expression database","volume":"43","author":"van Dam","year":"2014","journal-title":"Nucleic Acids Res"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/3\/803\/33227296\/bbz028.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/3\/803\/33227296\/bbz028.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,15]],"date-time":"2024-07-15T01:52:25Z","timestamp":1721008345000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/3\/803\/5380425"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,3,20]]},"references-count":67,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2019,3,20]]},"published-print":{"date-parts":[[2020,5,21]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bbz028","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,5]]},"published":{"date-parts":[[2019,3,20]]}}}