{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T14:15:57Z","timestamp":1772720157366,"version":"3.50.1"},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2019,1,9]],"date-time":"2019-01-09T00:00:00Z","timestamp":1546992000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,3,23]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning\u2013based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area has, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson\u2019s paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson\u2019s paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson\u2019s paradox involving top-ranked predictors are much more common for one of the feature ranking methods.<\/jats:p>","DOI":"10.1093\/bib\/bby126","type":"journal-article","created":{"date-parts":[[2018,12,7]],"date-time":"2018-12-07T20:14:09Z","timestamp":1544213649000},"page":"421-428","source":"Crossref","is-referenced-by-count":12,"title":["Investigating the role of Simpson\u2019s paradox in the analysis of top-ranked features in high-dimensional bioinformatics datasets"],"prefix":"10.1093","volume":"21","author":[{"given":"Alex A","family":"Freitas","sequence":"first","affiliation":[{"name":"University of Kent, Kent, UK"}]}],"member":"286","published-online":{"date-parts":[[2019,1,9]]},"reference":[{"key":"2020080709361988400_ref1","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1038\/nrg3920","article-title":"Machine learning applications in genetics and genomics","volume":"16","author":"Libbrecht","year":"2015","journal-title":"Nat Rev Genet"},{"key":"2020080709361988400_ref2","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1016\/j.cell.2018.05.015","article-title":"Next-generation machine learning for biological networks","volume":"173","author":"Camacho","year":"2018","journal-title":"Cell"},{"issue":"6","key":"2020080709361988400_ref3","first-page":"45","article-title":"Feature Selection: a data perspective","volume":"50","author":"Li","year":"2017","journal-title":"ACM Comput Surv"},{"key":"2020080709361988400_ref4","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/j.ymeth.2016.08.014","article-title":"Feature selection methods for big data bioinformatics: a survey from the search perspective","volume":"111","author":"Wang","year":"2016","journal-title":"Methods"},{"key":"2020080709361988400_ref5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-540-35488-8","volume-title":"Feature Extraction: Foundations and Applications","author":"Guyon","year":"2006"},{"issue":"19","key":"2020080709361988400_ref6","doi-asserted-by":"crossref","first-page":"2507","DOI":"10.1093\/bioinformatics\/btm344","article-title":"A review of feature selection techniques in bioinformatics","volume":"23","author":"Saeys","year":"2007","journal-title":"Bioinformatics"},{"key":"2020080709361988400_ref7","first-page":"13","article-title":"A review of feature selection and feature extraction methods applied on microarray data","author":"Hira","year":"2015","journal-title":"Adv Bioinformatics"},{"key":"2020080709361988400_ref8","volume-title":"Causality: Models, Reasoning and Inference","author":"Pearl","year":"2000"},{"issue":"1","key":"2020080709361988400_ref9","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1080\/00031305.2014.876829","article-title":"Comment: understanding Simpson\u2019s paradox","volume":"68","author":"Pearl","year":"2014","journal-title":"Am Stat"},{"issue":"4","key":"2020080709361988400_ref10","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1111\/j.1740-9713.2015.00844.x","article-title":"Simpson\u2019s paradox \u2026 and how to avoid it","volume":"12","author":"Norton","year":"2015","journal-title":"Significance"},{"key":"2020080709361988400_ref11","first-page":"1021","article-title":"Bias in OLAP queries: detection, explanation and removal","volume-title":"ACM Press.","author":"Salimi"},{"key":"2020080709361988400_ref12","doi-asserted-by":"crossref","first-page":"14","DOI":"10.3389\/fpsyg.2013.00513","article-title":"Simpson\u2019s paradox in psychological science: a practical guide","volume":"4","author":"Kievit","year":"2013","journal-title":"Front Psychol"},{"key":"2020080709361988400_ref13","doi-asserted-by":"crossref","first-page":"1238","DOI":"10.1091\/mbc.E14-06-1078","article-title":"A statistical anomaly indicates symbiotic origins of eukaryotic membranes","volume":"26","author":"Bansal","year":"2015","journal-title":"Mol Biol Cell"},{"key":"2020080709361988400_ref14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.2147\/OAMS.S52288","article-title":"Genomic aggregation effects and Simpson\u2019s paradox","volume":"4","author":"Brimacombe","year":"2014","journal-title":"Open Access Med Stat"},{"issue":"17","key":"2020080709361988400_ref15","doi-asserted-by":"crossref","first-page":"2836","DOI":"10.1093\/bioinformatics\/btv215","article-title":"Addressing false discoveries in network inference","volume":"31","author":"Petri","year":"2015","journal-title":"Bioinformatics"},{"key":"2020080709361988400_ref16","first-page":"148","article-title":"Discovering surprising patterns by detecting occurrences of Simpson\u2019s paradox","volume-title":"Springer.","author":"Fabris"},{"key":"2020080709361988400_ref17","first-page":"186","article-title":"Robust text classification in the presence of confounding bias","author":"Landeiro"},{"key":"2020080709361988400_ref18","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1023\/A:1008280620621","article-title":"Overcoming the myopia of inductive learning algorithms with ReliefF","volume":"7","author":"Kononenko","year":"1997","journal-title":"Appl Intell"},{"key":"2020080709361988400_ref19","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1023\/A:1025667309714","article-title":"Theoretical and empirical analysis of ReliefF and RReliefF","volume":"53","author":"Robnik-Sikonja","year":"2003","journal-title":"Mach Learn"},{"key":"2020080709361988400_ref20","volume-title":"C4.5: Programs for Machine Learning","author":"Quinlan","year":"1993"},{"issue":"2","key":"2020080709361988400_ref21","doi-asserted-by":"crossref","first-page":"262","DOI":"10.1109\/TCBB.2014.2355218","article-title":"Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods","volume":"12","author":"Wan","year":"2015","journal-title":"IEEE\/ACM Trans Comput Biol Bioinform"},{"key":"2020080709361988400_ref22","doi-asserted-by":"crossref","first-page":"4094","DOI":"10.1038\/s41598-018-22240-w","article-title":"Prediction and characterization of human ageing-related proteins by using machine learning","volume":"8","author":"Kerepesi","year":"2018","journal-title":"Scientific Reports"},{"key":"2020080709361988400_ref23","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1007\/978-1-4939-3743-1_14","article-title":"Gene Ontology: pitfalls, biases and remedies","volume-title":"The Gene Ontology Handbook","author":"Gaudet","year":"2017"},{"issue":"2","key":"2020080709361988400_ref24","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1007\/s10462-017-9541-y","article-title":"An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features","volume":"50","author":"Wan","year":"2017","journal-title":"Artif Intell Rev"},{"key":"2020080709361988400_ref25","first-page":"738","article-title":"A novel genetic algorithm for feature selection in hierarchical feature spaces","volume-title":"SIAM.","author":"Silva","year":"2018"},{"key":"2020080709361988400_ref26","volume-title":"Probability and Statistics","author":"DeGroot","year":"2002","edition":"3rd"},{"key":"2020080709361988400_ref27","doi-asserted-by":"crossref","first-page":"700","DOI":"10.1016\/j.mad.2010.10.001","article-title":"Systematic analysis and prediction of longevity genes in Caenorhabditis elegans","volume":"131","author":"Li","year":"2010","journal-title":"Mech Ageing Dev"}],"container-title":["Briefings in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/2\/421\/33585393\/bby126.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/bib\/article-pdf\/21\/2\/421\/33585393\/bby126.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,8,7]],"date-time":"2020-08-07T13:46:49Z","timestamp":1596808009000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bib\/article\/21\/2\/421\/5280899"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,1,9]]},"references-count":27,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2019,1,9]]},"published-print":{"date-parts":[[2020,3,23]]}},"URL":"https:\/\/doi.org\/10.1093\/bib\/bby126","relation":{},"ISSN":["1467-5463","1477-4054"],"issn-type":[{"value":"1467-5463","type":"print"},{"value":"1477-4054","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,3]]},"published":{"date-parts":[[2019,1,9]]}}}