{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T02:12:47Z","timestamp":1772503967252,"version":"3.50.1"},"reference-count":55,"publisher":"Oxford University Press (OUP)","issue":"15","license":[{"start":{"date-parts":[[2020,5,12]],"date-time":"2020-05-12T00:00:00Z","timestamp":1589241600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"name":"German Federal Ministry of Education and Research"},{"DOI":"10.13039\/501100002347","name":"BMBF","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002347","id-type":"DOI","asserted-by":"publisher"}]},{"name":"e:Med Programme on systems medicine","award":["01ZX1510"],"award-info":[{"award-number":["01ZX1510"]}]},{"name":"e:Med Programme on systems medicine","award":["01ZX1708E"],"award-info":[{"award-number":["01ZX1708E"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>An R package providing functions for data analysis and simulation is available at GitHub (https:\/\/github.com\/szymczak-lab\/PathwayGuidedRF). An accompanying R data package (https:\/\/github.com\/szymczak-lab\/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO).<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa483","type":"journal-article","created":{"date-parts":[[2020,5,5]],"date-time":"2020-05-05T11:12:03Z","timestamp":1588677123000},"page":"4301-4308","source":"Crossref","is-referenced-by-count":14,"title":["Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study"],"prefix":"10.1093","volume":"36","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2567-5728","authenticated-orcid":false,"given":"Stephan","family":"Seifert","sequence":"first","affiliation":[{"name":"Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein , Kiel 24105, Germany"}]},{"given":"Sven","family":"Gundlach","sequence":"additional","affiliation":[{"name":"Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein , Kiel 24105, Germany"}]},{"given":"Olaf","family":"Junge","sequence":"additional","affiliation":[{"name":"Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein , Kiel 24105, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8897-9035","authenticated-orcid":false,"given":"Silke","family":"Szymczak","sequence":"additional","affiliation":[{"name":"Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein , Kiel 24105, Germany"}]}],"member":"286","published-online":{"date-parts":[[2020,5,12]]},"reference":[{"key":"2023062312041114400_btaa483-B1","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1186\/1471-2105-10-47","article-title":"A general modular framework for gene set enrichment analysis","volume":"10","author":"Ackermann","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023062312041114400_btaa483-B2","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1016\/j.ymeth.2015.04.006","article-title":"Network-constrained forest for regularized classification of omics data","volume":"83","author":"And\u011bl","year":"2015","journal-title":"Methods"},{"key":"2023062312041114400_btaa483-B3","doi-asserted-by":"crossref","first-page":"D504","DOI":"10.1093\/nar\/gkj126","article-title":"Pathguide: a pathway resource list","volume":"34","author":"Bader","year":"2006","journal-title":"Nucleic Acids Res"},{"key":"2023062312041114400_btaa483-B4","doi-asserted-by":"crossref","first-page":"D991","DOI":"10.1093\/nar\/gks1193","article-title":"NCBI GEO: archive for functional genomics data sets-update","volume":"41","author":"Barrett","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2023062312041114400_btaa483-B5","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","article-title":"Controlling the false discovery rate: a practical and powerful approach to multiple testing","volume":"57","author":"Benjamini","year":"1995","journal-title":"J. R. Stat. Soc. Ser. B Stat. Methodol"},{"key":"2023062312041114400_btaa483-B6","doi-asserted-by":"crossref","first-page":"523","DOI":"10.1186\/1471-2105-11-523","article-title":"Class prediction for high-dimensional class-imbalanced data","volume":"11","author":"Blagus","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023062312041114400_btaa483-B7","doi-asserted-by":"crossref","first-page":"138","DOI":"10.1186\/s12874-017-0417-2","article-title":"Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies","volume":"17","author":"Boulesteix","year":"2017","journal-title":"BMC Med. Res. Methodol"},{"key":"2023062312041114400_btaa483-B8","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023062312041114400_btaa483-B9","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1093\/bioinformatics\/bts643","article-title":"Pathway hunting by random survival forests","volume":"29","author":"Chen","year":"2013","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B10","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1186\/1755-8794-5-14","article-title":"Identification of genes with a correlation between copy number and expression in gastric cancer","volume":"5","author":"Cheng","year":"2012","journal-title":"BMC Med. Genomics"},{"key":"2023062312041114400_btaa483-B11","doi-asserted-by":"crossref","first-page":"843","DOI":"10.1038\/s41592-019-0509-5","article-title":"Assessment of network module identification across complex diseases","volume":"16","author":"Choobdar","year":"2019","journal-title":"Nat. Methods"},{"key":"2023062312041114400_btaa483-B12","doi-asserted-by":"crossref","first-page":"D472","DOI":"10.1093\/nar\/gkt1102","article-title":"The reactome pathway knowledgebase","volume":"42","author":"Croft","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023062312041114400_btaa483-B13","doi-asserted-by":"crossref","first-page":"1846","DOI":"10.1093\/bioinformatics\/btm254","article-title":"GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor","volume":"23","author":"Davis","year":"2007","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B14","doi-asserted-by":"crossref","first-page":"492","DOI":"10.1093\/bib\/bbx124","article-title":"Evaluation of variable selection methods for random forests and omics data sets","volume":"20","author":"Degenhardt","year":"2019","journal-title":"Brief. Bioinf"},{"key":"2023062312041114400_btaa483-B15","doi-asserted-by":"crossref","first-page":"e17795","DOI":"10.1371\/journal.pone.0017795","article-title":"Do two machine-learning based prognostic signatures for breast cancer capture the same biological processes?","volume":"6","author":"Drier","year":"2011","journal-title":"PLoS One"},{"key":"2023062312041114400_btaa483-B16","doi-asserted-by":"crossref","first-page":"R187","DOI":"10.1186\/gb-2007-8-9-r187","article-title":"The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data","volume":"8","author":"Eichler","year":"2007","journal-title":"Genome Biol"},{"key":"2023062312041114400_btaa483-B17","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1093\/bioinformatics\/bth469","article-title":"Outcome signature genes in breast cancer: is there a unique set?","volume":"21","author":"Ein-Dor","year":"2005","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B18","doi-asserted-by":"crossref","first-page":"948","DOI":"10.1681\/ASN.2011090887","article-title":"Molecular phenotypes of acute kidney injury in kidney transplants","volume":"23","author":"Famulski","year":"2012","journal-title":"J. Am. Soc. Nephrol"},{"key":"2023062312041114400_btaa483-B19","author":"Genuer","year":"2008"},{"key":"2023062312041114400_btaa483-B20","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1016\/j.compbiolchem.2010.07.002","article-title":"Stable feature selection for biomarker discovery","volume":"34","author":"He","year":"2010","journal-title":"Comput. Biol. Chem"},{"key":"2023062312041114400_btaa483-B21","author":"Hediger","year":"2019"},{"key":"2023062312041114400_btaa483-B22","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1002\/sam.10103","article-title":"Random survival forests for high-dimensional data","volume":"4","author":"Ishwaran","year":"2011","journal-title":"Stat. Anal. Data Min"},{"key":"2023062312041114400_btaa483-B23","doi-asserted-by":"crossref","first-page":"885","DOI":"10.1007\/s11634-016-0276-4","article-title":"A computationally fast variable importance test for random forests for high-dimensional data","volume":"12","author":"Janitza","year":"2018","journal-title":"Adv. Data Anal. Classif"},{"key":"2023062312041114400_btaa483-B24","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1186\/1471-2164-15-33","article-title":"Sex differences in the human peripheral blood transcriptome","volume":"15","author":"Jansen","year":"2014","journal-title":"BMC Genomics"},{"key":"2023062312041114400_btaa483-B25","doi-asserted-by":"crossref","first-page":"D793","DOI":"10.1093\/nar\/gks1055","article-title":"The ConsensusPathDB interaction database: 2013 update","volume":"41","author":"Kamburov","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023062312041114400_btaa483-B26","doi-asserted-by":"crossref","first-page":"27","DOI":"10.1093\/nar\/28.1.27","article-title":"Kegg: Kyoto Encyclopedia of Genes and Genomes","volume":"28","author":"Kanehisa","year":"2000","journal-title":"Nucleic Acids Res"},{"key":"2023062312041114400_btaa483-B27","doi-asserted-by":"crossref","first-page":"e1002375","DOI":"10.1371\/journal.pcbi.1002375","article-title":"Ten years of pathway analysis: current approaches and outstanding challenges","volume":"8","author":"Khatri","year":"2012","journal-title":"PLoS Comput. Biol"},{"key":"2023062312041114400_btaa483-B28","doi-asserted-by":"crossref","first-page":"309","DOI":"10.3389\/fimmu.2013.00309","article-title":"Activation of the interferon pathway is dependent upon autoantibodies in African-American SLE patients, but not in European-American SLE patients","volume":"4","author":"Ko","year":"2013","journal-title":"Front. Immunol"},{"key":"2023062312041114400_btaa483-B29","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v036.i11","article-title":"Feature selection with the Boruta package","volume":"36","author":"Kursa","year":"2010","journal-title":"J. Stat. Softw"},{"key":"2023062312041114400_btaa483-B30","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1016\/j.cels.2015.12.004","article-title":"The Molecular Signatures Database hallmark gene set collection","volume":"1","author":"Liberzon","year":"2015","journal-title":"Cell Syst"},{"key":"2023062312041114400_btaa483-B31","doi-asserted-by":"crossref","first-page":"33","DOI":"10.3389\/fnins.2013.00033","article-title":"Peripheral blood RNA gene expression profiling in patients with bacterial meningitis","volume":"7","author":"Lill","year":"2013","journal-title":"Front. Neurosci"},{"key":"2023062312041114400_btaa483-B32","doi-asserted-by":"crossref","first-page":"546","DOI":"10.1186\/s12859-019-3146-1","article-title":"A comparative study of topology-based pathway enrichment analysis methods","volume":"20","author":"Ma","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2023062312041114400_btaa483-B33","doi-asserted-by":"crossref","first-page":"2063","DOI":"10.1093\/bioinformatics\/btm289","article-title":"Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data","volume":"23","author":"Maglietta","year":"2007","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B34","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/2043-9113-2-6","article-title":"Gene expression profiling of peripheral blood mononuclear cells in the setting of peripheral arterial disease","volume":"2","author":"Masud","year":"2012","journal-title":"J. Clin. Bioinf"},{"key":"2023062312041114400_btaa483-B35","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1186\/s13040-018-0166-8","article-title":"Gene set analysis methods: a systematic comparison","volume":"11","author":"Mathur","year":"2018","journal-title":"BioData Min"},{"key":"2023062312041114400_btaa483-B36","doi-asserted-by":"crossref","first-page":"1364","DOI":"10.1038\/ki.2011.245","article-title":"Progressive histological damage in renal allografts is associated with expression of innate and adaptive immunity genes","volume":"80","author":"Naesens","year":"2011","journal-title":"Kidney Int"},{"key":"2023062312041114400_btaa483-B37","doi-asserted-by":"crossref","first-page":"3711","DOI":"10.1093\/bioinformatics\/bty373","article-title":"The revival of the Gini importance?","volume":"34","author":"Nembrini","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B38","doi-asserted-by":"crossref","first-page":"203","DOI":"10.1186\/s13059-019-1790-4","article-title":"Identifying significantly impacted pathways: a comprehensive review and assessment","volume":"20","author":"Nguyen","year":"2019","journal-title":"Genome Biol"},{"key":"2023062312041114400_btaa483-B39","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1093\/bib\/bbr016","article-title":"Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures","volume":"12","author":"Nicodemus","year":"2011","journal-title":"Brief. Bioinf"},{"key":"2023062312041114400_btaa483-B40","first-page":"104","author":"Pan","year":"2013"},{"key":"2023062312041114400_btaa483-B41","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1002\/gepi.21794","article-title":"A system-level pathway\u2013phenotype association analysis using synthetic feature random forest","volume":"38","author":"Pan","year":"2014","journal-title":"Genet. Epidemiol"},{"key":"2023062312041114400_btaa483-B42","doi-asserted-by":"crossref","first-page":"2028","DOI":"10.1093\/bioinformatics\/btl344","article-title":"Pathway analysis using random forests classification and regression","volume":"22","author":"Pang","year":"2006","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B43","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1186\/1471-2105-12-459","article-title":"Integrative set enrichment testing for multiple omics platforms","volume":"12","author":"Poisson","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023062312041114400_btaa483-B44","doi-asserted-by":"crossref","first-page":"1263","DOI":"10.1158\/1541-7786.MCR-07-0267","article-title":"Transcriptome profile of human colorectal adenomas","volume":"5","author":"Sabates-Bellver","year":"2007","journal-title":"Mol. Cancer Res"},{"key":"2023062312041114400_btaa483-B45","doi-asserted-by":"crossref","first-page":"3663","DOI":"10.1093\/bioinformatics\/btz149","article-title":"Surrogate minimal depth as an importance measure for variables in random forests","volume":"35","author":"Seifert","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062312041114400_btaa483-B46","doi-asserted-by":"publisher","author":"Sergushichev","year":"2016","DOI":"10.1101\/060012"},{"key":"2023062312041114400_btaa483-B47","doi-asserted-by":"crossref","first-page":"877","DOI":"10.1007\/s00018-010-0500-x","article-title":"Cigarette smoking reprograms apical junctional complex molecular architecture in the human airway epithelium in vivo","volume":"68","author":"Shaykhiev","year":"2011","journal-title":"Cell Mol. Life Sci"},{"key":"2023062312041114400_btaa483-B48","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1186\/1471-2105-8-25","article-title":"Bias in random forest variable importance measures: illustrations, sources and a solution","volume":"8","author":"Strobl","year":"2007","journal-title":"BMC Bioinformatics"},{"key":"2023062312041114400_btaa483-B49","doi-asserted-by":"crossref","first-page":"15545","DOI":"10.1073\/pnas.0506580102","article-title":"Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles","volume":"102","author":"Subramanian","year":"2005","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023062312041114400_btaa483-B50","doi-asserted-by":"crossref","first-page":"e79217","DOI":"10.1371\/journal.pone.0079217","article-title":"A comparison of gene set analysis methods in terms of sensitivity. Prioritization and specificity","volume":"8","author":"Tarca","year":"2013","journal-title":"PLoS One"},{"key":"2023062312041114400_btaa483-B51","doi-asserted-by":"crossref","first-page":"13544","DOI":"10.1073\/pnas.0506577102","article-title":"Discovering statistically significant pathways in expression profiling studies","volume":"102","author":"Tian","year":"2005","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023062312041114400_btaa483-B52","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.12688\/f1000research.9471.1","article-title":"Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies","volume":"5","author":"Toker","year":"2016","journal-title":"F1000Res"},{"key":"2023062312041114400_btaa483-B53","doi-asserted-by":"crossref","first-page":"1640","DOI":"10.1111\/j.1349-7006.2012.02367.x","article-title":"Protein arginine methyltransferase 5 is a potential oncoprotein that upregulates G 1 cyclins\/cyclin-dependent kinases and the phosphoinositide 3-kinase\/AKT signaling cascade","volume":"103","author":"Wei","year":"2012","journal-title":"Cancer Sci"},{"key":"2023062312041114400_btaa483-B54","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v077.i01","article-title":"ranger: a fast implementation of random forests for high dimensional data in C++ and R","volume":"77","author":"Wright","year":"2017","journal-title":"J. Stat. Softw"},{"key":"2023062312041114400_btaa483-B55","first-page":"44","article-title":"Simulating gene expression data to estimate sample size for class and biomarker discovery","volume":"4","author":"Zhang","year":"2012","journal-title":"Int. J. Adv. Life Sci"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa483\/33509858\/btaa483.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/15\/4301\/50671500\/bioinformatics_36_15_4301.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/15\/4301\/50671500\/bioinformatics_36_15_4301.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,5]],"date-time":"2024-08-05T08:18:04Z","timestamp":1722845884000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/15\/4301\/5836498"}},"subtitle":[],"editor":[{"given":"Luigi","family":"Martelli","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,5,12]]},"references-count":55,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2020,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa483","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,8,1]]},"published":{"date-parts":[[2020,5,12]]}}}