{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T08:40:55Z","timestamp":1775896855800,"version":"3.50.1"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"1","license":[{"start":{"date-parts":[[2019,6,5]],"date-time":"2019-06-05T00:00:00Z","timestamp":1559692800000},"content-version":"vor","delay-in-days":1,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["LM010098"],"award-info":[{"award-number":["LM010098"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["LM012601"],"award-info":[{"award-number":["LM012601"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["AI116794"],"award-info":[{"award-number":["AI116794"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist\u2019s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We introduce two new features implemented in TPOT that helps increase the system\u2019s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT\u2019s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Detailed simulation and analysis code needed to reproduce the results in this study is available at https:\/\/github.com\/lelaboratoire\/tpot-fss. Implementation of the new TPOT operators is available at https:\/\/github.com\/EpistasisLab\/tpot.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz470","type":"journal-article","created":{"date-parts":[[2019,6,2]],"date-time":"2019-06-02T15:07:08Z","timestamp":1559488028000},"page":"250-256","source":"Crossref","is-referenced-by-count":398,"title":["Scaling tree-based automated machine learning to biomedical big data with a feature set selector"],"prefix":"10.1093","volume":"36","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3737-6565","authenticated-orcid":false,"given":"Trang T","family":"Le","sequence":"first","affiliation":[{"name":"Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania , Philadelphia, PA 19104, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6434-5468","authenticated-orcid":false,"given":"Weixuan","family":"Fu","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania , Philadelphia, PA 19104, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5015-1099","authenticated-orcid":false,"given":"Jason H","family":"Moore","sequence":"additional","affiliation":[{"name":"Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania , Philadelphia, PA 19104, USA"}]}],"member":"286","published-online":{"date-parts":[[2019,6,4]]},"reference":[{"key":"2023013109502938300_btz470-B1","author":"Banzhaf","year":"1998"},{"key":"2023013109502938300_btz470-B2","first-page":"281","article-title":"Random search for hyper-parameter optimization","volume":"13","author":"Bergstra","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2023013109502938300_btz470-B3","doi-asserted-by":"crossref","first-page":"1319","DOI":"10.1038\/ng1479","article-title":"Polymorphisms in FKBP5 are associated with increased recurrence of depressive episodes and rapid response to antidepressant treatment","volume":"36","author":"Binder","year":"2004","journal-title":"Nat. Genet"},{"key":"2023013109502938300_btz470-B4","author":"Chen","year":"2018"},{"key":"2023013109502938300_btz470-B5","author":"Chen","year":"2016"},{"key":"2023013109502938300_btz470-B6","author":"Thornton","year":"2013"},{"key":"2023013109502938300_btz470-B7","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1109\/4235.996017","article-title":"A fast and elitist multiobjective genetic algorithm: NSGA-II","volume":"6","author":"Deb","year":"2002","journal-title":"IEEE Trans. Evol. Comput"},{"key":"2023013109502938300_btz470-B8","first-page":"246","volume-title":"Lecture Notes in Computer Science","author":"de S\u00e1","year":"2017"},{"key":"2023013109502938300_btz470-B9","volume-title":"Introduction to Evolutionary Computing 1. ed., Corr. 2. Printing, Softcover Version of Original Hardcover ed. 2003","author":"Eiben","year":"2010"},{"key":"2023013109502938300_btz470-B10","doi-asserted-by":"crossref","first-page":"533","DOI":"10.4049\/jimmunol.163.1.533","article-title":"Increased apoptosis in patients with major depression: a preliminary study","volume":"163","author":"Eilat","year":"1999","journal-title":"J. Immunol"},{"key":"2023013109502938300_btz470-B11","author":"Brochu","year":"2010"},{"key":"2023013109502938300_btz470-B12","first-page":"2962","volume-title":"Advances in Neural Information Processing Systems 28","author":"Feurer","year":"2015"},{"key":"2023013109502938300_btz470-B13","first-page":"2171","article-title":"DEAP: evolutionary algorithms made easy","volume":"13","author":"Fortin","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2023013109502938300_btz470-B14","doi-asserted-by":"crossref","first-page":"1132","DOI":"10.21105\/joss.01132","article-title":"GAMA: genetic automated machine learning assistant","volume":"4","author":"Gijsbers","year":"2019","journal-title":"J. Open Source Softw"},{"key":"2023013109502938300_btz470-B15","doi-asserted-by":"crossref","DOI":"10.1007\/978-0-387-84858-7","volume-title":"The Elements of Statistical Learning: Data Mining, Inference, and Prediction","author":"Hastie","year":"2009","edition":"2nd edn"},{"key":"2023013109502938300_btz470-B16","author":"Himmelstein","year":"2019"},{"key":"2023013109502938300_btz470-B17","author":"Dewancker","year":"2016"},{"key":"2023013109502938300_btz470-B18","doi-asserted-by":"crossref","DOI":"10.1186\/s13041-018-0407-2","article-title":"Distribution of Caskin1 protein and phenotypic characterization of its knockout mice using a comprehensive behavioral test battery","volume":"11","author":"Katano","year":"2018","journal-title":"Mol. Brain"},{"key":"2023013109502938300_btz470-B19","first-page":"1","article-title":"Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA","volume":"18","author":"Kotthoff","year":"2017","journal-title":"J. Mach. Learn. Res"},{"key":"2023013109502938300_btz470-B20","doi-asserted-by":"crossref","first-page":"244","DOI":"10.1038\/gene.2016.15","article-title":"An interaction quantitative trait loci tool implicates epistatic functional variants in an apoptosis pathway in smallpox vaccine eQTL data","volume":"17","author":"Lareau","year":"2016","journal-title":"Genes Immun"},{"key":"2023013109502938300_btz470-B21","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1186\/s13040-015-0040-x","article-title":"Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure","volume":"8","author":"Lareau","year":"2015","journal-title":"BioData Min"},{"key":"2023013109502938300_btz470-B22","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1016\/j.jad.2010.02.113","article-title":"Variations in FKBP5 and BDNF genes are suggestively associated with depression in a Swedish population-based cohort","volume":"125","author":"Lavebratt","year":"2010","journal-title":"J. Affect. Disord"},{"key":"2023013109502938300_btz470-B23","first-page":"1358","article-title":"Integrated machine learning pipeline for aberrant biomarker enrichment (i-mAB): characterizing clusters of differentiation within a compendium of systemic lupus erythematosus patients","volume":"2018","author":"Le","year":"2018","journal-title":"AMIA Annu. Symp. Proc"},{"key":"2023013109502938300_btz470-B24","doi-asserted-by":"crossref","DOI":"10.1038\/s41398-018-0234-3","article-title":"Identification and replication of RNA-Seq gene network modules associated with depression severity","volume":"8","author":"Le","year":"2018","journal-title":"Transl. Psychiatry"},{"key":"2023013109502938300_btz470-B25","doi-asserted-by":"crossref","first-page":"1358","DOI":"10.1093\/bioinformatics\/bty788","article-title":"STatistical Inference Relief (STIR) feature selection","volume":"35","author":"Le","year":"2019","journal-title":"Bioinformatics"},{"key":"2023013109502938300_btz470-B26","doi-asserted-by":"crossref","first-page":"510","DOI":"10.1016\/j.biopsych.2014.07.029","article-title":"Genetic studies of major depressive disorder: why are there no genome-wide association study findings and what can we do about it?","volume":"76","author":"Levinson","year":"2014","journal-title":"Biol. Psychiatry"},{"key":"2023013109502938300_btz470-B27","first-page":"41","article-title":"A meta-analysis examining clinical predictors of hippocampal volume in patients with major depressive disorder","volume":"34","author":"McKinnon","year":"2009","journal-title":"J. Psychiatry Neurosci"},{"key":"2023013109502938300_btz470-B28","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1162\/evco.1995.3.2.199","article-title":"Strongly typed genetic programming","volume":"3","author":"Montana","year":"1995","journal-title":"Evol. Comput"},{"key":"2023013109502938300_btz470-B29","doi-asserted-by":"crossref","first-page":"1267","DOI":"10.1038\/mp.2013.161","article-title":"Type I interferon signaling genes in recurrent major depression: increased expression detected by whole-blood RNA sequencing","volume":"19","author":"Mostafavi","year":"2014","journal-title":"Mol. Psychiatry"},{"key":"2023013109502938300_btz470-B30","author":"Olson","year":"2016"},{"key":"2023013109502938300_btz470-B31","first-page":"192","article-title":"Data-driven advice for applying machine learning to bioinformatics problems","volume":"23","author":"Olson","year":"2018","journal-title":"Pac. Symp. Biocomput"},{"key":"2023013109502938300_btz470-B32","doi-asserted-by":"crossref","DOI":"10.1186\/s13040-017-0154-4","article-title":"PMLB: a large benchmark suite for machine learning evaluation and comparison","volume":"10","author":"Olson","year":"2017","journal-title":"BioData Mining"},{"key":"2023013109502938300_btz470-B33","article-title":"Scikit-learn: machine Learning in Python","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"2023013109502938300_btz470-B34","author":"Olson","year":"2016"},{"key":"2023013109502938300_btz470-B35","doi-asserted-by":"crossref","first-page":"378","DOI":"10.1007\/978-3-319-64185-0_28","volume-title":"Digital Forensics and Watermarking","author":"Ren","year":"2017"},{"key":"2023013109502938300_btz470-B36","doi-asserted-by":"crossref","first-page":"1011","DOI":"10.1176\/appi.ajp.2009.08121760","article-title":"A molecular signature of depression in the amygdala","volume":"166","author":"Sibille","year":"2009","journal-title":"Am. J. Psychiatry"},{"key":"2023013109502938300_btz470-B37","author":"Sohn","year":"2017"},{"key":"2023013109502938300_btz470-B38","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.brainres.2009.06.036","article-title":"Modulation of glucocorticoid receptor nuclear translocation in neurons by immunophilins FKBP51 and FKBP52: implications for major depressive disorder","volume":"1286","author":"Tatro","year":"2009","journal-title":"Brain Res"},{"key":"2023013109502938300_btz470-B39","doi-asserted-by":"crossref","DOI":"10.1038\/s41598-017-06522-3","article-title":"High-coverage whole-exome sequencing identifies candidate genes for suicide in victims with major depressive disorder","volume":"7","author":"Tomb\u00e1cz","year":"2017","journal-title":"Sci. Rep"},{"key":"2023013109502938300_btz470-B40","doi-asserted-by":"crossref","first-page":"1168.","DOI":"10.3390\/en10081168","article-title":"Short-term load forecasting using EMD-LSTM neural networks with a Xgboost algorithm for feature importance evaluation","volume":"10","author":"Zheng","year":"2017","journal-title":"Energies"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btz470\/28862658\/btz470.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/1\/250\/48981536\/bioinformatics_36_1_250.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/1\/250\/48981536\/bioinformatics_36_1_250.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T13:32:49Z","timestamp":1675171969000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/1\/250\/5511404"}},"subtitle":[],"editor":[{"given":"Janet","family":"Kelso","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2019,6,4]]},"references-count":40,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,1,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz470","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/502484","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,1,1]]},"published":{"date-parts":[[2019,6,4]]}}}