{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T18:29:07Z","timestamp":1778696947116,"version":"3.51.4"},"reference-count":20,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2019,1,30]],"date-time":"2019-01-30T00:00:00Z","timestamp":1548806400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health Grants","doi-asserted-by":"crossref","award":["U41 HG006620"],"award-info":[{"award-number":["U41 HG006620"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health Grants","doi-asserted-by":"crossref","award":["R01 AI134384-01"],"award-info":[{"award-number":["R01 AI134384-01"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation Grant","doi-asserted-by":"crossref","award":["1661497"],"award-info":[{"award-number":["1661497"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Penn State College of Engineering Multidisciplinary Seed Grant Program"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,9,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>Source code available at https:\/\/github.com\/atyryshkina\/algorithm-performance-analysis, implemented in Python.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz054","type":"journal-article","created":{"date-parts":[[2019,1,26]],"date-time":"2019-01-26T05:54:08Z","timestamp":1548482048000},"page":"3453-3460","source":"Crossref","is-referenced-by-count":18,"title":["Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage"],"prefix":"10.1093","volume":"35","author":[{"given":"Anastasia","family":"Tyryshkina","sequence":"first","affiliation":[{"name":"Huck Institute of Life Sciences, Neuroscience Program, The Pennsylvania State University, University Park , USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nate","family":"Coraor","sequence":"additional","affiliation":[{"name":"Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park , USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anton","family":"Nekrutenko","sequence":"additional","affiliation":[{"name":"Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park , USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2019,1,30]]},"reference":[{"key":"2023013108053924100_btz054-B1","doi-asserted-by":"crossref","first-page":"W3","DOI":"10.1093\/nar\/gkw343","article-title":"The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update","volume":"44","author":"Afgan","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023013108053924100_btz054-B2","first-page":"1","volume-title":"2013 26th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)","author":"Bankole","year":"2013"},{"key":"2023013108053924100_btz054-B3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1002\/0471142727.mb1910s89","article-title":"Galaxy, a web-based genome analysis tool for experimentalists","volume":"89","author":"Blankenberg","year":"2010","journal-title":"Curr. Protoc. Mol. Biol"},{"key":"2023013108053924100_btz054-B4","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random Forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn"},{"key":"2023013108053924100_btz054-B5","first-page":"339","author":"Duan","year":"2009"},{"key":"2023013108053924100_btz054-B6","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: a gradient boosting machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat"},{"key":"2023013108053924100_btz054-B7","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/s10994-006-6226-1","article-title":"Extremely randomized trees","volume":"63","author":"Geurts","year":"2006","journal-title":"Mach. Learn"},{"key":"2023013108053924100_btz054-B8","doi-asserted-by":"crossref","first-page":"R86.","DOI":"10.1186\/gb-2010-11-8-r86","article-title":"Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences","volume":"11","author":"Goecks","year":"2010","journal-title":"Genome Biol"},{"key":"2023013108053924100_btz054-B9","first-page":"9","author":"Gong","year":"2010"},{"key":"2023013108053924100_btz054-B10","first-page":"13","author":"Gupta","year":"2008"},{"key":"2023013108053924100_btz054-B11","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/11889205_17","volume-title":"Principles and Practice of Constraint Programming - CP 2006","author":"Hutter","year":"2006"},{"key":"2023013108053924100_btz054-B12","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1016\/j.artint.2013.10.003","article-title":"Algorithm runtime prediction: methods & evaluation","volume":"206","author":"Hutter","year":"2014","journal-title":"Artif. Intell"},{"key":"2023013108053924100_btz054-B13","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1016\/j.future.2011.05.027","article-title":"Empirical prediction models for adaptive resource provisioning in the cloud","volume":"28","author":"Islam","year":"2012","journal-title":"Future Gener. Comput. Syst"},{"key":"2023013108053924100_btz054-B14","first-page":"495","author":"Matsunaga","year":"2010"},{"key":"2023013108053924100_btz054-B15","first-page":"983","article-title":"Quantile Regression Forests","volume":"7","author":"Meinshausen","year":"2006","journal-title":"J. Mach. Learn. Res"},{"key":"2023013108053924100_btz054-B16","first-page":"316","author":"Nadeem","year":"2009"},{"key":"2023013108053924100_btz054-B17","first-page":"2825","article-title":"Scikit-learn: machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res"},{"key":"2023013108053924100_btz054-B18","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1007\/11508380_24","volume-title":"Advances in Grid Computing - EGC 2005","author":"Phinjaroenphan","year":"2005"},{"key":"2023013108053924100_btz054-B19","first-page":"111","author":"Sonmez","year":"2009"},{"key":"2023013108053924100_btz054-B20","first-page":"413","author":"Ting","year":"2008"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/18\/3453\/48975362\/bioinformatics_35_18_3453.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/18\/3453\/48975362\/bioinformatics_35_18_3453.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T13:43:32Z","timestamp":1675172612000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/18\/3453\/5304359"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2019,1,30]]},"references-count":20,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2019,9,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz054","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2019,9,15]]},"published":{"date-parts":[[2019,1,30]]}}}