{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,12]],"date-time":"2026-01-12T22:23:10Z","timestamp":1768256590693,"version":"3.49.0"},"reference-count":20,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2019,11,12]],"date-time":"2019-11-12T00:00:00Z","timestamp":1573516800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2019,11,12]],"date-time":"2019-11-12T00:00:00Z","timestamp":1573516800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2019,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/pybda.rtfd.io\">https:\/\/pybda.rtfd.io<\/jats:ext-link>.<\/jats:p><\/jats:sec>","DOI":"10.1186\/s12859-019-3087-8","type":"journal-article","created":{"date-parts":[[2019,11,12]],"date-time":"2019-11-12T08:03:18Z","timestamp":1573545798000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["PyBDA: a command line tool for automated analysis of big biological data sets"],"prefix":"10.1186","volume":"20","author":[{"given":"Simon","family":"Dirmeier","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mario","family":"Emmenlauer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christoph","family":"Dehio","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Niko","family":"Beerenwinkel","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2019,11,12]]},"reference":[{"key":"3087_CR1","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1016\/j.spl.2018.02.016","volume":"136","author":"P B\u00fchlmann","year":"2018","unstructured":"B\u00fchlmann P, van de Geer S. Statistics for big data: A perspective. Stat Probab Lett. 2018; 136:37\u201341.","journal-title":"Stat Probab Lett"},{"key":"3087_CR2","doi-asserted-by":"publisher","unstructured":"Katal A, Wazid M, Goudar RH. Big Data: Issues, Challenges, Tools and Good Practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3). IEEE: 2013. p. 404\u20139. https:\/\/doi.org\/10.1109\/IC3.2013.661222.","DOI":"10.1109\/IC3.2013.661222"},{"key":"3087_CR3","doi-asserted-by":"crossref","unstructured":"Marx V. The big challenges of big data. Nature 498. 2013.","DOI":"10.1038\/498255a"},{"issue":"1","key":"3087_CR4","doi-asserted-by":"publisher","first-page":"15","DOI":"10.1186\/s13059-017-1382-0","volume":"19","author":"FA Wolf","year":"2018","unstructured":"Wolf FA, Angerer P, Theis FJ. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19(1):15.","journal-title":"Genome Biol"},{"issue":"8","key":"3087_CR5","first-page":"098","volume":"7","author":"R Guo","year":"2018","unstructured":"Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. GigaScience. 2018; 7(8):098.","journal-title":"GigaScience"},{"issue":"1","key":"3087_CR6","first-page":"1235","volume":"17","author":"X Meng","year":"2016","unstructured":"Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, et al.Mllib: Machine learning in Apache Spark. J Mach Learn Res. 2016; 17(1):1235\u201341.","journal-title":"J Mach Learn Res"},{"issue":"11","key":"3087_CR7","doi-asserted-by":"publisher","first-page":"56","DOI":"10.1145\/2934664","volume":"59","author":"M Zaharia","year":"2016","unstructured":"Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, et al.Apache Spark: A unified engine for big data processing. Commun ACM. 2016; 59(11):56\u201365.","journal-title":"Commun ACM"},{"issue":"Oct","key":"3087_CR8","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al.Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12(Oct):2825\u201330.","journal-title":"J Mach Learn Res"},{"issue":"1","key":"3087_CR9","first-page":"5938","volume":"17","author":"B Bischl","year":"2016","unstructured":"Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM. mlr: Machine Learning in R. J Mach Learn Res. 2016; 17(1):5938\u201342.","journal-title":"J Mach Learn Res"},{"issue":"19","key":"3087_CR10","doi-asserted-by":"publisher","first-page":"2520","DOI":"10.1093\/bioinformatics\/bts480","volume":"28","author":"J K\u00f6ster","year":"2012","unstructured":"K\u00f6ster J, Rahmann S. Snakemake\u2014a scalable bioinformatics workflow engine. Bioinformatics. 2012; 28(19):2520\u20132.","journal-title":"Bioinformatics"},{"key":"3087_CR11","unstructured":"Abadi M, Agarwal A, Barham P, Brevdo E, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al.TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467. 2016."},{"key":"3087_CR12","unstructured":"H, 2O.ai. Python Interface for H2O. 2019. Python module version 3.26.0.2. https:\/\/github.com\/h2oai\/h2o-3."},{"key":"3087_CR13","unstructured":"Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic Differentiation in PyTorch. In: NIPS Autodiff Workshop: 2017."},{"key":"3087_CR14","unstructured":"Chollet F, et al.Keras. 2015. https:\/\/keras.io."},{"key":"3087_CR15","unstructured":"Tran D, Kucukelbir A, Dieng AB, Rudolph M, Liang D, Blei DM. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787. 2016."},{"key":"3087_CR16","doi-asserted-by":"publisher","first-page":"55","DOI":"10.7717\/peerj-cs.55","volume":"2","author":"J Salvatier","year":"2016","unstructured":"Salvatier J, Wiecki TV, Fonnesbeck C. Probabilistic programming in Python using PyMC3. PeerJ Comput Sci. 2016; 2:55.","journal-title":"PeerJ Comput Sci"},{"issue":"1","key":"3087_CR17","first-page":"1299","volume":"18","author":"DG Matthews","year":"2017","unstructured":"Matthews DG, Alexander G, Van Der Wilk M, Nickson T, Fujii K, Boukouvalas A, Le\u00f3n-Villagr\u00e1 P, Ghahramani Z, Hensman J. GPflow: A Gaussian Process Library using TensorFlow. J Mach Learn Res. 2017; 18(1):1299\u2013304.","journal-title":"J Mach Learn Res"},{"key":"3087_CR18","doi-asserted-by":"crossref","unstructured":"Golding N. Greta: Simple and Scalable Statistical Modelling in R. 2018. R package version 0.3.0. https:\/\/CRAN.R-project.org\/package=greta.","DOI":"10.32614\/CRAN.package.greta"},{"key":"3087_CR19","unstructured":"Pafka S. benchm-ml. GitHub. 2019. https:\/\/github.com\/szilard\/benchm-ml\/tree\/941dfd4ebab3854b3a49fd70c192ecf21e483267."},{"issue":"1","key":"3087_CR20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s41044-016-0020-2","volume":"2","author":"D Garc\u00eda-Gil","year":"2017","unstructured":"Garc\u00eda-Gil D, Ram\u00edrez-Gallego S, Garc\u00eda S, Herrera F. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics. 2017; 2(1):1.","journal-title":"Big Data Analytics"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-019-3087-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/s12859-019-3087-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-019-3087-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,26]],"date-time":"2024-07-26T19:15:52Z","timestamp":1722021352000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-019-3087-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,11,12]]},"references-count":20,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,12]]}},"alternative-id":["3087"],"URL":"https:\/\/doi.org\/10.1186\/s12859-019-3087-8","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,11,12]]},"assertion":[{"value":"29 April 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 September 2019","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 November 2019","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"564"}}