{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,26]],"date-time":"2024-04-26T06:29:18Z","timestamp":1714112958591},"reference-count":15,"publisher":"F1000 Research Ltd","license":[{"start":{"date-parts":[[2013,11,15]],"date-time":"2013-11-15T00:00:00Z","timestamp":1384473600000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0\/"}],"content-domain":{"domain":["f1000research.com"],"crossmark-restriction":false},"short-container-title":["F1000Res"],"abstract":"<ns4:p>Modern sequencing platforms generate enormous quantities of data in ever-decreasing amounts of time. Additionally, techniques such as multiplex sequencing allow one run to contain hundreds of different samples. With such data comes a significant challenge to understand its quality and to understand how the quality and yield are changing across instruments and over time. As well as the desire to understand historical data, sequencing centres often have a duty to provide clear summaries of individual run performance to collaborators or customers. We present StatsDB, an open-source software package for storage and analysis of next generation sequencing run metrics. The system has been designed for incorporation into a primary analysis pipeline, either at the programmatic level or via integration into existing user interfaces. Statistics are stored in an SQL database and APIs provide the ability to store and access the data while abstracting the underlying database design. This abstraction allows simpler, wider querying across multiple fields than is possible by the manual steps and calculation required to dissect individual reports, e.g. \u201dprovide metrics about nucleotide bias in libraries using adaptor barcode X, across all runs on sequencer A, within the last month\u201d. The software is supplied with modules for storage of statistics from FastQC, a commonly used tool for analysis of sequence reads, but the open nature of the database schema means it can be easily adapted to other tools. Currently at The Genome Analysis Centre (TGAC), reports are accessed through our LIMS system or through a standalone GUI tool, but the API and supplied examples make it easy to develop custom reports and to interface with other packages.<\/ns4:p>","DOI":"10.12688\/f1000research.2-248.v1","type":"journal-article","created":{"date-parts":[[2013,11,15]],"date-time":"2013-11-15T13:47:15Z","timestamp":1384523235000},"page":"248","update-policy":"http:\/\/dx.doi.org\/10.12688\/f1000research.crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["StatsDB: platform-agnostic storage and understanding of next generation sequencing run metrics"],"prefix":"10.12688","volume":"2","author":[{"given":"Ricardo H.","family":"Ramirez-Gonzalez","sequence":"first","affiliation":[]},{"given":"Richard M.","family":"Leggett","sequence":"additional","affiliation":[]},{"given":"Darren","family":"Waite","sequence":"additional","affiliation":[]},{"given":"Anil","family":"Thanki","sequence":"additional","affiliation":[]},{"given":"Nizar","family":"Drou","sequence":"additional","affiliation":[]},{"given":"Mario","family":"Caccamo","sequence":"additional","affiliation":[]},{"ORCID":"http:\/\/orcid.org\/0000-0002-5589-7754","authenticated-orcid":false,"given":"Robert","family":"Davey","sequence":"additional","affiliation":[]}],"member":"2560","published-online":{"date-parts":[[2013,11,15]]},"reference":[{"key":"ref-1","doi-asserted-by":"publisher","first-page":"240-248","DOI":"10.1101\/gr.5681207","article-title":"Rapid and cost-effective polymorphism identi.cation and genotyping using restriction site associated DNA (RAD) markers.","volume":"17","author":"M Miller","year":"2007","journal-title":"Genome Res."},{"key":"ref-2","doi-asserted-by":"publisher","first-page":"e3376","DOI":"10.1371\/journal.pone.0003376","article-title":"Rapid SNP discovery and genetic mapping using sequenced RAD markers.","volume":"3","author":"N Baird","year":"2008","journal-title":"PLoS One."},{"key":"ref-3","article-title":"FastQC: A quality control tool for high throughput sequence data","author":"S Andrews"},{"key":"ref-4","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1186\/1471-2105-14-33","article-title":"Htqc: a fast quality control toolkit for Illumina sequencing data.","volume":"14","author":"X Yang","year":"2013","journal-title":"BMC Bioinformatics."},{"key":"ref-5","doi-asserted-by":"publisher","first-page":"863-864","DOI":"10.1093\/bioinformatics\/btr026","article-title":"Quality control and preprocessing of metagenomic datasets.","volume":"27","author":"R Schmieder","year":"2011","journal-title":"Bioinformatics."},{"key":"ref-6","doi-asserted-by":"publisher","first-page":"S7","DOI":"10.1186\/1471-2164-11-S4-S7","article-title":"Ngsqc: cross-platform quality analysis pipeline for deep sequencing data.","volume":"11","author":"M Dai","year":"2010","journal-title":"BMC Genomics."},{"key":"ref-7","article-title":"QRQC - quick read quality control","author":"V Buffalo"},{"key":"ref-8","doi-asserted-by":"publisher","first-page":"130-131","DOI":"10.1093\/bioinformatics\/btq614","article-title":"Samstat: monitoring biases in next generation sequencing data.","volume":"27","author":"T Lassmann","year":"2011","journal-title":"Bioinformatics."},{"key":"ref-9","article-title":"stsPlots","author":"M Ashby"},{"key":"ref-10","article-title":"PacBio Exploratory Data Analysis","author":"T Skelly"},{"key":"ref-11","article-title":"MISO: An open-source LIMS for small-to-large scale sequencing centres","author":"R Davey"},{"key":"ref-12","article-title":"Perl DBI. Perl DBI"},{"key":"ref-13","article-title":"The Apache Software Foundation"},{"key":"ref-14","article-title":"The Genome Analysis Centre"},{"key":"ref-15","article-title":"D3.js - Data-Driven Documents","year":"2012"}],"container-title":["F1000Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/f1000research.com\/articles\/2-248\/v1\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/2-248\/v1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/2-248\/v1\/iparadigms","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2018,12,3]],"date-time":"2018-12-03T13:26:07Z","timestamp":1543843567000},"score":1,"resource":{"primary":{"URL":"https:\/\/f1000research.com\/articles\/2-248\/v1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,11,15]]},"references-count":15,"URL":"http:\/\/dx.doi.org\/10.12688\/f1000research.2-248.v1","relation":{"has-review":[{"id-type":"doi","id":"10.5256\/f1000research.2894.r2467","asserted-by":"subject"},{"id-type":"doi","id":"10.5256\/f1000research.2894.r2794","asserted-by":"subject"},{"id-type":"doi","id":"10.5256\/f1000research.2894.r2565","asserted-by":"subject"}]},"ISSN":["2046-1402"],"issn-type":[{"value":"2046-1402","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,11,15]]},"assertion":[{"value":"Indexed","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#article-reports","order":0,"name":"referee-status","label":"Referee status","group":{"name":"current-referee-status","label":"Current Referee Status"}},{"value":"10.5256\/f1000research.2894.r2467, Mick Watson, ARK-Genomics, University of Edinburgh, Edinburgh, UK, 10 Dec 2013, version 1, indexed","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-response-2467","order":0,"name":"referee-response-2467","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"<b>Robert Davey<\/b>; \n<i>Posted: 24 Jan 2014<\/i>; We would like to thank Mick for taking the time to read and comment on our manuscript.Admittedly, there is much interplay when considering terminology such as &quot;run&quot; and &quot;analysis&quot;, but throughout the paper we use the term &quot;run&quot; to represent a sequencing run, and similarly, the term &quot;analysis&quot; to represent a QC process. We feel this is adequate given the focus of the paper, but we have clarified one potential misuse (changed &quot;run&quot; to &quot;carried out&quot;, where appropriate).Yes, the field refers to these terms regularly, given that &quot;long read&quot; sequencers are available and distinct from their &quot;short-read&quot; counterparts, i.e. we consider the new 2x300bp and upcoming 2x400bp Illumina techniques to still be &quot;short-read&quot;.As long as the data produced from non-Illumina machines is in the FASTQ format (as we state in the existing Use Case text), then there are no differences from the method outlined in the paper, e.g. FASTQC output parsed and stored in StatsDB. Where differences may exist, e.g. STS files from PacBio, different parsers can be written to accommodate this, and we are working on producing such a parser. By no means is StatsDB inherently tied to the FASTQ format, but we feel this reflects the most common QC methods available currently.","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-comment-679","order":1,"name":"referee-comment-679","label":"Referee Comment","group":{"name":"article-reports","label":"Article Reports"}},{"value":"10.5256\/f1000research.2894.r2565, Cyriac Kandoth, The Genome Institute, Washington University, St Louis, MO, USA, 16 Dec 2013, version 1, indexed","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-response-2565","order":2,"name":"referee-response-2565","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"<b>Robert Davey<\/b>; \n<i>Posted: 24 Jan 2014<\/i>; We would like to thank Cyriac for taking the time to read and comment on our manuscript.In terms of the Use Case expansion, we believe the usefulness is inherent in the ability to store, retrieve and therefore compare historical run metrics, based on a variety of user- or analysis-specific attributes. We envisage tools like StatsDB Reporter will be the simplest and more widely-used interface with StatsDB for the average lab technician or bioinformatician. Similarly, we believe performance metrics at this level would be unhelpful rather than beneficial, as these would be very dependent on infrastructure, hardware and DMBS used. We aim to publish incremental updates to the API and surrounding tools, e.g. a full StatsDB Reporter release, and a Python API.The StatsDB schema was designed to allow the same analysis to be stored multiple times, with potentially differing or identical parameters. We foresee that if analysis uniqueness is required, this would be down to an API implementation to check for previous analyses with the same tool and parameters supplied, rather than at the database level. We have added a description to the Database Design section (analysis and latest_run table outlines) of the manuscript.The pipeline that utilises StatsDB has been covered in detail in our recent open-access FrontiersIn publication, so we have added a reference to this paper in the Use Case main text.The GPLv3 aims to give free software developers an advantage over proprietary developers. A parser that supports a proprietary format does not fit within the scope of our vision - we aim to continue the trend in bioinformatics software whereby fully open-source licences are preferred to maximise the reusability of a given tool or library.Yes, you are correct in using value_type for this kind of attribute. There are API calls that let you pull out analyses by a value_type value, so this should be supported out of the box.The accidental whitespace has been introduced in the HTML view, from which the PDF download is generated. We shall contact the editorial office to ensure these are corrected. Thank you for spotting those!","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-comment-678","order":3,"name":"referee-comment-678","label":"Referee Comment","group":{"name":"article-reports","label":"Article Reports"}},{"value":"10.5256\/f1000research.2894.r2794, Anuj Kumar, Department of Molecular, Cellular & Developmental Biology, University of Michigan, Ann Arbor, MI, USA, 27 Dec 2013, version 1, indexed","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-response-2794","order":4,"name":"referee-response-2794","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"<b>Robert Davey<\/b>; \n<i>Posted: 20 Jan 2014<\/i>; We would like to thank Anuj for taking the time to read and comment on our manuscript.We have added a sentence in to the Introduction, outlining briefly the data formats outputted by FASTQC, i.e. a set of HTML files, and a single plain-text flat file from which we parse the data to be loaded into StatsDB.We envisage no differences in scaling terms between a small-scale lab with one or two sequencers and a large multi-platform centre. StatsDB has applications in the smaller centre where quick but potentially more sporadic access to historical run data would be investigated through the StatsDB Reporter tool rather than the more &quot;heavyweight&quot; integration with a LIMS or via a web server. As such, we have added a sentence describing its relevance in this context. We will be publishing the StatsDB Reporter application separately, or as a software update to this publication, in due course.We have added a short overview of how the output of QC tools might be kept in different centres, highlighting the usefulness of StatsDB.","URL":"https:\/\/f1000research.com\/articles\/2-248\/v1#referee-comment-676","order":5,"name":"referee-comment-676","label":"Referee Comment","group":{"name":"article-reports","label":"Article Reports"}},{"value":"The development of StatsDB has been funded by a Biotechnology and Biological Sciences Research Council (BBSRC) National Capability Grant at TGAC.","order":6,"name":"grant-information","label":"Grant Information"},{"value":"This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Data associated with the article are available under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original data is properly cited.","order":0,"name":"copyright-info","label":"Copyright"}]}}