{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T05:08:55Z","timestamp":1772168935601,"version":"3.50.1"},"reference-count":11,"publisher":"F1000 Research Ltd","license":[{"start":{"date-parts":[[2020,5,19]],"date-time":"2020-05-19T00:00:00Z","timestamp":1589846400000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000092","name":"National Library of Medicine, National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100014989","name":"Chan Zuckerberg Initiative","doi-asserted-by":"publisher","award":["2018-182626"],"award-info":[{"award-number":["2018-182626"]}],"id":[{"id":"10.13039\/100014989","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["f1000research.com"],"crossmark-restriction":false},"short-container-title":["F1000Res"],"abstract":"<ns4:p>\n                    The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations.\u00a0 Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA\u2019s human RNA-seq data. The first tool, called the\n                    <ns4:italic>Case-Control Finder<\/ns4:italic>\n                    , finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type.\u00a0 The second tool, called the\n                    <ns4:italic>Series Finder<\/ns4:italic>\n                    , finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.\n                  <\/ns4:p>","DOI":"10.12688\/f1000research.23180.1","type":"journal-article","created":{"date-parts":[[2020,5,19]],"date-time":"2020-05-19T05:55:13Z","timestamp":1589867713000},"page":"376","update-policy":"https:\/\/doi.org\/10.12688\/f1000research.crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive"],"prefix":"10.12688","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1810-5252","authenticated-orcid":false,"given":"Matthew N.","family":"Bernstein","sequence":"first","affiliation":[]},{"given":"Ariella","family":"Gladstein","sequence":"additional","affiliation":[]},{"given":"Khun Zaw","family":"Latt","sequence":"additional","affiliation":[]},{"given":"Emily","family":"Clough","sequence":"additional","affiliation":[]},{"given":"Ben","family":"Busby","sequence":"additional","affiliation":[]},{"given":"Allissa","family":"Dillman","sequence":"additional","affiliation":[]}],"member":"2560","published-online":{"date-parts":[[2020,5,19]]},"reference":[{"key":"ref-1","doi-asserted-by":"publisher","first-page":"25-38","DOI":"10.7171\/jbt.18-2902-002","article-title":"The Cellosaurus, a Cell-Line Knowledge Resource.","volume":"29","author":"A Bairoch","year":"2018","journal-title":"J Biomol Tech."},{"key":"ref-2","doi-asserted-by":"publisher","first-page":"R21","DOI":"10.1186\/gb-2005-6-2-r21","article-title":"An ontology for cell types.","volume":"6","author":"J Bard","year":"2005","journal-title":"Genome Biol."},{"key":"ref-3","article-title":"mbernste\/hypothesis-driven-SRA-queries: First release (Version v1.0.0).","author":"M Bernstein","year":"2020","journal-title":"Zenodo."},{"key":"ref-4","doi-asserted-by":"publisher","first-page":"2914-2923","DOI":"10.1093\/bioinformatics\/btx334","article-title":"MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.","volume":"33","author":"M Bernstein","year":"2017","journal-title":"Bioinformatics."},{"key":"ref-5","doi-asserted-by":"publisher","first-page":"190021","DOI":"10.1038\/sdata.2019.21","article-title":"The variable quality of metadata about biological samples used in biomedical experiments.","volume":"6","author":"R Gon\u00e7alves","year":"2019","journal-title":"Scientific Data."},{"key":"ref-6","doi-asserted-by":"publisher","first-page":"90-95","DOI":"10.1109\/MCSE.2007.55","article-title":"Matplotlib: A 2D graphics environment.","volume":"9","author":"J Hunter","year":"2007","journal-title":"Comput Sci Eng."},{"key":"ref-7","doi-asserted-by":"publisher","first-page":"D19-21","DOI":"10.1093\/nar\/gkq1019","article-title":"The Sequence Read Archive.","volume":"39","author":"R Leinonen","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"ref-8","doi-asserted-by":"publisher","first-page":"1112-1118","DOI":"10.1093\/bioinformatics\/btq099","article-title":"Modeling sample variables with an Experimental Factor Ontology.","volume":"26","author":"J Malone","year":"2010","journal-title":"Bioinformatics."},{"key":"ref-9","article-title":"pandas: a foundational Python library for data analysis and statistics.","volume":"14","author":"W McKinney","year":"2011","journal-title":"Python for High Performance and Scientific Computing."},{"key":"ref-10","doi-asserted-by":"publisher","first-page":"R5","DOI":"10.1186\/gb-2012-13-1-r5","article-title":"Uberon, an integrative multi-species anatomy ontology.","volume":"13","author":"C Mungall","year":"2012","journal-title":"Genome Biol."},{"key":"ref-11","doi-asserted-by":"publisher","first-page":"D955-D962","DOI":"10.1093\/nar\/gky1032","article-title":"Human Disease Ontology 2018 update: classification, content and workflow expansion.","volume":"47","author":"L Schriml","year":"2019","journal-title":"Nucleic Acids Res."}],"updated-by":[{"DOI":"10.12688\/f1000research.23180.2","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2020,8,4]],"date-time":"2020-08-04T00:00:00Z","timestamp":1596499200000}}],"container-title":["F1000Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/f1000research.com\/articles\/9-376\/v1\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/9-376\/v1\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/f1000research.com\/articles\/9-376\/v1\/iparadigms","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2020,8,4]],"date-time":"2020-08-04T07:55:14Z","timestamp":1596527714000},"score":1,"resource":{"primary":{"URL":"https:\/\/f1000research.com\/articles\/9-376\/v1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,19]]},"references-count":11,"URL":"https:\/\/doi.org\/10.12688\/f1000research.23180.1","relation":{"has-review":[{"id-type":"doi","id":"10.5256\/f1000research.25586.r63613","asserted-by":"subject"},{"id-type":"doi","id":"10.5256\/f1000research.25586.r63614","asserted-by":"subject"},{"id-type":"doi","id":"10.5256\/f1000research.25586.r63613","asserted-by":"object"},{"id-type":"doi","id":"10.5256\/f1000research.25586.r63614","asserted-by":"object"}]},"ISSN":["2046-1402"],"issn-type":[{"value":"2046-1402","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,19]]},"assertion":[{"value":"Approved with reservations, Approved with reservations","URL":"https:\/\/f1000research.com\/articles\/9-376\/v1#article-reports","order":0,"name":"referee-status","label":"Referee status","group":{"name":"current-referee-status","label":"Current Referee Status"}},{"value":"10.5256\/f1000research.25586.r63613, Zichen Wang, Sema4, Stamford, CT, USA, 01 Jun 2020, version 1, 2 approved with reservations","URL":"https:\/\/f1000research.com\/articles\/9-376\/v1#referee-response-63613","order":0,"name":"referee-response-63613","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"<b>Matthew Bernstein<\/b>; \n<i>Posted: 23 Jul 2020<\/i>; We greatly appreciate the reviewer's valuable feedback. Please find our responses to each point below:1. Within the abstract we now point the reader to the tools\u2019 Github repository, which describes how the tools can be executed either locally or in the cloud via Google Colab.2.&nbsp;We have set up Google Colab notebooks to run these tools in the cloud. Links to the notebooks are found within the README in the Github repository.3.&nbsp;We thank you for this suggestion. We have updated the tools to now accept both ontology term names (i.e. free text) as well as ontology term ID\u2019s.4.&nbsp;We agree that using the MetaSRA\u2019s API would be a great idea; however, the API restricts queries that return too many results. Specifically, for queries that return too many results, the API returns an error message that the search results are too large. This severely restricts our ability to use the API for these tools.&nbsp; We note that the MetaSRA is released in discrete chunks and does not track every ongoing change to the SRA; thus, whenever the MetaSRA version changes, we will update the static version of the MetaSRA packaged with these tools. We have added text to this manuscript detailing our commitment to performing these updates. Lastly, we added text to the README that makes it more explicit to the user which version of the MetaSRA these tools are utilizing.5.&nbsp;Within the instructions (within Section 1 of the Series Finder), we&nbsp; now provide the user example properties (such as \u201cpassage number\u201d and \u201ctime\u201d) as well as example units (such as \u201chour\u201d and \u201cday\u201d). We also point the user to the Units Ontology for a full set of available units that are utilized by the underlying MetaSRA annotations.6.&nbsp;We note that the accuracy of the results is dependent on the accuracy of the MetaSRA annotations, which have been thoroughly evaluated in&nbsp; the original MetaSRA publication by Bernstein et al. (2017). Therefore, we added text to the \u201cConclusion and future work\u201d section that points readers to this analysis.&nbsp; We have also added text to this section that clarifies that these tools are for selecting an initial \n<i>candidate<\/i> set of samples from the SRA; however, given that the annotations are not error-free, we encourage the user to further validate the datasets returned by these tools before performing downstream analysis.7.&nbsp;The SRA stores sequencing data for both bulk and single-cell data; however, this information is not encoded in the metadata in a standardized way nor is it captured by the MetaSRA.&nbsp; Therefore, one limitation of the tools presented in this work is that they may return datasets that comprise both bulk and single-cell samples.&nbsp; We describe this limitation in the Conclusion section and again encourage users to validate the results returned by these tools before performing downstream analyses.8.&nbsp;In the Conclusion section, we now point the reader to databases of pre-processed SRA data including recount2, ARCHS4, and refine.bio.&nbsp; From these resources, users can download pre-processed expression data for the samples returned by the tools presented in this work.","URL":"https:\/\/f1000research.com\/articles\/9-376\/v1#referee-comment-5762","order":1,"name":"referee-comment-5762","label":"Referee Comment","group":{"name":"article-reports","label":"Article Reports"}},{"value":"10.5256\/f1000research.25586.r63614, Shannon Ellis, Department of Biostatistics, Johns Hopkins University School of Public Health, Baltimore, MD, USA, 05 Jun 2020, version 1, 2 approved with reservations","URL":"https:\/\/f1000research.com\/articles\/9-376\/v1#referee-response-63614","order":2,"name":"referee-response-63614","label":"Referee Report","group":{"name":"article-reports","label":"Article Reports"}},{"value":"<b>Matthew Bernstein<\/b>; \n<i>Posted: 23 Jul 2020<\/i>; We greatly appreciate the reviewer's valuable suggestions and feedback. Please see our responses below:1.&nbsp;We agree that using the MetaSRA\u2019s API would be a great idea; however, the API restricts queries that return too many results. Specifically, for queries that return too many results, the API returns an error message that the search results are too large. This severely restricts our ability to use the API for these tools. We note that the MetaSRA is released in discrete chunks and does not track every ongoing change to the SRA; thus, whenever the MetaSRA version changes, we will update the static version of the MetaSRA packaged with these tools. We have added text to this manuscript detailing our commitment to performing these updates. Lastly, we added text to the README that makes it more explicit to the user which version of the MetaSRA these tools are utilizing.2.&nbsp;-&nbsp;We tested the query \u201cheart\u201d and it now should return results. We also provide more thorough input validation for cases in which the query does not return results.-&nbsp;We have updated the code so that the tools retrieves sample that are annotated as an ancestral term to the query term (e.g. a sample labelled as \u201cbrain glioma\u201d should be retrieved when the user inputs the query \u201cbrain cancer\u201d). Now the query \u201cbrain cancer\u201d will retrieve many more samples than before. We do note a few issues with the particular query \u201cbrain cancer\u201d (which maps to term DOID:1319 in the Disease Ontology).&nbsp; Specifically, we found that the MetaSRA failed to label many samples as \u201cbrain cancer\u201d due to the fact that many of the subterms (e.g. \u201cbrain glioma\u201d) are missing important synonyms that would have led the MetaSRA to pick them up. For example, the term \u201cbrain glioma\u201d (DOID:0060108) is not associated with the simple synonym \u201cglioma\u201d and thus, unless a sample for a given glioma sample was described using the string \u201cbrain glioma\u201d, which appears to be rare, the MetaSRA failed to annotate this sample as a \u201cbrain glioma\u201d.&nbsp; Instead, the MetaSRA labels glioma samples using an alternative \u201cglioma\u201d term from the Experimental Factors Ontology (EFO:0005543), which does not have \u201cbrain cancer\u201d as an ancestor term, but instead has \u201cbrain neoplasm\u201d as an ancestor (EFO:0003833). This case points to the fact that there is still work to be done in both standardizing the metadata in the SRA and in constructing comprehensive ontologies. Unfortunately, these issues remain out of the scope for this work; however, we now include new text in the Conclusion section that discusses how the original MetaSRA annotations contain some errors and that these errors may propagate to the output of these tools.&nbsp;-&nbsp;Thank you for this suggestion. We have added more detailed instructions for each input parameter. We also perform more thorough input-validation on the user\u2019s input. Lastly, we have added more documentation to each function in utils to help a user who wishes to dive further into the code.Responses to minor issues:1.&nbsp;We apologize for this password issue. Given how few dependencies these notebooks utilize, we decided that Docker is probably overkill for this project and therefore we removed this option altogether. We instead uploaded these notebooks to Google Colab to run in the cloud.&nbsp; If a user would like to run the notebooks locally, we now detail all of the dependencies in the file \u201crequirements.txt\u201d within the repository and offer guidance on installing these dependencies in the README.&nbsp;&nbsp;2.&nbsp;Thank you for these suggestions. We flipped the barcharts 90 degrees and also use a different color palette for each pie chart. We note that the same samples are used to construct each of the four pie charts.3.&nbsp;We added text to this sentence highlighting another example of a temporal property: time in which cells have spent differentiating in vitro. To this end, we have also added another parameter to the query that enables users to select only in vitro differentiating cells in order to answer possible biological questions pertaining to differentiation.4.&nbsp;This is definitely an important feature, thank you for suggesting it. We now enable the user to match by age and sex in the notebook (see Section \u201c3. Set filtering parameters\u201d) in the notebook. Specifically, in the notebook, if the user sets the variable \u201cMATCH_BY_SEX\u201d to True, we only consider samples that are annotated by sex in the MetaSRA and then match accordingly.&nbsp; Similarly, if the user sets \u201cMATCH_BY_AGE\u201d to True, we only consider samples that are annotated with age and then match accordingly.","URL":"https:\/\/f1000research.com\/articles\/9-376\/v1#referee-comment-5763","order":3,"name":"referee-comment-5763","label":"Referee Comment","group":{"name":"article-reports","label":"Article Reports"}},{"value":"This work was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.  Matthew Bernstein acknowledges support from grant 2018-182626 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.","order":4,"name":"grant-information","label":"Grant Information"},{"value":"This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.","order":0,"name":"copyright-info","label":"Copyright"}]}}