{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,18]],"date-time":"2026-06-18T03:16:13Z","timestamp":1781752573886,"version":"3.54.5"},"reference-count":30,"publisher":"Oxford University Press (OUP)","issue":"17","license":[{"start":{"date-parts":[[2022,7,8]],"date-time":"2022-07-08T00:00:00Z","timestamp":1657238400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,9,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in &amp;lt;2\u2009h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>https:\/\/github.com\/seqan\/needle.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac492","type":"journal-article","created":{"date-parts":[[2022,7,8]],"date-time":"2022-07-08T13:27:58Z","timestamp":1657286878000},"page":"4100-4108","source":"Crossref","is-referenced-by-count":12,"title":["Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0643-5123","authenticated-orcid":false,"given":"Mitra","family":"Darvish","sequence":"first","affiliation":[{"name":"Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics , Berlin, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Enrico","family":"Seiler","sequence":"additional","affiliation":[{"name":"Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics , Berlin, Germany"},{"name":"Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin , 14195 Berlin, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Svenja","family":"Mehringer","sequence":"additional","affiliation":[{"name":"Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin , 14195 Berlin, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ren\u00e9","family":"Rahn","sequence":"additional","affiliation":[{"name":"Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics , Berlin, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Knut","family":"Reinert","sequence":"additional","affiliation":[{"name":"Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics , Berlin, Germany"},{"name":"Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin , 14195 Berlin, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2022,7,8]]},"reference":[{"key":"2023041408423445000_","first-page":"285","author":"Bingmann","year":"2019"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time trade-offs in hash coding with allowable errors","volume":"13","author":"Bloom","year":"1970","journal-title":"Commun. ACM"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1038\/nbt.3519","article-title":"Near-optimal probabilistic RNA-seq quantification","volume":"34","author":"Bray","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"103592","DOI":"10.1016\/j.ebiom.2021.103592","article-title":"Kidney damage causally affects the brain cortical structure: a mendelian randomization study","volume":"72","author":"Chen","year":"2021","journal-title":"eBioMedicine"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"i766","DOI":"10.1093\/bioinformatics\/bty567","article-title":"DREAM-Yara: an exact read mapper for very large databases with short update time","volume":"34","author":"Dadi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1093\/bioinformatics\/bts635","article-title":"STAR: ultrafast universal RNA-seq aligner","volume":"29","author":"Dobin","year":"2013","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"2778","DOI":"10.1093\/bioinformatics\/btv272","article-title":"Polyester: simulating RNA-seq datasets with differential transcript expression","volume":"31","author":"Frazee","year":"2015","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"2628","DOI":"10.1093\/bioinformatics\/btz931","article-title":"ShinyGO: a graphical gene-set enrichment tool for animals and plants","volume":"36","author":"Ge","year":"2019","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"721","DOI":"10.1093\/bioinformatics\/btz662","article-title":"Improved representation of sequence bloom trees","volume":"36","author":"Harris","year":"2020","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1038\/eye.1999.9","article-title":"Genetic predisposition to ocular melanoma","volume":"13","author":"Houlston","year":"1999","journal-title":"Eye (London)"},{"key":"2023041408423445000_","first-page":"12:1","volume-title":"21st International Workshop on Algorithms in Bioinformatics (WABI 2021), Volume 201 of Leibniz International Proceedings in Informatics (LIPIcs)","author":"Kitaya","year":"2021"},{"key":"2023041408423445000_","article-title":"kmtricks: Efficient construction of bloom filters for large sequencing data collections","author":"Lemane","year":"2021","journal-title":"bioRxiv"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3230636","article-title":"Fast random integer generation in an interval","volume":"29","author":"Lemire","year":"2019","journal-title":"ACM Trans. Model. Comput. Simul"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","DOI":"10.1186\/s42047-018-0027-2","article-title":"Columnar cell lesions of the breast: a practical review for the pathologist","volume":"2","author":"Logullo","year":"2019","journal-title":"Surg. Exp. Pathol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1186\/s13059-014-0550-8","article-title":"Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2","volume":"15","author":"Love","year":"2014","journal-title":"Genome Biol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"i177","DOI":"10.1093\/bioinformatics\/btaa487","article-title":"REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets","volume":"36","author":"Marchet","year":"2020","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1016\/j.cels.2018.05.021","article-title":"Mantis: a fast, small, and exact large-scale sequence-search index","volume":"7","author":"Pandey","year":"2018","journal-title":"Cell Systems"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1038\/nbt.2862","article-title":"Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms","volume":"32","author":"Patro","year":"2014","journal-title":"Nat. Biotechnol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1038\/nmeth.4197","article-title":"Salmon provides fast and bias-aware quantification of transcript expression","volume":"14","author":"Patro","year":"2017","journal-title":"Nat. Methods"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023041408423445000_","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"102782","DOI":"10.1016\/j.isci.2021.102782","article-title":"Raptor: a fast and space-efficient pre-filter for querying very large collections of nucleotide sequences","volume":"24","author":"Seiler","year":"2021","journal-title":"iScience"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"903","DOI":"10.1038\/nbt.2957","article-title":"A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium","volume":"32","author":"SEQC\/MAQC-III Consortium","year":"2014","journal-title":"Nat. Biotechnol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1038\/nbt.3442","article-title":"Fast search of thousands of short-read sequencing experiments","volume":"34","author":"Solomon","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1089\/cmb.2017.0265","article-title":"Improved search of large transcriptomic sequencing databases using split sequence bloom trees","volume":"25","author":"Solomon","year":"2018","journal-title":"J. Comput. Biol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1089\/cmb.2017.0258","article-title":"AllSome sequence bloom trees","volume":"25","author":"Sun","year":"2018","journal-title":"J. Comput. Biol"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1186\/s13059-018-1535-9","article-title":"SeqOthello: querying RNA-seq experiments at scale","volume":"19","author":"Yu","year":"2018","journal-title":"Genome Biol"},{"key":"2023041408423445000_","first-page":"285","author":"Zhang","year":"2021"},{"key":"2023041408423445000_","doi-asserted-by":"crossref","first-page":"269","DOI":"10.1186\/s12967-021-02936-w","article-title":"TPM, FPKM, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the NCI patient-derived models repository","volume":"19","author":"Zhao","year":"2021","journal-title":"J. Transl. Med"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac492\/45026034\/btac492.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/17\/4100\/49889793\/btac492.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/17\/4100\/49889793\/btac492.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,24]],"date-time":"2023-11-24T10:50:56Z","timestamp":1700823056000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/17\/4100\/6633930"}},"subtitle":[],"editor":[{"given":"Yann","family":"Ponty","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"editor"}]}],"short-title":[],"issued":{"date-parts":[[2022,7,8]]},"references-count":30,"journal-issue":{"issue":"17","published-print":{"date-parts":[[2022,9,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac492","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,9,1]]},"published":{"date-parts":[[2022,7,8]]}}}