{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,1]],"date-time":"2026-07-01T15:56:25Z","timestamp":1782921385830,"version":"3.54.5"},"reference-count":32,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2023,6,30]],"date-time":"2023-06-30T00:00:00Z","timestamp":1688083200000},"content-version":"vor","delay-in-days":29,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001665","name":"Agence Nationale de la recherche","doi-asserted-by":"publisher","award":["ANR-22-CE45-0007"],"award-info":[{"award-number":["ANR-22-CE45-0007"]}],"id":[{"id":"10.13039\/501100001665","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10\u00a0000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3\u20136 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32\u00a0000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5\u00a0TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC\u2019s ability to query 500\u00a0000 transcript sequences in less than an hour.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>PAC\u2019s open-source software is available at https:\/\/github.com\/Malfoy\/PAC.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad225","type":"journal-article","created":{"date-parts":[[2023,5,25]],"date-time":"2023-05-25T11:29:38Z","timestamp":1685014178000},"page":"i252-i259","source":"Crossref","is-referenced-by-count":23,"title":["Scalable sequence database search using partitioned aggregated Bloom comb trees"],"prefix":"10.1093","volume":"39","author":[{"given":"Camille","family":"Marchet","sequence":"first","affiliation":[{"name":"University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL , F-59000 Lille, France"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Antoine","family":"Limasset","sequence":"additional","affiliation":[{"name":"University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL , F-59000 Lille, France"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2023,6,30]]},"reference":[{"key":"2023063008160118200_btad225-B1","doi-asserted-by":"crossref","first-page":"1946","DOI":"10.1093\/bioinformatics\/btaa546","article-title":"Succinct dynamic de Bruijn graphs","volume":"37","author":"Alipanahi","year":"2021","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B2","doi-asserted-by":"crossref","first-page":"i169","DOI":"10.1093\/bioinformatics\/bty292","article-title":"A space and time-efficient index for the compacted colored de Bruijn graph","volume":"34","author":"Almodaresi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B3","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J Mol Biol"},{"key":"2023063008160118200_btad225-B4","doi-asserted-by":"crossref","first-page":"1279","DOI":"10.1142\/S0129054118430037","article-title":"Bidirectional variable-order de Bruijn graphs","volume":"29","author":"Belazzougui","year":"2018","journal-title":"Int J Found Comput Sci"},{"key":"2023063008160118200_btad225-B5","author":"Bingmann","year":"2019"},{"key":"2023063008160118200_btad225-B6","doi-asserted-by":"crossref","first-page":"e3001421","DOI":"10.1371\/journal.pbio.3001421","article-title":"Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences","volume":"19","author":"Blackwell","year":"2021","journal-title":"PLoS Biol"},{"key":"2023063008160118200_btad225-B7","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time trade-offs in hash coding with allowable errors","volume":"13","author":"Bloom","year":"1970","journal-title":"Commun ACM"},{"key":"2023063008160118200_btad225-B8","doi-asserted-by":"crossref","first-page":"152","DOI":"10.1038\/s41587-018-0010-1","article-title":"Ultrafast search of all deposited bacterial and viral genomic data","volume":"37","author":"Bradley","year":"2019","journal-title":"Nat Biotechnol"},{"key":"2023063008160118200_btad225-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-10-421","article-title":"BLAST+: architecture and applications","volume":"10","author":"Camacho","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023063008160118200_btad225-B10","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1089\/cmb.2014.0160","article-title":"On the representation of de Bruijn graphs","volume":"22","author":"Chikhi","year":"2015","journal-title":"J Comput Biol"},{"key":"2023063008160118200_btad225-B11","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-14-160","article-title":"Disk-based k-mer counting on a PC","volume":"14","author":"Deorowicz","year":"2013","journal-title":"BMC Bioinformatics"},{"key":"2023063008160118200_btad225-B13","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1101\/gr.211748.116","article-title":"Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes","volume":"27","author":"Dolle","year":"2017","journal-title":"Genome Res"},{"key":"2023063008160118200_btad225-B14","author":"European Nucleotide Archive"},{"key":"2023063008160118200_btad225-B15","doi-asserted-by":"crossref","first-page":"721","DOI":"10.1093\/bioinformatics\/btz662","article-title":"Improved representation of sequence bloom trees","volume":"36","author":"Harris","year":"2020","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02135-8","article-title":"Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs","volume":"21","author":"Holley","year":"2020","journal-title":"Genome Biol"},{"key":"2023063008160118200_btad225-B17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13015-016-0066-8","article-title":"Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage","volume":"11","author":"Holley","year":"2016","journal-title":"Algorithms Mol Biol"},{"key":"2023063008160118200_btad225-B18","doi-asserted-by":"crossref","first-page":"2796","DOI":"10.1093\/bioinformatics\/btu387","article-title":"BEETL-fastq: a searchable compressed archive for DNA reads","volume":"30","author":"Janin","year":"2014","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B19","doi-asserted-by":"crossref","first-page":"vbac029","DOI":"10.1093\/bioadv\/vbac029","article-title":"kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections","volume":"2","author":"Lemane","year":"2022","journal-title":"Bioinform Adv"},{"key":"2023063008160118200_btad225-B20","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with Burrows\u2013Wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B21","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1101\/gr.260604.119","article-title":"Data structures based on k-mers for querying large collections of sequencing data sets","volume":"31","author":"Marchet","year":"2021","journal-title":"Genome Res"},{"key":"2023063008160118200_btad225-B23","doi-asserted-by":"crossref","first-page":"2858","DOI":"10.1093\/bioinformatics\/btab217","article-title":"BLight: efficient exact associative structure for k-mers","volume":"37","author":"Marchet","year":"2021","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B24","doi-asserted-by":"crossref","first-page":"1324","DOI":"10.1093\/bioinformatics\/btw832","article-title":"ntCard: a streaming algorithm for cardinality estimation in genomics data","volume":"33","author":"Mohamadi","year":"2017","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B25","doi-asserted-by":"crossref","first-page":"i51","DOI":"10.1093\/bioinformatics\/btz350","article-title":"Building large updatable colored de Bruijn graphs via merging","volume":"35","author":"Muggli","year":"2019","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B26","doi-asserted-by":"crossref","first-page":"3181","DOI":"10.1093\/bioinformatics\/btx067","article-title":"Succinct colored de Bruijn graphs","volume":"33","author":"Muggli","year":"2017","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B27","doi-asserted-by":"crossref","first-page":"i185","DOI":"10.1093\/bioinformatics\/btac245","article-title":"Sparse and skew hashing of k-mers","volume":"38","author":"Pibiri","year":"2022","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B28","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023063008160118200_btad225-B29","doi-asserted-by":"crossref","first-page":"e0163962","DOI":"10.1371\/journal.pone.0163962","article-title":"SeqKit: a cross-platform and ultrafast toolkit for FASTA\/Q file manipulation","volume":"11","author":"Shen","year":"2016","journal-title":"PLoS One"},{"key":"2023063008160118200_btad225-B30","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1038\/nbt.3442","article-title":"Fast search of thousands of short-read sequencing experiments","volume":"34","author":"Solomon","year":"2016","journal-title":"Nat Biotechnol"},{"key":"2023063008160118200_btad225-B31","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1089\/cmb.2017.0265","article-title":"Improved search of large transcriptomic sequencing databases using split sequence bloom trees","volume":"25","author":"Solomon","year":"2018","journal-title":"J Comput Biol"},{"key":"2023063008160118200_btad225-B32","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1186\/s13059-018-1535-9","article-title":"SeqOthello: querying RNA-seq experiments at scale","volume":"19","author":"Yu","year":"2018","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/Supplement_1\/i252\/50741675\/btad225.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/Supplement_1\/i252\/50741675\/btad225.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,30]],"date-time":"2023-06-30T04:19:08Z","timestamp":1688098748000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/39\/Supplement_1\/i252\/7210475"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,1]]},"references-count":32,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2023,6,30]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad225","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.02.11.480089","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,6,1]]},"published":{"date-parts":[[2023,6,1]]}}}