{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:48Z","timestamp":1772138088866,"version":"3.50.1"},"reference-count":39,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2018,6,27]],"date-time":"2018-06-27T00:00:00Z","timestamp":1530057600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["BBSRC-NSF\/BIO-1564917"],"award-info":[{"award-number":["BBSRC-NSF\/BIO-1564917"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Indexing reference sequences for search\u2014both individual genomes and collections of genomes\u2014is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.<\/jats:p>\n                    <jats:p>Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>pufferfish is written in C++11, is open source, and is available at https:\/\/github.com\/COMBINE-lab\/pufferfish.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty292","type":"journal-article","created":{"date-parts":[[2018,4,16]],"date-time":"2018-04-16T15:11:48Z","timestamp":1523891508000},"page":"i169-i177","source":"Crossref","is-referenced-by-count":80,"title":["A space and time-efficient index for the compacted colored de Bruijn graph"],"prefix":"10.1093","volume":"34","author":[{"given":"Fatemeh","family":"Almodaresi","sequence":"first","affiliation":[{"name":"Department of Computer Science, Stony Brook University, Stonybrook, NY, USA"}]},{"given":"Hirak","family":"Sarkar","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Stony Brook University, Stonybrook, NY, USA"}]},{"given":"Avi","family":"Srivastava","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Stony Brook University, Stonybrook, NY, USA"}]},{"given":"Rob","family":"Patro","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Stony Brook University, Stonybrook, NY, USA"}]}],"member":"286","published-online":{"date-parts":[[2018,6,27]]},"reference":[{"key":"2023051604245547800_bty292-B1","author":"Almodaresi","year":"2017"},{"key":"2023051604245547800_bty292-B2","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1007\/978-3-319-46049-9_14","volume-title":"International Symposium on String Processing and Information Retrieval","author":"Belazzougui","year":"2016"},{"key":"2023051604245547800_bty292-B3","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1186\/s13015-016-0083-7","article-title":"A representation of a compressed de Bruijn graph for pan-genome analysis that enables search","volume":"11","author":"Beller","year":"2016","journal-title":"Algorithms Mol. Biol"},{"key":"2023051604245547800_bty292-B4","author":"Bowe","year":"2012"},{"key":"2023051604245547800_bty292-B5","doi-asserted-by":"crossref","first-page":"525","DOI":"10.1038\/nbt.3519","article-title":"Near-optimal probabilistic RNA-seq quantification","volume":"34","author":"Bray","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023051604245547800_bty292-B6","author":"Chikhi","year":"2014"},{"key":"2023051604245547800_bty292-B7","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B8","doi-asserted-by":"crossref","first-page":"15","DOI":"10.1093\/bioinformatics\/bts635","article-title":"STAR: ultrafast universal RNA-seq aligner","volume":"29","author":"Dobin","year":"2013","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B9","author":"Ferragina","year":"2001"},{"key":"2023051604245547800_bty292-B10","doi-asserted-by":"crossref","first-page":"644","DOI":"10.1038\/nbt.1883","article-title":"Full-length transcriptome assembly from RNA-Seq data without a reference genome","volume":"29","author":"Grabherr","year":"2011","journal-title":"Nat. Biotechnol"},{"key":"2023051604245547800_bty292-B11","doi-asserted-by":"crossref","first-page":"1494","DOI":"10.1038\/nprot.2013.084","article-title":"De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with trinity","volume":"8","author":"Haas","year":"2013","journal-title":"Nat. Protoc"},{"key":"2023051604245547800_bty292-B12","doi-asserted-by":"crossref","first-page":"576","DOI":"10.1038\/nmeth0810-576","article-title":"mrsFAST: a cache-oblivious algorithm for short-read mapping","volume":"7","author":"Hach","year":"2010","journal-title":"Nat. Methods"},{"key":"2023051604245547800_bty292-B13","doi-asserted-by":"crossref","first-page":"3.","DOI":"10.1186\/s13015-016-0066-8","article-title":"Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage","volume":"11","author":"Holley","year":"2016","journal-title":"Algorithms Mol. Biol"},{"key":"2023051604245547800_bty292-B14","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de Bruijn graphs","volume":"44","author":"Iqbal","year":"2012","journal-title":"Nat. Genet"},{"key":"2023051604245547800_bty292-B15","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.3317","article-title":"HISAT: a fast spliced aligner with low memory requirements","volume":"12","author":"Kim","year":"2015","journal-title":"Nat. Methods"},{"key":"2023051604245547800_bty292-B16","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2023051604245547800_bty292-B17","doi-asserted-by":"crossref","first-page":"R25.","DOI":"10.1186\/gb-2009-10-3-r25","article-title":"Ultrafast and memory-efficient alignment of short DNA sequences to the human genome","volume":"10","author":"Langmead","year":"2009","journal-title":"Genome Biol"},{"key":"2023051604245547800_bty292-B18","author":"Li","year":"2013"},{"key":"2023051604245547800_bty292-B19","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with burrows\u2013wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B20","doi-asserted-by":"crossref","first-page":"1851","DOI":"10.1101\/gr.078212.108","article-title":"Mapping short DNA sequencing reads and calling variants using mapping quality scores","volume":"18","author":"Li","year":"2008","journal-title":"Genome Res"},{"key":"2023051604245547800_bty292-B21","doi-asserted-by":"crossref","first-page":"e108","DOI":"10.1093\/nar\/gkt214","article-title":"The subread aligner: fast, accurate and scalable read mapping by seed-and-vote","volume":"41","author":"Liao","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023051604245547800_bty292-B22","doi-asserted-by":"crossref","first-page":"237.","DOI":"10.1186\/s12859-016-1103-9","article-title":"Read mapping on de Bruijn graphs","volume":"17","author":"Limasset","year":"2016","journal-title":"BMC Bioinformatics"},{"key":"2023051604245547800_bty292-B23","author":"Limasset","year":"2017"},{"key":"2023051604245547800_bty292-B24","doi-asserted-by":"crossref","first-page":"3224","DOI":"10.1093\/bioinformatics\/btw371","article-title":"deBGA: read alignment with de Bruijn graph-based seed and extension","volume":"32","author":"Liu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B25","doi-asserted-by":"crossref","first-page":"222","DOI":"10.1007\/978-3-319-43681-4_18","volume-title":"International Workshop on Algorithms in Bioinformatics","author":"Maciuca","year":"2016"},{"key":"2023051604245547800_bty292-B26","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1186\/s13059-017-1299-7","article-title":"Comprehensive benchmarking and ensemble approaches for metagenomic classifiers","volume":"18","author":"McIntyre","year":"2017","journal-title":"Genome Biol"},{"key":"2023051604245547800_bty292-B27","doi-asserted-by":"crossref","first-page":"215","DOI":"10.1007\/978-3-642-40453-5_17","volume-title":"International Workshop on Algorithms in Bioinformatics","author":"Minkin","year":"2013"},{"key":"2023051604245547800_bty292-B28","first-page":"4024","article-title":"TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes","volume":"15","author":"Minkin","year":"2016","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B29","author":"Movahedi","year":"2012"},{"key":"2023051604245547800_bty292-B30","doi-asserted-by":"crossref","first-page":"3181","DOI":"10.1093\/bioinformatics\/btx067","article-title":"Succinct colored de Bruijn graphs","volume":"33","author":"Muggli","year":"2017","journal-title":"Bioinformatics"},{"key":"2023051604245547800_bty292-B31","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1186\/s12864-015-1419-2","article-title":"CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers","volume":"16","author":"Ounit","year":"2015","journal-title":"BMC Genomics"},{"key":"2023051604245547800_bty292-B32","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1101\/gr.214155.116","article-title":"Genome graphs and the evolution of genome inference","volume":"27","author":"Paten","year":"2017","journal-title":"Genome Res"},{"key":"2023051604245547800_bty292-B33","doi-asserted-by":"crossref","first-page":"9748","DOI":"10.1073\/pnas.171285098","article-title":"An Eulerian path approach to DNA fragment assembly","volume":"98","author":"Pevzner","year":"2001","journal-title":"Proc. Natl .Acad. Sci. USA"},{"key":"2023051604245547800_bty292-B34","author":"Raman","year":"2002"},{"key":"2023051604245547800_bty292-B35","author":"Sarkar","year":"2017"},{"key":"2023051604245547800_bty292-B36","doi-asserted-by":"crossref","first-page":"R98.","DOI":"10.1186\/gb-2009-10-9-r98","article-title":"Simultaneous alignment of short reads against multiple genomes","volume":"10","author":"Schneeberger","year":"2009","journal-title":"Genome Biol"},{"key":"2023051604245547800_bty292-B37","author":"Sir\u00e9n","year":"2017"},{"key":"2023051604245547800_bty292-B38","doi-asserted-by":"crossref","first-page":"R46.","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"},{"key":"2023051604245547800_bty292-B39","doi-asserted-by":"crossref","first-page":"374.","DOI":"10.1038\/nbt.3511","article-title":"Compressive mapping for next-generation sequencing","volume":"34","author":"Yorukoglu","year":"2016","journal-title":"Nat. Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/13\/i169\/50316242\/bioinformatics_34_13_i169.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/13\/i169\/50316242\/bioinformatics_34_13_i169.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T00:28:07Z","timestamp":1684196887000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/13\/i169\/5045749"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,6,27]]},"references-count":39,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2018,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty292","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/191874","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,7,1]]},"published":{"date-parts":[[2018,6,27]]}}}