{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T10:33:09Z","timestamp":1775730789081,"version":"3.50.1"},"reference-count":52,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":11,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"NIH","doi-asserted-by":"publisher","award":["R01 HG009937"],"award-info":[{"award-number":["R01 HG009937"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CCF-1750472"],"award-info":[{"award-number":["CCF-1750472"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["CNS-1763680"],"award-info":[{"award-number":["CNS-1763680"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata\u2019s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9\u2009h, using \u223c29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9\u2009h, using \u223c84 GB of memory. The only other tool completing these tasks on the hardware took over 23\u2009h using \u223c126 GB of memory, and over 16\u2009h using \u223c289 GB of memory, respectively.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Cuttlefish is implemented in C++14, and is available under an open source license at https:\/\/github.com\/COMBINE-lab\/cuttlefish.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab309","type":"journal-article","created":{"date-parts":[[2021,4,30]],"date-time":"2021-04-30T15:18:09Z","timestamp":1619795889000},"page":"i177-i186","source":"Crossref","is-referenced-by-count":33,"title":["Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections"],"prefix":"10.1093","volume":"37","author":[{"given":"Jamshed","family":"Khan","sequence":"first","affiliation":[{"name":"Department of Computer Science, University of Maryland, College Park , MD 20742, USA"},{"name":"Center for Bioinformatics and Computational Biology, University of Maryland, College Park , MD 20742, USA"}]},{"given":"Rob","family":"Patro","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Maryland, College Park , MD 20742, USA"},{"name":"Center for Bioinformatics and Computational Biology, University of Maryland, College Park , MD 20742, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"2023062410291160600_btab309-B1","doi-asserted-by":"crossref","first-page":"i169","DOI":"10.1093\/bioinformatics\/bty292","article-title":"A space and time-efficient index for the compacted colored de Bruijn graph","volume":"34","author":"Almodaresi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B2","first-page":"1","volume-title":"Research in Computational Molecular Biology.","author":"Almodaresi","year":"2019"},{"key":"2023062410291160600_btab309-B3","author":"Almodaresi","year":"2020"},{"key":"2023062410291160600_btab309-B4","doi-asserted-by":"crossref","first-page":"497","DOI":"10.1093\/bioinformatics\/btv603","article-title":"Graphical pan-genome analysis with compressed suffix trees and the Burrows\u2013Wheeler transform","volume":"32","author":"Baier","year":"2015","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B5","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1089\/cmb.2012.0021","article-title":"SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing","volume":"19","author":"Bankevich","year":"2012","journal-title":"J. Comput. Biol"},{"key":"2023062410291160600_btab309-B6","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time trade-offs in hash coding with allowable errors","volume":"13","author":"Bloom","year":"1970","journal-title":"Commun. ACM"},{"key":"2023062410291160600_btab309-B7","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1007\/978-3-642-33122-0_18","volume-title":"Algorithms in Bioinformatics","author":"Bowe","year":"2012"},{"key":"2023062410291160600_btab309-B8","author":"Burrows","year":"1994"},{"key":"2023062410291160600_btab309-B9","doi-asserted-by":"crossref","first-page":"810","DOI":"10.1101\/gr.7337908","article-title":"ALLPATHS: de novo assembly of whole-genome shotgun microreads","volume":"18","author":"Butler","year":"2008","journal-title":"Genome Res"},{"key":"2023062410291160600_btab309-B10","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1101\/gr.7088808","article-title":"Short read fragment assembly of bacterial genomes","volume":"18","author":"Chaisson","year":"2008","journal-title":"Genome Res"},{"key":"2023062410291160600_btab309-B11","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1186\/1748-7188-8-22","article-title":"Space-efficient and exact de Bruijn graph representation based on a bloom filter","volume":"8","author":"Chikhi","year":"2013","journal-title":"Algorithms Mol. Biol"},{"key":"2023062410291160600_btab309-B12","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1007\/978-3-319-05269-4_4","volume-title":"Research in Computational Molecular Biology","author":"Chikhi","year":"2014"},{"key":"2023062410291160600_btab309-B13","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3445967","article-title":"Data structures to represent a set of k-long DNA sequences","volume":"54","author":"Chikhi","year":"2021","journal-title":"ACM Comput. Surv"},{"key":"2023062410291160600_btab309-B15","volume-title":"Introduction to Algorithms","author":"Cormen","year":"2009","edition":"3rd edn"},{"key":"2023062410291160600_btab309-B16","doi-asserted-by":"crossref","first-page":"2529","DOI":"10.1038\/nprot.2016.150","article-title":"Indel variant analysis of short-read sequencing data with scalpel","volume":"11","author":"Fang","year":"2016","journal-title":"Nat. Prot"},{"key":"2023062410291160600_btab309-B17","first-page":"1","author":"Guo","year":"2019"},{"key":"2023062410291160600_btab309-B18","article-title":"Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs","volume":"21, 249","author":"Holley","year":"2020","journal-title":"Genome Biol"},{"key":"2023062410291160600_btab309-B19","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1186\/s13015-016-0066-8","article-title":"Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage","volume":"11","author":"Holley","year":"2016","journal-title":"Algorithms Mol. Biol"},{"key":"2023062410291160600_btab309-B20","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de Bruijn graphs","volume":"44","author":"Iqbal","year":"2012","journal-title":"Nat. Genet"},{"key":"2023062410291160600_btab309-B21","author":"Karasikov","year":"2020"},{"key":"2023062410291160600_btab309-B22","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1186\/1471-2105-15-149","article-title":"FIGG: simulating populations of whole genome sequences for heterogeneous data analyses","volume":"15","author":"Killcoyne","year":"2014","journal-title":"BMC Bioinformatics"},{"key":"2023062410291160600_btab309-B23","doi-asserted-by":"crossref","first-page":"2759","DOI":"10.1093\/bioinformatics\/btx304","article-title":"KMC 3: counting and manipulating k-mer statistics","volume":"33","author":"Kokot","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B24","doi-asserted-by":"crossref","first-page":"560","DOI":"10.1186\/1471-2105-11-560","article-title":"Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs","volume":"11","author":"Kundeti","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023062410291160600_btab309-B25","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1101\/gr.097261.109","article-title":"De novo assembly of human genomes with massively parallel short read sequencing","volume":"20","author":"Li","year":"2010","journal-title":"Genome Research"},{"key":"2023062410291160600_btab309-B26","first-page":"1","author":"Limasset","year":"2017"},{"key":"2023062410291160600_btab309-B27","doi-asserted-by":"crossref","first-page":"3224","DOI":"10.1093\/bioinformatics\/btw371","article-title":"deBGA: read alignment with de Bruijn graph-based seed and extension","volume":"32","author":"Liu","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B28","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1186\/s13059-019-1895-9","article-title":"deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index","volume":"20","author":"Liu","year":"2019","journal-title":"Genome Biol"},{"key":"2023062410291160600_btab309-B29","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1186\/s13742-015-0069-2","article-title":"SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler","volume":"4","author":"Luo","year":"2015","journal-title":"GigaSci"},{"key":"2023062410291160600_btab309-B30","doi-asserted-by":"crossref","first-page":"R103","DOI":"10.1186\/gb-2009-10-10-r103","article-title":"ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads","volume":"10","author":"MacCallum","year":"2009","journal-title":"Genome Biol"},{"key":"2023062410291160600_btab309-B31","author":"Mar\u00e7ais","year":"2020"},{"key":"2023062410291160600_btab309-B32","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B33","author":"Marchet","year":"2019"},{"key":"2023062410291160600_btab309-B34","doi-asserted-by":"crossref","first-page":"3476","DOI":"10.1093\/bioinformatics\/btu756","article-title":"SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips","volume":"30","author":"Marcus","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B35","doi-asserted-by":"crossref","first-page":"101224","DOI":"10.1016\/j.isci.2020.101224","article-title":"Scalable pairwise whole-genome homology mapping of long genomes with BubbZ","volume":"23","author":"Minkin","year":"2020","journal-title":"IScience"},{"key":"2023062410291160600_btab309-B36","doi-asserted-by":"crossref","first-page":"4024","DOI":"10.1093\/bioinformatics\/btw609","article-title":"TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes","volume":"33","author":"Minkin","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B37","doi-asserted-by":"crossref","first-page":"3181","DOI":"10.1093\/bioinformatics\/btx067","article-title":"Succinct colored de Bruijn graphs","volume":"33","author":"Muggli","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B38","doi-asserted-by":"crossref","first-page":"i51","DOI":"10.1093\/bioinformatics\/btz350","article-title":"Building large updatable colored de Bruijn graphs via merging","volume":"35","author":"Muggli","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410291160600_btab309-B39","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1038\/nature25458","article-title":"The axolotl genome and the evolution of key tissue formation regulators","volume":"554","author":"Nowoshilow","year":"2018","journal-title":"Nature"},{"key":"2023062410291160600_btab309-B40","author":"Pan","year":"2018"},{"key":"2023062410291160600_btab309-B41","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1016\/j.cels.2018.05.021","article-title":"Mantis: a fast, small, and exact large-scale sequence-search index","volume":"7","author":"Pandey","year":"2018","journal-title":"Cell Syst"},{"key":"2023062410291160600_btab309-B42","doi-asserted-by":"crossref","first-page":"13272","DOI":"10.1073\/pnas.1121464109","article-title":"Scaling metagenome sequence assembly with probabilistic de Bruijn graphs","volume":"109","author":"Pell","year":"2012","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023062410291160600_btab309-B43","doi-asserted-by":"crossref","first-page":"9748","DOI":"10.1073\/pnas.171285098","article-title":"An Eulerian path approach to DNA fragment assembly","volume":"98","author":"Pevzner","year":"2001","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023062410291160600_btab309-B44","doi-asserted-by":"crossref","first-page":"909","DOI":"10.1038\/nmeth.1517","article-title":"De novo assembly and analysis of RNA-seq data","volume":"7","author":"Robertson","year":"2010","journal-title":"Nat. Methods"},{"key":"2023062410291160600_btab309-B45","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1186\/1471-2105-14-313","article-title":"Compact representation of k-mer de Bruijn graphs for genome read assembly","volume":"14","author":"R\u00f8dland","year":"2013","journal-title":"BMC Bioinformatics"},{"key":"2023062410291160600_btab309-B46","doi-asserted-by":"crossref","first-page":"D94","DOI":"10.1093\/nar\/gky989","article-title":"GenBank","volume":"47","author":"Sayers","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023062410291160600_btab309-B47","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.1101\/gr.089532.108","article-title":"ABySS: a parallel assembler for short read sequence data","volume":"19","author":"Simpson","year":"2009","journal-title":"Genome Res"},{"key":"2023062410291160600_btab309-B48","doi-asserted-by":"crossref","first-page":"3673","DOI":"10.1093\/nar\/8.16.3673","article-title":"A new computer method for the storage and manipulation of DNA gel reading data","volume":"8","author":"Staden","year":"1980","journal-title":"Nucleic Acids Res"},{"key":"2023062410291160600_btab309-B49","doi-asserted-by":"crossref","first-page":"1613","DOI":"10.1534\/genetics.116.193227","article-title":"Sequence of the sugar pine megagenome","volume":"204","author":"Stevens","year":"2016","journal-title":"Genetics"},{"key":"2023062410291160600_btab309-B50","doi-asserted-by":"crossref","first-page":"e11","DOI":"10.1093\/nar\/gku1187","article-title":"Reference-free detection of isolated SNPs","volume":"43","author":"Uricaru","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023062410291160600_btab309-B51","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino","year":"2008","journal-title":"Genome Res"},{"key":"2023062410291160600_btab309-B52","doi-asserted-by":"crossref","first-page":"e8407","DOI":"10.1371\/journal.pone.0008407","article-title":"Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler","volume":"4","author":"Zerbino","year":"2009","journal-title":"PLoS One"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i177\/50694391\/btab309.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i177\/50694391\/btab309.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T20:21:21Z","timestamp":1687638081000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/Supplement_1\/i177\/6319696"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":52,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab309","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.10.21.349605","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,7,1]]}}}