{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T23:45:04Z","timestamp":1769816704684,"version":"3.49.0"},"reference-count":37,"publisher":"Proceedings of the National Academy of Sciences","issue":"33","content-domain":{"domain":["www.pnas.org"],"crossmark-restriction":true},"short-container-title":["Proc. Natl. Acad. Sci. U.S.A."],"published-print":{"date-parts":[[2012,8,14]]},"abstract":"<jats:p>\n            Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the\n            <jats:italic>k<\/jats:italic>\n            -mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per\n            <jats:italic>k<\/jats:italic>\n            -mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.\n          <\/jats:p>","DOI":"10.1073\/pnas.1121464109","type":"journal-article","created":{"date-parts":[[2012,7,31]],"date-time":"2012-07-31T08:08:52Z","timestamp":1343722132000},"page":"13272-13277","update-policy":"https:\/\/doi.org\/10.1073\/pnas.cm10313","source":"Crossref","is-referenced-by-count":195,"title":["Scaling metagenome sequence assembly with probabilistic de Bruijn graphs"],"prefix":"10.1073","volume":"109","author":[{"given":"Jason","family":"Pell","sequence":"first","affiliation":[{"name":"Computer Science and Engineering, Michigan State University, East Lansing, MI 48824;"}]},{"given":"Arend","family":"Hintze","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, Michigan State University, East Lansing, MI 48824;"}]},{"given":"Rosangela","family":"Canino-Koning","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, Michigan State University, East Lansing, MI 48824;"}]},{"given":"Adina","family":"Howe","sequence":"additional","affiliation":[{"name":"Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824; and"}]},{"given":"James M.","family":"Tiedje","sequence":"additional","affiliation":[{"name":"Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824; and"},{"name":"Crop and Soil Sciences, Michigan State University, East Lansing, MI 48824"}]},{"given":"C. Titus","family":"Brown","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, Michigan State University, East Lansing, MI 48824;"},{"name":"Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824; and"}]}],"member":"341","published-online":{"date-parts":[[2012,7,30]]},"reference":[{"key":"e_1_3_4_1_2","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1093\/bib\/bbp026","article-title":"Genome assembly reborn: Recent computational challenges","volume":"10","author":"Pop M","year":"2009","unstructured":"M Pop, Genome assembly reborn: Recent computational challenges. Brief Bioinform 10, 354\u2013366 (2009).","journal-title":"Brief Bioinform"},{"key":"e_1_3_4_2_2","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1101\/gr.131383.111","article-title":"GAGE: A critical evaluation of genome assemblies and assembly algorithms","volume":"22","author":"Salzberg S","year":"2012","unstructured":"S Salzberg, et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22, 557\u2013567 (2012).","journal-title":"Genome Res"},{"key":"e_1_3_4_3_2","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nature08821","article-title":"A human gut microbial gene catalogue established by metagenomic sequencing","volume":"464","author":"Qin J","year":"2010","unstructured":"J Qin, et al., A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59\u201365 (2010).","journal-title":"Nature"},{"key":"e_1_3_4_4_2","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1126\/science.1200387","article-title":"Metagenomic discovery of biomass-degrading genes and genomes from cow rumen","volume":"331","author":"Hess M","year":"2011","unstructured":"M Hess, et al., Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331, 463\u2013467 (2011).","journal-title":"Science"},{"key":"e_1_3_4_5_2","doi-asserted-by":"crossref","first-page":"e1000667","DOI":"10.1371\/journal.pcbi.1000667","article-title":"A primer on metagenomics","volume":"6","author":"Wooley J","year":"2010","unstructured":"J Wooley, A Godzik, I Friedberg, A primer on metagenomics. PLoS Comput Biol 6, e1000667 (2010).","journal-title":"PLoS Comput Biol"},{"key":"e_1_3_4_6_2","doi-asserted-by":"crossref","first-page":"1387","DOI":"10.1126\/science.1112665","article-title":"Computational improvements reveal great bacterial diversity and high metal toxicity in soil","volume":"309","author":"Gans J","year":"2005","unstructured":"J Gans, M Wolinsky, J Dunbar, Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387\u20131390 (2005).","journal-title":"Science"},{"key":"e_1_3_4_7_2","volume-title":"The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet","year":"2007","unstructured":"The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet (National Research Council (US), National Academy Press, Washington, DC, 2007)."},{"key":"e_1_3_4_8_2","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1126\/science.1093857","article-title":"Environmental genome shotgun sequencing of the Sargasso Sea","volume":"304","author":"Venter J","year":"2004","unstructured":"J Venter, et al., Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66\u201374 (2004).","journal-title":"Science"},{"key":"e_1_3_4_9_2","doi-asserted-by":"crossref","first-page":"368","DOI":"10.1038\/nature10576","article-title":"Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw","volume":"480","author":"Mackelprang R","year":"2011","unstructured":"R Mackelprang, et al., Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature 480, 368\u2013371 (2011).","journal-title":"Nature"},{"key":"e_1_3_4_10_2","doi-asserted-by":"crossref","first-page":"9748","DOI":"10.1073\/pnas.171285098","article-title":"An Eulerian path approach to DNA fragment assembly","volume":"98","author":"Pevzner P","year":"2001","unstructured":"P Pevzner, H Tang, M Waterman, An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98, 9748\u20139753 (2001).","journal-title":"Proc Natl Acad Sci USA"},{"key":"e_1_3_4_11_2","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1016\/j.ygeno.2010.03.001","article-title":"Assembly algorithms for next-generation sequencing data","volume":"95","author":"Miller J","year":"2010","unstructured":"J Miller, S Koren, G Sutton, Assembly algorithms for next-generation sequencing data. Genomics 95, 315\u2013327 (2010).","journal-title":"Genomics"},{"key":"e_1_3_4_12_2","doi-asserted-by":"crossref","first-page":"987","DOI":"10.1038\/nbt.2023","article-title":"How to apply de Bruijn graphs to genome assembly","volume":"29","author":"Compeau P","year":"2011","unstructured":"P Compeau, P Pevzner, G Tesler, How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29, 987\u2013991 (2011).","journal-title":"Nat Biotechnol"},{"key":"e_1_3_4_13_2","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1093\/bioinformatics\/btq697","article-title":"Succinct data structures for assembling large genomes","volume":"27","author":"Conway TC","year":"2011","unstructured":"TC Conway, AJ Bromage, Succinct data structures for assembling large genomes. Bioinformatics 27, 479\u2013486 (2011).","journal-title":"Bioinformatics"},{"key":"e_1_3_4_14_2","doi-asserted-by":"crossref","first-page":"1513","DOI":"10.1073\/pnas.1017351108","article-title":"High-quality draft assemblies of mammalian genomes from massively parallel sequence data","volume":"108","author":"Gnerre S","year":"2011","unstructured":"S Gnerre, et al., High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA 108, 1513\u20131518 (2011).","journal-title":"Proc Natl Acad Sci USA"},{"key":"e_1_3_4_15_2","doi-asserted-by":"crossref","first-page":"R116","DOI":"10.1186\/gb-2010-11-11-r116","article-title":"Quake: Quality-aware detection and correction of sequencing errors","volume":"11","author":"Kelley D","year":"2010","unstructured":"D Kelley, M Schatz, S Salzberg, Quake: Quality-aware detection and correction of sequencing errors. Genome Biol 11, R116 (2010).","journal-title":"Genome Biol"},{"key":"e_1_3_4_16_2","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time tradeoffs in hash coding with allowable errors","volume":"13","author":"Bloom B","year":"1970","unstructured":"B Bloom, Space\/time tradeoffs in hash coding with allowable errors. CACM 13, 422\u2013426 (1970).","journal-title":"CACM"},{"key":"e_1_3_4_17_2","doi-asserted-by":"crossref","first-page":"603","DOI":"10.1089\/cmb.2009.0062","article-title":"A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware","volume":"17","author":"Shi H","year":"2010","unstructured":"H Shi, A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. J Comput Biol 17, 603\u2013615 (2010).","journal-title":"J Comput Biol"},{"key":"e_1_3_4_18_2","doi-asserted-by":"crossref","first-page":"1595","DOI":"10.1093\/bioinformatics\/btq230","article-title":"Classification of DNA sequences using Bloom filters","volume":"26","author":"Stranneheim H","year":"2010","unstructured":"H Stranneheim, Classification of DNA sequences using Bloom filters. Bioinformatics 26, 1595\u20131600 (2010).","journal-title":"Bioinformatics"},{"key":"e_1_3_4_19_2","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a bloom filter","volume":"12","author":"Malsted P","year":"2011","unstructured":"P Malsted, Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12, 333 (2011).","journal-title":"BMC Bioinformatics"},{"key":"e_1_3_4_20_2","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1186\/1471-2105-12-85","article-title":"DecGPU: Distributed error correction on massively parallel graphics processing units using CUDA and MPI","volume":"12","author":"Liu Y","year":"2011","unstructured":"Y Liu, DecGPU: Distributed error correction on massively parallel graphics processing units using CUDA and MPI. BMC Bioinformatics 12, 85 (2011).","journal-title":"BMC Bioinformatics"},{"key":"e_1_3_4_21_2","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: Algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino DR","year":"2008","unstructured":"DR Zerbino, E Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821\u2013829 (2008).","journal-title":"Genome Res"},{"key":"e_1_3_4_22_2","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.1101\/gr.089532.108","article-title":"ABySS: A parallel assembler for short read sequence data","volume":"19","author":"Simpson JT","year":"2009","unstructured":"JT Simpson, et al., ABySS: A parallel assembler for short read sequence data. Genome Res 19, 1117\u20131123 (2009).","journal-title":"Genome Res"},{"key":"e_1_3_4_23_2","volume-title":"ACM Conference on Bioinformatics, Computational Biology and Biomedicine","author":"Namiki T","year":"2011","unstructured":"T Namiki, T Hachiya, H Tanaka, Y Sakakibara, MetaVelvet: An extension of Velvet assembler to de novo metagenome assembly from short sequence reads. ACM Conference on Bioinformatics, Computational Biology and Biomedicine, 2011)."},{"key":"e_1_3_4_24_2","doi-asserted-by":"crossref","first-page":"i94","DOI":"10.1093\/bioinformatics\/btr216","article-title":"Meta-IDBA: A de Novo assembler for metagenomic data","volume":"27","author":"Peng Y","year":"2011","unstructured":"Y Peng, H Leung, S Yiu, F Chin, Meta-IDBA: A de Novo assembler for metagenomic data. Bioinformatics 27, i94\u2013i101 (2011).","journal-title":"Bioinformatics"},{"key":"e_1_3_4_25_2","doi-asserted-by":"crossref","first-page":"644","DOI":"10.1038\/nbt.1883","article-title":"Full-length transcriptome assembly from RNA-Seq data without a reference genome","volume":"29","author":"Grabherr M","year":"2011","unstructured":"M Grabherr, et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644\u2013652 (2011).","journal-title":"Nat Biotechnol"},{"key":"e_1_3_4_26_2","volume-title":"Introduction to Percolation Theory","author":"Stauffer D","year":"1994","unstructured":"D Stauffer, A Aharony Introduction to Percolation Theory (Taylor and Frances, London, 1994)."},{"key":"e_1_3_4_27_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/0370-1573(79)90060-7","article-title":"Scaling theory of percolation clusters","volume":"54","author":"Stauffer D","year":"1979","unstructured":"D Stauffer, Scaling theory of percolation clusters. Phys Rep 54, 1\u201374 (1979).","journal-title":"Phys Rep"},{"key":"e_1_3_4_28_2","author":"Bondy J","year":"2006","unstructured":"J Bondy, U Murty Graph Theory. Graduate Texts in Mathematics (Springer, New York, 2006).","journal-title":"Graph Theory. Graduate Texts in Mathematics"},{"key":"e_1_3_4_29_2","unstructured":"DR Zerbino Genome assembly and comparison using de Bruijn graphs.  (Univ of Cambridge Cambridge UK PhD thesis. (2009)."},{"key":"e_1_3_4_30_2","doi-asserted-by":"crossref","first-page":"243","DOI":"10.4056\/sigs.1433550","article-title":"Meeting report: The terabase metagenomics workshop and the vision of an earth microbiome project","volume":"3","author":"Gilbert J","year":"2010","unstructured":"J Gilbert, et al., Meeting report: The terabase metagenomics workshop and the vision of an earth microbiome project. Stand Genomic Sci 3, 243\u2013248 (2010).","journal-title":"Stand Genomic Sci"},{"key":"e_1_3_4_31_2","doi-asserted-by":"crossref","first-page":"249","DOI":"10.4056\/aigs.1443528","article-title":"The Earth microbiome project: Meeting report of the \u201c1 EMP meeting on sample selection and acquisition\u201d at Argonne National Laboratory October 6 2010","volume":"3","author":"Gilbert J","year":"2010","unstructured":"J Gilbert, et al., The Earth microbiome project: Meeting report of the \u201c1 EMP meeting on sample selection and acquisition\u201d at Argonne National Laboratory October 6 2010. Stand Genomic Sci 3, 249\u2013253 (2010).","journal-title":"Stand Genomic Sci"},{"key":"e_1_3_4_32_2","first-page":"205","volume-title":"Cold Spring Harbor Symposia on Quantitative Biology","author":"Zhang Y","year":"2003","unstructured":"Y Zhang, M Waterman, DNA sequence assembly and multiple sequence alignment by an Eulerian path approach. Cold Spring Harbor Symposia on Quantitative Biology (Cold Spring Harbor Lab Press, Cold Spring Harbor, NY) Vol 68, 205\u2013212 (2003)."},{"key":"e_1_3_4_33_2","doi-asserted-by":"crossref","first-page":"i351","DOI":"10.1093\/bioinformatics\/bti1018","article-title":"De novo identification of repeat families in large genomes","volume":"21","author":"Price A","year":"2005","unstructured":"A Price, N Jones, P Pevzner, De novo identification of repeat families in large genomes. Bioinformatics 21, i351\u2013i358 (2005).","journal-title":"Bioinformatics"},{"key":"e_1_3_4_34_2","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1038\/ng.1028","article-title":"De novo assembly and genotyping of variants using colored de Bruijn graphs","volume":"44","author":"Iqbal Z","year":"2012","unstructured":"Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean, De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet 44, 226\u2013232 (2012).","journal-title":"Nat Genet"},{"key":"e_1_3_4_35_2","doi-asserted-by":"crossref","first-page":"485","DOI":"10.1080\/15427951.2004.10129096","article-title":"Network applications of bloom filters: A survey","volume":"1","author":"Broder A","year":"2004","unstructured":"A Broder, M Mitzenmacher, Network applications of bloom filters: A survey. Internet Math 1, 485\u2013509 (2004).","journal-title":"Internet Math"},{"key":"e_1_3_4_36_2","doi-asserted-by":"crossref","first-page":"011907","DOI":"10.1103\/PhysRevE.66.011907","article-title":"Critical and near-critical branching processes","volume":"66","author":"Adami C","year":"2002","unstructured":"C Adami, J Chu, Critical and near-critical branching processes. Phys Rev E 66, 011907 (2002).","journal-title":"Phys Rev E"},{"key":"e_1_3_4_37_2","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1090\/S0002-9947-1943-0012401-3","article-title":"Tests of statistical hypotheses concerning several parameters when the number of observations is large","volume":"54","author":"Wald A","year":"1943","unstructured":"A Wald, Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Am Math Soc 54, 426\u2013482 (1943).","journal-title":"Trans Am Math Soc"}],"container-title":["Proceedings of the National Academy of Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/pnas.org\/doi\/pdf\/10.1073\/pnas.1121464109","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,7]],"date-time":"2022-06-07T08:53:15Z","timestamp":1654591995000},"score":1,"resource":{"primary":{"URL":"https:\/\/pnas.org\/doi\/full\/10.1073\/pnas.1121464109"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,7,30]]},"references-count":37,"journal-issue":{"issue":"33","published-print":{"date-parts":[[2012,8,14]]}},"alternative-id":["10.1073\/pnas.1121464109"],"URL":"https:\/\/doi.org\/10.1073\/pnas.1121464109","relation":{},"ISSN":["0027-8424","1091-6490"],"issn-type":[{"value":"0027-8424","type":"print"},{"value":"1091-6490","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,7,30]]},"assertion":[{"value":"2012-07-30","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}