{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,17]],"date-time":"2026-03-17T00:06:17Z","timestamp":1773705977606,"version":"3.50.1"},"reference-count":29,"publisher":"Oxford University Press (OUP)","issue":"24","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014,12,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Motivation: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k -mers (substrings of length k ). Histograms of k -mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment.<\/jats:p>\n                  <jats:p>Results: We present KmerStream, a streaming algorithm for estimating the number of distinct k -mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream.<\/jats:p>\n                  <jats:p>As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.<\/jats:p>\n                  <jats:p>Availability and implementation: The tool KmerStream is written in C++ and is released under a GPL license. It is freely available at https:\/\/github.com\/pmelsted\/KmerStream<\/jats:p>\n                  <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <jats:p>Contact: \u00a0pmelsted@hi.is or Bjarni.Halldorsson@decode.is .<\/jats:p>","DOI":"10.1093\/bioinformatics\/btu713","type":"journal-article","created":{"date-parts":[[2014,10,30]],"date-time":"2014-10-30T01:01:31Z","timestamp":1414630891000},"page":"3541-3547","source":"Crossref","is-referenced-by-count":58,"title":["KmerStream: streaming algorithms for\n                    <i>k<\/i>\n                    -mer abundance estimation"],"prefix":"10.1093","volume":"30","author":[{"given":"P\u00e1ll","family":"Melsted","sequence":"first","affiliation":[{"name":"1 Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjav\u00edk, Iceland, 2 deCODE Genetics\/Amgen, Reykjav\u00edk, Iceland and 3 School of Science and Engineering, Reykjav\u00edk University, Reykjav\u00edk, Iceland"},{"name":"1 Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjav\u00edk, Iceland, 2 deCODE Genetics\/Amgen, Reykjav\u00edk, Iceland and 3 School of Science and Engineering, Reykjav\u00edk University, Reykjav\u00edk, Iceland"}]},{"given":"Bjarni V.","family":"Halld\u00f3rsson","sequence":"additional","affiliation":[{"name":"1 Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjav\u00edk, Iceland, 2 deCODE Genetics\/Amgen, Reykjav\u00edk, Iceland and 3 School of Science and Engineering, Reykjav\u00edk University, Reykjav\u00edk, Iceland"},{"name":"1 Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjav\u00edk, Iceland, 2 deCODE Genetics\/Amgen, Reykjav\u00edk, Iceland and 3 School of Science and Engineering, Reykjav\u00edk University, Reykjav\u00edk, Iceland"}]}],"member":"286","published-online":{"date-parts":[[2014,10,28]]},"reference":[{"key":"2023012712050682100_btu713-B1","doi-asserted-by":"crossref","first-page":"20","DOI":"10.1145\/237814.237823","article-title":"The space complexity of approximating the frequency moments","volume-title":"Proceedings of the twenty-eighth annual ACM symposium on Theory of computing","author":"Alon","year":"1996"},{"key":"2023012712050682100_btu713-B2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/3-540-45726-7_1","article-title":"Counting distinct elements in a data stream","volume-title":"Randomization and Approximation Techniques in Computer Science","author":"Bar-Yossef","year":"2002"},{"key":"2023012712050682100_btu713-B3","doi-asserted-by":"crossref","first-page":"1146","DOI":"10.1038\/nbt.1495","article-title":"The potential and challenges of nanopore sequencing","volume":"26","author":"Branton","year":"2008","journal-title":"Nature biotechnology"},{"key":"2023012712050682100_btu713-B4","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1093\/bioinformatics\/btt310","article-title":"Informed and automated k-mer size selection for genome assembly","volume":"30","author":"Chikhi","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B5","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1007\/978-3-642-33122-0_19","article-title":"Space-efficient and exact de Bruijn graph representation based on a Bloom filter","volume-title":"Algorithms in Bioinformatics","author":"Chikhi","year":"2012"},{"key":"2023012712050682100_btu713-B6","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1038\/nnano.2009.12","article-title":"Continuous base identification for single-molecule nanopore DNA sequencing","volume":"4","author":"Clarke","year":"2009","journal-title":"Nat. Nanotechnol."},{"key":"2023012712050682100_btu713-B7","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1093\/bioinformatics\/btq697","article-title":"Succinct data structures for assembling large genomes","volume":"27","author":"Conway","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B8","doi-asserted-by":"crossref","first-page":"160","DOI":"10.1186\/1471-2105-14-160","article-title":"Disk-based k-mer counting on a pc","volume":"14","author":"Deorowicz","year":"2013","journal-title":"BMC bioinformatics"},{"key":"2023012712050682100_btu713-B9","doi-asserted-by":"crossref","first-page":"491","DOI":"10.1038\/ng.806","article-title":"A framework for variation discovery and genotyping using next-generation DNA sequencing data","volume":"43","author":"DePristo","year":"2011","journal-title":"Nat. Genetics"},{"key":"2023012712050682100_btu713-B10","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1016\/0022-0000(85)90041-8","article-title":"Probabilistic counting algorithms for data base applications","volume":"31","author":"Flajolet","year":"1985","journal-title":"J. Comput. Syst. Sci."},{"key":"2023012712050682100_btu713-B11","doi-asserted-by":"crossref","first-page":"1513","DOI":"10.1073\/pnas.1017351108","article-title":"High-quality draft assemblies of mammalian genomes from massively parallel sequence data","volume":"108","author":"Gnerre","year":"2011","journal-title":"Proc Natl Acad Sci"},{"key":"2023012712050682100_btu713-B29","article-title":"Large-scale whole-genome sequencing of the Icelandic population","author":"Gudbjartsson","year":"2014"},{"key":"2023012712050682100_btu713-B12","doi-asserted-by":"crossref","first-page":"R116","DOI":"10.1186\/gb-2010-11-11-r116","article-title":"Quake: quality-aware detection and correction of sequencing errors","volume":"11","author":"Kelley","year":"2010","journal-title":"Genome Biol."},{"key":"2023012712050682100_btu713-B13","doi-asserted-by":"crossref","first-page":"1068","DOI":"10.1038\/ng.216","article-title":"Detection of sharing by descent, long-range phasing and haplotype imputation","volume":"40","author":"Kong","year":"2008","journal-title":"Nat. Genetics"},{"key":"2023012712050682100_btu713-B14","doi-asserted-by":"crossref","first-page":"517","DOI":"10.1186\/1471-2164-9-517","article-title":"A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes","volume":"9","author":"Kurtz","year":"2008","journal-title":"BMC Genomics"},{"key":"2023012712050682100_btu713-B15","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1101\/gr.097261.109","article-title":"De novo assembly of human genomes with massively parallel short read sequencing","volume":"20","author":"Li","year":"2010","journal-title":"Genome Res."},{"key":"2023012712050682100_btu713-B16","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1093\/bioinformatics\/bts690","article-title":"Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data","volume":"29","author":"Liu","year":"2013","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B17","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B18","doi-asserted-by":"crossref","first-page":"1297","DOI":"10.1101\/gr.107524.110","article-title":"The Genome Analysis Toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data","volume":"20","author":"McKenna","year":"2010","journal-title":"Genome Res."},{"key":"2023012712050682100_btu713-B19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-12-451","article-title":"Identification and correction of systematic error in high-throughput sequence data","volume":"12","author":"Meacham","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012712050682100_btu713-B20","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a Bloom filter","volume":"12","author":"Melsted","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023012712050682100_btu713-B21","doi-asserted-by":"crossref","first-page":"R112","DOI":"10.1186\/gb-2011-12-11-r112","article-title":"Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems","volume":"12","author":"Minoche","year":"2011","journal-title":"Genome Biol"},{"key":"2023012712050682100_btu713-B22","doi-asserted-by":"crossref","first-page":"13272","DOI":"10.1073\/pnas.1121464109","article-title":"Scaling metagenome sequence assembly with probabilistic de Bruijn graphs","volume":"109","author":"Pell","year":"2012","journal-title":"Proc. Natl Acad. Sci."},{"key":"2023012712050682100_btu713-B23","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1186\/gm290","article-title":"RNA-Seq and find: entering the RNA deep field","volume":"3","author":"Roberts","year":"2011","journal-title":"Genome Med."},{"key":"2023012712050682100_btu713-B24","doi-asserted-by":"crossref","first-page":"1950","DOI":"10.1093\/bioinformatics\/btu132","article-title":"Turtle: Identifying frequent k-mers with cache-efficient algorithms","volume":"30","author":"Roy","year":"2014","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B25","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1101\/gr.131383.111","article-title":"GAGE: A critical evaluation of genome assemblies and assembly algorithms","volume":"22","author":"Salzberg","year":"2012","journal-title":"Genome Res."},{"key":"2023012712050682100_btu713-B26","doi-asserted-by":"crossref","first-page":"2157","DOI":"10.1093\/bioinformatics\/btp379","article-title":"SHREC: a short-read error correction method","volume":"25","author":"Schr\u00f6der","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012712050682100_btu713-B27","doi-asserted-by":"crossref","first-page":"517","DOI":"10.1038\/nature12124","article-title":"Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits","volume":"497","author":"Styrkarsdottir","year":"2013","journal-title":"Nature"},{"key":"2023012712050682100_btu713-B28","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino","year":"2008","journal-title":"Genome Res."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/24\/3541\/48931408\/bioinformatics_30_24_3541.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/30\/24\/3541\/48931408\/bioinformatics_30_24_3541.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,27]],"date-time":"2023-01-27T08:03:50Z","timestamp":1674806630000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/30\/24\/3541\/2422237"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,10,28]]},"references-count":29,"journal-issue":{"issue":"24","published-print":{"date-parts":[[2014,12,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btu713","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/003962","asserted-by":"object"}]},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2014,12,15]]},"published":{"date-parts":[[2014,10,28]]}}}