{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T09:04:17Z","timestamp":1778144657587,"version":"3.51.4"},"reference-count":31,"publisher":"Oxford University Press (OUP)","issue":"9","license":[{"start":{"date-parts":[[2017,1,5]],"date-time":"2017-01-05T00:00:00Z","timestamp":1483574400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01HG007182"],"award-info":[{"award-number":["R01HG007182"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2017,5,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480\u2009GB, 500\u2009GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies &amp;gt;15\u00d7\u2009faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and Implementation<\/jats:title>\n                  <jats:p>ntCard is written in C\u2009++\u2009and is released under the GPL license. It is freely available at https:\/\/github.com\/bcgsc\/ntCard.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btw832","type":"journal-article","created":{"date-parts":[[2017,1,6]],"date-time":"2017-01-06T01:35:34Z","timestamp":1483666534000},"page":"1324-1330","source":"Crossref","is-referenced-by-count":66,"title":["ntCard: a streaming algorithm for cardinality estimation in genomics data"],"prefix":"10.1093","volume":"33","author":[{"given":"Hamid","family":"Mohamadi","sequence":"first","affiliation":[{"name":"Canada\u2019s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada"},{"name":"Faculty of Science, University of British Columbia, Vancouver, BC, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hamza","family":"Khan","sequence":"additional","affiliation":[{"name":"Canada\u2019s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada"},{"name":"Faculty of Science, University of British Columbia, Vancouver, BC, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[{"name":"Canada\u2019s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada"},{"name":"Faculty of Science, University of British Columbia, Vancouver, BC, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2017,1,5]]},"reference":[{"key":"2023020205024304500_btw832-B1","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1006\/jcss.1997.1545","article-title":"The space complexity of approximating the frequency moments","volume":"58","author":"Alon","year":"1999","journal-title":"J. Comput. Syst. Sci"},{"key":"2023020205024304500_btw832-B2","first-page":"1","author":"Bar-Yossef","year":"2002"},{"key":"2023020205024304500_btw832-B3","doi-asserted-by":"crossref","first-page":"810","DOI":"10.1101\/gr.7337908","article-title":"ALLPATHS: de novo assembly of whole-genome shotgun microreads","volume":"18","author":"Butler","year":"2008","journal-title":"Gen. Res"},{"key":"2023020205024304500_btw832-B4","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1093\/bioinformatics\/btt310","article-title":"Informed and automated k-mer size selection for genome assembly","volume":"30","author":"Chikhi","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B5","doi-asserted-by":"crossref","first-page":"3402","DOI":"10.1093\/bioinformatics\/btu558","article-title":"BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters","volume":"30","author":"Chu","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B6","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1093\/bioinformatics\/btq697","article-title":"Succinct data structures for assembling large genomes","volume":"27","author":"Conway","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B7","author":"Cormode","year":"2005"},{"key":"2023020205024304500_btw832-B8","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1016\/j.jalgor.2003.12.001","article-title":"An improved data stream summary: the count-min sketch and its applications","volume":"55","author":"Cormode","year":"2005","journal-title":"J. Algorithms"},{"key":"2023020205024304500_btw832-B9","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"KMC 2: fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B10","doi-asserted-by":"crossref","first-page":"1792","DOI":"10.1093\/nar\/gkh340","article-title":"MUSCLE: multiple sequence alignment with high accuracy and high throughput","volume":"32","author":"Edgar","year":"2004","journal-title":"Nucl. Acids Res"},{"key":"2023020205024304500_btw832-B11","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1016\/0022-0000(85)90041-8","article-title":"Probabilistic counting algorithms for data base applications","volume":"31","author":"Flajolet","year":"1985","journal-title":"J. Comput. Syst. Sci"},{"key":"2023020205024304500_btw832-B12","doi-asserted-by":"crossref","first-page":"1354","DOI":"10.1093\/bioinformatics\/btu030","article-title":"BLESS: bloom filter-based error correction solution for high-throughput sequencing reads","volume":"30","author":"Heo","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B13","author":"Indyk","year":"2005"},{"key":"2023020205024304500_btw832-B14","first-page":"1","article-title":"Efficient cardinality estimation for k-mers in large DNA sequencing data sets","author":"Irber Junior","year":"2016","journal-title":"bioRxiv"},{"key":"2023020205024304500_btw832-B15","first-page":"1","article-title":"ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter","author":"Jackman","year":"2016","journal-title":"bioRxiv"},{"key":"2023020205024304500_btw832-B16","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1101\/gr.097261.109","article-title":"De novo assembly of human genomes with massively parallel short read sequencing","volume":"20","author":"Li","year":"2010","journal-title":"Gen. Res"},{"key":"2023020205024304500_btw832-B17","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B18","doi-asserted-by":"crossref","first-page":"i137","DOI":"10.1093\/bioinformatics\/btr208","article-title":"Error correction of high-throughput sequencing datasets with non-uniform coverage","volume":"27","author":"Medvedev","year":"2011","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B19","doi-asserted-by":"crossref","first-page":"3541","DOI":"10.1093\/bioinformatics\/btu713","article-title":"KmerStream: streaming algorithms for k-mer abundance estimation","volume":"30","author":"Melsted","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B20","doi-asserted-by":"crossref","first-page":"333.","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a bloom filter","volume":"12","author":"Melsted","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023020205024304500_btw832-B21","doi-asserted-by":"crossref","first-page":"3492","DOI":"10.1093\/bioinformatics\/btw397","article-title":"ntHash: recursive nucleotide hashing","volume":"32","author":"Mohamadi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B22","doi-asserted-by":"crossref","first-page":"3021","DOI":"10.1093\/bioinformatics\/btw369","article-title":"Assemblytics: a web analytics tool for the detection of variants from an assembly","volume":"32","author":"Nattestad","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B23","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1038\/nbt.2862","article-title":"Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms","volume":"32","author":"Patro","year":"2014","journal-title":"Nat. Biotech"},{"key":"2023020205024304500_btw832-B24","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"DSK: k-mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B25","doi-asserted-by":"crossref","first-page":"557","DOI":"10.1101\/gr.131383.111","article-title":"GAGE: a critical evaluation of genome assemblies and assembly algorithms","volume":"22","author":"Salzberg","year":"2012","journal-title":"Gen. Res"},{"key":"2023020205024304500_btw832-B26","doi-asserted-by":"crossref","first-page":"i538","DOI":"10.1093\/bioinformatics\/btw460","article-title":"Fast genotyping of known SNPs through approximate k-mer matching","volume":"32","author":"Shajii","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B27","doi-asserted-by":"crossref","first-page":"1228","DOI":"10.1093\/bioinformatics\/btu023","article-title":"Exploring genome characteristics and sequence quality without a reference","volume":"30","author":"Simpson","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020205024304500_btw832-B28","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.1101\/gr.089532.108","article-title":"ABySS: a parallel assembler for short read sequence data","volume":"19","author":"Simpson","year":"2009","journal-title":"Gen. Res"},{"key":"2023020205024304500_btw832-B29","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1111\/tpj.12886","article-title":"Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism","volume":"83","author":"Warren","year":"2015","journal-title":"Plant J"},{"key":"2023020205024304500_btw832-B30","doi-asserted-by":"crossref","first-page":"821","DOI":"10.1101\/gr.074492.107","article-title":"Velvet: algorithms for de novo short read assembly using de Bruijn graphs","volume":"18","author":"Zerbino","year":"2008","journal-title":"Gen. Res"},{"key":"2023020205024304500_btw832-B31","doi-asserted-by":"crossref","first-page":"160025","DOI":"10.1038\/sdata.2016.25","article-title":"Extensive sequencing of seven human genomes to characterize benchmark reference materials","volume":"3","author":"Zook","year":"2016","journal-title":"Sci. Data"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/9\/1324\/49038754\/bioinformatics_33_9_1324.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/33\/9\/1324\/49038754\/bioinformatics_33_9_1324.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T05:06:52Z","timestamp":1675314412000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/33\/9\/1324\/2832780"}},"subtitle":[],"editor":[{"given":"Bonnie","family":"Berger","sequence":"additional","affiliation":[],"role":[{"role":"editor","vocabulary":"crossref"}]}],"short-title":[],"issued":{"date-parts":[[2017,1,5]]},"references-count":31,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2017,5,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btw832","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2017,5,1]]},"published":{"date-parts":[[2017,1,5]]}}}