{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T08:36:42Z","timestamp":1780475802119,"version":"3.54.1"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1014158","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T00:00:00Z","timestamp":1775520000000}}],"reference-count":37,"publisher":"Public Library of Science (PLoS)","issue":"4","license":[{"start":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T00:00:00Z","timestamp":1775001600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000024","name":"Canadian Institutes of Health Research","doi-asserted-by":"publisher","award":["PJT-183608"],"award-info":[{"award-number":["PJT-183608"]}],"id":[{"id":"10.13039\/501100000024","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000038","name":"Natural Sciences and Engineering Research Council of Canada","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000038","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>\n                    <jats:italic>K<\/jats:italic>\n                    -mer counts are fundamental in many genomic data analysis tasks, providing valuable information for genome assembly, error correction, and variant detection. State-of-the-art\n                    <jats:italic>k<\/jats:italic>\n                    -mer counting tools employ various techniques, such as parallelism, probabilistic data structures, and disk utilization, to efficiently extract\n                    <jats:italic>k<\/jats:italic>\n                    -mer frequencies from large datasets. The distribution of\n                    <jats:italic>k<\/jats:italic>\n                    -mer counts in raw sequencing reads reveals key genomic characteristics such as genome size, heterozygosity, and basecalling quality. The number of reads containing a\n                    <jats:italic>k<\/jats:italic>\n                    -mer has also shown application in genome assembly and sequence analysis. We present ntStat, a toolkit that employs succinct Bloom filter data structures to track both\n                    <jats:italic>k<\/jats:italic>\n                    -mer count and depth information and use in downstream applications. ntStat models the\n                    <jats:italic>k<\/jats:italic>\n                    -mer count histogram using evolutionary computation, and infers valuable insights about the genome, sequencing data, and individual\n                    <jats:italic>k<\/jats:italic>\n                    -mers,\n                    <jats:italic>de novo<\/jats:italic>\n                    . ntStat consistently ran faster than DSK, BFCounter, hackgap, and Squeakr in all of our tests. Jellyfish performed faster than ntStat for human data with k\u2009=\u200925 but fell behind with k\u2009=\u200964. KMC3 was faster overall but at a high disk usage and memory cost. ntStat also used less memory than other non-disk-based k-mer counters and typically, 99.5-99.9% of the k-mers processed by ntStat are counted correctly. ntStat\u2019s histogram analysis module detected heterozygosity percentages and k-mer coverage for long-read datasets simulated from a diploid human genome with less than 1% and 0.5-fold difference to the ground truth. The analysis of simulated long read datasets showed an average error of just 2% in k-mer robustness estimates.\n                  <\/jats:p>","DOI":"10.1371\/journal.pcbi.1014158","type":"journal-article","created":{"date-parts":[[2026,4,1]],"date-time":"2026-04-01T17:30:13Z","timestamp":1775064613000},"page":"e1014158","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":1,"title":["ntStat: k-mer characterization using occurrence statistics in raw sequencing data"],"prefix":"10.1371","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2126-5644","authenticated-orcid":true,"given":"Parham","family":"Kazemi","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lauren","family":"Coombe","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9890-2293","authenticated-orcid":true,"given":"Ren\u00e9 L.","family":"Warren","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0950-7839","authenticated-orcid":true,"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"340","published-online":{"date-parts":[[2026,4,1]]},"reference":[{"key":"pcbi.1014158.ref001","doi-asserted-by":"crossref","first-page":"2289","DOI":"10.1016\/j.csbj.2024.05.025","article-title":"A survey of k-mer methods and applications in bioinformatics","volume":"23","author":"C Moeckel","year":"2024","journal-title":"Comput Struct Biotechnol J"},{"issue":"7","key":"pcbi.1014158.ref002","first-page":"758","article-title":"A combinatorial problem","volume":"49","author":"NG De Bruijn","year":"1946","journal-title":"Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam"},{"issue":"1","key":"pcbi.1014158.ref003","doi-asserted-by":"crossref","first-page":"17765","DOI":"10.1038\/s41598-023-44636-z","article-title":"GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads","volume":"13","author":"F-D Pajuste","year":"2023","journal-title":"Sci Rep"},{"issue":"1","key":"pcbi.1014158.ref004","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1186\/s13059-022-02826-4","article-title":"STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci","volume":"23","author":"H Dashnow","year":"2022","journal-title":"Genome Biol"},{"issue":"3","key":"pcbi.1014158.ref005","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/18.3.440","article-title":"PatternHunter: faster and more sensitive homology search","volume":"18","author":"B Ma","year":"2002","journal-title":"Bioinformatics"},{"issue":"17","key":"pcbi.1014158.ref006","doi-asserted-by":"crossref","first-page":"2759","DOI":"10.1093\/bioinformatics\/btx304","article-title":"KMC 3: counting and manipulating k -mer statistics","volume":"33","author":"M Kokot","year":"2017","journal-title":"Bioinformatics"},{"key":"pcbi.1014158.ref007","unstructured":"Mohamadi H, Chu J, Coombe L, Warren R, Birol I. ntHits: de novo repeat identification of genomics data using a streaming approach [Internet]. 2020 [cited 2024 Oct 25]. Available from: http:\/\/biorxiv.org\/lookup\/doi\/10.1101\/2020.11.02"},{"key":"pcbi.1014158.ref008","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a bloom filter","volume":"12","author":"P Melsted","year":"2011","journal-title":"BMC Bioinformatics"},{"issue":"4","key":"pcbi.1014158.ref009","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1093\/bioinformatics\/btx636","article-title":"Squeakr: an exact and approximate k -mer counting system","volume":"34","author":"P Pandey","year":"2018","journal-title":"Bioinformatics"},{"issue":"7","key":"pcbi.1014158.ref010","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time trade-offs in hash coding with allowable errors","volume":"13","author":"BH Bloom","year":"1970","journal-title":"Commun ACM"},{"key":"pcbi.1014158.ref011","first-page":"775","article-title":"A General-Purpose Counting Filter: Making Every Bit Count.","volume-title":"Proceedings of the 2017 ACM International Conference on Management of Data [Internet]","author":"P Pandey","year":"2017"},{"issue":"6","key":"pcbi.1014158.ref012","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"G Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"issue":"21","key":"pcbi.1014158.ref013","doi-asserted-by":"crossref","first-page":"4430","DOI":"10.1093\/bioinformatics\/btz400","article-title":"ntEdit: scalable genome sequence polishing","volume":"35","author":"RL Warren","year":"2019","journal-title":"Bioinformatics"},{"issue":"3","key":"pcbi.1014158.ref014","article-title":"JASPER: A fast genome polishing tool that improves accuracy of genome assemblies","volume":"19","author":"A Guo","year":"2023","journal-title":"PLoS Comput Biol"},{"key":"pcbi.1014158.ref015","volume":"242","author":"J Zentgraf","year":"2022","journal-title":"Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables"},{"key":"pcbi.1014158.ref016","first-page":"29","article-title":"Using tf-idf to determine word relevance in document queries.","volume-title":"Proceedings of the first instructional conference on machine learning","author":"J Ramos","year":"2003"},{"issue":"5","key":"pcbi.1014158.ref017","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1101\/gr.215087.116","article-title":"Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation","volume":"27","author":"S Koren","year":"2017","journal-title":"Genome Res"},{"issue":"2","key":"pcbi.1014158.ref018","doi-asserted-by":"crossref","first-page":"344","DOI":"10.1093\/bioinformatics\/btab672","article-title":"Tiara: deep learning-based classification system for eukaryotic sequences","volume":"38","author":"M Karlicki","year":"2022","journal-title":"Bioinformatics"},{"issue":"9","key":"pcbi.1014158.ref019","doi-asserted-by":"crossref","first-page":"1324","DOI":"10.1093\/bioinformatics\/btw832","article-title":"ntCard: a streaming algorithm for cardinality estimation in genomics data","volume":"33","author":"H Mohamadi","year":"2017","journal-title":"Bioinformatics"},{"key":"pcbi.1014158.ref020","first-page":"438","article-title":"KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage.","volume-title":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics [Internet]","author":"S Behera","year":"2018"},{"issue":"14","key":"pcbi.1014158.ref021","doi-asserted-by":"crossref","first-page":"2202","DOI":"10.1093\/bioinformatics\/btx153","article-title":"GenomeScope: fast reference-free genome profiling from short reads","volume":"33","author":"GW Vurture","year":"2017","journal-title":"Bioinformatics"},{"issue":"1","key":"pcbi.1014158.ref022","doi-asserted-by":"crossref","first-page":"1432","DOI":"10.1038\/s41467-020-14998-3","article-title":"GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes","volume":"11","author":"TR Ranallo-Benavidez","year":"2020","journal-title":"Nat Commun"},{"issue":"79","key":"pcbi.1014158.ref023","doi-asserted-by":"crossref","first-page":"4720","DOI":"10.21105\/joss.04720","article-title":"btllib: A C library with Python interface for efficient genomic sequence processing","volume":"7","author":"V Nikoli\u0107","year":"2022","journal-title":"JOSS"},{"issue":"11","key":"pcbi.1014158.ref024","doi-asserted-by":"crossref","DOI":"10.1186\/gb-2010-11-11-r116","article-title":"Quake: quality-aware detection and correction of sequencing errors","volume":"11","author":"DR Kelley","year":"2010","journal-title":"Genome Biol"},{"issue":"4","key":"pcbi.1014158.ref025","doi-asserted-by":"crossref","first-page":"574","DOI":"10.1093\/bioinformatics\/btw663","article-title":"KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies","volume":"33","author":"D Mapleson","year":"2017","journal-title":"Bioinformatics"},{"issue":"4","key":"pcbi.1014158.ref026","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1093\/bioinformatics\/btx637","article-title":"findGSE: estimating genome size variation within human and Arabidopsis using k -mer frequencies.","volume":"34","author":"H Sun","year":"2018","journal-title":"Bioinformatics"},{"issue":"4","key":"pcbi.1014158.ref027","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1023\/A:1008202821328","article-title":"Differential evolution \u2013 A simple and efficient heuristic for global optimization over continuous spaces","volume":"11","author":"R Storn","year":"1997","journal-title":"Journal of Global Optimization"},{"issue":"1","key":"pcbi.1014158.ref028","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1093\/imamat\/6.1.76","article-title":"The convergence of a class of double-rank minimization algorithms 1. General considerations","volume":"6","author":"CG Broyden","year":"1970","journal-title":"IMA J Appl Math"},{"issue":"3","key":"pcbi.1014158.ref029","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1093\/comjnl\/13.3.317","article-title":"A new approach to variable metric algorithms","volume":"13","author":"R Fletcher","year":"1970","journal-title":"The Computer Journal"},{"issue":"109","key":"pcbi.1014158.ref030","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1090\/S0025-5718-1970-0258249-6","article-title":"A family of variable-metric methods derived by variational means","volume":"24","author":"D Goldfarb","year":"1970","journal-title":"Math Comp"},{"issue":"111","key":"pcbi.1014158.ref031","doi-asserted-by":"crossref","first-page":"647","DOI":"10.1090\/S0025-5718-1970-0274029-X","article-title":"Conditioning of quasi-Newton methods for function minimization","volume":"24","author":"DF Shanno","year":"1970","journal-title":"Math Comp"},{"issue":"5331","key":"pcbi.1014158.ref032","doi-asserted-by":"crossref","first-page":"1453","DOI":"10.1126\/science.277.5331.1453","article-title":"The Complete Genome Sequence of Escherichia coli K-12","volume":"277","author":"FR Blattner","year":"1997","journal-title":"Science"},{"issue":"11","key":"pcbi.1014158.ref033","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1093\/bioinformatics\/bts187","article-title":"pIRS: Profile-based Illumina pair-end reads simulator","volume":"28","author":"X Hu","year":"2012","journal-title":"Bioinformatics"},{"issue":"4","key":"pcbi.1014158.ref034","doi-asserted-by":"crossref","DOI":"10.1093\/gigascience\/gix010","article-title":"NanoSim: nanopore sequence read simulator based on statistical characterization","volume":"6","author":"C Yang","year":"2017","journal-title":"GigaScience"},{"issue":"6588","key":"pcbi.1014158.ref035","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1126\/science.abj6987","article-title":"The complete sequence of a human genome","volume":"376","author":"S Nurk","year":"2022","journal-title":"Science"},{"issue":"10","key":"pcbi.1014158.ref036","doi-asserted-by":"crossref","first-page":"1474","DOI":"10.1038\/s41587-023-01662-6","article-title":"Telomere-to-telomere assembly of diploid chromosomes with Verkko","volume":"41","author":"M Rautiainen","year":"2023","journal-title":"Nat Biotechnol"},{"issue":"1","key":"pcbi.1014158.ref037","doi-asserted-by":"crossref","DOI":"10.1093\/bioadv\/vbaf287","article-title":"ntRoot: computational inference of human ancestry at scale from genomic data","volume":"5","author":"RL Warren","year":"2025","journal-title":"Bioinform Adv"}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1014158","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T00:00:00Z","timestamp":1775520000000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1014158","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T17:45:34Z","timestamp":1775583934000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1014158"}},"subtitle":[],"editor":[{"given":"Michael","family":"Domaratzki","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"editor"}]}],"short-title":[],"issued":{"date-parts":[[2026,4,1]]},"references-count":37,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2026,4,1]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1014158","relation":{},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,4,1]]}}}