{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,4]],"date-time":"2025-09-04T13:25:02Z","timestamp":1756992302294,"version":"3.37.3"},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"Supplement_2","license":[{"start":{"date-parts":[[2020,12,1]],"date-time":"2020-12-01T00:00:00Z","timestamp":1606780800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"CUHK Young Researcher Award"},{"name":"Council General Research Funds","award":["14145916","14170217"],"award-info":[{"award-number":["14145916","14170217"]}]},{"name":"Collaborative Research Funds","award":["C4054-16G","C4045-18WF","C4057-18EF"],"award-info":[{"award-number":["C4054-16G","C4045-18WF","C4057-18EF"]}]},{"name":"Theme-based Research Scheme","award":["T12C-714\/14-R"],"award-info":[{"award-number":["T12C-714\/14-R"]}]},{"name":"Hong Kong Epigenomics Project"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,12,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>In de novo sequence assembly, a standard pre-processing step is k-mer counting, which computes the number of occurrences of every length-k sub-sequence in the sequencing reads. Sequencing errors can produce many k-mers that do not appear in the genome, leading to the need for an excessive amount of memory during counting. This issue is particularly serious when the genome to be assembled is large, the sequencing depth is high, or when the memory available is limited.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>Here, we propose a fast near-exact k-mer counting method, CQF-deNoise, which has a module for dynamically removing noisy false k-mers. It automatically determines the suitable time and number of rounds of noise removal according to a user-specified wrong removal rate. We tested CQF-deNoise comprehensively using data generated from a diverse set of genomes with various data properties, and found that the memory consumed was almost constant regardless of the sequencing errors while the noise removal procedure\u00a0had minimal effects on counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consistently performed the best in terms of memory usage, consuming 49\u201376% less memory than the\u00a0second best method. When counting the k-mers from a human dataset with around 60\u00d7 coverage, the peak\u00a0memory usage of CQF-deNoise was only 10.9\u00a0GB (gigabytes) for k\u2009=\u200928 and 21.5\u00a0GB for k\u2009=\u200955. De novo assembly of 106\u00d7 human sequencing data using CQF-deNoise for k-mer counting required only 2.7\u2009h and 90\u00a0GB peak memory.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The source codes of CQF-deNoise and SH-assembly are available at https:\/\/github.com\/Christina-hshi\/CQF-deNoise.git and https:\/\/github.com\/Christina-hshi\/SH-assembly.git, respectively, both under the BSD 3-Clause license.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa890","type":"journal-article","created":{"date-parts":[[2020,10,1]],"date-time":"2020-10-01T03:36:16Z","timestamp":1601523376000},"page":"i625-i633","source":"Crossref","is-referenced-by-count":4,"title":["A general near-exact k-mer counting method with low memory consumption enables <i>de novo<\/i> assembly of 106\u00d7 human sequence data in 2.7 hours"],"prefix":"10.1093","volume":"36","author":[{"given":"Christina Huan","family":"Shi","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5516-9944","authenticated-orcid":false,"given":"Kevin Y.","family":"Yip","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering"},{"name":"Hong Kong Bioinformatics Centre"},{"name":"Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong , Shatin, New Territories, Hong Kong SAR"}]}],"member":"286","published-online":{"date-parts":[[2020,12,29]]},"reference":[{"key":"2023062409330646900_btaa890-B1","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1145\/362686.362692","article-title":"Space\/time trade-offs in hash coding with allowable errors","volume":"13","author":"Bloom","year":"1970","journal-title":"Commun. ACM"},{"year":"2020","author":"Bushnell","key":"2023062409330646900_btaa890-B2"},{"first-page":"272","year":"2011","author":"Chapuis","key":"2023062409330646900_btaa890-B3"},{"key":"2023062409330646900_btaa890-B4","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B5","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1186\/1748-7188-8-22","article-title":"Space-efficient and exact de Bruijn graph representation based on a Bloom filter","volume":"8","author":"Chikhi","year":"2013","journal-title":"Algorithms Mol. Biol"},{"key":"2023062409330646900_btaa890-B6","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1109\/90.851975","article-title":"Summary cache: a scalable wide-area web cache sharing protocol","volume":"8","author":"Fan","year":"2000","journal-title":"IEEE\/ACM Trans. Netw"},{"key":"2023062409330646900_btaa890-B7","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1038\/nrg.2016.49","article-title":"Coming of age: ten years of next-generation sequencing technologies","volume":"17","author":"Goodwin","year":"2016","journal-title":"Nat. Rev. Genet"},{"key":"2023062409330646900_btaa890-B8","doi-asserted-by":"crossref","first-page":"1072","DOI":"10.1093\/bioinformatics\/btt086","article-title":"QUAST: quality assessment tool for genome assemblies","volume":"29","author":"Gurevich","year":"2013","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B9","doi-asserted-by":"crossref","first-page":"1354","DOI":"10.1093\/bioinformatics\/btu030","article-title":"BLESS: bloom filter-based error correction solution for high-throughput sequencing reads","volume":"30","author":"Heo","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B10","doi-asserted-by":"crossref","first-page":"593","DOI":"10.1093\/bioinformatics\/btr708","article-title":"ART: a next-generation sequencing read simulator","volume":"28","author":"Huang","year":"2012","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B11","doi-asserted-by":"crossref","first-page":"768","DOI":"10.1101\/gr.214346.116","article-title":"ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter","volume":"27","author":"Jackman","year":"2017","journal-title":"Genome Res"},{"key":"2023062409330646900_btaa890-B12","doi-asserted-by":"crossref","first-page":"2759","DOI":"10.1093\/bioinformatics\/btx304","article-title":"KMC 3: counting and manipulating k-mer statistics","volume":"33","author":"Kokot","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B13","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1038\/nature08696","article-title":"The sequence and de novo assembly of the giant panda genome","volume":"463","author":"Li","year":"2010","journal-title":"Nature"},{"key":"2023062409330646900_btaa890-B14","doi-asserted-by":"crossref","first-page":"1916","DOI":"10.1101\/gr.1251803","article-title":"Estimating the repeat structure and length of DNA sequences using l-tuples","volume":"13","author":"Li","year":"2003","journal-title":"Genome Res"},{"key":"2023062409330646900_btaa890-B15","doi-asserted-by":"crossref","first-page":"3264","DOI":"10.1093\/bioinformatics\/btu513","article-title":"Trowel: a fast and accurate error correction module for illumina sequencing reads","volume":"30","author":"Lim","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B16","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1186\/2047-217X-1-18","article-title":"SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler","volume":"1","author":"Luo","year":"2012","journal-title":"Gigascience"},{"key":"2023062409330646900_btaa890-B17","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B18","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a bloom filter","volume":"12","author":"Melsted","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023062409330646900_btaa890-B19","doi-asserted-by":"crossref","first-page":"3492","DOI":"10.1093\/bioinformatics\/btw397","article-title":"ntHash: recursive nucleotide hashing","volume":"32","author":"Mohamadi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B20","doi-asserted-by":"crossref","first-page":"1324","DOI":"10.1093\/bioinformatics\/btw832","article-title":"ntCard: a streaming algorithm for cardinality estimation in genomics data","volume":"33","author":"Mohamadi","year":"2017","journal-title":"Bioinformatics"},{"first-page":"775","year":"2017","author":"Pandey","key":"2023062409330646900_btaa890-B21"},{"key":"2023062409330646900_btaa890-B22","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1093\/bioinformatics\/btx636","article-title":"Squeakr: an exact and approximate k-mer counting system","volume":"34","author":"Pandey","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B23","doi-asserted-by":"crossref","first-page":"586","DOI":"10.1016\/j.molcel.2015.05.004","article-title":"High-throughput sequencing technologies","volume":"58","author":"Reuter","year":"2015","journal-title":"Mol. Cell"},{"key":"2023062409330646900_btaa890-B24","doi-asserted-by":"crossref","first-page":"1950","DOI":"10.1093\/bioinformatics\/btu132","article-title":"Turtle: identifying frequent k-mers with cache-efficient algorithms","volume":"30","author":"Roy","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062409330646900_btaa890-B25","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1038\/nbt.3442","article-title":"Fast search of thousands of short-read sequencing experiments","volume":"34","author":"Solomon","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023062409330646900_btaa890-B26","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1186\/s13059-018-1540-z","article-title":"SKESA: strategic k-mer extension for scrupulous assemblies","volume":"19","author":"Souvorov","year":"2018","journal-title":"Genome Biol"},{"key":"2023062409330646900_btaa890-B27","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1038\/s41592-018-0236-3","article-title":"Long-read sequence and assembly of segmental duplications","volume":"16","author":"Vollger","year":"2019","journal-title":"Nat. Methods"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_2\/i625\/50693402\/btaa890.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_2\/i625\/50693402\/btaa890.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T23:57:21Z","timestamp":1687651041000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/Supplement_2\/i625\/6055931"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12]]},"references-count":27,"journal-issue":{"issue":"Supplement_2","published-print":{"date-parts":[[2020,12,30]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa890","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2020,12]]},"published":{"date-parts":[[2020,12]]}}}