{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:03Z","timestamp":1772138043962,"version":"3.50.1"},"reference-count":23,"publisher":"Oxford University Press (OUP)","issue":"17","license":[{"start":{"date-parts":[[2021,3,8]],"date-time":"2021-03-08T00:00:00Z","timestamp":1615161600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,9,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers\u2014ordered m-mers where m\u2009&amp;lt;\u2009k\u2014are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1\/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Discount is GPL licensed and available at https:\/\/github.com\/jtnystrom\/discount. The data underlying this article are available in the article and in its online supplementary material.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab156","type":"journal-article","created":{"date-parts":[[2021,3,3]],"date-time":"2021-03-03T15:31:34Z","timestamp":1614785494000},"page":"2563-2569","source":"Crossref","is-referenced-by-count":17,"title":["Compact and evenly distributed\n                    <i>k<\/i>\n                    -mer binning for genomic sequences"],"prefix":"10.1093","volume":"37","author":[{"given":"Johan","family":"Nystr\u00f6m-Persson","sequence":"first","affiliation":[{"name":"JNP Solutions , Yokoami, Sumida-ku, Tokyo 130-0015, Japan"},{"name":"Department of R&D, Lifematics Inc ., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan"}]},{"given":"Gabriel","family":"Keeble-Gagn\u00e8re","sequence":"additional","affiliation":[{"name":"Department of Agriculture Victoria Research, AgriBio, Centre for AgriBioscience , Bundoora, VIC 3083, Australia"}]},{"given":"Niamat","family":"Zawad","sequence":"additional","affiliation":[{"name":"Department of R&D, Lifematics Inc ., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan"}]}],"member":"286","published-online":{"date-parts":[[2021,3,8]]},"reference":[{"key":"2023051609214088500_btab156-B1","doi-asserted-by":"crossref","first-page":"1659","DOI":"10.1093\/bioinformatics\/btx753","article-title":"Mapping-free variant calling using haplotype reconstruction from k-mer frequencies","volume":"34","author":"Audano","year":"2018","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B2","first-page":"35","author":"Chikhi","year":"2014"},{"key":"2023051609214088500_btab156-B3","first-page":"167","author":"DeBlasio","year":"2019"},{"key":"2023051609214088500_btab156-B4","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"KMC 2: fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B5","first-page":"146","author":"Efe","year":"2018"},{"key":"2023051609214088500_btab156-B6","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1007\/978-3-030-45257-5_3","volume-title":"Research in Computational Molecular Biology","author":"Ekim","year":"2020"},{"key":"2023051609214088500_btab156-B7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13015-017-0097-9","article-title":"Gerbil: a fast and memory-efficient k-mer counter with GPU-support","volume":"12","author":"Erbert","year":"2017","journal-title":"Algorithms Mol. Biol"},{"key":"2023051609214088500_btab156-B8","doi-asserted-by":"crossref","first-page":"1575","DOI":"10.1093\/bioinformatics\/btx010","article-title":"Fastdoop: a versatile and efficient library for the input of fasta and fastq files for mapreduce hadoop bioinformatics applications","volume":"33","author":"Ferraro Petrillo","year":"2017","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-019-2694-8","article-title":"Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics","volume":"20","author":"Ferraro Petrillo","year":"2019","journal-title":"BMC Bioinformatics"},{"key":"2023051609214088500_btab156-B10","doi-asserted-by":"crossref","first-page":"463","DOI":"10.1126\/science.1200387","article-title":"Metagenomic discovery of biomass-degrading genes and genomes from cow rumen","volume":"331","author":"Hess","year":"2011","journal-title":"Science"},{"key":"2023051609214088500_btab156-B11","doi-asserted-by":"crossref","first-page":"i111","DOI":"10.1093\/bioinformatics\/btaa435","article-title":"Weighted minimizer sampling improves long read mapping","volume":"36","author":"Jain","year":"2020","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023051609214088500_btab156-B12","first-page":"2759","article-title":"KMC 3: counting and manipulating k-mer statistics","volume":"33","author":"Kokot","year":"2017","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023051609214088500_btab156-B13","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1101\/gr.215087.116","article-title":"Canu: scalable and accurate long-read assembly via adaptive \u03ba-mer weighting and repeat separation","volume":"27","author":"Koren","year":"2017","journal-title":"Genome Res"},{"key":"2023051609214088500_btab156-B14","first-page":"1","article-title":"A benchmark study of k-mer counting methods for high-throughput sequencing","volume":"7","author":"Manekar","year":"2018","journal-title":"GigaScience"},{"key":"2023051609214088500_btab156-B15","doi-asserted-by":"crossref","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B16","first-page":"257","volume-title":"Algorithms in Bioinformatics. WABI 2016. Lecture Notes in Computer Science","author":"Orenstein","year":"2016"},{"key":"2023051609214088500_btab156-B17","doi-asserted-by":"crossref","first-page":"e1005777-15","DOI":"10.1371\/journal.pcbi.1005777","article-title":"Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing","volume":"13","author":"Orenstein","year":"2017","journal-title":"PLoS Comput. Biol"},{"key":"2023051609214088500_btab156-B18","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"DSK: k -mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B19","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023051609214088500_btab156-B20","doi-asserted-by":"crossref","first-page":"1261359","DOI":"10.1126\/science.1261359","article-title":"Structure and function of the global ocean microbiome","volume":"348","author":"Sunagawa","year":"2015","journal-title":"Science"},{"key":"2023051609214088500_btab156-B21","year":"2020"},{"key":"2023051609214088500_btab156-B22","doi-asserted-by":"crossref","first-page":"R46","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"},{"key":"2023051609214088500_btab156-B23","doi-asserted-by":"crossref","first-page":"i119","DOI":"10.1093\/bioinformatics\/btaa472","article-title":"Improved design and analysis of practical minimizers","volume":"36","author":"Zheng","year":"2020","journal-title":"Bioinformatics (Oxford, England)"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab156\/38609377\/btab156.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/17\/2563\/50339356\/btab156.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/17\/2563\/50339356\/btab156.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,5,16]],"date-time":"2023-05-16T05:27:10Z","timestamp":1684214830000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/17\/2563\/6162158"}},"subtitle":[],"editor":[{"given":"Peter","family":"Robinson","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,3,8]]},"references-count":23,"journal-issue":{"issue":"17","published-print":{"date-parts":[[2021,9,9]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab156","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.10.12.335364","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,9,1]]},"published":{"date-parts":[[2021,3,8]]}}}