{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,12]],"date-time":"2025-12-12T13:07:24Z","timestamp":1765544844057,"version":"3.37.3"},"reference-count":65,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2022,5,18]],"date-time":"2022-05-18T00:00:00Z","timestamp":1652832000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"name":"Italian Ministry of Education, University and Research","award":["20174LF3T8"],"award-info":[{"award-number":["20174LF3T8"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,6,27]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https:\/\/github.com\/VandinLab\/SPRISS.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac180","type":"journal-article","created":{"date-parts":[[2022,5,18]],"date-time":"2022-05-18T12:22:45Z","timestamp":1652876565000},"page":"3343-3350","source":"Crossref","is-referenced-by-count":2,"title":["SPRISS: approximating frequent<i>k<\/i>-mers by sampling reads, and applications"],"prefix":"10.1093","volume":"38","author":[{"given":"Diego","family":"Santoro","sequence":"first","affiliation":[{"name":"Department of Information Engineering, University of Padova , 35131 Padova, Italy"}]},{"given":"Leonardo","family":"Pellegrina","sequence":"additional","affiliation":[{"name":"Department of Information Engineering, University of Padova , 35131 Padova, Italy"}]},{"given":"Matteo","family":"Comin","sequence":"additional","affiliation":[{"name":"Department of Information Engineering, University of Padova , 35131 Padova, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2244-2320","authenticated-orcid":false,"given":"Fabio","family":"Vandin","sequence":"additional","affiliation":[{"name":"Department of Information Engineering, University of Padova , 35131 Padova, Italy"}]}],"member":"286","published-online":{"date-parts":[[2022,5,18]]},"reference":[{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"i169","DOI":"10.1093\/bioinformatics\/bty292","article-title":"A space and time-efficient index for the compacted colored de Bruijn graph","volume":"34","author":"Almodaresi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"2070","DOI":"10.1093\/bioinformatics\/btu152","article-title":"Kanalyze: a fast versatile pipelined k-mer toolkit","volume":"30","author":"Audano","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1186\/s13059-017-1372-2","article-title":"De-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition","volume":"18","author":"Audoux","year":"2017","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"e94","DOI":"10.7717\/peerj-cs.94","article-title":"Multiple comparative metagenomics using multiset k-mer counting","volume":"2","author":"Benoit","year":"2016","journal-title":"PeerJ Comput. Sci"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"152","DOI":"10.1038\/s41587-018-0010-1","article-title":"Ultrafast search of all deposited bacterial and viral genomic data","volume":"37","author":"Bradley","year":"2019","journal-title":"Nat. Biotechnol"},{"year":"2012","author":"Brown","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1093\/bioinformatics\/btt310","article-title":"Informed and automated k-mer size selection for genome assembly","volume":"30","author":"Chikhi","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","first-page":"35","volume-title":"International Conference on Research in Computational Molecular Biology, RECOMB 2014","author":"Chikhi","year":"2014"},{"key":"2023041407595247100_","first-page":"17","article-title":"Data structures to represent a set of k-long DNA sequences","volume":"54","author":"Chikhi","year":"2021","journal-title":"ACM Comput. Surv"},{"first-page":"852889","year":"2019","author":"Coleman","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"i766","DOI":"10.1093\/bioinformatics\/bty567","article-title":"Dream-yara: an exact read mapper for very large databases with short update time","volume":"34","author":"Dadi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"0144","DOI":"10.1038\/s41559-017-0144","article-title":"A submarine volcanic eruption leads to a novel microbial habitat","volume":"1","author":"Danovaro","year":"2017","journal-title":"Nat. Ecol. Evol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"e1700585","DOI":"10.1126\/sciadv.1700585","article-title":"Carryover effects of larval exposure to different environmental bacteria drive adult trait variation in a mosquito vector","volume":"3","author":"Dickson","year":"2017","journal-title":"Sci. Adv"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"5217","DOI":"10.1093\/nar\/gkaa265","article-title":"To petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics","volume":"48","author":"Elworth","year":"2020","journal-title":"Nucleic Acids Res"},{"first-page":"2157","year":"2021","author":"Guo","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"721","DOI":"10.1093\/bioinformatics\/btz662","article-title":"Improved representation of sequence bloom trees","volume":"36","author":"Harris","year":"2020","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1146\/annurev-biodatasci-072018-021229","article-title":"Genomic data compression","volume":"2","author":"Hernaez","year":"2019","journal-title":"Annu. Rev. Biomed. Data Sci"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02135-8","article-title":"Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs","volume":"21","author":"Holley","year":"2020","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"56","DOI":"10.3390\/info7040056","article-title":"A survey on data compression methods for biological sequences","volume":"7","author":"Hosseini","year":"2016","journal-title":"Information"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"R116","DOI":"10.1186\/gb-2010-11-11-r116","article-title":"Quake: quality-aware detection and correction of sequencing errors","volume":"11","author":"Kelley","year":"2010","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"2759","DOI":"10.1093\/bioinformatics\/btx304","article-title":"Kmc 3: counting and manipulating k-mer statistics","volume":"33","author":"Kokot","year":"2017","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"517","DOI":"10.1186\/1471-2164-9-517","article-title":"A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes","volume":"9","author":"Kurtz","year":"2008","journal-title":"BMC Genomics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"2987","DOI":"10.1093\/bioinformatics\/btr509","article-title":"A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data","volume":"27","author":"Li","year":"2011","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1754","DOI":"10.1093\/bioinformatics\/btp324","article-title":"Fast and accurate short read alignment with burrows-wheeler transform","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1916","DOI":"10.1101\/gr.1251803","article-title":"Estimating the repeat structure and length of DNA sequences using \u2113-tuples","volume":"13","author":"Li","year":"2003","journal-title":"Genome Res"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"42444","DOI":"10.1038\/srep42444","article-title":"Unbiased k-mer analysis reveals changes in copy number of highly repetitive sequences during maize domestication and improvement","volume":"7","author":"Liu","year":"2017","journal-title":"Sci. Rep"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"764","DOI":"10.1093\/bioinformatics\/btr011","article-title":"A fast, lock-free approach for efficient parallel counting of occurrences of k-mers","volume":"27","author":"Mar\u00e7ais","year":"2011","journal-title":"Bioinformatics"},{"first-page":"546309","year":"2019","author":"Marchet","key":"2023041407595247100_"},{"first-page":"i177","year":"2020","author":"Marchet","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"92","DOI":"10.1016\/j.dam.2018.03.035","article-title":"A resource-frugal probabilistic dictionary and applications in bioinformatics","volume":"274","author":"Marchet","year":"2020","journal-title":"Discrete Appl. Math"},{"first-page":"1","year":"2021","author":"Marchet","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"3541","DOI":"10.1093\/bioinformatics\/btu713","article-title":"Kmerstream: streaming algorithms for k-mer abundance estimation","volume":"30","author":"Melsted","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1186\/1471-2105-12-333","article-title":"Efficient counting of k-mers in DNA sequences using a bloom filter","volume":"12","author":"Melsted","year":"2011","journal-title":"BMC Bioinformatics"},{"volume-title":"Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis","year":"2017","author":"Mitzenmacher","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1324","DOI":"10.1093\/bioinformatics\/btw832","article-title":"ntcard: a streaming algorithm for cardinality estimation in genomics data","volume":"33","author":"Mohamadi","year":"2017","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.1038\/nmeth.4037","article-title":"Comparison of high-throughput sequencing data compression tools","volume":"13","author":"Numanagi\u0107","year":"2016","journal-title":"Nat. Methods"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using MinHash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1186\/s12864-015-1419-2","article-title":"Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers","volume":"16","author":"Ounit","year":"2015","journal-title":"BMC Genomics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"568","DOI":"10.1093\/bioinformatics\/btx636","article-title":"Squeakr: an exact and approximate k-mer counting system","volume":"34","author":"Pandey","year":"2017","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1016\/j.cels.2018.05.021","article-title":"Mantis: a fast, small, and exact large-scale sequence-search index","volume":"7","author":"Pandey","year":"2018","journal-title":"Cell Syst"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1038\/nbt.2862","article-title":"Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms","volume":"32","author":"Patro","year":"2014","journal-title":"Nat. Biotechnol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"534","DOI":"10.1089\/cmb.2019.0314","article-title":"Fast approximation of frequent k-mers and applications to metagenomics","volume":"27","author":"Pellegrina","year":"2020","journal-title":"J. Comput. Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4612-5254-2","volume-title":"Convergence of Stochastic Processes","author":"Pollard","year":"1984"},{"key":"2023041407595247100_","first-page":"152","volume-title":"International Conference on Research in Computational Molecular Biology, RECOMB 2020","author":"Rahman","year":"2020"},{"year":"2021","author":"Rahman","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"652","DOI":"10.1093\/bioinformatics\/btt020","article-title":"Dsk: k-mer counting with very low memory usage","volume":"29","author":"Rizk","year":"2013","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1950","DOI":"10.1093\/bioinformatics\/btu132","article-title":"Turtle: identifying frequent k-mers with cache-efficient algorithms","volume":"30","author":"Roy","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1371\/journal.pbio.0050077","article-title":"The sorcerer II global ocean sampling expedition: northwest Atlantic through eastern tropical pacific","volume":"5","author":"Rusch","year":"2007","journal-title":"PLoS Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"114715","DOI":"10.1109\/ACCESS.2020.3003918","article-title":"Mining discriminative k-mers in DNA sequences using sketches and hardware acceleration","volume":"8","author":"Saavedra","year":"2020","journal-title":"IEEE Access"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"799","DOI":"10.1093\/bioinformatics\/btw321","article-title":"Accurate self-correction of errors in long reads using de Bruijn graphs","volume":"33","author":"Salmela","year":"2016","journal-title":"Bioinformatics"},{"year":"2021","author":"Santoro","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"308","DOI":"10.1093\/nar\/29.1.308","article-title":"dbSNP: the NCBI database of genetic variation","volume":"29","author":"Sherry","year":"2001","journal-title":"Nucleic Acids Res"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"2677","DOI":"10.1073\/pnas.0813249106","article-title":"Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions","volume":"106","author":"Sims","year":"2009","journal-title":"Proc. Natl. Acad. Sci. USA"},{"year":"2016","author":"Sivadasan","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1038\/nbt.3442","article-title":"Fast search of thousands of short-read sequencing experiments","volume":"34","author":"Solomon","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1089\/cmb.2017.0265","article-title":"Improved search of large transcriptomic sequencing databases using split sequence bloom trees","volume":"25","author":"Solomon","year":"2018","journal-title":"J. Comput. Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1093\/bioinformatics\/bty641","article-title":"Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics","volume":"35","author":"Sun","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1089\/cmb.2017.0258","article-title":"Allsome sequence bloom trees","volume":"25","author":"Sun","year":"2018","journal-title":"J. Comput. Biol"},{"volume-title":"Statistical Learning Theory","year":"1998","author":"Vapnik","key":"2023041407595247100_"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1186\/s12859-017-1724-7","article-title":"An improved filtering algorithm for big read datasets and its application to single-cell assembly","volume":"18","author":"Wedemeyer","year":"2017","journal-title":"BMC Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"R46","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1186\/s13059-018-1535-9","article-title":"Seqothello: querying RNA-seq experiments at scale","volume":"19","author":"Yu","year":"2018","journal-title":"Genome Biol"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"e101271","DOI":"10.1371\/journal.pone.0101271","article-title":"These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure","volume":"9","author":"Zhang","year":"2014","journal-title":"PLoS One"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"i283","DOI":"10.1093\/bioinformatics\/btu288","article-title":"RNA-skim: a rapid method for RNA-seq quantification at transcript level","volume":"30","author":"Zhang","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407595247100_","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1038\/nbt.2835","article-title":"Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls","volume":"32","author":"Zook","year":"2014","journal-title":"Nat. Biotechnol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btac180\/44136669\/btac180.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/13\/3343\/49883849\/btac180.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/13\/3343\/49883849\/btac180.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,21]],"date-time":"2023-11-21T19:56:03Z","timestamp":1700596563000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/13\/3343\/6588068"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2022,5,18]]},"references-count":65,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2022,6,27]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac180","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2022,7,1]]},"published":{"date-parts":[[2022,5,18]]}}}