{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T20:08:17Z","timestamp":1774382897776,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":3,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"MobiDataLab","award":["101006879"],"award-info":[{"award-number":["101006879"]}]},{"name":"OK-INSAID","award":["MIUR-PON 2018"],"award-info":[{"award-number":["MIUR-PON 2018"]}]},{"name":"OK-INSAID","award":["ARS01_00917"],"award-info":[{"award-number":["ARS01_00917"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,6,24]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings\u2014in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space\/time trade-off of the dictionary compared to the best-known solutions.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>https:\/\/github.com\/jermp\/sshash.<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information<\/jats:title><jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac245","type":"journal-article","created":{"date-parts":[[2022,4,14]],"date-time":"2022-04-14T11:10:15Z","timestamp":1649934615000},"page":"i185-i194","source":"Crossref","is-referenced-by-count":72,"title":["Sparse and skew hashing of K-mers"],"prefix":"10.1093","volume":"38","author":[{"given":"Giulio Ermanno","family":"Pibiri","sequence":"first","affiliation":[{"name":"ISTI-CNR , Pisa 56124, Italy"}]}],"member":"286","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"i169","DOI":"10.1093\/bioinformatics\/bty292","article-title":"A space and time-efficient index for the compacted colored de Bruijn graph","volume":"34","author":"Almodaresi","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","first-page":"285","author":"Bingmann","year":"2019"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-021-02297-z","article-title":"Simplitigs as an efficient and scalable representation of de Bruijn graphs","volume":"22","author":"B\u0159inda","year":"2021","journal-title":"Genome Biol"},{"key":"2023041407545851300_","volume-title":"Digital SRC Research Report","author":"Burrows","year":"1994"},{"key":"2023041407545851300_","first-page":"35","author":"Chikhi","year":"2014"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3445967","article-title":"Data structures to represent a set of k-long DNA sequences","volume":"54","author":"Chikhi","year":"2021","journal-title":"ACM Comput. Surv"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1145\/321812.321820","article-title":"Efficient storage and retrieval by content and address of static files","volume":"21","author":"Elias","year":"1974","journal-title":"J. ACM"},{"key":"2023041407545851300_","author":"Fano","year":"1971"},{"key":"2023041407545851300_","first-page":"390","author":"Ferragina","year":"2000"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02135-8","article-title":"Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs","volume":"21","author":"Holley","year":"2020","journal-title":"Genome Biol"},{"key":"2023041407545851300_","first-page":"1","author":"Italiano","year":"2021"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"768","DOI":"10.1101\/gr.214346.116","article-title":"ABySS 2.0: resource-efficient assembly of large genomes using a bloom filter","volume":"27","author":"Jackman","year":"2017","journal-title":"Genome Res"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"i111","DOI":"10.1093\/bioinformatics\/btaa435","article-title":"Weighted minimizer sampling improves long read mapping","volume":"36","author":"Jain","year":"2020","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"i177","DOI":"10.1093\/bioinformatics\/btab309","article-title":"Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections","volume":"37(Suppl_1","author":"Khan","year":"2021","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","article-title":"Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with cuttlefish 2","author":"Khan","year":"2021","journal-title":"bioRxiv"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"169","DOI":"10.14778\/2535569.2448951","article-title":"Memory efficient minimum substring partitioning","volume":"6","author":"Li","year":"2013","journal-title":"Proc. VLDB Endow"},{"key":"2023041407545851300_","first-page":"1","author":"Loukides","year":"2021"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"2858","DOI":"10.1093\/bioinformatics\/btab217","article-title":"Blight: efficient exact associative structure for k-mers","volume":"37","author":"Marchet","year":"2021","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1016\/j.is.2015.08.008","article-title":"Practical compressed string dictionaries","volume":"56","author":"Mart\u00ednez-Prieto","year":"2016","journal-title":"Inf. Syst"},{"key":"2023041407545851300_","first-page":"170","author":"Mehlhorn","year":"1982"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"3187","DOI":"10.1109\/TKDE.2020.2966609","article-title":"Compressed indexes for fast search of semantic data","volume":"33","author":"Perego","year":"2021","journal-title":"IEEE Trans. Knowl. Data Eng"},{"key":"2023041407545851300_","author":"Pibiri","year":"2021"},{"key":"2023041407545851300_","author":"Pibiri","year":"2021"},{"key":"2023041407545851300_","first-page":"2:1","article-title":"Clustered Elias-Fano indexes","volume":"36","author":"Pibiri","year":"2017","journal-title":"ACM Trans. Inf. Syst"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3302913","article-title":"Handling massive N-gram datasets efficiently","volume":"37","author":"Pibiri","year":"2019","journal-title":"ACM Trans. Inf. Syst"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3415148","article-title":"Techniques for inverted index compression","volume":"53","author":"Pibiri","year":"2021","journal-title":"ACM Comput. Surv"},{"key":"2023041407545851300_","author":"Rahman","year":"2020"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1007\/978-3-030-86692-1_13","volume-title":"String Processing and Information Retrieval","author":"Robidou","year":"2021"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"2080","DOI":"10.1101\/gr.275648.121","article-title":"Effective sequence similarity detection with strobemers","volume":"31","author":"Sahlin","year":"2021","journal-title":"Genome Res"},{"key":"2023041407545851300_","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023041407545851300_","first-page":"8","author":"Shibuya","year":"2021"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.1101\/gr.089532.108","article-title":"Abyss: a parallel assembler for short read sequence data","volume":"19","author":"Simpson","year":"2009","journal-title":"Genome Res"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"300","DOI":"10.1038\/nbt.3442","article-title":"Fast search of thousands of short-read sequencing experiments","volume":"34","author":"Solomon","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023041407545851300_","doi-asserted-by":"crossref","first-page":"i119","DOI":"10.1093\/bioinformatics\/btaa472","article-title":"Improved design and analysis of practical minimizers","volume":"36","author":"Zheng","year":"2020","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/Supplement_1\/i185\/49887045\/btac245.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/Supplement_1\/i185\/49887045\/btac245.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,22]],"date-time":"2024-09-22T06:55:56Z","timestamp":1726988156000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/Supplement_1\/i185\/6617506"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,24]]},"references-count":36,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2022,6,24]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac245","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,7,1]]},"published":{"date-parts":[[2022,6,24]]}}}