{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T03:46:17Z","timestamp":1775274377662,"version":"3.50.1"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2021,7,12]],"date-time":"2021-07-12T00:00:00Z","timestamp":1626048000000},"content-version":"vor","delay-in-days":11,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000936","name":"Gordon and Betty Moore Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000936","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Data-Driven Discovery Initiative","award":["GBMF4554"],"award-info":[{"award-number":["GBMF4554"]}]},{"DOI":"10.13039\/100000002","name":"US National Institutes of Health","doi-asserted-by":"crossref","award":["R01GM122935"],"award-info":[{"award-number":["R01GM122935"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"US National Science Foundation","doi-asserted-by":"crossref","award":["DBI-1937540"],"award-info":[{"award-number":["DBI-1937540"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100010319","name":"The Shurl and Kay Curci Foundation","doi-asserted-by":"publisher","award":["4100070287"],"award-info":[{"award-number":["4100070287"]}],"id":[{"id":"10.13039\/100010319","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100004897","name":"Pennsylvania Department of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100004897","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,8,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>A reference implementation and code for analyses under an open-source license are at https:\/\/github.com\/kingsford-group\/polarset.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab313","type":"journal-article","created":{"date-parts":[[2021,5,12]],"date-time":"2021-05-12T15:22:57Z","timestamp":1620832977000},"page":"i187-i195","source":"Crossref","is-referenced-by-count":25,"title":["Sequence-specific minimizers via polar sets"],"prefix":"10.1093","volume":"37","author":[{"given":"Hongyu","family":"Zheng","sequence":"first","affiliation":[{"name":"Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA 15213, USA"}]},{"given":"Carl","family":"Kingsford","sequence":"additional","affiliation":[{"name":"Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA 15213, USA"}]},{"given":"Guillaume","family":"Mar\u00e7ais","sequence":"additional","affiliation":[{"name":"Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, PA 15213, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,7,12]]},"reference":[{"key":"2023062410305007400_btab313-B1","doi-asserted-by":"crossref","first-page":"e0189960","DOI":"10.1371\/journal.pone.0189960","article-title":"Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches","volume":"13","author":"Almutairy","year":"2018","journal-title":"PLoS One"},{"key":"2023062410305007400_btab313-B2","doi-asserted-by":"crossref","first-page":"4890","DOI":"10.1109\/TIT.2015.2456634","article-title":"Non-overlapping codes","volume":"61","author":"Blackburn","year":"2015","journal-title":"IEEE Trans. Inf. Theory"},{"key":"2023062410305007400_btab313-B3","doi-asserted-by":"crossref","first-page":"i201","DOI":"10.1093\/bioinformatics\/btw279","article-title":"Compacting de Bruijn graphs from sequencing data quickly and in low memory","volume":"32","author":"Chikhi","year":"2016","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B4","first-page":"167","author":"DeBlasio","year":"2019"},{"key":"2023062410305007400_btab313-B5","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"KMC 2: fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B6","doi-asserted-by":"publisher","DOI":"10.7717\/peerj.10805"},{"key":"2023062410305007400_btab313-B7","doi-asserted-by":"publisher","author":"Ekim","year":"2020","DOI":"10.1007\/978-3-030-45257-5_3"},{"key":"2023062410305007400_btab313-B8","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1186\/s13015-017-0097-9","article-title":"Gerbil: a fast and memory-efficient k-mer counter with GPU-support","volume":"12","author":"Erbert","year":"2017","journal-title":"Algorithms Mol. Biol"},{"key":"2023062410305007400_btab313-B9","author":"Frith","year":"2020"},{"key":"2023062410305007400_btab313-B10","author":"Jain","year":"2020"},{"key":"2023062410305007400_btab313-B11","doi-asserted-by":"crossref","first-page":"i111","DOI":"10.1093\/bioinformatics\/btaa435","article-title":"Weighted minimizer sampling improves long read mapping","volume":"36","author":"Jain","year":"2020","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B12","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1093\/bioinformatics\/btu687","article-title":"E-mem: efficient computation of maximal exact matches for very large genomes","volume":"31","author":"Khiste","year":"2015","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B13","first-page":"88","article-title":"Maximum number of words in codes without overlaps","volume":"6","author":"Levenshtein","year":"1970","journal-title":"Problemy Peredachi Informatsii"},{"key":"2023062410305007400_btab313-B14","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B15","doi-asserted-by":"crossref","first-page":"4560","DOI":"10.1093\/bioinformatics\/btz273","article-title":"Fast detection of maximal exact matches via fixed sampling of query k-mers and bloom filtering of index k-mers","volume":"35","author":"Liu","year":"2019","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B16","doi-asserted-by":"crossref","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B17","doi-asserted-by":"crossref","first-page":"i13","DOI":"10.1093\/bioinformatics\/bty258","article-title":"Asymptotically optimal minimizers schemes","volume":"34","author":"Mar\u00e7ais","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B18","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1146\/annurev-biodatasci-072018-021156","article-title":"Sketching and sublinear data structures in genomics","volume":"2","author":"Mar\u00e7ais","year":"2019","journal-title":"Annu. Rev. Biomed. Data Sci"},{"key":"2023062410305007400_btab313-B19","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1038\/s41586-020-2547-7","article-title":"Telomere-to-telomere assembly of a complete human x chromosome","volume":"585","author":"Miga","year":"2020","journal-title":"Nature"},{"key":"2023062410305007400_btab313-B20","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/0095-8956(72)90006-8","article-title":"A proof of Golomb\u2019s conjecture for the de Bruijn graph","volume":"13","author":"Mykkeltveit","year":"1972","journal-title":"J. Comb. Theory Ser. B"},{"key":"2023062410305007400_btab313-B21","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btab156.\u00a0"},{"key":"2023062410305007400_btab313-B22","author":"Orenstein Y, Pellow D, Mar\u00e7ais G, Shamir R, Kingsford C (2017) Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput Biol, 13: e1005777. 10.1371\/journal.pcbi.1005777"},{"key":"2023062410305007400_btab313-B23","doi-asserted-by":"crossref","first-page":"547","DOI":"10.1089\/cmb.2016.0155","article-title":"Improving bloom filter performance on sequence data using k-mer bloom filters","volume":"24","author":"Pellow","year":"2017","journal-title":"J. Comput. Biol"},{"key":"2023062410305007400_btab313-B24","doi-asserted-by":"crossref","first-page":"734","DOI":"10.1089\/cmb.2004.11.734","article-title":"A preprocessor for shotgun assembly of large genomes","volume":"11","author":"Roberts","year":"2004","journal-title":"J. Comput. Biol"},{"key":"2023062410305007400_btab313-B25","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023062410305007400_btab313-B26","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023062410305007400_btab313-B27","doi-asserted-by":"crossref","first-page":"S1","DOI":"10.1186\/1471-2105-13-S6-S1","article-title":"Exploiting sparseness in de novo genome assembly","volume":"13","author":"Ye","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023062410305007400_btab313-B28","doi-asserted-by":"crossref","first-page":"i119","DOI":"10.1093\/bioinformatics\/btaa472","article-title":"Improved design and analysis of practical minimizers","volume":"36","author":"Zheng","year":"2020","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i187\/50694416\/btab313.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/37\/Supplement_1\/i187\/50694416\/btab313.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,24]],"date-time":"2023-06-24T20:23:18Z","timestamp":1687638198000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/37\/Supplement_1\/i187\/6319665"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,1]]},"references-count":28,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2021,8,4]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab313","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.02.01.429246","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,7,1]]},"published":{"date-parts":[[2021,7,1]]}}}