{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T13:33:35Z","timestamp":1773840815634,"version":"3.50.1"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2024,10,21]],"date-time":"2024-10-21T00:00:00Z","timestamp":1729468800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"US National Science Foundation","doi-asserted-by":"crossref","award":["DBI-1937540"],"award-info":[{"award-number":["DBI-1937540"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000001","name":"US National Science Foundation","doi-asserted-by":"crossref","award":["III-2232121"],"award-info":[{"award-number":["III-2232121"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space).<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (\u201csketching deserts\u201d) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The code used in this analysis is available under a permissive license at https:\/\/github.com\/Kingsford-Group\/mdsscope.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae629","type":"journal-article","created":{"date-parts":[[2024,10,17]],"date-time":"2024-10-17T15:25:50Z","timestamp":1729178750000},"source":"Crossref","is-referenced-by-count":8,"title":["<i>k<\/i>\n                    -nonical space: sketching with reverse complements"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5083-5925","authenticated-orcid":false,"given":"Guillaume","family":"Mar\u00e7ais","sequence":"first","affiliation":[{"name":"Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University , Pittsburgh, PA 15213,","place":["United States"]}]},{"given":"C S","family":"Elder","sequence":"additional","affiliation":[{"name":"Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University , Pittsburgh, PA 15213,","place":["United States"]}]},{"given":"Carl","family":"Kingsford","sequence":"additional","affiliation":[{"name":"Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University , Pittsburgh, PA 15213,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2024,10,21]]},"reference":[{"key":"2024110904305559300_btae629-B1","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-319-11080-6","volume-title":"Linear Algebra Done Right","author":"Axler","year":"2015"},{"key":"2024110904305559300_btae629-B2","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1142\/S0218196704001700","article-title":"Unavoidable sets of constant length","volume":"14","author":"Champarnaud","year":"2004","journal-title":"Int J Algebra Comput"},{"key":"2024110904305559300_btae629-B3","first-page":"167","author":"DeBlasio","year":"2019"},{"key":"2024110904305559300_btae629-B4","doi-asserted-by":"publisher","first-page":"e1010638","DOI":"10.1371\/journal.pcbi.1010638","article-title":"Parameterized syncmer schemes improve long-read mapping","volume":"18","author":"Dutta","year":"2022","journal-title":"PLoS Comput Biol"},{"key":"2024110904305559300_btae629-B5","doi-asserted-by":"publisher","first-page":"e10805","DOI":"10.7717\/peerj.10805","article-title":"Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences","volume":"9","author":"Edgar","year":"2021","journal-title":"PeerJ"},{"key":"2024110904305559300_btae629-B6","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1007\/978-3-030-45257-5\\3","volume-title":"Research in Computational Molecular Biology, Lecture Notes in Computer Science","author":"Ekim","year":"2020"},{"key":"2024110904305559300_btae629-B7","doi-asserted-by":"publisher","first-page":"958","DOI":"10.1016\/j.cels.2021.08.009","article-title":"Minimizer-space de Bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer","volume":"12","author":"Ekim","year":"2021","journal-title":"Cell Syst"},{"key":"2024110904305559300_btae629-B8","doi-asserted-by":"publisher","first-page":"2","DOI":"10.1089\/cmb.2023.0212","article-title":"Density and conservation optimization of the generalized masked-minimizer sketching scheme","volume-title":"J Comput Biol","author":"Hoang","year":"2024"},{"key":"2024110904305559300_btae629-B9","doi-asserted-by":"publisher","first-page":"705","DOI":"10.1038\/s41592-022-01457-8","article-title":"Long-read mapping to repetitive reference sequences using Winnowmap2","volume":"19","author":"Jain","year":"2022","journal-title":"Nat Methods"},{"key":"2024110904305559300_btae629-B10","first-page":"14644","volume-title":"Complexity of Computer Computations","author":"Karp","year":"1972"},{"key":"2024110904305559300_btae629-B11","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btad512","article-title":"Minmers are a generalization of minimizers that enable unbiased local jaccard estimation","volume":"39","author":"Kille","year":"2023","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B12","doi-asserted-by":"publisher","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B13","doi-asserted-by":"publisher","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: Pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B14","doi-asserted-by":"publisher","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B15","doi-asserted-by":"publisher","first-page":"I 13","DOI":"10.1093\/bioinformatics\/bty258","article-title":"Asymptotically optimal minimizers schemes","volume":"34","author":"Mar\u00e7ais","year":"2018","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B16","doi-asserted-by":"publisher","first-page":"597","DOI":"10.1089\/cmb.2024.0544","article-title":"Sketching methods with small window guarantee using minimum decycling sets","volume":"31","author":"Mar\u00e7ais","year":"2024","journal-title":"J Comput Biol"},{"key":"2024110904305559300_btae629-B17","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1146\/annurev-biodatasci-072018-021156","article-title":"Sketching and sublinear data structures in genomics","volume":"2","author":"Mar\u00e7ais","year":"2019","journal-title":"Annu Rev Biomed Data Sci"},{"key":"2024110904305559300_btae629-B18","doi-asserted-by":"publisher","first-page":"421","DOI":"10.1016\/0166-218X(92)90149-5","article-title":"Asymptotically-tight bounds on the number of cycles in generalized de Bruijn-Good graphs","volume":"37-38","author":"Maurer","year":"1992","journal-title":"Discrete Applied Mathematics"},{"key":"2024110904305559300_btae629-B19","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/0095-8956(72)90006-8","article-title":"A proof of golomb\u2019s conjecture for the de bruijn graph","volume":"13","author":"Mykkeltveit","year":"1972","journal-title":"Journal of Combinatorial Theory, Series B"},{"key":"2024110904305559300_btae629-B20","first-page":"257","volume-title":"Algorithms in bioinformatics, lecture notes in computer science","author":"Orenstein","year":"2016"},{"key":"2024110904305559300_btae629-B21","doi-asserted-by":"publisher","first-page":"E 1005777","DOI":"10.1371\/journal.pcbi.1005777","article-title":"Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing","volume":"13","author":"Orenstein","year":"2017","journal-title":"PLoS Comput Biol"},{"key":"2024110904305559300_btae629-B22","doi-asserted-by":"publisher","first-page":"1154","DOI":"10.1101\/gr.277644.123","article-title":"Efficient minimizer orders for large values of k using minimum decycling sets","volume":"33","author":"Pellow","year":"2023","journal-title":"Genome Res"},{"key":"2024110904305559300_btae629-B23","doi-asserted-by":"publisher","author":"Pissis","year":"2024","DOI":"10.4230\/LIPIcs.WABI.2024"},{"key":"2024110904305559300_btae629-B24","doi-asserted-by":"publisher","first-page":"1474","DOI":"10.1038\/s41587-023-01662-6","article-title":"Telomere-to-telomere assembly of diploid chromosomes with verkko","volume":"41","author":"Rautiainen","year":"2023","journal-title":"Nat Biotechnol"},{"key":"2024110904305559300_btae629-B25","doi-asserted-by":"publisher","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B26","doi-asserted-by":"publisher","first-page":"734","DOI":"10.1089\/cmb.2004.11.734","article-title":"A preprocessor for shotgun assembly of large genomes","volume":"11","author":"Roberts","year":"2004","journal-title":"J Comput Biol"},{"key":"2024110904305559300_btae629-B27","author":"Rouz\u00e9","year":"2023"},{"key":"2024110904305559300_btae629-B28","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2024110904305559300_btae629-B29","doi-asserted-by":"publisher","first-page":"4659","DOI":"10.1093\/bioinformatics\/btab790","article-title":"Theory of local k-mer selection with applications to long-read alignment","volume":"38","author":"Shaw","year":"2022","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B30","doi-asserted-by":"publisher","first-page":"1661","DOI":"10.1038\/s41592-023-02018-3","article-title":"Fast and robust metagenomic sequence comparison through sparse chaining with skani","volume":"20","author":"Shaw","year":"2023","journal-title":"Nat Methods"},{"key":"2024110904305559300_btae629-B31","doi-asserted-by":"publisher","DOI":"10.24072\/pcjournal.323","article-title":"General encoding of canonical k-mers","volume":"3","author":"Wittler","year":"2023","journal-title":"Peer Community Journal"},{"key":"2024110904305559300_btae629-B32","doi-asserted-by":"publisher","first-page":"r46","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: Ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"},{"key":"2024110904305559300_btae629-B33","doi-asserted-by":"publisher","first-page":"i119","DOI":"10.1093\/bioinformatics\/btaa472","article-title":"Improved design and analysis of practical minimizers","volume":"36","author":"Zheng","year":"2020","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B34","doi-asserted-by":"publisher","first-page":"i187","DOI":"10.1093\/bioinformatics\/btab313","article-title":"Sequence-specific minimizers via polar sets","volume":"37","author":"Zheng","year":"2021","journal-title":"Bioinformatics"},{"key":"2024110904305559300_btae629-B35","doi-asserted-by":"publisher","first-page":"1251","DOI":"10.1089\/cmb.2023.0094","article-title":"Creating and using minimizer sketches in computational genomics","volume":"30","author":"Zheng","year":"2023","journal-title":"J Comput Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae629\/59933652\/btae629.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae629\/60549099\/btae629.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/11\/btae629\/60549099\/btae629.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T23:31:14Z","timestamp":1731108674000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae629\/7829143"}},"subtitle":[],"editor":[{"given":"Macha","family":"Nikolski","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,10,21]]},"references-count":35,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae629","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.01.25.577301","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,11]]},"published":{"date-parts":[[2024,10,21]]},"article-number":"btae629"}}