{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T22:05:37Z","timestamp":1774649137895,"version":"3.50.1"},"reference-count":36,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T00:00:00Z","timestamp":1674604800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002241","name":"Japan Science and Technology Agency","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002241","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000092","name":"National Library of Medicine","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000092","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,2,3]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Source code is freely available at https:\/\/gitlab.com\/mcfrith\/noverlap.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad057","type":"journal-article","created":{"date-parts":[[2023,1,26]],"date-time":"2023-01-26T19:51:29Z","timestamp":1674762689000},"source":"Crossref","is-referenced-by-count":7,"title":["How to optimally sample a sequence for rapid analysis"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0998-2859","authenticated-orcid":false,"given":"Martin C","family":"Frith","sequence":"first","affiliation":[{"name":"Artificial Intelligence Research Center, AIST , Tokyo 135-0064, Japan"},{"name":"Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo , Chiba 277-8568, Japan"},{"name":"Computational Bio Big-Data Open Innovation Laboratory, AIST , Tokyo 169-8555, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6990-7829","authenticated-orcid":false,"given":"Jim","family":"Shaw","sequence":"additional","affiliation":[{"name":"Department of Mathematics, University of Toronto , Toronto, ON M5S 2E4, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6207-1419","authenticated-orcid":false,"given":"John L","family":"Spouge","sequence":"additional","affiliation":[{"name":"National Library of Medicine, National Institutes of Health , Bethesda, MD 20894, USA"}]}],"member":"286","published-online":{"date-parts":[[2023,1,25]]},"reference":[{"key":"2023020815020173600_btad057-B1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol"},{"key":"2023020815020173600_btad057-B2","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2023020815020173600_btad057-B3","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1007\/978-3-540-89097-3_27","volume-title":"International Symposium on String Processing and Information Retrieval, Melbourne, Australia","author":"Benson","year":"2008"},{"key":"2023020815020173600_btad057-B4","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.mbs.2007.10.001","article-title":"Solvable models of neighbor-dependent substitution processes","volume":"211","author":"B\u00e9rard","year":"2008","journal-title":"Math. Biosci"},{"key":"2023020815020173600_btad057-B5","doi-asserted-by":"crossref","first-page":"426","DOI":"10.1093\/bib\/bbx067","article-title":"Alignment-free inference of hierarchical and reticulate phylogenomic relationships","volume":"20","author":"Bernard","year":"2019","journal-title":"Brief. Bioinform"},{"key":"2023020815020173600_btad057-B6","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1007\/BF02173653","article-title":"Chemical specificity of nucleic acids and mechanism of their enzymatic degradation","volume":"6","author":"Chargaff","year":"1950","journal-title":"Experientia"},{"key":"2023020815020173600_btad057-B7","first-page":"35","volume-title":"International Conference on Research in Computational Molecular Biology, Pittsburgh, PA, USA","author":"Chikhi","year":"2014"},{"key":"2023020815020173600_btad057-B8","doi-asserted-by":"crossref","first-page":"373","DOI":"10.1007\/978-3-540-27801-6_28","volume-title":"Annual Symposium on Combinatorial Pattern Matching, Istanbul, Turkey","author":"Cs\u0171r\u00f6s","year":"2004"},{"key":"2023020815020173600_btad057-B9","doi-asserted-by":"crossref","first-page":"e1010638","DOI":"10.1371\/journal.pcbi.1010638","article-title":"Parameterized syncmer schemes improve long-read mapping","volume":"18","author":"Dutta","year":"2022","journal-title":"PLoS Comput. Biol"},{"key":"2023020815020173600_btad057-B10","doi-asserted-by":"crossref","first-page":"e10805","DOI":"10.7717\/peerj.10805","article-title":"Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences","volume":"9","author":"Edgar","year":"2021","journal-title":"PeerJ"},{"key":"2023020815020173600_btad057-B11","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-015-0670-9","article-title":"Split-alignment of genomes finds orthologies more accurately","volume":"16","author":"Frith","year":"2015","journal-title":"Genome Biol"},{"key":"2023020815020173600_btad057-B12","doi-asserted-by":"crossref","first-page":"5344","DOI":"10.1093\/bioinformatics\/btaa1054","article-title":"Minimally overlapping words for sequence similarity search","volume":"36","author":"Frith","year":"2020","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B13","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-017-01982-7","article-title":"Centromere evolution and CpG methylation during vertebrate speciation","volume":"8","author":"Ichikawa","year":"2017","journal-title":"Nat. Commun"},{"key":"2023020815020173600_btad057-B14","first-page":"164","article-title":"PatternHunter II: highly sensitive and fast homology search","volume":"14","author":"Li","year":"2003","journal-title":"Genome Inform"},{"key":"2023020815020173600_btad057-B15","doi-asserted-by":"crossref","first-page":"3303","DOI":"10.1093\/bioinformatics\/btz068","article-title":"Rapid alignment-free phylogenetic identification of metagenomic sequences","volume":"35","author":"Linard","year":"2019","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B16","doi-asserted-by":"crossref","first-page":"1039","DOI":"10.1101\/gr.214973.116","article-title":"Short template switch events explain mutation clusters in the human genome","volume":"27","author":"L\u00f6ytynoja","year":"2017","journal-title":"Genome Res"},{"key":"2023020815020173600_btad057-B17","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/18.3.440","article-title":"PatternHunter: faster and more sensitive homology search","volume":"18","author":"Ma","year":"2002","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B18","first-page":"1","author":"Manber","year":"1994"},{"key":"2023020815020173600_btad057-B19","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1007\/978-1-0716-1036-7_8","volume-title":"Multiple Sequence Alignment. Methods in Molecular Biology","author":"Morgenstern","year":"2021"},{"key":"2023020815020173600_btad057-B20","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1186\/s13015-015-0032-x","article-title":"Estimating evolutionary distances between genomic sequences from spaced-word matches","volume":"10","author":"Morgenstern","year":"2015","journal-title":"Algorithms Mol. Biol"},{"key":"2023020815020173600_btad057-B21","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1007\/978-3-662-44753-6_5","volume-title":"International Workshop on Algorithms in Bioinformatics, Wroclaw, Poland","author":"Myers","year":"2014"},{"key":"2023020815020173600_btad057-B22","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1186\/1471-2105-5-149","article-title":"Improved hit criteria for DNA local alignment","volume":"5","author":"No\u00e9","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023020815020173600_btad057-B23","doi-asserted-by":"crossref","first-page":"947","DOI":"10.1089\/cmb.2014.0173","article-title":"A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances","volume":"21","author":"No\u00e9","year":"2014","journal-title":"J. Comput. Biol"},{"key":"2023020815020173600_btad057-B24","doi-asserted-by":"crossref","first-page":"2563","DOI":"10.1093\/bioinformatics\/btab156","article-title":"Compact and evenly distributed k-mer binning for genomic sequences","volume":"37","author":"Nystr\u00f6m-Persson","year":"2021","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B25","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1007\/978-3-319-43681-4_21","volume-title":"International Workshop on Algorithms in Bioinformatics, Aarhus, Denmark","author":"Orenstein","year":"2016"},{"key":"2023020815020173600_btad057-B26","doi-asserted-by":"crossref","first-page":"3823","DOI":"10.1093\/bioinformatics\/btw542","article-title":"Higher classification sensitivity of short metagenomic reads with CLARK-S","volume":"32","author":"Ounit","year":"2016","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B27","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B28","doi-asserted-by":"crossref","first-page":"2080","DOI":"10.1101\/gr.275648.121","article-title":"Effective sequence similarity detection with strobemers","volume":"31","author":"Sahlin","year":"2021","journal-title":"Genome Res"},{"key":"2023020815020173600_btad057-B29","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023020815020173600_btad057-B30","doi-asserted-by":"crossref","first-page":"4659","DOI":"10.1093\/bioinformatics\/btab790","article-title":"Theory of local k-mer selection with applications to long-read alignment","volume":"38","author":"Shaw","year":"2022","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B31","author":"Shaw","year":"2022"},{"key":"2023020815020173600_btad057-B32","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1093\/bioinformatics\/btg005","article-title":"Alignment-free sequence comparison\u2014a review","volume":"19","author":"Vinga","year":"2003","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B33","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-019-1891-0","article-title":"Improved metagenomic analysis with kraken 2","volume":"20","author":"Wood","year":"2019","journal-title":"Genome Biol"},{"key":"2023020815020173600_btad057-B34","doi-asserted-by":"crossref","first-page":"e75","DOI":"10.1093\/nar\/gkt003","article-title":"Co-phylog: an assembly-free phylogenomic approach for closely related organisms","volume":"41","author":"Yi","year":"2013","journal-title":"Nucleic Acids Res"},{"key":"2023020815020173600_btad057-B35","doi-asserted-by":"crossref","first-page":"i187","DOI":"10.1093\/bioinformatics\/btab313","article-title":"Sequence-specific minimizers via polar sets","volume":"37","author":"Zheng","year":"2021","journal-title":"Bioinformatics"},{"key":"2023020815020173600_btad057-B36","doi-asserted-by":"crossref","first-page":"19359","DOI":"10.1073\/pnas.1921719117","article-title":"DNA methylation enables transposable element-driven genome expansion","volume":"117","author":"Zhou","year":"2020","journal-title":"Proc. Natl. Acad. Sci. USA"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad057\/48907444\/btad057.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/2\/btad057\/49124149\/btad057.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/2\/btad057\/49124149\/btad057.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,8]],"date-time":"2023-02-08T10:19:49Z","timestamp":1675851589000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad057\/7005197"}},"subtitle":[],"editor":[{"given":"Janet","family":"Kelso","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,1,25]]},"references-count":36,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,2,3]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad057","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.08.18.504476","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,2,1]]},"published":{"date-parts":[[2023,1,25]]},"article-number":"btad057"}}