{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T00:36:20Z","timestamp":1774312580185,"version":"3.50.1"},"reference-count":28,"publisher":"Oxford University Press (OUP)","issue":"22-23","license":[{"start":{"date-parts":[[2020,12,21]],"date-time":"2020-12-21T00:00:00Z","timestamp":1608508800000},"content-version":"vor","delay-in-days":20,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100002261","name":"RFBR","doi-asserted-by":"publisher","award":["20-07-00652"],"award-info":[{"award-number":["20-07-00652"]}],"id":[{"id":"10.13039\/501100002261","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002261","name":"RFBR","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002261","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001691","name":"JSPS","doi-asserted-by":"publisher","award":["20-51-50007"],"award-info":[{"award-number":["20-51-50007"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001665","name":"ANR","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001665","id-type":"DOI","asserted-by":"publisher"}]},{"name":"ASTER","award":["ANR-16-CE23-0001"],"award-info":[{"award-number":["ANR-16-CE23-0001"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,4,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and\/or genomes. For big data, this is typically done via \u2018seeds\u2019: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Here, we study a simple sparse-seeding method: using seeds at positions of certain \u2018words\u2019 (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed \u2018minimizer\u2019 sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Software to design and test minimally overlapping words is freely available at https:\/\/gitlab.com\/mcfrith\/noverlap.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa1054","type":"journal-article","created":{"date-parts":[[2020,12,9]],"date-time":"2020-12-09T00:04:36Z","timestamp":1607472276000},"page":"5344-5350","source":"Crossref","is-referenced-by-count":24,"title":["Minimally overlapping words for sequence similarity search"],"prefix":"10.1093","volume":"36","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0998-2859","authenticated-orcid":false,"given":"Martin C","family":"Frith","sequence":"first","affiliation":[{"name":"Artificial Intelligence Research Center , AIST, Tokyo, Japan"},{"name":"Graduate School of Frontier Sciences, University of Tokyo , Chiba, Japan"},{"name":"AIST-Waseda University CBBD-OIL , AIST, Tokyo, Japan"}]},{"given":"Laurent","family":"No\u00e9","sequence":"additional","affiliation":[{"name":"CRIStAL UMR9189, Universit\u00e9 de Lille, Villeneuve d\u2019Ascq , France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5899-5424","authenticated-orcid":false,"given":"Gregory","family":"Kucherov","sequence":"additional","affiliation":[{"name":"LIGM, CNRS, Universit\u00e9 Gustave Eiffel, Marne-la-Valle\u00e9 , France"},{"name":"Skolkovo Institute of Science and Technology , Moscow, Russia"}]}],"member":"286","published-online":{"date-parts":[[2020,12,21]]},"reference":[{"key":"2023062708411927500_btaa1054-B1","doi-asserted-by":"crossref","first-page":"e0189960","DOI":"10.1371\/journal.pone.0189960","article-title":"Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches","volume":"13","author":"Almutairy","year":"2018","journal-title":"PLoS One"},{"key":"2023062708411927500_btaa1054-B2","doi-asserted-by":"crossref","first-page":"4890","DOI":"10.1109\/TIT.2015.2456634","article-title":"Non-overlapping codes","volume":"61","author":"Blackburn","year":"2015","journal-title":"IEEE Trans. Inf. Theory"},{"key":"2023062708411927500_btaa1054-B3","first-page":"67","author":"Buhler","year":"2003"},{"key":"2023062708411927500_btaa1054-B4","first-page":"35","author":"Chikhi","year":"2014"},{"key":"2023062708411927500_btaa1054-B5","first-page":"373","author":"Cs\u0171r\u00f6s","year":"2004"},{"key":"2023062708411927500_btaa1054-B6","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btv022","article-title":"KMC 2: fast and resource-frugal k-mer counting","volume":"31","author":"Deorowicz","year":"2015","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B7","doi-asserted-by":"crossref","first-page":"e59","DOI":"10.1093\/nar\/gku104","article-title":"Improved search heuristics find 20 000 new alignments between human and mouse genomes","volume":"42","author":"Frith","year":"2014","journal-title":"Nucleic Acids Res"},{"key":"2023062708411927500_btaa1054-B8","doi-asserted-by":"crossref","first-page":"e1005107","DOI":"10.1371\/journal.pcbi.1005107","article-title":"rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison","volume":"12","author":"Hahn","year":"2016","journal-title":"PLoS Comput. Biol"},{"key":"2023062708411927500_btaa1054-B9","doi-asserted-by":"crossref","first-page":"2969","DOI":"10.1093\/bioinformatics\/btm422","article-title":"Multiple spaced seeds for homology search","volume":"23","author":"Ilie","year":"2007","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B10","doi-asserted-by":"crossref","first-page":"i748","DOI":"10.1093\/bioinformatics\/bty597","article-title":"A fast adaptive algorithm for computing whole-genome homology maps","volume":"34","author":"Jain","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B11","doi-asserted-by":"crossref","first-page":"487","DOI":"10.1101\/gr.113985.110","article-title":"Adaptive seeds tame genomic sequence comparison","volume":"21","author":"Kielbasa","year":"2011","journal-title":"Genome Res"},{"key":"2023062708411927500_btaa1054-B12","first-page":"569","article-title":"An improved branch and bound algorithm for the maximum clique problem","volume":"58","author":"Konc","year":"2007","journal-title":"MATCH Commun. Math. Comput. Chem"},{"key":"2023062708411927500_btaa1054-B13","doi-asserted-by":"crossref","first-page":"553","DOI":"10.1142\/S0219720006001977","article-title":"A unifying framework for seed sensitivity and its application to subset seeds","volume":"4","author":"Kucherov","year":"2006","journal-title":"J. Bioinform. Comput. Biol"},{"key":"2023062708411927500_btaa1054-B14","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B15","doi-asserted-by":"crossref","first-page":"169","DOI":"10.14778\/2535569.2448951","article-title":"Memory efficient minimum substring partitioning","volume":"6","author":"Li","year":"2013","journal-title":"Proceedings VLDB Endowment"},{"key":"2023062708411927500_btaa1054-B16","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/18.3.440","article-title":"PatternHunter: faster and more sensitive homology search","volume":"18","author":"Ma","year":"2002","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B17","author":"Manber","year":"1994"},{"key":"2023062708411927500_btaa1054-B18","doi-asserted-by":"crossref","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B19","doi-asserted-by":"crossref","first-page":"i13","DOI":"10.1093\/bioinformatics\/bty258","article-title":"Asymptotically optimal minimizers schemes","volume":"34","author":"Mar\u00e7ais","year":"2018","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B20","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1186\/1471-2105-5-149","article-title":"Improved hit criteria for DNA local alignment","volume":"5","author":"No\u00e9","year":"2004","journal-title":"BMC Bioinformatics"},{"key":"2023062708411927500_btaa1054-B21","doi-asserted-by":"crossref","first-page":"e1005777","DOI":"10.1371\/journal.pcbi.1005777","article-title":"Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing","volume":"13","author":"Orenstein","year":"2017","journal-title":"PLoS Comput. Biol"},{"key":"2023062708411927500_btaa1054-B22","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023062708411927500_btaa1054-B23","doi-asserted-by":"crossref","first-page":"483","DOI":"10.1109\/TCBB.2009.4","article-title":"On subset seeds for protein alignment","volume":"6","author":"Roytberg","year":"2009","journal-title":"IEEE\/ACM Trans. Comput. Biol. Bioinform"},{"key":"2023062708411927500_btaa1054-B24","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023062708411927500_btaa1054-B25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-020-02023-1","article-title":"Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank","volume":"21","author":"Steinegger","year":"2020","journal-title":"Genome Biol"},{"key":"2023062708411927500_btaa1054-B26","doi-asserted-by":"crossref","first-page":"133","DOI":"10.1186\/1471-2105-7-133","article-title":"Choosing the best heuristic for seeded alignment of DNA sequences","volume":"7","author":"Sun","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023062708411927500_btaa1054-B27","first-page":"678","article-title":"Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases","volume":"9","author":"Tamura","year":"1992","journal-title":"Mol. Biol. Evol"},{"key":"2023062708411927500_btaa1054-B28","doi-asserted-by":"crossref","first-page":"R46","DOI":"10.1186\/gb-2014-15-3-r46","article-title":"Kraken: ultrafast metagenomic sequence classification using exact alignments","volume":"15","author":"Wood","year":"2014","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btaa1054\/35480175\/btaa1054.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/22-23\/5344\/50716588\/btaa1054.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/22-23\/5344\/50716588\/btaa1054.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,27]],"date-time":"2023-06-27T04:42:13Z","timestamp":1687840933000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/22-23\/5344\/6042707"}},"subtitle":[],"editor":[{"given":"Yann","family":"Ponty","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2020,12,1]]},"references-count":28,"journal-issue":{"issue":"22-23","published-print":{"date-parts":[[2021,4,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa1054","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.07.24.220616","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,12,1]]},"published":{"date-parts":[[2020,12,1]]}}}