{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T03:46:12Z","timestamp":1775274372784,"version":"3.50.1"},"reference-count":38,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","funder":[{"DOI":"10.13039\/100000051","name":"National Human Genome Research Institute","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000051","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Winnowmap is built on top of the Minimap2 codebase and is available at https:\/\/github.com\/marbl\/winnowmap.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa435","type":"journal-article","created":{"date-parts":[[2020,5,4]],"date-time":"2020-05-04T08:13:01Z","timestamp":1588579981000},"page":"i111-i118","source":"Crossref","is-referenced-by-count":190,"title":["Weighted minimizer sampling improves long read mapping"],"prefix":"10.1093","volume":"36","author":[{"given":"Chirag","family":"Jain","sequence":"first","affiliation":[{"name":"National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA"}]},{"given":"Arang","family":"Rhie","sequence":"additional","affiliation":[{"name":"National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA"}]},{"given":"Haowen","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computing, Georgia Institute of Technology , Atlanta, GA 30332, USA"}]},{"given":"Claudia","family":"Chu","sequence":"additional","affiliation":[{"name":"College of Computing, Georgia Institute of Technology , Atlanta, GA 30332, USA"}]},{"given":"Brian P","family":"Walenz","sequence":"additional","affiliation":[{"name":"National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA"}]},{"given":"Sergey","family":"Koren","sequence":"additional","affiliation":[{"name":"National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA"}]},{"given":"Adam M","family":"Phillippy","sequence":"additional","affiliation":[{"name":"National Human Genome Research Institute, National Institutes of Health , Bethesda, MD 20892, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,7,13]]},"reference":[{"key":"2024021913324726800_btaa435-B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2024021913324726800_btaa435-B38"},{"key":"2024021913324726800_btaa435-B2","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1038\/nbt.3238","article-title":"Assembling large genomes with single-molecule sequencing and locality-sensitive hashing","volume":"33","author":"Berlin","year":"2015","journal-title":"Nat. Biotechnol"},{"key":"2024021913324726800_btaa435-B3","first-page":"21","author":"Broder","year":"1997"},{"key":"2024021913324726800_btaa435-B4","doi-asserted-by":"crossref","first-page":"336","DOI":"10.1089\/cmb.2014.0160","article-title":"On the representation of de Bruijn graphs","volume":"22","author":"Chikhi","year":"2015","journal-title":"J. Comput. Biol"},{"key":"2024021913324726800_btaa435-B5","author":"Chin","year":"2019"},{"key":"2024021913324726800_btaa435-B6","first-page":"812","article-title":"Near duplicate image detection: min-Hash and tf-idf weighting","volume":"810","author":"Chum","year":"2008","journal-title":"BMVC"},{"key":"2024021913324726800_btaa435-B7","first-page":"167","author":"DeBlasio","year":"2019"},{"key":"2024021913324726800_btaa435-B8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-019-10934-2","article-title":"Strain-level metagenomic assignment and compositional estimation for long reads with metamaps","volume":"10","author":"Dilthey","year":"2019","journal-title":"Nat. Commun"},{"key":"2024021913324726800_btaa435-B9","doi-asserted-by":"crossref","first-page":"e28819","DOI":"10.1371\/journal.pone.0028819","article-title":"Gentle masking of low-complexity sequences improves homology search","volume":"6","author":"Frith","year":"2011","journal-title":"PLoS One"},{"key":"2024021913324726800_btaa435-B10","doi-asserted-by":"crossref","first-page":"766","DOI":"10.1089\/cmb.2018.0036","article-title":"A fast approximate algorithm for mapping long reads to large reference databases","volume":"25","author":"Jain","year":"2018","journal-title":"J. Comput. Biol"},{"key":"2024021913324726800_btaa435-B11","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1101\/gr.215087.116","article-title":"Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation","volume":"27","author":"Koren","year":"2017","journal-title":"Genome Res"},{"key":"2024021913324726800_btaa435-B12","author":"Kundu","year":"2019"},{"key":"2024021913324726800_btaa435-B13","doi-asserted-by":"crossref","first-page":"R12","DOI":"10.1186\/gb-2004-5-2-r12","article-title":"Versatile and open software for comparing large genomes","volume":"5","author":"Kurtz","year":"2004","journal-title":"Genome Biol"},{"key":"2024021913324726800_btaa435-B14","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2024021913324726800_btaa435-B15","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B16","author":"Li","year":"2018"},{"key":"2024021913324726800_btaa435-B17","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B18","doi-asserted-by":"crossref","first-page":"i110","DOI":"10.1093\/bioinformatics\/btx235","article-title":"Improving the performance of minimizers and winnowing schemes","volume":"33","author":"Mar\u00e7ais","year":"2017","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B19","doi-asserted-by":"crossref","first-page":"i13","DOI":"10.1093\/bioinformatics\/bty258","article-title":"Asymptotically optimal minimizers schemes","volume":"34","author":"Mar\u00e7ais","year":"2018","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B20","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1146\/annurev-biodatasci-072018-021156","article-title":"Sketching and sublinear data structures in genomics","volume":"2","author":"Mar\u00e7ais","year":"2019","journal-title":"Annu. Rev. Biomed. Data Sci"},{"key":"2024021913324726800_btaa435-B21","first-page":"735928","author":"Miga","year":"2019"},{"key":"2024021913324726800_btaa435-B22","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using minhash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2024021913324726800_btaa435-B23","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1093\/bioinformatics\/bts649","article-title":"PBSIM: PacBio reads simulator-toward accurate genome assembly","volume":"29","author":"Ono","year":"2013","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B24","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1007\/978-3-319-43681-4_21","volume-title":"International Workshop on Algorithms in Bioinformatics","author":"Orenstein","year":"2016"},{"key":"2024021913324726800_btaa435-B25","doi-asserted-by":"crossref","first-page":"15311","DOI":"10.1038\/ncomms15311","article-title":"A hybrid cloud read aligner based on minhash and kmer voting that preserves privacy","volume":"8","author":"Popic","year":"2017","journal-title":"Nat. Commun"},{"key":"2024021913324726800_btaa435-B26","author":"Rhie","year":"2020"},{"key":"2024021913324726800_btaa435-B27","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2024021913324726800_btaa435-B28","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1186\/s13059-019-1809-x","article-title":"When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data","volume":"20","author":"Rowe","year":"2019","journal-title":"Genome Biol"},{"key":"2024021913324726800_btaa435-B29","first-page":"472","author":"Sahlin","year":"2020"},{"key":"2024021913324726800_btaa435-B30","author":"Sahlin","year":"2020"},{"key":"2024021913324726800_btaa435-B31","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2024021913324726800_btaa435-B32","doi-asserted-by":"crossref","first-page":"849","DOI":"10.1101\/gr.213611.116","article-title":"Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly","volume":"27","author":"Schneider","year":"2017","journal-title":"Genome Res"},{"key":"2024021913324726800_btaa435-B33","author":"Shafin"},{"key":"2024021913324726800_btaa435-B34","author":"Smith","year":"2011"},{"key":"2024021913324726800_btaa435-B35","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J. Mol. Biol"},{"key":"2024021913324726800_btaa435-B36","author":"Xin","year":"2018"},{"key":"2024021913324726800_btaa435-B37","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1016\/j.cels.2015.08.004","article-title":"Entropy-scaling search of massive biological data","volume":"1","author":"Yu","year":"2015","journal-title":"Cell Syst"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_1\/i111\/56702288\/bioinformatics_36_supplement1_i111.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_1\/i111\/56702288\/bioinformatics_36_supplement1_i111.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,19]],"date-time":"2024-02-19T08:38:33Z","timestamp":1708331913000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/Supplement_1\/i111\/5870473"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,1]]},"references-count":38,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2020,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa435","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.02.11.943241","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,7]]},"published":{"date-parts":[[2020,7,1]]}}}