{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T07:28:04Z","timestamp":1772609284838,"version":"3.50.1"},"reference-count":37,"publisher":"Oxford University Press (OUP)","issue":"11","license":[{"start":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T00:00:00Z","timestamp":1698192000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,11,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how \u201clexicographically similar\u201d the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision\u2013recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>LexicHash is available on GitHub at https:\/\/github.com\/gcgreenberg\/LexicHash.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad652","type":"journal-article","created":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T19:02:09Z","timestamp":1698260529000},"source":"Crossref","is-referenced-by-count":8,"title":["LexicHash: sequence similarity estimation via lexicographic comparison of hashes"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0378-0471","authenticated-orcid":false,"given":"Grant","family":"Greenberg","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign , Urbana, IL, United States"}]},{"given":"Aditya Narayan","family":"Ravi","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign , Urbana, IL, United States"}]},{"given":"Ilan","family":"Shomorony","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign , Urbana, IL, United States"}]}],"member":"286","published-online":{"date-parts":[[2023,10,25]]},"reference":[{"key":"2023110707091319100_btad652-B1","doi-asserted-by":"crossref","first-page":"100081","DOI":"10.1016\/j.patter.2020.100081","article-title":"Spectral jaccard similarity: a new approach to estimating pairwise sequence alignments","volume":"1","author":"Baharav","year":"2020","journal-title":"Patterns (N Y)"},{"key":"2023110707091319100_btad652-B2","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1038\/nbt.3238","article-title":"Assembling large genomes with single-molecule sequencing and locality-sensitive hashing","volume":"33","author":"Berlin","year":"2015","journal-title":"Nat Biotechnol"},{"key":"2023110707091319100_btad652-B3","first-page":"21","author":"Broder","year":"1997"},{"key":"2023110707091319100_btad652-B4","doi-asserted-by":"crossref","first-page":"27","DOI":"10.21105\/joss.00027","article-title":"sourmash: a library for minhash sketching of DNA","volume":"1","author":"Brown","year":"2016","journal-title":"JOSS"},{"key":"2023110707091319100_btad652-B5","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1186\/1471-2105-13-238","article-title":"Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory","volume":"13","author":"Chaisson","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023110707091319100_btad652-B6","author":"Chin","year":"2019"},{"key":"2023110707091319100_btad652-B7","first-page":"233","author":"Davis"},{"key":"2023110707091319100_btad652-B8","first-page":"167","author":"DeBlasio","year":"2019"},{"key":"2023110707091319100_btad652-B9","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1010638","article-title":"Parameterized syncmer schemes improve long-read mapping","volume-title":"PLOS Computational Biology","author":"Dutta"},{"key":"2023110707091319100_btad652-B10","doi-asserted-by":"crossref","first-page":"e10805","DOI":"10.7717\/peerj.10805","article-title":"Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences","volume":"9","author":"Edgar","year":"2021","journal-title":"PeerJ"},{"key":"2023110707091319100_btad652-B11","doi-asserted-by":"crossref","first-page":"958","DOI":"10.1016\/j.cels.2021.08.009","article-title":"Minimizer-space de bruijn graphs: whole-genome assembly of long reads in minutes on a personal computer","volume":"12","author":"Ekim","year":"2021","journal-title":"Cell Syst"},{"key":"2023110707091319100_btad652-B12","doi-asserted-by":"crossref","first-page":"lqad004","DOI":"10.1093\/nargab\/lqad004","article-title":"Blend: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis","volume":"5","author":"Firtina","year":"2023","journal-title":"NAR Genom Bioinform"},{"key":"2023110707091319100_btad652-B13","author":"Irber","year":"2022"},{"key":"2023110707091319100_btad652-B14","doi-asserted-by":"crossref","first-page":"766","DOI":"10.1089\/cmb.2018.0036","article-title":"A fast approximate algorithm for mapping long reads to large reference databases","volume":"25","author":"Jain","year":"2018","journal-title":"J Comput Biol"},{"key":"2023110707091319100_btad652-B15","doi-asserted-by":"crossref","first-page":"i111","DOI":"10.1093\/bioinformatics\/btaa435","article-title":"Weighted minimizer sampling improves long read mapping","volume":"36","author":"Jain","year":"2020","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B16","author":"Joudaki","year":"2020"},{"key":"2023110707091319100_btad652-B17","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1101\/gr.215087.116","article-title":"Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation","volume":"27","author":"Koren","year":"2017","journal-title":"Genome Res"},{"key":"2023110707091319100_btad652-B18","doi-asserted-by":"crossref","first-page":"2103","DOI":"10.1093\/bioinformatics\/btw152","article-title":"Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences","volume":"32","author":"Li","year":"2016","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B19","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B20","volume-title":"Genome Research","author":"Maier"},{"key":"2023110707091319100_btad652-B21","doi-asserted-by":"crossref","first-page":"i13","DOI":"10.1093\/bioinformatics\/bty258","article-title":"Asymptotically optimal minimizers schemes","volume":"34","author":"Mar\u00e7ais","year":"2018","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B22","first-page":"i127","volume-title":"Bioinformatics","author":"Mar\u00e7ais"},{"key":"2023110707091319100_btad652-B23","volume-title":"International Workshop on Algorithms in Bioinformatics"},{"key":"2023110707091319100_btad652-B24","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using minhash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2023110707091319100_btad652-B25","doi-asserted-by":"crossref","first-page":"e1005777","DOI":"10.1371\/journal.pcbi.1005777","article-title":"Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing","volume":"13","author":"Orenstein","year":"2017","journal-title":"PLoS Comput Biol"},{"key":"2023110707091319100_btad652-B26","author":"Popic","year":"2016"},{"key":"2023110707091319100_btad652-B27","author":"Public Health England, Pacific Biosciences, and Wellcome Sanger Institute","year":"2014"},{"key":"2023110707091319100_btad652-B28","doi-asserted-by":"crossref","first-page":"3363","DOI":"10.1093\/bioinformatics\/bth408","article-title":"Reducing storage requirements for biological sequence comparison","volume":"20","author":"Roberts","year":"2004","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B29","doi-asserted-by":"crossref","first-page":"2080","DOI":"10.1101\/gr.275648.121","article-title":"Effective sequence similarity detection with strobemers","volume":"31","author":"Sahlin","year":"2021","journal-title":"Genome Res"},{"key":"2023110707091319100_btad652-B30","author":"Sahlin","year":"2021"},{"key":"2023110707091319100_btad652-B31","first-page":"76","author":"Schleimer","year":"2003"},{"key":"2023110707091319100_btad652-B32","first-page":"4659","article-title":"Theory of local k-mer selection with applications to long-read alignment","author":"Shaw","year":"2021","journal-title":"Bioinformatics"},{"key":"2023110707091319100_btad652-B33","first-page":"3308","author":"Shomorony","year":"2021"},{"key":"2023110707091319100_btad652-B34","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J Mol Biol"},{"key":"2023110707091319100_btad652-B35","doi-asserted-by":"crossref","first-page":"jkab083","DOI":"10.1093\/g3journal\/jkab083","article-title":"Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes","volume":"11","author":"Tvedte","year":"2021","journal-title":"G3"},{"key":"2023110707091319100_btad652-B36","doi-asserted-by":"crossref","first-page":"339","DOI":"10.1093\/dnares\/dsw022","article-title":"Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (&gt;11 kb), single molecule, real-time sequencing","volume":"23","author":"Vembar","year":"2016","journal-title":"DNA Res"},{"key":"2023110707091319100_btad652-B37","doi-asserted-by":"crossref","first-page":"i119","DOI":"10.1093\/bioinformatics\/btaa472","article-title":"Improved design and analysis of practical minimizers","volume":"36","author":"Zheng","year":"2020","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad652\/52543854\/btad652.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/11\/btad652\/52771922\/btad652.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/11\/btad652\/52771922\/btad652.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,7]],"date-time":"2023-11-07T07:09:42Z","timestamp":1699340982000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad652\/7329717"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,10,25]]},"references-count":37,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2023,11,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad652","relation":{},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,11,1]]},"published":{"date-parts":[[2023,10,25]]},"article-number":"btad652"}}