{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T02:36:23Z","timestamp":1773887783395,"version":"3.50.1"},"reference-count":31,"publisher":"Oxford University Press (OUP)","issue":"14","license":[{"start":{"date-parts":[[2019,7,8]],"date-time":"2019-07-08T00:00:00Z","timestamp":1562544000000},"content-version":"vor","delay-in-days":7,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000936","name":"Gordon and Betty Moore Foundation","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100000936","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Data-Driven Discovery Initiative","award":["GBMF4554"],"award-info":[{"award-number":["GBMF4554"]}]},{"DOI":"10.13039\/100000002","name":"US National Institutes of Health","doi-asserted-by":"crossref","award":["R01GM122935"],"award-info":[{"award-number":["R01GM122935"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100010319","name":"The Shurl and Kay Curci Foundation","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100010319","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,7,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>The code to generate the results is available at http:\/\/github.com\/Kingsford-Group\/omhismb2019.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz354","type":"journal-article","created":{"date-parts":[[2019,5,9]],"date-time":"2019-05-09T15:21:53Z","timestamp":1557415313000},"page":"i127-i135","source":"Crossref","is-referenced-by-count":56,"title":["Locality-sensitive hashing for the edit distance"],"prefix":"10.1093","volume":"35","author":[{"given":"Guillaume","family":"Mar\u00e7ais","sequence":"first","affiliation":[{"name":"Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"given":"Dan","family":"DeBlasio","sequence":"additional","affiliation":[{"name":"Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"given":"Prashant","family":"Pandey","sequence":"additional","affiliation":[{"name":"Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"given":"Carl","family":"Kingsford","sequence":"additional","affiliation":[{"name":"Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA"}]}],"member":"286","published-online":{"date-parts":[[2019,7,5]]},"reference":[{"key":"2023062712330746200_btz354-B1","doi-asserted-by":"crossref","first-page":"413","DOI":"10.1090\/S0273-0979-99-00796-X","article-title":"Longest increasing subsequences: from patience sorting to the Baik-Deift-Johansson theorem","volume":"36","author":"Aldous","year":"1999","journal-title":"Bull. Am. Math. Soc"},{"key":"2023062712330746200_btz354-B2","first-page":"203","volume-title":"Asia Information Retrieval Symposium","author":"Alonso","year":"2013"},{"key":"2023062712330746200_btz354-B3","doi-asserted-by":"crossref","first-page":"51","DOI":"10.1145\/2746539.2746612","volume-title":"Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC \u201915","author":"Backurs","year":"2015"},{"key":"2023062712330746200_btz354-B4","first-page":"550","author":"Bar-Yossef","year":"2004"},{"key":"2023062712330746200_btz354-B5","doi-asserted-by":"crossref","first-page":"623","DOI":"10.1038\/nbt.3238","article-title":"Assembling large genomes with single-molecule sequencing and locality-sensitive hashing","volume":"33","author":"Berlin","year":"2015","journal-title":"Nat. Biotechnol"},{"key":"2023062712330746200_btz354-B6","first-page":"21","author":"Broder","year":"1997"},{"key":"2023062712330746200_btz354-B7","first-page":"812","article-title":"Near duplicate image detection: min-Hash and tf-idf weighting","author":"Chum","year":"2008","journal-title":"BMVC"},{"key":"2023062712330746200_btz354-B8","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1186\/s13635-017-0055-6","article-title":"Polymorphic malware detection using sequence classification methods and ensembles","volume":"2017","author":"Drew","year":"2017","journal-title":"EURASIP J. Inf. Secur"},{"key":"2023062712330746200_btz354-B9","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1016\/0012-365X(75)90103-X","article-title":"On computing the length of longest increasing subsequences","volume":"11","author":"Fredman","year":"1975","journal-title":"Discrete Math"},{"key":"2023062712330746200_btz354-B10","first-page":"475","author":"Gollapudi","year":"2006"},{"key":"2023062712330746200_btz354-B11","author":"Harris","year":"2007"},{"key":"2023062712330746200_btz354-B12","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1145\/359581.359603","article-title":"A fast algorithm for computing longest common subsequences","volume":"20","author":"Hunt","year":"1977","journal-title":"Commun. ACM"},{"key":"2023062712330746200_btz354-B13","doi-asserted-by":"crossref","first-page":"604","DOI":"10.1145\/276698.276876","volume-title":"Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC \u201998","author":"Indyk","year":"1998"},{"key":"2023062712330746200_btz354-B14","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1101\/gr.828403","article-title":"Whole-genome sequence assembly for mammalian genomes: Arachne 2","volume":"13","author":"Jaffe","year":"2003","journal-title":"Genome Res"},{"key":"2023062712330746200_btz354-B15","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1007\/978-3-319-56970-3_5","volume-title":"Research in Computational Molecular Biology","author":"Jain","year":"2017"},{"key":"2023062712330746200_btz354-B16","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1137\/S0097539798347177","article-title":"Efficient search for approximate nearest neighbor in high dimensional spaces","volume":"30","author":"Kushilevitz","year":"2000","journal-title":"SIAM J. Comput"},{"key":"2023062712330746200_btz354-B17","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1038\/nmeth.1923","article-title":"Fast gapped-read alignment with Bowtie 2","volume":"9","author":"Langmead","year":"2012","journal-title":"Nat. Methods"},{"key":"2023062712330746200_btz354-B18","first-page":"7109","article-title":"Insertion sequence-caused large-scale rearrangements in the genome of Escherichia coli","volume":"44","author":"Lee","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023062712330746200_btz354-B19","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","author":"Levenshtein","year":"1966","journal-title":"Soviet Physics Doklady"},{"key":"2023062712330746200_btz354-B20","doi-asserted-by":"crossref","first-page":"589","DOI":"10.1093\/bioinformatics\/btp698","article-title":"Fast and accurate long-read alignment with Burrows\u2013Wheeler transform","volume":"26","author":"Li","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062712330746200_btz354-B21","first-page":"878","article-title":"SOAP3: ultra-fast GPU-based parallel alignment tool for short reads","volume":"28","author":"Liu","year":"2012","journal-title":"Bioinformatics (Oxford, England)"},{"key":"2023062712330746200_btz354-B22","first-page":"38","author":"Luo","year":"2017"},{"key":"2023062712330746200_btz354-B24","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1005944","article-title":"MUMmer4: a fast and versatile genome alignment system","volume":"14","author":"Mar\u00e7ais","year":"2018","journal-title":"PLOS Comput. Biol"},{"key":"2023062712330746200_btz354-B26","doi-asserted-by":"crossref","first-page":"2196","DOI":"10.1126\/science.287.5461.2196","article-title":"A whole-genome assembly of Drosophila","volume":"287","author":"Myers","year":"2000","journal-title":"Science"},{"key":"2023062712330746200_btz354-B27","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using MinHash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2023062712330746200_btz354-B28","first-page":"218","article-title":"Low distortion embeddings for edit distance","volume":"54","author":"Ostrovsky","year":"2007","journal-title":"J.\u00a0ACM"},{"key":"2023062712330746200_btz354-B29","first-page":"111","author":"Raff","year":"2017"},{"key":"2023062712330746200_btz354-B30","first-page":"1498","author":"Shrivastava","year":"2016"},{"key":"2023062712330746200_btz354-B31","first-page":"203","article-title":"Circuits and trees in oriented linear graphs. Simon Stevin : Wis-en Natuurkundig Tijdschrift","volume":"28","year":"1951","journal-title":"Tschr"},{"key":"2023062712330746200_btz354-B32","author":"Wu","year":"2017"},{"key":"2023062712330746200_btz354-B33","doi-asserted-by":"crossref","first-page":"e82138","DOI":"10.1371\/journal.pone.0082138","article-title":"SSW Library: an SIMD Smith-Waterman C\/C++ library for use in genomic applications","volume":"8","author":"Zhao","year":"2013","journal-title":"PLoS One"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/14\/i127\/50720313\/bioinformatics_35_14_i127.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/14\/i127\/50720313\/bioinformatics_35_14_i127.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,27]],"date-time":"2023-06-27T08:34:10Z","timestamp":1687854850000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/14\/i127\/5529166"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7]]},"references-count":31,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2019,7,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz354","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/534446","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2019,7]]},"published":{"date-parts":[[2019,7]]}}}