{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,22]],"date-time":"2026-03-22T06:28:04Z","timestamp":1774160884894,"version":"3.50.1"},"reference-count":40,"publisher":"Oxford University Press (OUP)","issue":"7","license":[{"start":{"date-parts":[[2024,7,16]],"date-time":"2024-07-16T00:00:00Z","timestamp":1721088000000},"content-version":"vor","delay-in-days":15,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Center for Processing with Intelligent Storage and Memory","award":["2023-JU-3135"],"award-info":[{"award-number":["2023-JU-3135"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We evaluate HyperGen\u2019s sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https:\/\/github.com\/wh-xu\/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https:\/\/github.com\/wh-xu\/experiment-hyper-gen.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae452","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T10:29:59Z","timestamp":1720780199000},"source":"Crossref","is-referenced-by-count":11,"title":["HyperGen: compact and efficient genome sketching using hyperdimensional vectors"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3766-3353","authenticated-orcid":false,"given":"Weihong","family":"Xu","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, University of California San Diego , La Jolla, CA 92093, United States"}]},{"given":"Po-Kai","family":"Hsu","sequence":"additional","affiliation":[{"name":"School of Electrical and Computer Engineering, Georgia Institute of Technology , Atlanta, GA 30332, United States"}]},{"given":"Niema","family":"Moshiri","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, University of California San Diego , La Jolla, CA 92093, United States"}]},{"given":"Shimeng","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Electrical and Computer Engineering, Georgia Institute of Technology , Atlanta, GA 30332, United States"}]},{"given":"Tajana","family":"Rosing","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, University of California San Diego , La Jolla, CA 92093, United States"}]}],"member":"286","published-online":{"date-parts":[[2024,7,16]]},"reference":[{"key":"2024072623111323200_btae452-B1","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1186\/s13059-019-1875-0","article-title":"Dashing: fast and accurate genomic distances with hyperloglog","volume":"20","author":"Baker","year":"2019","journal-title":"Genome Biol"},{"key":"2024072623111323200_btae452-B2","first-page":"1218","article-title":"Genomic sketching with multiplicities and locality-sensitive hashing using dashing 2","volume":"33","author":"Baker","year":"2023","journal-title":"Genome Res"},{"key":"2024072623111323200_btae452-B3","first-page":"21","author":"Broder","year":"1997"},{"key":"2024072623111323200_btae452-B4","doi-asserted-by":"crossref","first-page":"27","DOI":"10.21105\/joss.00027","article-title":"sourmash: a library for minhash sketching of DNA","volume":"1","author":"Brown","year":"2016","journal-title":"JOSS"},{"key":"2024072623111323200_btae452-B5","doi-asserted-by":"crossref","first-page":"5315","DOI":"10.1093\/bioinformatics\/btac672","article-title":"Gtdb-tk v2: memory friendly classification with the genome taxonomy database","volume":"38","author":"Chaumeil","year":"2022","journal-title":"Bioinformatics"},{"key":"2024072623111323200_btae452-B6","doi-asserted-by":"crossref","first-page":"2244","DOI":"10.14778\/3476249.3476276","article-title":"Setsketch: filling the gap between minhash and hyperloglog","volume":"14","author":"Ertl","year":"2021","journal-title":"Proc VLDB Endow"},{"key":"2024072623111323200_btae452-B7","doi-asserted-by":"crossref","first-page":"lqad004","DOI":"10.1093\/nargab\/lqad004","article-title":"Blend: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis","volume":"5","author":"Firtina","year":"2023","journal-title":"NAR Genom Bioinform"},{"key":"2024072623111323200_btae452-B8","first-page":"3887","author":"Guo","year":"2020"},{"key":"2024072623111323200_btae452-B9","first-page":"gr\u2013277651","article-title":"Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using fracminhash","author":"Hera","year":"2023","journal-title":"Genome Res"},{"key":"2024072623111323200_btae452-B10","doi-asserted-by":"crossref","first-page":"e0291492","DOI":"10.1371\/journal.pone.0291492","article-title":"Fast genome-based delimitation of enterobacterales species","volume":"18","author":"Hern\u00e1ndez-Salmer\u00f3n","year":"2023","journal-title":"PLoS One"},{"key":"2024072623111323200_btae452-B11","first-page":"2022","article-title":"Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers","author":"Irber","year":"2022","journal-title":"BioRxiv"},{"key":"2024072623111323200_btae452-B12","first-page":"66","author":"Jain","year":"2017"},{"key":"2024072623111323200_btae452-B13","doi-asserted-by":"crossref","first-page":"5114","DOI":"10.1038\/s41467-018-07641-9","article-title":"High throughput ani analysis of 90k prokaryotic genomes reveals clear species boundaries","volume":"9","author":"Jain","year":"2018","journal-title":"Nat Commun"},{"key":"2024072623111323200_btae452-B14","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1109\/TPAMI.2010.57","article-title":"Product quantization for nearest neighbor search","volume":"33","author":"J\u00e9gou","year":"2011","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"2024072623111323200_btae452-B15","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1007\/s12559-009-9009-8","article-title":"Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors","volume":"1","author":"Kanerva","year":"2009","journal-title":"Cogn Comput"},{"key":"2024072623111323200_btae452-B16","author":"Kanerva","year":"2000"},{"key":"2024072623111323200_btae452-B17","doi-asserted-by":"crossref","first-page":"btad404","DOI":"10.1093\/bioinformatics\/btad404","article-title":"Accelerating open modification spectral library searching on tensor core in high-dimensional space","volume":"39","author":"Kang","year":"2023","journal-title":"Bioinformatics"},{"key":"2024072623111323200_btae452-B18","first-page":"115","author":"Kim","year":"2020"},{"key":"2024072623111323200_btae452-B19","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/gb-2004-5-2-r12","article-title":"Versatile and open software for comparing large genomes","volume":"5","author":"Kurtz","year":"2004","journal-title":"Genome Biology"},{"key":"2024072623111323200_btae452-B20","first-page":"11523","author":"Lee","year":"2022"},{"key":"2024072623111323200_btae452-B21","doi-asserted-by":"crossref","first-page":"1100","DOI":"10.1099\/ijsem.0.000760","article-title":"Orthoani: an improved algorithm and software for calculating average nucleotide identity","volume":"66","author":"Lee","year":"2016","journal-title":"Int J Syst Evol Microbiol"},{"key":"2024072623111323200_btae452-B22","doi-asserted-by":"crossref","first-page":"i28","DOI":"10.1093\/bioinformatics\/btac237","article-title":"Cmash: fast, multi-resolution estimation of k-mer-based jaccard and containment indices","volume":"38","author":"Liu","year":"2022","journal-title":"Bioinformatics"},{"key":"2024072623111323200_btae452-B23","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1145\/2692956.2663188","article-title":"The rust language","volume":"34","author":"Matsakis","year":"2014","journal-title":"Ada Lett"},{"key":"2024072623111323200_btae452-B24","first-page":"1758","author":"Nunes","year":"2023"},{"key":"2024072623111323200_btae452-B25","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1186\/s13059-016-0997-x","article-title":"Mash: fast genome and metagenome distance estimation using minhash","volume":"17","author":"Ondov","year":"2016","journal-title":"Genome Biol"},{"key":"2024072623111323200_btae452-B26","doi-asserted-by":"crossref","first-page":"232","DOI":"10.1186\/s13059-019-1841-x","article-title":"Mash screen: high-throughput sequence containment estimation for genome discovery","volume":"20","author":"Ondov","year":"2019","journal-title":"Genome Biol"},{"key":"2024072623111323200_btae452-B27","doi-asserted-by":"crossref","first-page":"1533","DOI":"10.1038\/s41564-017-0012-7","article-title":"Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life","volume":"2","author":"Parks","year":"2017","journal-title":"Nat Microbiol"},{"key":"2024072623111323200_btae452-B28","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1038\/nbt.4229","article-title":"A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life","volume":"36","author":"Parks","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2024072623111323200_btae452-B29","doi-asserted-by":"crossref","first-page":"1079","DOI":"10.1038\/s41587-020-0501-8","article-title":"A complete domain-to-species taxonomy for bacteria and archaea","volume":"38","author":"Parks","year":"2020","journal-title":"Nat Biotechnol"},{"key":"2024072623111323200_btae452-B30","author":"Sahlgren","year":"2005"},{"key":"2024072623111323200_btae452-B31","doi-asserted-by":"crossref","first-page":"82493","DOI":"10.1109\/ACCESS.2022.3195878","article-title":"Demeter: a fast and energy-efficient food profiler using hyperdimensional computing in memory","volume":"10","author":"Shahroodi","year":"2022","journal-title":"IEEE Access"},{"key":"2024072623111323200_btae452-B32","doi-asserted-by":"crossref","first-page":"1661","DOI":"10.1038\/s41592-023-02018-3","article-title":"Fast and robust metagenomic sequence comparison through sparse chaining with skani","volume":"20","author":"Shaw","year":"2023","journal-title":"Nat Methods"},{"key":"2024072623111323200_btae452-B33","first-page":"3154","author":"Shrivastava","year":"2017"},{"key":"2024072623111323200_btae452-B34","doi-asserted-by":"crossref","first-page":"3210","DOI":"10.1093\/bioinformatics\/btv351","article-title":"Busco: assessing genome assembly and annotation completeness with single-copy orthologs","volume":"31","author":"Sim\u00e3o","year":"2015","journal-title":"Bioinformatics"},{"key":"2024072623111323200_btae452-B35","doi-asserted-by":"crossref","first-page":"24075","DOI":"10.1007\/s11042-020-09108-w","article-title":"Testu01 and practrand: tools for a randomness evaluation for famous multimedia ciphers","volume":"79","author":"Sleem","year":"2020","journal-title":"Multimed Tools Appl"},{"key":"2024072623111323200_btae452-B36","doi-asserted-by":"crossref","first-page":"640","DOI":"10.1038\/msb.2012.61","article-title":"High-throughput sequencing for biology and medicine","volume":"9","author":"Soon","year":"2013","journal-title":"Mol Syst Biol"},{"key":"2024072623111323200_btae452-B37","doi-asserted-by":"crossref","first-page":"e1002195","DOI":"10.1371\/journal.pbio.1002195","article-title":"Big data: astronomical or genomical?","volume":"13","author":"Stephens","year":"2015","journal-title":"PLoS Biol"},{"key":"2024072623111323200_btae452-B38","doi-asserted-by":"crossref","first-page":"1639","DOI":"10.1021\/acs.jproteome.2c00612","article-title":"Hyperspec: ultrafast mass spectra clustering in hyperdimensional space","volume":"22","author":"Xu","year":"2023","journal-title":"J Proteome Res"},{"key":"2024072623111323200_btae452-B39","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1093\/bioinformatics\/bty651","article-title":"Bindash, software for fast genome distance estimation on a typical personal laptop","volume":"35","author":"Zhao","year":"2019","journal-title":"Bioinformatics"},{"key":"2024072623111323200_btae452-B40","first-page":"656","author":"Zou","year":"2022"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae452\/58558968\/btae452.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae452\/58662658\/btae452.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/7\/btae452\/58662658\/btae452.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,26]],"date-time":"2024-07-26T23:49:56Z","timestamp":1722037796000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae452\/7714688"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,7,1]]},"references-count":40,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae452","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.03.05.583605","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,7]]},"published":{"date-parts":[[2024,7,1]]},"article-number":"btae452"}}