{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:28Z","timestamp":1772138068920,"version":"3.50.1"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"5","license":[{"start":{"date-parts":[[2024,4,22]],"date-time":"2024-04-22T00:00:00Z","timestamp":1713744000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100000780","name":"European Union","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100000780","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,5,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>As data storage challenges grow and existing technologies approach their limits, synthetic DNA emerges as a promising storage solution due to its remarkable density and durability advantages. While cost remains a concern, emerging sequencing and synthetic technologies aim to mitigate it, yet introduce challenges such as errors in the storage and retrieval process. One crucial task in a DNA storage system is clustering numerous DNA reads into groups that represent the original input strands.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>In this paper, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering results.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/bensdvir\/GradHC.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btae274","type":"journal-article","created":{"date-parts":[[2024,4,17]],"date-time":"2024-04-17T20:51:39Z","timestamp":1713387099000},"source":"Crossref","is-referenced-by-count":9,"title":["GradHC: highly reliable gradual hash-based clustering for DNA storage systems"],"prefix":"10.1093","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-2163-0727","authenticated-orcid":false,"given":"Dvir","family":"Ben Shabat","sequence":"first","affiliation":[{"name":"Department of Computer Science, Technion , Haifa 320003, Israel"}]},{"given":"Adar","family":"Hadad","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Technion , Haifa 320003, Israel"}]},{"given":"Avital","family":"Boruchovsky","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Technion , Haifa 320003, Israel"}]},{"given":"Eitan","family":"Yaakobi","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Technion , Haifa 320003, Israel"}]}],"member":"286","published-online":{"date-parts":[[2024,4,22]]},"reference":[{"key":"2024051108505466200_btae274-B1","doi-asserted-by":"crossref","first-page":"5345","DOI":"10.1038\/s41467-020-19148-3","article-title":"Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction","volume":"11","author":"Antkowiak","year":"2020","journal-title":"Nat Commun"},{"key":"2024051108505466200_btae274-B2","doi-asserted-by":"crossref","first-page":"2502","DOI":"10.1093\/bioinformatics\/btr447","article-title":"SEED: efficient clustering of next-generation sequences","volume":"27","author":"Bao","year":"2011","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B3","doi-asserted-by":"crossref","first-page":"54348","DOI":"10.1109\/ACCESS.2022.3176954","article-title":"A deep embedded clustering algorithm for the binning of metagenomic sequences","volume":"10","author":"Bao","year":"2022","journal-title":"IEEE Access"},{"key":"2024051108505466200_btae274-B4","first-page":"910","author":"Batu","year":"2004"},{"key":"2024051108505466200_btae274-B5","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1093\/bioinformatics\/17.5.419","article-title":"Efficient large-scale sequence comparison by locality-sensitive hashing","volume":"17","author":"Buhler","year":"2001","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B6","doi-asserted-by":"crossref","first-page":"8242","DOI":"10.1038\/s41598-020-64803-w","article-title":"Evaluating white matter lesion segmentations with refined s\u00f8rensen-dice analysis","volume":"10","author":"Carass","year":"2020","journal-title":"Sci Rep"},{"key":"2024051108505466200_btae274-B7","doi-asserted-by":"crossref","first-page":"456","DOI":"10.1038\/s41576-019-0125-3","article-title":"Molecular digital data storage using DNA","volume":"20","author":"Ceze","year":"2019","journal-title":"Nat Rev Genet"},{"key":"2024051108505466200_btae274-B8","author":"Chaykin","year":"2022"},{"key":"2024051108505466200_btae274-B9","doi-asserted-by":"crossref","first-page":"108","DOI":"10.1186\/s12859-022-04643-9","article-title":"Clustering biological sequences with dynamic sequence similarity threshold","volume":"23","author":"Chiu","year":"2022","journal-title":"BMC Bioinformatics"},{"key":"2024051108505466200_btae274-B10","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B11","doi-asserted-by":"crossref","first-page":"950","DOI":"10.1126\/science.aaj2038","article-title":"DNA fountain enables a robust and efficient storage architecture","volume":"355","author":"Erlich","year":"2017","journal-title":"Science"},{"key":"2024051108505466200_btae274-B12","first-page":"226","author":"Ester","year":"1996"},{"key":"2024051108505466200_btae274-B13","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"CD-HIT: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B14","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1186\/1471-2105-12-271","article-title":"Dnaclust: accurate and efficient clustering of phylogenetic marker genes","volume":"12","author":"Ghodsi","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2024051108505466200_btae274-B15","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1186\/s12864-022-08619-0","article-title":"Meshclust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores","volume":"23","author":"Girgis","year":"2022","journal-title":"BMC Genomics"},{"key":"2024051108505466200_btae274-B16","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1038\/nature11875","article-title":"Towards practical, high-capacity, low-maintenance information storage in synthesized DNA","volume":"494","author":"Goldman","year":"2013","journal-title":"Nature"},{"key":"2024051108505466200_btae274-B17","doi-asserted-by":"crossref","first-page":"2552","DOI":"10.1002\/anie.201411378","article-title":"Robust chemical preservation of digital information on DNA in silica with error-correcting codes","volume":"54","author":"Grass","year":"2015","journal-title":"Angew Chem Int Ed Engl"},{"key":"2024051108505466200_btae274-B18","first-page":"604","author":"Indyk","year":"1998"},{"key":"2024051108505466200_btae274-B19","doi-asserted-by":"crossref","first-page":"e83","DOI":"10.1093\/nar\/gky315","article-title":"MeShClust: an intelligent tool for clustering DNA sequences","volume":"46","author":"James","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2024051108505466200_btae274-B20","doi-asserted-by":"crossref","first-page":"499","DOI":"10.1038\/nmeth.2918","article-title":"Large-scale de novo DNA synthesis: technologies and applications","volume":"11","author":"Kosuri","year":"2014","journal-title":"Nat Methods"},{"key":"2024051108505466200_btae274-B21","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9781139924801","volume-title":"Mining of Massive Datasets","author":"Leskovec","year":"2014","edition":"2nd edn"},{"key":"2024051108505466200_btae274-B22","doi-asserted-by":"crossref","first-page":"i127","DOI":"10.1093\/bioinformatics\/btz354","article-title":"Locality-sensitive hashing for the edit distance","volume":"35","author":"Mar\u00e7ais","year":"2019","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B23","doi-asserted-by":"crossref","first-page":"242","DOI":"10.1038\/nbt.4079","article-title":"Random access in large-scale dna data storage","volume":"36","author":"Organick","year":"2018","journal-title":"Nat Biotechnol"},{"key":"2024051108505466200_btae274-B24","doi-asserted-by":"crossref","first-page":"bbac336","DOI":"10.1093\/bib\/bbac336","article-title":"Clover: tree structure-based efficient DNA clustering for DNA-based data storage","volume":"23","author":"Qu","year":"2022","journal-title":"Brief Bioinform"},{"key":"2024051108505466200_btae274-B25","volume-title":"Advances in Neural Information Processing Systems 30, Long Beach, CA, USA","author":"Rashtchian"},{"key":"2024051108505466200_btae274-B26","doi-asserted-by":"crossref","first-page":"720","DOI":"10.1093\/bioinformatics\/btaa740","article-title":"SOLQC: synthetic oligo library quality control tool","volume":"37","author":"Sabary","year":"2020","journal-title":"Bioinformatics"},{"key":"2024051108505466200_btae274-B27","doi-asserted-by":"crossref","first-page":"1951","DOI":"10.1038\/s41598-024-51730-3","article-title":"Reconstruction algorithms for DNA-storage systems","volume":"14","author":"Sabary","year":"2024","journal-title":"Sci Rep"},{"key":"2024051108505466200_btae274-B28","first-page":"USA: IEEE, 2022, 269","author":"Sankar","year":"2022"},{"key":"2024051108505466200_btae274-B29","doi-asserted-by":"crossref","first-page":"1560","DOI":"10.1109\/TIT.2021.3127174","article-title":"Clustering-correcting codes","volume":"68","author":"Shinkar","year":"2022","journal-title":"IEEE Trans Inform Theory"},{"key":"2024051108505466200_btae274-B30","author":"Srinivasavaradhan"},{"key":"2024051108505466200_btae274-B31","doi-asserted-by":"crossref","first-page":"2542","DOI":"10.1038\/s41467-018-04964-5","article-title":"Clustering huge protein sequence sets in linear time","volume":"9","author":"Steinegger","year":"2018","journal-title":"Nat Commun"},{"key":"2024051108505466200_btae274-B32","first-page":"399","author":"Viswanathan","year":"2008"},{"key":"2024051108505466200_btae274-B33","doi-asserted-by":"crossref","first-page":"1913","DOI":"10.1093\/bioinformatics\/btv053","article-title":"Starcode: sequence clustering based on all-pairs search","volume":"31","author":"Zorita","year":"2015","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btae274\/57300151\/btae274.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/5\/btae274\/57521113\/btae274.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/40\/5\/btae274\/57521113\/btae274.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,11]],"date-time":"2024-05-11T05:09:17Z","timestamp":1715404157000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btae274\/7655853"}},"subtitle":[],"editor":[{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,4,22]]},"references-count":33,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5,2]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btae274","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2023.10.05.561008","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,5,1]]},"published":{"date-parts":[[2024,4,22]]},"article-number":"btae274"}}