{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T15:00:02Z","timestamp":1776092402482,"version":"3.50.1"},"reference-count":26,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2023,3,9]],"date-time":"2023-03-09T00:00:00Z","timestamp":1678320000000},"content-version":"vor","delay-in-days":8,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100017223","name":"National Energy Research Scientific Computing Center","doi-asserted-by":"publisher","id":[{"id":"10.13039\/100017223","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Read-overlap-based graph data structures play a central role in computing de novo genome assembly. Most long-read assemblers use Myers\u2019s string graph model to sparsify overlap graphs. Graph sparsification improves assembly contiguity by removing spurious and redundant connections. However, a graph model must be coverage-preserving, i.e. it must ensure that there exist walks in the graph that spell all chromosomes, given sufficient sequencing coverage. This property becomes even more important for diploid genomes, polyploid genomes, and metagenomes where there is a risk of losing haplotype-specific information.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We develop a novel theoretical framework under which the coverage-preserving properties of a graph model can be analyzed. We first prove that de Bruijn graph and overlap graph models are guaranteed to be coverage-preserving. We next show that the standard string graph model lacks this guarantee. The latter result is consistent with prior work suggesting that removal of contained reads, i.e. the reads that are substrings of other reads, can lead to coverage gaps during string graph construction. Our experiments done using simulated long reads from HG002 human diploid genome show that 50 coverage gaps are introduced on average by ignoring contained reads from nanopore datasets. To remedy this, we propose practical heuristics that are well-supported by our theoretical results and are useful to decide which contained reads should be retained to avoid coverage gaps. Our method retains a small fraction of contained reads (1\u20132%) and closes majority of the coverage gaps.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>Source code is available through GitHub (https:\/\/github.com\/at-cg\/ContainX) and Zenodo with doi: 10.5281\/zenodo.7687543.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btad124","type":"journal-article","created":{"date-parts":[[2023,3,8]],"date-time":"2023-03-08T14:32:32Z","timestamp":1678285952000},"source":"Crossref","is-referenced-by-count":10,"title":["Coverage-preserving sparsification of overlap graphs for long-read assembly"],"prefix":"10.1093","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4300-0794","authenticated-orcid":false,"given":"Chirag","family":"Jain","sequence":"first","affiliation":[{"name":"Department of Computational and Data Sciences, Indian Institute of Science , Bengaluru, Karnataka 560012, India"}]}],"member":"286","published-online":{"date-parts":[[2023,3,9]]},"reference":[{"key":"2023042618305109500_btad124-B1","doi-asserted-by":"crossref","first-page":"1075","DOI":"10.1038\/s41587-022-01220-6","article-title":"Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads","volume":"40","year":"2022","journal-title":"Nat Biotechnol"},{"key":"2023042618305109500_btad124-B2","doi-asserted-by":"crossref","first-page":"170","DOI":"10.1038\/s41592-020-01056-5","article-title":"Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm","volume":"18","year":"2021","journal-title":"Nat Methods"},{"key":"2023042618305109500_btad124-B3","doi-asserted-by":"crossref","first-page":"1332","DOI":"10.1038\/s41587-022-01261-x","article-title":"Haplotype-resolved assembly of diploid genomes without parental data","volume":"40","author":"Cheng","year":"2022","journal-title":"Nat Biotechnol"},{"key":"2023042618305109500_btad124-B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1748-7188-8-22","article-title":"Space-efficient and exact de bruijn graph representation based on a bloom filter","volume":"8","author":"Chikhi","year":"2013","journal-title":"Algorithms Mol Biol"},{"key":"2023042618305109500_btad124-B5","doi-asserted-by":"crossref","first-page":"671","DOI":"10.1038\/s41592-022-01478-3","article-title":"Metagenome assembly of high-fidelity long reads with hifiasm-meta","volume":"19","author":"Feng","year":"2022","journal-title":"Nat Methods"},{"key":"2023042618305109500_btad124-B6","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13059-018-1605-z","article-title":"A comparative evaluation of hybrid error correction methods for error-prone long reads","volume":"20","author":"Fu","year":"2019","journal-title":"Genome Biol"},{"key":"2023042618305109500_btad124-B7","doi-asserted-by":"crossref","first-page":"i105","DOI":"10.1093\/bioinformatics\/bty279","article-title":"A graph-based approach to diploid genome assembly","volume":"34","author":"Garg","year":"2018","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B8","first-page":"1018","author":"Hui","year":"2016"},{"key":"2023042618305109500_btad124-B9","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1089\/cmb.1995.2.291","article-title":"A new algorithm for DNA sequence assembly","volume":"2","author":"Idury","year":"1995","journal-title":"J Comput Biol"},{"key":"2023042618305109500_btad124-B10","doi-asserted-by":"crossref","first-page":"519","DOI":"10.1038\/s41586-022-05325-5","article-title":"Semi-automated assembly of high-quality diploid human reference genomes","volume":"611","author":"Jarvis","year":"2022","journal-title":"Nature"},{"key":"2023042618305109500_btad124-B11","doi-asserted-by":"crossref","first-page":"231","DOI":"10.1016\/0888-7543(88)90007-9","article-title":"Genomic mapping by fingerprinting random clones: a mathematical analysis","volume":"2","author":"Lander","year":"1988","journal-title":"Genomics"},{"key":"2023042618305109500_btad124-B12","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B13","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1038\/s41592-018-0054-7","article-title":"A synthetic-diploid benchmark for accurate variant-calling evaluation","volume":"15","author":"Li","year":"2018","journal-title":"Nat Methods"},{"key":"2023042618305109500_btad124-B14","author":"Liao","year":"2022"},{"key":"2023042618305109500_btad124-B15","doi-asserted-by":"crossref","first-page":"e1008928","DOI":"10.1371\/journal.pcbi.1008928","article-title":"What do Eulerian and Hamiltonian cycles have to do with genome assembly?","volume":"17","author":"Medvedev","year":"2021","journal-title":"PLoS Comput Biol"},{"key":"2023042618305109500_btad124-B16","doi-asserted-by":"crossref","first-page":"2818","DOI":"10.1093\/bioinformatics\/btn548","article-title":"Aggressive assembly of pyrosequencing reads with mates","volume":"24","author":"Miller","year":"2008","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B17","doi-asserted-by":"crossref","first-page":"275","DOI":"10.1089\/cmb.1995.2.275","article-title":"Toward simplifying and accurately formulating fragment assembly","volume":"2","author":"Myers","year":"1995","journal-title":"J Comput Biol"},{"issue":"Suppl_2","key":"2023042618305109500_btad124-B18","doi-asserted-by":"crossref","first-page":"ii79","DOI":"10.1093\/bioinformatics\/bti1114","article-title":"The fragment assembly string graph","volume":"21","author":"Myers","year":"2005","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B19","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1126\/science.abj6987","article-title":"The complete sequence of a human genome","volume":"376","author":"Nurk","year":"2022","journal-title":"Science"},{"key":"2023042618305109500_btad124-B20","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1093\/bioinformatics\/btq033","article-title":"Bedtools: a flexible suite of utilities for comparing genomic features","volume":"26","author":"Quinlan","year":"2010","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B21","first-page":"1","article-title":"Telomere-to-telomere assembly of diploid chromosomes with verkko","author":"Rautiainen","year":"2023","journal-title":"Nat Biotechnol"},{"key":"2023042618305109500_btad124-B22","doi-asserted-by":"crossref","first-page":"823","DOI":"10.1038\/s41592-022-01539-7","article-title":"Oxford nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing","volume":"19","author":"Sereika","year":"2022","journal-title":"Nat Methods"},{"key":"2023042618305109500_btad124-B23","doi-asserted-by":"crossref","first-page":"i494","DOI":"10.1093\/bioinformatics\/btw450","article-title":"Information-optimal genome assembly via sparse read-overlap graphs","volume":"32","author":"Shomorony","year":"2016","journal-title":"Bioinformatics"},{"key":"2023042618305109500_btad124-B24","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1089\/cmb.2016.0141","article-title":"Safe and complete contig assembly through omnitigs","volume":"24","author":"Tomescu","year":"2017","journal-title":"J Comput Biol"},{"key":"2023042618305109500_btad124-B25","doi-asserted-by":"crossref","first-page":"1155","DOI":"10.1038\/s41587-019-0217-9","article-title":"Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome","volume":"37","author":"Wenger","year":"2019","journal-title":"Nat Biotechnol"},{"key":"2023042618305109500_btad124-B26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12864-020-07227-0","article-title":"A comprehensive evaluation of long read error correction methods","volume":"21","author":"Zhang","year":"2020","journal-title":"BMC Genomics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btad124\/49479723\/btad124.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/3\/btad124\/50100989\/btad124.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/39\/3\/btad124\/50100989\/btad124.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,8]],"date-time":"2023-12-08T10:08:25Z","timestamp":1702030105000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btad124\/7074174"}},"subtitle":[],"editor":[{"given":"Anthony","family":"Mathelier","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2023,3,1]]},"references-count":26,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btad124","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.03.17.484715","asserted-by":"object"}]},"ISSN":["1367-4811"],"issn-type":[{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023,3,1]]},"published":{"date-parts":[[2023,3,1]]},"article-number":"btad124"}}