{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T23:31:41Z","timestamp":1776123101570,"version":"3.50.1"},"reference-count":53,"publisher":"Oxford University Press (OUP)","issue":"17","license":[{"start":{"date-parts":[[2018,9,9]],"date-time":"2018-09-09T00:00:00Z","timestamp":1536451200000},"content-version":"vor","delay-in-days":8,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"name":"National Science and Engineering Research Council Discovery"},{"name":"EMBO Installation","award":["IG-2521"],"award-info":[{"award-number":["IG-2521"]}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["GM108348"],"award-info":[{"award-number":["GM108348"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,9,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Segmental duplications (SDs) or low-copy repeats, are segments of DNA\u2009&amp;gt;\u20091 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% \u2018pairwise error\u2019 between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>SEDEF is available at https:\/\/github.com\/vpc-ccg\/sedef.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty586","type":"journal-article","created":{"date-parts":[[2018,7,7]],"date-time":"2018-07-07T05:42:12Z","timestamp":1530942132000},"page":"i706-i714","source":"Crossref","is-referenced-by-count":91,"title":["Fast characterization of segmental duplications in genome assemblies"],"prefix":"10.1093","volume":"34","author":[{"given":"Ibrahim","family":"Numanagi\u0107","sequence":"first","affiliation":[{"name":"Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA"},{"name":"Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alim S","family":"G\u00f6kkaya","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Bilkent University, Ankara, Turkey"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lillian","family":"Zhang","sequence":"additional","affiliation":[{"name":"Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bonnie","family":"Berger","sequence":"additional","affiliation":[{"name":"Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA"},{"name":"Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Can","family":"Alkan","sequence":"additional","affiliation":[{"name":"Department of Computer Engineering, Bilkent University, Ankara, Turkey"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Faraz","family":"Hach","sequence":"additional","affiliation":[{"name":"Vancouver Prostate Centre, Vancouver, Canada"},{"name":"Department of Urologic Sciences, University of British Columbia, Vancouver, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2018,9,8]]},"reference":[{"key":"2023061313503492400_bty586-B1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-540-39763-2_1","article-title":"A local chaining algorithm and its applications in comparative genomics","volume-title":"Algorithms in Bioinformatics","author":"Abouelhoda","year":"2003"},{"key":"2023061313503492400_bty586-B2","doi-asserted-by":"crossref","first-page":"1061","DOI":"10.1038\/ng.437","article-title":"Personalized copy number and segmental duplication maps using next-generation sequencing","volume":"41","author":"Alkan","year":"2009","journal-title":"Nat. Genet."},{"key":"2023061313503492400_bty586-B3","doi-asserted-by":"crossref","first-page":"363","DOI":"10.1038\/nrg2958","article-title":"Genome structural variation discovery and genotyping","volume":"12","author":"Alkan","year":"2011","journal-title":"Nat Rev. Genet."},{"key":"2023061313503492400_bty586-B4","doi-asserted-by":"crossref","first-page":"61","DOI":"10.1038\/nmeth.1527","article-title":"Limitations of next-generation genome sequence assembly","volume":"8","author":"Alkan","year":"2011","journal-title":"Nat. Methods"},{"key":"2023061313503492400_bty586-B5","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol."},{"key":"2023061313503492400_bty586-B6","first-page":"377","article-title":"Polylogarithmic approximation for edit distance and the asymmetric query complexity","volume-title":"Proceedings of the 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS '10","author":"Andoni","year":"2010"},{"key":"2023061313503492400_bty586-B7","doi-asserted-by":"crossref","DOI":"10.1145\/2746539.2746612","article-title":"Edit distance cannot be computed in strongly subquadratic time (unless SETH is false)","volume-title":"Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing","author":"Backurs","year":"2015"},{"key":"2023061313503492400_bty586-B8","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.1101\/gr.187101","article-title":"Segmental duplications: organization and impact within the current human genome project assembly","volume":"11","author":"Bailey","year":"2001","journal-title":"Genome Res."},{"key":"2023061313503492400_bty586-B9","doi-asserted-by":"crossref","first-page":"1003","DOI":"10.1126\/science.1072047","article-title":"Recent segmental duplications in the human genome","volume":"297","author":"Bailey","year":"2002","journal-title":"Science"},{"key":"2023061313503492400_bty586-B10","first-page":"550","article-title":"Approximating edit distance efficiently","volume-title":"Proceedings of the 45th Annual IEEE Symp. Foundations of Computer Science","author":"Bar-Yossef","year":"2004"},{"key":"2023061313503492400_bty586-B11","first-page":"21","article-title":"On the resemblance and containment of documents","volume-title":"Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)","author":"Broder","year":"1997"},{"key":"2023061313503492400_bty586-B12","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1007\/3-540-45452-7_19","article-title":"One-gapped q-gram filters for Levenshtein distance","volume-title":"Annual Symposium on Combinatorial Pattern Matching","author":"Burkhardt","year":"2002"},{"key":"2023061313503492400_bty586-B13","unstructured":"Carruthers-Smith \u00a0K. (2013) Sliding window minimum implementations. https:\/\/people.cs.uct.ac.za\/\u223cksmith\/articles\/sliding_window_minimum.html (28 January 2018, date last accessed)."},{"key":"2023061313503492400_bty586-B14","doi-asserted-by":"crossref","first-page":"627","DOI":"10.1038\/nrg3933","article-title":"Genetic variation and the de novo assembly of human genomes","volume":"16","author":"Chaisson","year":"2015","journal-title":"Nat. Rev. Genet."},{"key":"2023061313503492400_bty586-B15","doi-asserted-by":"crossref","first-page":"667","DOI":"10.1186\/s12864-017-4083-x","article-title":"Gapless genome assembly of colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite gene clusters","volume":"18","author":"Dallery","year":"2017","journal-title":"BMC Genomics"},{"key":"2023061313503492400_bty586-B16","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1186\/s12864-015-1647-5","article-title":"An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data","volume":"16","author":"Fan","year":"2015","journal-title":"BMC Genomics"},{"key":"2023061313503492400_bty586-B17","doi-asserted-by":"crossref","first-page":"2243","DOI":"10.1093\/bioinformatics\/btw139","article-title":"On genomic repeats and reproducibility","volume":"32","author":"Firtina","year":"2016","journal-title":"Bioinformatics"},{"key":"2023061313503492400_bty586-B18","doi-asserted-by":"crossref","first-page":"1434","DOI":"10.1126\/science.1101160","article-title":"The influence of CCL3L1 gene-containing segmental duplications on HIV-1\/AIDS susceptibility","volume":"307","author":"Gonzalez","year":"2005","journal-title":"Science"},{"key":"2023061313503492400_bty586-B19","doi-asserted-by":"crossref","DOI":"10.1109\/GRC.2011.6122599","article-title":"A practical comparison of edit distance approximation algorithms","author":"Hanada","year":"2011","journal-title":"Proceedingss of 2011 IEEE International Conference on Granular Computing, GrC-2011"},{"key":"2023061313503492400_bty586-B20","unstructured":"Harris \u00a0R.S. (2007) Improved pairwise alignment of genomic DNA.PhD Thesis, Pennsylvania State University, University Park, PA, USA. AAI3299002."},{"key":"2023061313503492400_bty586-B21","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780199535033.001.0001","volume-title":"The Timetree of Life","author":"Hedges","year":"2009"},{"key":"2023061313503492400_bty586-B22","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1038\/ng.2007.48","article-title":"Psoriasis is associated with increased beta-defensin genomic copy number","volume":"40","author":"Hollox","year":"2008","journal-title":"Nat. Genet."},{"key":"2023061313503492400_bty586-B23","first-page":"66","article-title":"A fast approximate algorithm for mapping long reads to large reference databases","volume-title":"Proceedings of 21st Annual International Conference on Research in Computational Molecular Biology (RECOMB 2017)","author":"Jain","year":"2017"},{"key":"2023061313503492400_bty586-B24","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1038\/nbt.4060","article-title":"Nanopore sequencing and assembly of a human genome with ultra-long reads","volume":"36","author":"Jain","year":"2018","journal-title":"Nat. Biotechnol."},{"key":"2023061313503492400_bty586-B25","doi-asserted-by":"crossref","first-page":"1361","DOI":"10.1038\/ng.2007.9","article-title":"Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution","volume":"39","author":"Jiang","year":"2007","journal-title":"Nat. Genet."},{"key":"2023061313503492400_bty586-B26","doi-asserted-by":"crossref","first-page":"1362","DOI":"10.1101\/gr.078477.108","article-title":"Dupmasker: a tool for annotating primate segmental duplications","volume":"18","author":"Jiang","year":"2008","journal-title":"Genome Res."},{"key":"2023061313503492400_bty586-B27","first-page":"240","volume-title":"Two Algorithms for Approxmate String Matching in Static Texts","author":"Jokinen","year":"1991"},{"key":"2023061313503492400_bty586-B28","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/B978-1-4832-3211-9.50009-7","article-title":"Evolution of protein molecules","volume-title":"Mammalian Protein Metabolism, III","author":"Jukes","year":"1969"},{"key":"2023061313503492400_bty586-B29","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1007\/BF01653945","article-title":"On the stochastic model for estimation of mutational distance between homologous proteins","volume":"2","author":"Kimura","year":"1972","journal-title":"J. Mol. Evol."},{"key":"2023061313503492400_bty586-B30","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Sov. Phys. Doklady"},{"key":"2023061313503492400_bty586-B31","article-title":"KSW2: global alignment and alignment extension","author":"Li","year":"2017"},{"key":"2023061313503492400_bty586-B32","doi-asserted-by":"crossref","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: fast pairwise alignment for long dna sequences","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2023061313503492400_bty586-B33","doi-asserted-by":"crossref","first-page":"2078","DOI":"10.1093\/bioinformatics\/btp352","article-title":"The sequence alignment\/map format and samtools","volume":"25","author":"Li","year":"2009","journal-title":"Bioinformatics"},{"key":"2023061313503492400_bty586-B34","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1038\/nature08696","article-title":"The sequence and de novo assembly of the giant panda genome","volume":"463","author":"Li","year":"2010","journal-title":"Nature"},{"key":"2023061313503492400_bty586-B35","doi-asserted-by":"crossref","first-page":"e1005944","DOI":"10.1371\/journal.pcbi.1005944","article-title":"Mummer4: a fast and versatile genome alignment system","volume":"14","author":"Mar\u00e7ais","year":"2018","journal-title":"PLoS Comput. Biol."},{"key":"2023061313503492400_bty586-B36","doi-asserted-by":"crossref","first-page":"877","DOI":"10.1038\/nature07744","article-title":"A burst of segmental duplications in the genome of the African great ape ancestor","volume":"457","author":"Marques-Bonet","year":"2009","journal-title":"Nature"},{"key":"2023061313503492400_bty586-B37","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1038\/nature09708","article-title":"Mapping copy number variation by population-scale genome sequencing","volume":"470","author":"Mills","year":"2011","journal-title":"Nature"},{"key":"2023061313503492400_bty586-B38","doi-asserted-by":"crossref","first-page":"749","DOI":"10.1101\/gr.148718.112","article-title":"The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes","volume":"23","author":"Montgomery","year":"2013","journal-title":"Genome Res."},{"key":"2023061313503492400_bty586-B39","doi-asserted-by":"crossref","first-page":"587","DOI":"10.1038\/nmeth.3865","article-title":"A hybrid approach for de novo human genome sequence assembly and phasing","volume":"13","author":"Mostovoy","year":"2016","journal-title":"Nat Methods"},{"key":"2023061313503492400_bty586-B40","article-title":"Chaining multiple-alignment fragments in sub-quadratic time","volume-title":"Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms","author":"Myers","year":"1995"},{"key":"2023061313503492400_bty586-B41","doi-asserted-by":"crossref","first-page":"471","DOI":"10.1038\/nature12228","article-title":"Great ape genetic diversity and population history","volume":"499","author":"Prado-Martinez","year":"2013","journal-title":"Nature"},{"key":"2023061313503492400_bty586-B42","doi-asserted-by":"crossref","first-page":"901","DOI":"10.1101\/gr.228718.117","article-title":"Detection and analysis of ancient segmental duplications in mammalian genomes","volume":"28","author":"Pu","year":"2018","journal-title":"Genome Res."},{"key":"2023061313503492400_bty586-B43","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1016\/j.gde.2016.07.008","article-title":"The mutation rate in human evolution and demographic inference","volume":"41","author":"Scally","year":"2016","journal-title":"Curr. Opin. Genet. Dev."},{"key":"2023061313503492400_bty586-B44","doi-asserted-by":"crossref","first-page":"76","DOI":"10.1145\/872757.872770","article-title":"Winnowing: local algorithms for document fingerprinting","volume-title":"Proceedings of the 2003 ACM SIGMOD international conference on Management of data","author":"Schleimer","year":"2003"},{"key":"2023061313503492400_bty586-B45","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1101\/gr.809403","article-title":"Human-mouse alignments with BLASTZ","volume":"13","author":"Schwartz","year":"2003","journal-title":"Genome Res."},{"key":"2023061313503492400_bty586-B46","doi-asserted-by":"crossref","first-page":"909","DOI":"10.1038\/ng.172","article-title":"Mouse segmental duplication and copy number variation","volume":"40","author":"She","year":"2008","journal-title":"Nat. Genet."},{"key":"2023061313503492400_bty586-B47","first-page":"422","article-title":"Building and improving reference genome assemblies","volume":"105","author":"Steinberg","year":"2017","journal-title":"Proc. IEEE"},{"key":"2023061313503492400_bty586-B48","doi-asserted-by":"crossref","first-page":"641","DOI":"10.1126\/science.1197005","article-title":"Diversity of human copy number variation and multicopy genes","volume":"330","author":"Sudmant","year":"2010","journal-title":"Science"},{"key":"2023061313503492400_bty586-B49","doi-asserted-by":"crossref","first-page":"1373","DOI":"10.1101\/gr.158543.113","article-title":"Evolution and diversity of copy number variation in the great ape lineage","volume":"23","author":"Sudmant","year":"2013","journal-title":"Genome Res"},{"key":"2023061313503492400_bty586-B50","first-page":"42","article-title":"Gnu parallel - the command-line power tool","volume":"36","author":"Tange","year":"2011","journal-title":"Login USENIX Magazine"},{"key":"2023061313503492400_bty586-B51","doi-asserted-by":"crossref","first-page":"36","DOI":"10.1038\/nrg3117","article-title":"Repetitive DNA and next-generation sequencing: computational challenges and solutions","volume":"13","author":"Treangen","year":"2011","journal-title":"Nat. Rev. Genet."},{"key":"2023061313503492400_bty586-B52","doi-asserted-by":"crossref","first-page":"1037","DOI":"10.1086\/518257","article-title":"Gene copy-number variation and associated polymorphisms of complement component C4 in human systemic lupus erythematosus (SLE): low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans","volume":"80","author":"Yang","year":"2007","journal-title":"Am. J. Hum. Genet."},{"key":"2023061313503492400_bty586-B53","doi-asserted-by":"crossref","first-page":"725","DOI":"10.1093\/bioinformatics\/btx675","article-title":"ARCS: scaffolding genome drafts with linked reads","volume":"34","author":"Yeo","year":"2018","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/17\/i706\/50582423\/bioinformatics_34_17_i706.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/17\/i706\/50582423\/bioinformatics_34_17_i706.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,8]],"date-time":"2024-07-08T00:35:05Z","timestamp":1720398905000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/17\/i706\/5093240"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,9,1]]},"references-count":53,"journal-issue":{"issue":"17","published-print":{"date-parts":[[2018,9,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty586","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,9,1]]},"published":{"date-parts":[[2018,9,1]]}}}