{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,13]],"date-time":"2026-03-13T08:19:59Z","timestamp":1773389999481,"version":"3.50.1"},"reference-count":33,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2020,7,13]],"date-time":"2020-07-13T00:00:00Z","timestamp":1594598400000},"content-version":"vor","delay-in-days":12,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004285","name":"St. Petersburg State University","doi-asserted-by":"publisher","award":["PURE 51555639"],"award-info":[{"award-number":["PURE 51555639"]}],"id":[{"id":"10.13039\/501100004285","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>StringDecomposer is publicly available on https:\/\/github.com\/ablab\/stringdecomposer.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btaa454","type":"journal-article","created":{"date-parts":[[2020,5,4]],"date-time":"2020-05-04T08:13:01Z","timestamp":1588579981000},"page":"i93-i101","source":"Crossref","is-referenced-by-count":45,"title":["The string decomposition problem and its applications to centromere analysis and assembly"],"prefix":"10.1093","volume":"36","author":[{"given":"Tatiana","family":"Dvorkina","sequence":"first","affiliation":[{"name":"Center for Algorithmic Biotechnology, Institute of Translational Biomedicine , Saint Petersburg State University, Saint Petersburg 199034, Russia"}]},{"given":"Andrey V","family":"Bzikadze","sequence":"additional","affiliation":[{"name":"Graduate Program in Bioinformatics and Systems Biology, University of California , San Diego, CA 92093, USA"}]},{"given":"Pavel A","family":"Pevzner","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, University of California , San Diego, CA 92093, USA"}]}],"member":"286","published-online":{"date-parts":[[2020,7,13]]},"reference":[{"key":"2024061816140094100_btaa454-B1","doi-asserted-by":"crossref","first-page":"e181","DOI":"10.1371\/journal.pcbi.0030181","article-title":"Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data","volume":"3","author":"Alkan","year":"2007","journal-title":"PLoS Comput. Biol"},{"key":"2024061816140094100_btaa454-B2","doi-asserted-by":"crossref","DOI":"10.1038\/s41467-018-06545-y","article-title":"The dark side of centromeres: types, causes and consequences of structural abnormalities implicating centromeric DNA","volume":"9","author":"Barra","year":"2018","journal-title":"Nat. Commun"},{"key":"2024061816140094100_btaa454-B3","doi-asserted-by":"crossref","first-page":"573","DOI":"10.1093\/nar\/27.2.573","article-title":"Tandem repeats finder: a program to analyze DNA sequences","volume":"27","author":"Benson","year":"1999","journal-title":"Nucleic Acids Res"},{"key":"2024061816140094100_btaa454-B4","doi-asserted-by":"crossref","first-page":"615","DOI":"10.3390\/genes9120615","article-title":"Repetitive fragile sites: centromere satellite DNA as a source of genome instability in human diseases","volume":"9","author":"Black","year":"2018","journal-title":"Genes"},{"key":"2024061816140094100_btaa454-B5","author":"Bzikadze","year":"2019"},{"key":"2024061816140094100_btaa454-B6","volume-title":"Bioinformatics Algorithms: An Active Learning Approach","author":"Compeau","year":"2018"},{"key":"2024061816140094100_btaa454-B7","doi-asserted-by":"crossref","first-page":"755","DOI":"10.1093\/bioinformatics\/14.9.755","article-title":"Profile hidden Markov models","volume":"14","author":"Eddy","year":"1998","journal-title":"Bioinformatics"},{"key":"2024061816140094100_btaa454-B8","doi-asserted-by":"crossref","first-page":"479","DOI":"10.1007\/s10577-015-9482-8","article-title":"Satellite non-coding RNAs: the emerging players in cells, cellular pathways and cancer","volume":"23","author":"Ferreira","year":"2015","journal-title":"Chromosome Res"},{"key":"2024061816140094100_btaa454-B9","doi-asserted-by":"crossref","first-page":"111","DOI":"10.1007\/3-540-56024-6_9","volume-title":"Combinatorial Pattern Matching","author":"Fischetti","year":"1992"},{"key":"2024061816140094100_btaa454-B10","doi-asserted-by":"crossref","first-page":"1928","DOI":"10.1073\/pnas.1615133114","article-title":"Integrity of the human centromere DNA repeats is protected by CENP-A, CENP-C, and CENP-T","volume":"114","author":"Giunta","year":"2017","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2024061816140094100_btaa454-B11","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511574931","volume-title":"Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology","author":"Gusfield","year":"1997"},{"key":"2024061816140094100_btaa454-B12","doi-asserted-by":"crossref","first-page":"4809","DOI":"10.1093\/bioinformatics\/btz484","article-title":"Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data","volume":"35","author":"Harris","year":"2019","journal-title":"Bioinformatics"},{"key":"2024061816140094100_btaa454-B13","doi-asserted-by":"crossref","first-page":"763","DOI":"10.1128\/MCB.01198-12","article-title":"Sequences associated with centromere competency in the human genome","volume":"33","author":"Hayden","year":"2013","journal-title":"Mol. Cell. Biol"},{"key":"2024061816140094100_btaa454-B14","doi-asserted-by":"crossref","first-page":"e1400234","DOI":"10.1126\/sciadv.1400234","article-title":"A unique chromatin complex occupies young \u03b1-satellite arrays of human centromeres","volume":"1","author":"Henikoff","year":"2015","journal-title":"Sci. Adv"},{"key":"2024061816140094100_btaa454-B15","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1038\/nbt.4109","article-title":"Linear assembly of a human centromere on the Y chromosome","volume":"36","author":"Jain","year":"2018","journal-title":"Nat. Biotechnol"},{"key":"2024061816140094100_btaa454-B16","doi-asserted-by":"crossref","first-page":"619","DOI":"10.1016\/S0888-7543(03)00182-4","article-title":"Interspersed repeats are found predominantly in the \u201cold\u201d \u03b1 satellite families","volume":"82","author":"Kazakov","year":"2003","journal-title":"Genomics"},{"key":"2024061816140094100_btaa454-B17","doi-asserted-by":"crossref","first-page":"540","DOI":"10.1038\/s41587-019-0072-8","article-title":"Assembly of long, error-prone reads using repeat graphs","volume":"37","author":"Kolmogorov","year":"2019","journal-title":"Nat. Biotechnol"},{"key":"2024061816140094100_btaa454-B18","doi-asserted-by":"crossref","first-page":"722","DOI":"10.1101\/gr.215087.116","article-title":"Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation","volume":"27","author":"Koren","year":"2017","journal-title":"Genome Res"},{"key":"2024061816140094100_btaa454-B19","doi-asserted-by":"crossref","first-page":"3094","DOI":"10.1093\/bioinformatics\/bty191","article-title":"Minimap2: pairwise alignment for nucleotide sequences","volume":"34","author":"Li","year":"2018","journal-title":"Bioinformatics"},{"key":"2024061816140094100_btaa454-B20","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1016\/j.gde.2018.03.003","article-title":"Satellite DNA evolution: old ideas, new approaches. Satellite DNA evolution: old ideas, new approaches","volume":"49","author":"Lower","year":"2018","journal-title":"Curr. Opin. Genet. Dev"},{"key":"2024061816140094100_btaa454-B21","doi-asserted-by":"crossref","first-page":"1211","DOI":"10.1089\/cmb.2011.0101","article-title":"An algorithm to solve the motif alignment problem for approximate nested tandem repeats in biological sequences","volume":"18","author":"Matroud","year":"2011","journal-title":"J. Comput. Biol"},{"key":"2024061816140094100_btaa454-B22","doi-asserted-by":"crossref","first-page":"e17","DOI":"10.1093\/nar\/gkr1070","article-title":"NTRFinder: a software tool to find nested tandem repeats","volume":"40","author":"Matroud","year":"2012","journal-title":"Nucleic Acids Res"},{"key":"2024061816140094100_btaa454-B23","author":"Miga","year":"2019"},{"key":"2024061816140094100_btaa454-B24","doi-asserted-by":"crossref","DOI":"10.1093\/bioinformatics\/btaa440","article-title":"TandemTools: mapping long reads and assessing\/improving assembly quality in extra-long tandem repeats","author":"Mikheenko","year":"2020","journal-title":"Bioinformatics"},{"key":"2024061816140094100_btaa454-B25","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1038\/s41592-019-0669-3","article-title":"Fast and accurate long-read assembly with wtdbg2","volume":"17","author":"Ruan","year":"2020","journal-title":"Nat. Methods"},{"key":"2024061816140094100_btaa454-B26","doi-asserted-by":"crossref","first-page":"1921","DOI":"10.1093\/bioinformatics\/btw101","article-title":"Alpha-CENTAURI: assessing novel centromeric repeat sequence variation with long read sequencing","volume":"32","author":"Sevim","year":"2016","journal-title":"Bioinformatics"},{"key":"2024061816140094100_btaa454-B27","author":"Shafin","year":"2019"},{"key":"2024061816140094100_btaa454-B28","doi-asserted-by":"crossref","first-page":"e1000641","DOI":"10.1371\/journal.pgen.1000641","article-title":"The evolutionary origin of man can be traced in the layers of defunct ancestral alpha satellites flanking the active centromeres of human chromosomes","volume":"5","author":"Shepelev","year":"2009","journal-title":"PLoS Genet"},{"key":"2024061816140094100_btaa454-B29","doi-asserted-by":"crossref","DOI":"10.3389\/fgene.2018.00674","article-title":"Centromere and pericentromere transcription: roles and regulation\u2026 in sickness and in health","volume":"9","author":"Smurova","year":"2018","journal-title":"Front. Genet"},{"key":"2024061816140094100_btaa454-B30","author":"Suzuki","year":"2019"},{"key":"2024061816140094100_btaa454-B31","doi-asserted-by":"crossref","first-page":"103708","DOI":"10.1016\/j.dib.2019.103708","article-title":"Classification and monomer-by-monomer annotation dataset of suprachromosomal family 1 alpha satellite higher-order repeats in hg38 human genome assembly","volume":"24","author":"Uralsky","year":"2019","journal-title":"Data Brief"},{"key":"2024061816140094100_btaa454-B32","doi-asserted-by":"crossref","first-page":"2731","DOI":"10.1093\/nar\/13.8.2731","article-title":"Chromosome-specific alpha satellite DNA: nucleotide sequence analysis of the 2.0 kilobasepair repeat from the human X chromosome","volume":"13","author":"Waye","year":"1985","journal-title":"Nucleic Acids Res"},{"key":"2024061816140094100_btaa454-B33","doi-asserted-by":"crossref","first-page":"842","DOI":"10.1016\/j.molcel.2018.04.023","article-title":"Heterochromatin-encoded satellite RNAs induce breast cancer","volume":"70","author":"Zhu","year":"2018","journal-title":"Mol. Cell"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_1\/i93\/58271674\/bioinformatics_36_supplement1_i93.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/Supplement_1\/i93\/58271674\/bioinformatics_36_supplement1_i93.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,18]],"date-time":"2024-06-18T13:44:02Z","timestamp":1718718242000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/Supplement_1\/i93\/5870498"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,1]]},"references-count":33,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2020,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btaa454","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2019.12.26.888685","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,7]]},"published":{"date-parts":[[2020,7,1]]}}}