{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T11:53:07Z","timestamp":1774007587580,"version":"3.50.1"},"reference-count":21,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2026,3,14]],"date-time":"2026-03-14T00:00:00Z","timestamp":1773446400000},"content-version":"vor","delay-in-days":14,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"US National Institute of Health","award":["R01HG010040"],"award-info":[{"award-number":["R01HG010040"]}]},{"name":"US National Institute of Health","award":["R01HG014175"],"award-info":[{"award-number":["R01HG014175"]}]},{"name":"US National Institute of Health","award":["U01HG013748"],"award-info":[{"award-number":["U01HG013748"]}]},{"name":"US National Institute of Health","award":["U41HG010972"],"award-info":[{"award-number":["U41HG010972"]}]},{"name":"US National Institute of Health","award":["U24CA294203"],"award-info":[{"award-number":["U24CA294203"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Low-complexity (LC) DNA sequences are compositionally repetitive sequences that are often associated with spurious homologous matches and variant calling artifacts. While algorithms for identifying LC sequences exist, they either lack concise mathematical definition of complexity or are inefficient with long or variable context windows.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>Longdust is a new algorithm that efficiently identifies long LC sequences including centromeric satellite and tandem repeats with moderately long motifs. It defines string complexity by statistically modeling the k-mer count distribution with the parameters: the k-mer length, the context window size and a threshold on complexity. Longdust exhibits high performance on real data and high consistency with existing methods.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>https:\/\/github.com\/lh3\/longdust<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btag112","type":"journal-article","created":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T12:48:40Z","timestamp":1772887720000},"source":"Crossref","is-referenced-by-count":0,"title":["Finding low-complexity DNA sequences with longdust"],"prefix":"10.1093","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-4874-2874","authenticated-orcid":false,"given":"Heng","family":"Li","sequence":"first","affiliation":[{"name":"Department of Biomedical Informatics, Harvard Medical School , Boston, MA 02215,","place":["United States"]},{"name":"Department of Data Science, Dana-Farber Cancer Institute , Boston, MA 02215,","place":["United States"]},{"name":"Broad Insitute of MIT and Harvard , Cambridge, MA 02142,","place":["United States"]}]},{"given":"Brian","family":"Li","sequence":"additional","affiliation":[{"name":"Commonwealth School , Boston, MA 02116,","place":["United States"]}]}],"member":"286","published-online":{"date-parts":[[2026,3,13]]},"reference":[{"key":"2026032005262843100_btag112-B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2026032005262843100_btag112-B2","doi-asserted-by":"crossref","first-page":"573","DOI":"10.1093\/nar\/27.2.573","article-title":"Tandem repeats finder: a program to analyze DNA sequences","volume":"27","author":"Benson","year":"1999","journal-title":"Nucleic Acids Res"},{"key":"2026032005262843100_btag112-B3","doi-asserted-by":"crossref","first-page":"275","DOI":"10.1016\/S0097-8485(99)00009-1","article-title":"Zones of low entropy in genomic sequences","volume":"23","author":"Crochemore","year":"1999","journal-title":"Comput Chem"},{"key":"2026032005262843100_btag112-B4","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1186\/s12859-025-06168-3","article-title":"Pytrf: a python package for finding tandem repeats from genomic sequences","volume":"26","author":"Du","year":"2025","journal-title":"BMC Bioinformatics"},{"key":"2026032005262843100_btag112-B5","doi-asserted-by":"crossref","DOI":"10.1126\/science.adw1931","article-title":"Multispecies pangenomes reveal a pervasive influence of population size on structural variation","volume":"390","author":"Edwards","year":"2025","journal-title":"Science"},{"key":"2026032005262843100_btag112-B6","doi-asserted-by":"crossref","first-page":"e23","DOI":"10.1093\/nar\/gkq1212","article-title":"A new repeat-masking method enables specific detection of homologous sequences","volume":"39","author":"Frith","year":"2011","journal-title":"Nucleic Acids Res"},{"key":"2026032005262843100_btag112-B7","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1016\/S0304-3975(98)00075-9","article-title":"On tables of random numbers (reprinted from \"Sankhya: The Indian Journal of Statistics\", series a, vol. 25 part 4, 1963)","volume":"207","author":"Kolmogorov","year":"1998","journal-title":"Theor. Comput. Sci"},{"key":"2026032005262843100_btag112-B8","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1109\/TIT.1976.1055501","article-title":"On the complexity of finite sequences","volume":"22","author":"Lempel","year":"1976","journal-title":"IEEE Trans Inform Theory"},{"key":"2026032005262843100_btag112-B9","doi-asserted-by":"crossref","first-page":"2843","DOI":"10.1093\/bioinformatics\/btu356","article-title":"Toward better understanding of artifacts in variant calling from high-coverage samples","volume":"30","author":"Li","year":"2014","journal-title":"Bioinformatics"},{"key":"2026032005262843100_btag112-B10","doi-asserted-by":"crossref","first-page":"btae717","DOI":"10.1093\/bioinformatics\/btae717","article-title":"BWT construction and search at the terabase scale","volume":"40","author":"Li","year":"2024","journal-title":"Bioinformatics"},{"key":"2026032005262843100_btag112-B11","doi-asserted-by":"crossref","first-page":"giaf103","DOI":"10.1093\/gigascience\/giaf103","article-title":"Finding easy regions for short-read variant calling from pangenome data","volume":"14","author":"Li","year":"2025","journal-title":"Gigascience"},{"key":"2026032005262843100_btag112-B12","doi-asserted-by":"crossref","first-page":"1028","DOI":"10.1089\/cmb.2006.13.1028","article-title":"A fast and symmetric DUST implementation to mask low-complexity DNA sequences","volume":"13","author":"Morgulis","year":"2006","journal-title":"J Comput Biol"},{"key":"2026032005262843100_btag112-B13","doi-asserted-by":"crossref","first-page":"44","DOI":"10.1126\/science.abj6987","article-title":"The complete sequence of a human genome","volume":"376","author":"Nurk","year":"2022","journal-title":"Science"},{"key":"2026032005262843100_btag112-B14","doi-asserted-by":"crossref","first-page":"vbae149","DOI":"10.1093\/bioadv\/vbae149","article-title":"ULTRA\u2013effective labeling of tandem repeats in genomic sequence","volume":"4","author":"Olson","year":"2024","journal-title":"Bioinform Adv"},{"key":"2026032005262843100_btag112-B15","doi-asserted-by":"crossref","first-page":"W628","DOI":"10.1093\/nar\/gkh466","article-title":"Complexity: an internet resource for analysis of DNA sequence complexity","volume":"32","author":"Orlov","year":"2004","journal-title":"Nucleic Acids Res"},{"key":"2026032005262843100_btag112-B16","doi-asserted-by":"crossref","first-page":"giaf154","DOI":"10.1093\/gigascience\/giaf154","article-title":"Challenges in structural variant calling in low-complexity regions","volume":"14","author":"Qin","year":"2025","journal-title":"Gigascience"},{"key":"2026032005262843100_btag112-B17","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1002\/j.1538-7305.1948.tb01338.x","article-title":"A mathematical theory of communication","volume":"27","author":"Shannon","year":"1948","journal-title":"Bell Syst Tech J"},{"key":"2026032005262843100_btag112-B18","doi-asserted-by":"crossref","first-page":"giad101","DOI":"10.1093\/gigascience\/giad101","article-title":"Alcor: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data","volume":"12","author":"Silva","year":"2022","journal-title":"Gigascience"},{"key":"2026032005262843100_btag112-B19","doi-asserted-by":"crossref","first-page":"514","DOI":"10.1239\/jap\/1183667418","article-title":"Markov additive processes and repeats in sequences","volume":"44","author":"Spouge","year":"2007","journal-title":"J Appl Probab"},{"key":"2026032005262843100_btag112-B20","doi-asserted-by":"crossref","first-page":"679","DOI":"10.1093\/bioinformatics\/18.5.679","article-title":"Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity","volume":"18","author":"Troyanskaya","year":"2002","journal-title":"Bioinformatics"},{"key":"2026032005262843100_btag112-B21","doi-asserted-by":"crossref","first-page":"401","DOI":"10.1038\/s41586-025-08816-3","article-title":"Complete sequencing of ape genomes","volume":"641","author":"Yoo","year":"2025","journal-title":"Nature"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btag112\/67346273\/btag112.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/3\/btag112\/67346273\/btag112.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/42\/3\/btag112\/67346273\/btag112.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T09:26:35Z","timestamp":1773998795000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/doi\/10.1093\/bioinformatics\/btag112\/8519623"}},"subtitle":[],"editor":[{"given":"Dr Can","family":"Alkan","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2026,2,28]]},"references-count":21,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,2,28]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btag112","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2026,3]]},"published":{"date-parts":[[2026,2,28]]},"article-number":"btag112"}}