{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,26]],"date-time":"2026-02-26T20:34:09Z","timestamp":1772138049238,"version":"3.50.1"},"reference-count":25,"publisher":"Oxford University Press (OUP)","issue":"3","license":[{"start":{"date-parts":[[2021,10,20]],"date-time":"2021-10-20T00:00:00Z","timestamp":1634688000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Extreme Science and Engineering Discovery Environment (XSEDE) Bridges system at the Pittsburgh Supercomputing","award":["BIO180028"],"award-info":[{"award-number":["BIO180028"]}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01GM138634-01"],"award-info":[{"award-number":["R01GM138634-01"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,1,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Motivation<\/jats:title>\n                    <jats:p>Clustering is a fundamental task in the analysis of nucleotide sequences. Despite the exponential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. Traditional clustering methods have mostly focused on optimizing high speed clustering of highly similar sequences. We develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>We describe a clustering program AncestralClust, which is developed for clustering divergent sequences. We compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. We show that, in divergent datasets, AncestralClust has higher accuracy and more even cluster sizes than current popular methods.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Availability and implementation<\/jats:title>\n                    <jats:p>AncestralClust is an Open Source program available at https:\/\/github.com\/lpipes\/ancestralclust.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Supplementary information<\/jats:title>\n                    <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btab723","type":"journal-article","created":{"date-parts":[[2021,10,16]],"date-time":"2021-10-16T12:06:15Z","timestamp":1634385975000},"page":"663-670","source":"Crossref","is-referenced-by-count":7,"title":["AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees"],"prefix":"10.1093","volume":"38","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0056-8045","authenticated-orcid":false,"given":"Lenore","family":"Pipes","sequence":"first","affiliation":[{"name":"Department of Integrative Biology, University of California-Berkeley , Berkeley, CA 94707, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0513-6591","authenticated-orcid":false,"given":"Rasmus","family":"Nielsen","sequence":"additional","affiliation":[{"name":"Department of Integrative Biology, University of California-Berkeley , Berkeley, CA 94707, USA"},{"name":"Department of Statistics, University of California-Berkeley , Berkeley, CA 94707, USA"},{"name":"Globe Institute, University of Copenhagen , 1350 K\u00f8benhavn K, Copenhagen, Denmark"}]}],"member":"286","published-online":{"date-parts":[[2021,10,20]]},"reference":[{"key":"2023020108490522300_btab723-B1","doi-asserted-by":"crossref","first-page":"e0221068","DOI":"10.1371\/journal.pone.0221068","article-title":"Treecluster: clustering biological sequences using phylogenetic trees","volume":"14","author":"Balaban","year":"2019","journal-title":"PLoS One"},{"key":"2023020108490522300_btab723-B2","doi-asserted-by":"crossref","first-page":"2891","DOI":"10.1093\/bioinformatics\/bts552","article-title":"Comparing clustering and pre-processing in taxonomy analysis","volume":"28","author":"Bonder","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B3","first-page":"1","article-title":"Comparative analysis of sequence clustering methods for deduplication of biological databases","volume":"9","author":"Chen","year":"2018","journal-title":"J. Data Inf. Qual"},{"key":"2023020108490522300_btab723-B4","doi-asserted-by":"crossref","first-page":"1469","DOI":"10.1111\/2041-210X.13214","article-title":"Anacapa toolkit: an environmental DNA toolkit for processing multilocus metabarcode datasets","volume":"10","author":"Curd","year":"2019","journal-title":"Methods Ecol. Evol"},{"key":"2023020108490522300_btab723-B5","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than blast","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B6","doi-asserted-by":"crossref","first-page":"645","DOI":"10.1086\/282802","article-title":"Estimating phylogenetic trees from distance matrices","volume":"106","author":"Farris","year":"1972","journal-title":"Am. Nat"},{"key":"2023020108490522300_btab723-B7","doi-asserted-by":"crossref","first-page":"3150","DOI":"10.1093\/bioinformatics\/bts565","article-title":"Cd-hit: accelerated for clustering the next-generation sequencing data","volume":"28","author":"Fu","year":"2012","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B8","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1471-2105-12-271","article-title":"Dnaclust: accurate and efficient clustering of phylogenetic marker genes","volume":"12","author":"Ghodsi","year":"2011","journal-title":"BMC Bioinformatics"},{"key":"2023020108490522300_btab723-B9","doi-asserted-by":"crossref","first-page":"680","DOI":"10.1093\/bioinformatics\/btq003","article-title":"CD-HIT Suite: a web server for clustering and comparing biological sequences","volume":"26","author":"Huang","year":"2010","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B10","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1016\/B978-1-4832-3211-9.50009-7","article-title":"Evolution of protein molecules","volume":"3","author":"Jukes","year":"1969","journal-title":"Mammalian Protein Metab"},{"key":"2023020108490522300_btab723-B11","author":"Lassmann","year":"2020"},{"key":"2023020108490522300_btab723-B12","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1093\/bioinformatics\/17.3.282","article-title":"Clustering of highly homologous sequences to reduce the size of large protein databases","volume":"17","author":"Li","year":"2001","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B14","first-page":"1","article-title":"Fast gap-affine pairwise alignment using the wavefront algorithm","author":"Marco-Sola","year":"2020","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B15","doi-asserted-by":"crossref","first-page":"103439","DOI":"10.1016\/j.compbiomed.2019.103439","article-title":"Spclust: towards a fast and reliable clustering for potentially divergent biological sequences","volume":"114","author":"Matar","year":"2019","journal-title":"Comput. Biol. Med"},{"key":"2023020108490522300_btab723-B16","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"J. Mol. Biol"},{"key":"2023020108490522300_btab723-B17","doi-asserted-by":"crossref","first-page":"e59","DOI":"10.1371\/journal.pone.0000059","article-title":"Taxonomic reliability of DNA sequences in public sequence databases: a fungal perspective","volume":"1","author":"Nilsson","year":"2006","journal-title":"PLoS One"},{"key":"2023020108490522300_btab723-B18","doi-asserted-by":"crossref","first-page":"355","DOI":"10.1111\/j.1471-8286.2007.01678.x","article-title":"Bold: the barcode of life data system (http:\/\/www. barcodinglife. org)","volume":"7","author":"Ratnasingham","year":"2007","journal-title":"Mol. Ecol. Notes"},{"key":"2023020108490522300_btab723-B19","doi-asserted-by":"crossref","first-page":"e77","DOI":"10.1371\/journal.pbio.0050077","article-title":"The sorcerer ii global ocean sampling expedition: northwest Atlantic through eastern tropical pacific","volume":"5","author":"Rusch","year":"2007","journal-title":"PLoS Biol"},{"key":"2023020108490522300_btab723-B20","first-page":"406","article-title":"The neighbor-joining method: a new method for reconstructing phylogenetic trees","volume":"4","author":"Saitou","year":"1987","journal-title":"Mol. Biol. Evol"},{"key":"2023020108490522300_btab723-B21","doi-asserted-by":"crossref","DOI":"10.1093\/database\/baaa062","article-title":"NCBI taxonomy: a comprehensive update on curation, resources and tools","volume":"2020","author":"Schoch","year":"2020","journal-title":"Database"},{"key":"2023020108490522300_btab723-B5437901","volume-title":"Introduction to Information Retrieval","author":"Sch\u00fctze","year":"2008"},{"key":"2023020108490522300_btab723-B22","doi-asserted-by":"crossref","first-page":"1312","DOI":"10.1093\/bioinformatics\/btu033","article-title":"Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies","volume":"30","author":"Stamatakis","year":"2014","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B23","doi-asserted-by":"crossref","DOI":"10.1093\/acprof:oso\/9780199602605.001.0001","volume-title":"Molecular Evolution: A Statistical Approach","author":"Yang","year":"2014"},{"key":"2023020108490522300_btab723-B24","doi-asserted-by":"crossref","first-page":"380","DOI":"10.1093\/bioinformatics\/bty617","article-title":"A parallel computational framework for ultra-large-scale sequence clustering analysis","volume":"35","author":"Zheng","year":"2019","journal-title":"Bioinformatics"},{"key":"2023020108490522300_btab723-B25","first-page":"1","article-title":"Sequence clustering in bioinformatics: an empirical study","volume":"21","author":"Zou","year":"2018","journal-title":"Brief. Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btab723\/41104132\/btab723.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/3\/663\/49008375\/btab723.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/3\/663\/49008375\/btab723.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,1]],"date-time":"2023-02-01T15:07:50Z","timestamp":1675264070000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/3\/663\/6404580"}},"subtitle":[],"editor":[{"given":"Inanc","family":"Birol","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2021,10,20]]},"references-count":25,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,1,12]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btab723","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2021.01.08.426008","asserted-by":"object"}]},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,2,1]]},"published":{"date-parts":[[2021,10,20]]}}}