{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,6]],"date-time":"2026-06-06T16:57:43Z","timestamp":1780765063615,"version":"3.54.1"},"reference-count":33,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Algorithms Mol Biol"],"published-print":{"date-parts":[[2010,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to <jats:italic>N<\/jats:italic>\n              <jats:sup>2<\/jats:sup> for <jats:italic>N<\/jats:italic> sequences. When <jats:italic>N<\/jats:italic> grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"http:\/\/www.clustal.org\/mbed.tgz\" ext-link-type=\"uri\">http:\/\/www.clustal.org\/mbed.tgz<\/jats:ext-link>.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1748-7188-5-21","type":"journal-article","created":{"date-parts":[[2010,5,14]],"date-time":"2010-05-14T18:15:21Z","timestamp":1273860921000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":102,"title":["Sequence embedding for fast construction of guide trees for multiple sequence alignment"],"prefix":"10.1186","volume":"5","author":[{"given":"Gordon","family":"Blackshields","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Fabian","family":"Sievers","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Weifeng","family":"Shi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andreas","family":"Wilm","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Desmond G","family":"Higgins","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2010,5,14]]},"reference":[{"issue":"2","key":"94_CR1","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1007\/BF02257378","volume":"20","author":"P Hogeweg","year":"1984","unstructured":"Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol. 1984, 20 (2): 175-86. 10.1007\/BF02257378","journal-title":"J Mol Evol"},{"issue":"2","key":"94_CR2","first-page":"81","volume":"3","author":"WR Taylor","year":"1987","unstructured":"Taylor WR: Multiple sequence alignment by a pairwise algorithm. Comput Appl Biosci. 1987, 3 (2): 81-7.","journal-title":"Comput Appl Biosci"},{"issue":"4","key":"94_CR3","doi-asserted-by":"publisher","first-page":"351","DOI":"10.1007\/BF02603120","volume":"25","author":"DF Feng","year":"1987","unstructured":"Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987, 25 (4): 351-60. 10.1007\/BF02603120","journal-title":"J Mol Evol"},{"key":"94_CR4","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1006\/jmbi.2000.4042","volume":"302","author":"C Notredame","year":"2000","unstructured":"Notredame C, Higgins DG, Heringa J: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006\/jmbi.2000.4042","journal-title":"J Mol Biol"},{"issue":"22","key":"94_CR5","doi-asserted-by":"publisher","first-page":"4673","DOI":"10.1093\/nar\/22.22.4673","volume":"22","author":"JD Thompson","year":"1994","unstructured":"Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-80. 10.1093\/nar\/22.22.4673","journal-title":"Nucleic Acids Res"},{"issue":"5","key":"94_CR6","doi-asserted-by":"publisher","first-page":"1792","DOI":"10.1093\/nar\/gkh340","volume":"32","author":"RC Edgar","year":"2004","unstructured":"Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-7. 10.1093\/nar\/gkh340","journal-title":"Nucleic Acids Res"},{"issue":"14","key":"94_CR7","doi-asserted-by":"publisher","first-page":"3059","DOI":"10.1093\/nar\/gkf436","volume":"30","author":"K Katoh","year":"2002","unstructured":"Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002, 30 (14): 3059-66. 10.1093\/nar\/gkf436","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"94_CR8","doi-asserted-by":"publisher","first-page":"330","DOI":"10.1101\/gr.2821705","volume":"15","author":"CB Do","year":"2005","unstructured":"Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15 (2): 330-40. 10.1101\/gr.2821705","journal-title":"Genome Res"},{"issue":"7","key":"94_CR9","doi-asserted-by":"publisher","first-page":"802","DOI":"10.1093\/bioinformatics\/btm017","volume":"23","author":"J Pei","year":"2007","unstructured":"Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics. 2007, 23 (7): 802-8. 10.1093\/bioinformatics\/btm017","journal-title":"Bioinformatics"},{"issue":"suppl_1","key":"94_CR10","doi-asserted-by":"publisher","first-page":"D141","DOI":"10.1093\/nar\/gkn879","volume":"37","author":"JR Cole","year":"2009","unstructured":"Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucl Acids Res. 2009, 37 (suppl_1): D141-145. 10.1093\/nar\/gkn879","journal-title":"Nucl Acids Res"},{"issue":"4","key":"94_CR11","first-page":"406","volume":"4","author":"N Saitou","year":"1987","unstructured":"Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-25.","journal-title":"Mol Biol Evol"},{"key":"94_CR12","volume-title":"Numerical Taxonomy","author":"P Sneath","year":"1973","unstructured":"Sneath P, Sokal R: Numerical Taxonomy. 1973, San Francisco, CA: WH Freeman"},{"issue":"3","key":"94_CR13","doi-asserted-by":"publisher","first-page":"372","DOI":"10.1093\/bioinformatics\/btl592","volume":"23","author":"K Katoh","year":"2007","unstructured":"Katoh K, Toh H: PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics. 2007, 23 (3): 372-4. 10.1093\/bioinformatics\/btl592","journal-title":"Bioinformatics"},{"issue":"4","key":"94_CR14","doi-asserted-by":"publisher","first-page":"286","DOI":"10.1093\/bib\/bbn013","volume":"9","author":"K Katoh","year":"2008","unstructured":"Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008, 9 (4): 286-98. 10.1093\/bib\/bbn013","journal-title":"Brief Bioinform"},{"key":"94_CR15","doi-asserted-by":"crossref","unstructured":"Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, D247-51. 34 Database","DOI":"10.1093\/nar\/gkj149"},{"key":"94_CR16","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1007\/BF01200757","volume":"15","author":"N Linial","year":"1995","unstructured":"Linial N, London E, Rabinovich Y: The Geometry of Graphs and Some of Its Algorithmic Applications. Combinatorica. 1995, 15: 215-245. 10.1007\/BF01200757","journal-title":"Combinatorica"},{"key":"94_CR17","volume-title":"Tech rep","author":"G Hristescu","year":"1999","unstructured":"Hristescu G, Farach-Colton M: Cluster-Preserving Embedding of Proteins. Tech rep. 1999, [Technical report], Rutgers University, New Jersey"},{"issue":"3","key":"94_CR18","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","volume":"48","author":"SB Needleman","year":"1970","unstructured":"Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-53. 10.1016\/0022-2836(70)90057-4","journal-title":"J Mol Biol"},{"issue":"2","key":"94_CR19","doi-asserted-by":"publisher","first-page":"539","DOI":"10.1006\/jmbi.1997.0948","volume":"268","author":"M Linial","year":"1997","unstructured":"Linial M, Linial N, Tishby N, Yona G: Global self-organization of all known protein sequences reveals inherent biological signatures. J Mol Biol. 1997, 268 (2): 539-56. 10.1006\/jmbi.1997.0948","journal-title":"J Mol Biol"},{"issue":"4","key":"94_CR20","doi-asserted-by":"publisher","first-page":"282","DOI":"10.1016\/j.compbiolchem.2008.03.005","volume":"32","author":"G Blackshields","year":"2008","unstructured":"Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG: Fast embedding methods for clustering tens of thousands of sequences. Computational Biology and Chemistry. 2008, 32 (4): 282-286. 10.1016\/j.compbiolchem.2008.03.005","journal-title":"Computational Biology and Chemistry"},{"issue":"11","key":"94_CR21","doi-asserted-by":"publisher","first-page":"2469","DOI":"10.1002\/pro.5560071126","volume":"7","author":"K Mizuguchi","year":"1998","unstructured":"Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 1998, 7 (11): 2469-71. 10.1002\/pro.5560071126","journal-title":"Protein Sci"},{"key":"94_CR22","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1002\/prot.20527","volume":"61","author":"JD Thompson","year":"2005","unstructured":"Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins. 2005, 61: 127-36. 10.1002\/prot.20527","journal-title":"Proteins"},{"key":"94_CR23","doi-asserted-by":"publisher","first-page":"325","DOI":"10.1093\/biomet\/53.3-4.325","volume":"53","author":"JC Gower","year":"1966","unstructured":"Gower JC: Some Distance Properties of Latent Root and Vector Methods Used in Multivariate Analysis. Biometrika. 1966, 53: 325-","journal-title":"Biometrika"},{"key":"94_CR24","first-page":"707","volume":"10","author":"V Levenshtein","year":"1966","unstructured":"Levenshtein V: Binary Codes Capable of Correcting Deletions, Inserstions and Reversals. Soviet Physics Doklady. 1966, 10: 707-710.","journal-title":"Soviet Physics Doklady"},{"issue":"3","key":"94_CR25","doi-asserted-by":"publisher","first-page":"726","DOI":"10.1073\/pnas.80.3.726","volume":"80","author":"WJ Wilbur","year":"1983","unstructured":"Wilbur WJ, Lipman DJ: Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci USA. 1983, 80 (3): 726-30. 10.1073\/pnas.80.3.726","journal-title":"Proc Natl Acad Sci USA"},{"key":"94_CR26","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/BF02289565","volume":"29","author":"J Kruskal","year":"1964","unstructured":"Kruskal J: Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika. 1964, 29: 1-27. 10.1007\/BF02289565","journal-title":"Psychometrika"},{"issue":"1-2","key":"94_CR27","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1016\/0025-5564(81)90043-2","volume":"53","author":"D Robinson","year":"1981","unstructured":"Robinson D, Foulds L: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53 (1-2): 131-147. 10.1016\/0025-5564(81)90043-2","journal-title":"Mathematical Biosciences"},{"key":"94_CR28","first-page":"164","volume":"5","author":"J Felsenstein","year":"1989","unstructured":"Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.","journal-title":"Cladistics"},{"key":"94_CR29","first-page":"87","volume":"15","author":"JD Thompson","year":"1999","unstructured":"Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Nucleic Acids Res. 1999, 15: 87-8.","journal-title":"Nucleic Acids Res"},{"issue":"3","key":"94_CR30","doi-asserted-by":"publisher","first-page":"405","DOI":"10.1002\/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L","volume":"28","author":"EL Sonnhammer","year":"1997","unstructured":"Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997, 28 (3): 405-20. 10.1002\/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L","journal-title":"Proteins"},{"key":"94_CR31","doi-asserted-by":"crossref","unstructured":"Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, D121-4. 33 Database","DOI":"10.1093\/nar\/gki081"},{"issue":"13","key":"94_CR32","doi-asserted-by":"publisher","first-page":"i41","DOI":"10.1093\/bioinformatics\/btn174","volume":"24","author":"Y Loewenstein","year":"2008","unstructured":"Loewenstein Y, Portugaly E, Fromer M, Linial M: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. 2008, 24 (13): i41-i49. 10.1093\/bioinformatics\/btn174","journal-title":"Bioinformatics"},{"issue":"21","key":"94_CR33","doi-asserted-by":"publisher","first-page":"2947","DOI":"10.1093\/bioinformatics\/btm404","volume":"23","author":"MA Larkin","year":"2007","unstructured":"Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics (Oxford, England). 2007, 23 (21): 2947-2948. 10.1093\/bioinformatics\/btm404","journal-title":"Bioinformatics (Oxford, England)"}],"container-title":["Algorithms for Molecular Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1748-7188-5-21.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T04:39:59Z","timestamp":1630471199000},"score":1,"resource":{"primary":{"URL":"https:\/\/almob.biomedcentral.com\/articles\/10.1186\/1748-7188-5-21"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,5,14]]},"references-count":33,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2010,12]]}},"alternative-id":["94"],"URL":"https:\/\/doi.org\/10.1186\/1748-7188-5-21","relation":{},"ISSN":["1748-7188"],"issn-type":[{"value":"1748-7188","type":"electronic"}],"subject":[],"published":{"date-parts":[[2010,5,14]]},"assertion":[{"value":"12 February 2010","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 May 2010","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 May 2010","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"21"}}