{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T02:22:23Z","timestamp":1777342943362,"version":"3.51.4"},"reference-count":22,"publisher":"Oxford University Press (OUP)","issue":"13","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":3015,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2008,7,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets.<\/jats:p>\n               <jats:p>Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation.<\/jats:p>\n               <jats:p>Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families.<\/jats:p>\n               <jats:p>Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.<\/jats:p>\n               <jats:p>Contact: \u00a0lonshy@cs.huji.ac.il<\/jats:p>","DOI":"10.1093\/bioinformatics\/btn174","type":"journal-article","created":{"date-parts":[[2008,6,27]],"date-time":"2008-06-27T07:43:13Z","timestamp":1214552593000},"page":"i41-i49","source":"Crossref","is-referenced-by-count":104,"title":["Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space"],"prefix":"10.1093","volume":"24","author":[{"given":"Yaniv","family":"Loewenstein","sequence":"first","affiliation":[{"name":"1 School of Computer Science and Engineering and 2Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Elon","family":"Portugaly","sequence":"additional","affiliation":[{"name":"1 School of Computer Science and Engineering and 2Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Menachem","family":"Fromer","sequence":"additional","affiliation":[{"name":"1 School of Computer Science and Engineering and 2Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michal","family":"Linial","sequence":"additional","affiliation":[{"name":"1 School of Computer Science and Engineering and 2Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2008,7,1]]},"reference":[{"key":"2023020210390433600_B1","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped BLAST and PSI-BLAST: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res"},{"key":"2023020210390433600_B2","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology. The Gene ontology consortium","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet"},{"key":"2023020210390433600_B3","doi-asserted-by":"crossref","first-page":"1499","DOI":"10.1038\/nbt1205-1499","article-title":"How does gene expression clustering work?","volume":"23","author":"D'haeseleer","year":"2005","journal-title":"Nat. Biotechnol"},{"key":"2023020210390433600_B4","volume-title":"Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids","author":"Durbin","year":"1999"},{"key":"2023020210390433600_B5","doi-asserted-by":"crossref","first-page":"D247","DOI":"10.1093\/nar\/gkj149","article-title":"Pfam: clans, web tools and services","volume":"34","author":"Finn","year":"2006","journal-title":"Nucleic Acids Res"},{"key":"2023020210390433600_B6","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1126\/science.155.3760.279","article-title":"Construction of phylogenetic trees","volume":"155","author":"Fitch","year":"1967","journal-title":"Science"},{"key":"2023020210390433600_B7","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1126\/science.1136800","article-title":"Clustering by passing messages between data points","volume":"315","author":"Frey","year":"2007","journal-title":"Science"},{"key":"2023020210390433600_B8","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1186\/1471-2105-5-196","article-title":"A functional hierarchical organization of the protein sequence space","volume":"5","author":"Kaplan","year":"2004","journal-title":"BMC Bioinformatics"},{"issue":"(Database issue)","key":"2023020210390433600_B9","first-page":"D216","article-title":"ProtoNet 4.0: a hierarchical classification of one million protein sequences","volume":"33","author":"Kaplan","year":"2005","journal-title":"Nucleic Acids Res"},{"key":"2023020210390433600_B10","doi-asserted-by":"crossref","first-page":"1020","DOI":"10.1093\/bioinformatics\/bti135","article-title":"Predicting fold novelty based on ProtoNet hierarchical classification","volume":"21","author":"Kifer","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020210390433600_B11","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1186\/1471-2105-6-15","article-title":"Large scale hierarchical clustering of protein sequences","volume":"6","author":"Krause","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023020210390433600_B12","doi-asserted-by":"crossref","first-page":"1876","DOI":"10.1093\/bioinformatics\/bti244","article-title":"On the quality of tree-based protein classification","volume":"21","author":"Lazareva-Ulitsky","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020210390433600_B13","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/S1367-5931(02)00003-0","article-title":". Domains, motifs and clusters in the protein universe","volume":"7","author":"Liu","year":"2003","journal-title":"Curr. Opin. Chem. Biol"},{"key":"2023020210390433600_B14","doi-asserted-by":"crossref","first-page":"D224","DOI":"10.1093\/nar\/gkl841","article-title":"New developments in the interpro database","volume":"35","author":"Mulder","year":"2007","journal-title":"Nucleic Acids Res"},{"key":"2023020210390433600_B15","doi-asserted-by":"crossref","first-page":"536","DOI":"10.1016\/S0022-2836(05)80134-2","article-title":"SCOP: a structural classification of proteins database for the investigation of sequences and structures","volume":"247","author":"Murzin","year":"1995","journal-title":"J. Mol. Biol"},{"key":"2023020210390433600_B16","doi-asserted-by":"crossref","first-page":"3604","DOI":"10.1093\/bioinformatics\/bti542","article-title":"The predictive power of the CluSTr database","volume":"21","author":"Petryszak","year":"2005","journal-title":"Bioinformatics"},{"key":"2023020210390433600_B17","doi-asserted-by":"crossref","first-page":"1557","DOI":"10.1110\/ps.062185706","article-title":"Functional annotation prediction: all for one and one for all","volume":"15","author":"Sasson","year":"2006","journal-title":"Protein Sci"},{"key":"2023020210390433600_B18","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1002\/prot.20235","article-title":"A robust method to detect structural and functional remote homologues","volume":"57","author":"Shachar","year":"2004","journal-title":"Proteins"},{"key":"2023020210390433600_B19","first-page":"201","article-title":"The application of computers to taxonomy","volume":"17","author":"Sneath","year":"1957","journal-title":"J. Gen. Microbiol"},{"key":"2023020210390433600_B20","first-page":"1409","article-title":"A statistical method for evaluating systematic relationships","volume-title":"Univ. Kans. Sci. Bull","author":"Sokal","year":"1958"},{"key":"2023020210390433600_B21","doi-asserted-by":"crossref","first-page":"1282","DOI":"10.1093\/bioinformatics\/btm098","article-title":"UniRef: comprehensive and non-redundant UniProt reference clusters","volume":"23","author":"Suzek","year":"2007","journal-title":"Bioinformatics"},{"key":"2023020210390433600_B22","doi-asserted-by":"crossref","first-page":"631","DOI":"10.1126\/science.278.5338.631","article-title":"A genomic perspective on protein families","volume":"278","author":"Tatusov","year":"1997","journal-title":"Science"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/13\/i41\/49050020\/bioinformatics_24_13_i41.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/24\/13\/i41\/49050020\/bioinformatics_24_13_i41.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,2,2]],"date-time":"2023-02-02T12:22:00Z","timestamp":1675340520000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/24\/13\/i41\/234385"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2008,7,1]]},"references-count":22,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2008,7,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btn174","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2008,7,1]]},"published":{"date-parts":[[2008,7,1]]}}}