{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T21:17:06Z","timestamp":1780694226119,"version":"3.54.1"},"reference-count":75,"publisher":"Oxford University Press (OUP)","issue":"15","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2005,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge\u2014whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical cluster validation.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability:<\/jats:title><jats:p>The software used in the experiments is available at http:\/\/dbkgroup.org\/handl\/clustervalidation\/<\/jats:p><\/jats:sec><jats:sec><jats:title>Contact<\/jats:title><jats:p>J.Handl@postgrad.manchester.ac.uk<\/jats:p><\/jats:sec><jats:sec><jats:title>Supplementary information:<\/jats:title><jats:p>Enlarged colour plots are provided in the Supplementary Material, which is available at http:\/\/dbkgroup.org\/handl\/clustervalidation\/<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/bti517","type":"journal-article","created":{"date-parts":[[2005,5,25]],"date-time":"2005-05-25T02:38:44Z","timestamp":1116988724000},"page":"3201-3212","source":"Crossref","is-referenced-by-count":721,"title":["Computational cluster validation in post-genomic data analysis"],"prefix":"10.1093","volume":"21","author":[{"given":"Julia","family":"Handl","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Joshua","family":"Knowles","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Douglas B.","family":"Kell","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"286","published-online":{"date-parts":[[2005,8,1]]},"reference":[{"key":"2023051611591107400_B1","doi-asserted-by":"crossref","unstructured":"Ankerst M , Breunig M, Kriegel H-P, Sander J. OPTICS: ordering points to identify clustering structure. In: Proceedings of the 1999 International Conference on Management of Data\u2014Delis A, et al, eds. (1999) New York: ACM Press. 49\u201360.","DOI":"10.1145\/304182.304187"},{"key":"2023051611591107400_B2","doi-asserted-by":"crossref","unstructured":"Bandyopadhyay S , Manlik U. Nonparametric genetic clustering: comparison of validity indices. IEEE Trans. Syst. Man Cybernet (2001) 31:120\u2013125.","DOI":"10.1109\/5326.923275"},{"key":"2023051611591107400_B3","doi-asserted-by":"crossref","unstructured":"Ben-Dor A , Friedman M, Yakhini Z. Overabundance analysis and class discovery in gene expression data. In: Technical report (2002) Agilent Laboratories, Palo Aeto.","DOI":"10.1145\/369133.369167"},{"key":"2023051611591107400_B4","unstructured":"Ben-Hur A , Elisseeff A, Guyon I. A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing\u2014Aetman RB, et al, eds. (2002) New Jersey: World Scientific Publishing Co."},{"key":"2023051611591107400_B5","doi-asserted-by":"crossref","unstructured":"Bezdek J , Pal N. Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybernet. (1998) 28:301\u2013315.","DOI":"10.1109\/3477.678624"},{"key":"2023051611591107400_B6","doi-asserted-by":"crossref","unstructured":"Bilu Y , Linial M. The advantage of functional prediction based on clustering of yeast genes and its correlation with non-sequence based classification. J. Comput. Biol. (2002) 9:193\u2013210.","DOI":"10.1089\/10665270252935412"},{"key":"2023051611591107400_B7","doi-asserted-by":"crossref","unstructured":"Bittner M , et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature (2000) 406:536\u2013540.","DOI":"10.1038\/35020115"},{"key":"2023051611591107400_B8","doi-asserted-by":"crossref","unstructured":"Bolshakova N , Azuaje F. Cluster validation techniques for genome expression data. Signal Processing (2003) 83:825\u2013833.","DOI":"10.1016\/S0165-1684(02)00475-9"},{"key":"2023051611591107400_B9","doi-asserted-by":"crossref","unstructured":"Bolshakova N , et al. An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics (2005) 21:451\u2013455.","DOI":"10.1093\/bioinformatics\/bti190"},{"key":"2023051611591107400_B10","doi-asserted-by":"crossref","unstructured":"Breckenridge J . Replicating cluster analysis: method, consistency and validity. Multivar. Behav. Res. (1989) 24:147\u2013161.","DOI":"10.1207\/s15327906mbr2402_1"},{"key":"2023051611591107400_B11","doi-asserted-by":"crossref","unstructured":"Breckenridge J . Validating cluster analysis: consistent replication and symmetry. Multivar. Behav. Res. (2000) 35:261\u2013285.","DOI":"10.1207\/S15327906MBR3502_5"},{"key":"2023051611591107400_B12","doi-asserted-by":"crossref","unstructured":"Datta S , Datta S. Comparison and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics (2003) 19:459\u2013466.","DOI":"10.1093\/bioinformatics\/btg025"},{"key":"2023051611591107400_B13","doi-asserted-by":"crossref","unstructured":"Davies DL , Bouldin DW. A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell. (1979) 1:224\u2013227.","DOI":"10.1109\/TPAMI.1979.4766909"},{"key":"2023051611591107400_B14","doi-asserted-by":"crossref","unstructured":"Ding C , He C. K-nearest neighbor consistency in data clustering: incorporating local information into global optimization. In: Proceedings of the 2004 ACM Symposium on Applied Computing\u2014Haddad HM, et al, eds. (2004) New York: ACM Press. 584\u2013589.","DOI":"10.1145\/967900.968021"},{"key":"2023051611591107400_B15","doi-asserted-by":"crossref","unstructured":"Dubes R , Jain AK. Validity studies in clustering methodologies. Pattern Recog. Lett. (1979) 11:235\u2013254.","DOI":"10.1016\/0031-3203(79)90034-7"},{"key":"2023051611591107400_B16","unstructured":"Duda RO , Hart PE, Stork DG. Pattern Classification (2001) 2nd edn. John Wiley and Sons Ltd."},{"key":"2023051611591107400_B17","doi-asserted-by":"crossref","unstructured":"Dunn JC . Well separated clusters and fuzzy partitions. J. Cybernet. (1974) 4:95\u2013104.","DOI":"10.1080\/01969727408546059"},{"key":"2023051611591107400_B18","unstructured":"Edwards AL . The Correlation Coefficient (1967) W.H. Freeman. 33\u201346."},{"key":"2023051611591107400_B19","doi-asserted-by":"crossref","unstructured":"Efron B , Tibshirani RJ. An Introduction to the Bootstrap (1993) Chapman and Hall.","DOI":"10.1007\/978-1-4899-4541-9"},{"key":"2023051611591107400_B20","doi-asserted-by":"crossref","unstructured":"Eisen MB . Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863\u201314868.","DOI":"10.1073\/pnas.95.25.14863"},{"key":"2023051611591107400_B21","unstructured":"Ester M , Kriegel HP, Sander J. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data-Mining\u2014Simoudis E, et al, eds. (1996) Menlo Park: AIII Press."},{"key":"2023051611591107400_B22","doi-asserted-by":"crossref","unstructured":"Estivill-Castro V . Why so many clustering algorithms: a position paper. ACM SIGKDD Explor. Newslett. (2002) 4:65\u201375.","DOI":"10.1145\/568574.568575"},{"key":"2023051611591107400_B23","unstructured":"Everitt BS . Cluster Analysis (1993) Edward Arnold."},{"key":"2023051611591107400_B24","doi-asserted-by":"crossref","unstructured":"Fonseca CM , Fleming PJ. On the performance assessment and comparison of stochastic multiobjective optimizers. In: Proceedings of the Fourth International Conference on Parallel Problem Solving from Nature\u2014Voigt HM, et al, eds. (1996) Berlin: Springer-Verlag. 584\u2013593.","DOI":"10.1007\/3-540-61723-X_1022"},{"key":"2023051611591107400_B25","unstructured":"Fridlyand J , Dudoit S. Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method. In: Technical report (2001) Berkeley: Department of Statistics."},{"key":"2023051611591107400_B26","doi-asserted-by":"crossref","unstructured":"Gasch AP , Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol. (2002) 3:1\u201322.","DOI":"10.1186\/gb-2002-3-11-research0059"},{"key":"2023051611591107400_B27","doi-asserted-by":"crossref","unstructured":"Gat-Viks I , et al. Scoring clustering solutions by their biological relevance. Bioinformatics (2003) 19:2381\u20132389.","DOI":"10.1093\/bioinformatics\/btg330"},{"key":"2023051611591107400_B28","doi-asserted-by":"crossref","unstructured":"Golub TR , et al. Molecular classification of cancer: class discovery and class prediction by gene expression. Science (1999) 286:531\u2013537.","DOI":"10.1126\/science.286.5439.531"},{"key":"2023051611591107400_B29","doi-asserted-by":"crossref","unstructured":"Goodacre R , et al. Rapid identification of urinary tract infection bacteria using hyperspectral whole organism fingerprinting and artificial neural networks. Microbiology (1998) 144:1157\u20131170.","DOI":"10.1099\/00221287-144-5-1157"},{"key":"2023051611591107400_B30","doi-asserted-by":"crossref","unstructured":"Gordon AD . Classification (1999) 2nd edn. Chapman and Hall.","DOI":"10.1201\/9781584888536"},{"key":"2023051611591107400_B31","doi-asserted-by":"crossref","unstructured":"Halkidi M , et al. On clustering validation techniques. J. Intell. Inform. Syst. (2001) 17:107\u2013145.","DOI":"10.1023\/A:1012801612483"},{"key":"2023051611591107400_B32","doi-asserted-by":"crossref","unstructured":"Handl J , Knowles J. Exploiting the trade-off\u2014the benefits of multiple objectives in data clustering. In: Proceedings of the Third International Conference on Evolutionary Multicriterion Optimization\u2014Coello LA, et al, eds. (2005) Berlin: Springer-Verlag. 547\u2013560.","DOI":"10.1007\/978-3-540-31880-4_38"},{"key":"2023051611591107400_B33","doi-asserted-by":"crossref","unstructured":"Hastie T , et al. Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. (2000) 1:1\u201321.","DOI":"10.1186\/gb-2000-1-2-research0003"},{"key":"2023051611591107400_B34","unstructured":"Hastie T , Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag."},{"key":"2023051611591107400_B35","doi-asserted-by":"crossref","unstructured":"Herrero J , et al. A hierarchical unsupervised growing neural network for clustering gene expression data. Bioinformatics (2001) 17:126\u2013136.","DOI":"10.1093\/bioinformatics\/17.2.126"},{"key":"2023051611591107400_B36","doi-asserted-by":"crossref","unstructured":"Hubert A . Comparing partitions. J. Classif. (1985) 2:193\u2013198.","DOI":"10.1007\/BF01908075"},{"key":"2023051611591107400_B37","unstructured":"Jaccard S . Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. (1908) 44:223\u2013270."},{"key":"2023051611591107400_B38","doi-asserted-by":"crossref","unstructured":"Jain AK , et al. Data clustering: a review. ACM Comput. Surv. (1999) 31:264\u2013323.","DOI":"10.1145\/331499.331504"},{"key":"2023051611591107400_B39","unstructured":"Jardine N , Sibson R. Mathematical Taxonomy (1971) John Wiley and Sons."},{"key":"2023051611591107400_B40","doi-asserted-by":"crossref","unstructured":"Kaplan N , et al. A functional hierarchical organization of the protein sequence space. BMC Bioinformatics (2004) 5.","DOI":"10.1186\/1471-2105-5-196"},{"key":"2023051611591107400_B41","doi-asserted-by":"crossref","unstructured":"Kell DB , Oliver SG. Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays (2004) 26:99\u2013105.","DOI":"10.1002\/bies.10385"},{"key":"2023051611591107400_B42","doi-asserted-by":"crossref","unstructured":"Kerr MK , Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA (2001) 98:8961\u20138965.","DOI":"10.1073\/pnas.161273698"},{"key":"2023051611591107400_B43","doi-asserted-by":"crossref","unstructured":"Kohonen T . Self-organizing maps. In: Springer Series in Information Sciences (2001) 30. Springer-Verlag.","DOI":"10.1007\/978-3-642-56927-2"},{"key":"2023051611591107400_B44","doi-asserted-by":"crossref","unstructured":"Krasnogor N , Pelta DA. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics (2004) 20:1015\u20131021.","DOI":"10.1093\/bioinformatics\/bth031"},{"key":"2023051611591107400_B45","doi-asserted-by":"crossref","unstructured":"Krieger AM , Green P. A cautionary note on using internal crossvalidation. Psychometrika (1999) 64:341\u2013353.","DOI":"10.1007\/BF02294300"},{"key":"2023051611591107400_B46","doi-asserted-by":"crossref","unstructured":"Lange T , et al. Stability-based validation of clustering solutions. Neural comput. (2004) 16:1299\u20131323.","DOI":"10.1162\/089976604773717621"},{"key":"2023051611591107400_B47","unstructured":"Lehmann EL , D'Abrera HJM. Nonparametrics: Statistical Methods Based on Ranks (1998) Prentice-Hall."},{"key":"2023051611591107400_B48","doi-asserted-by":"crossref","unstructured":"Levine E , Domany E. Resampling method for unsupervised estimation of cluster validity. Neural Comput. (2001) 13:2573\u20132593.","DOI":"10.1162\/089976601753196030"},{"key":"2023051611591107400_B49","doi-asserted-by":"crossref","unstructured":"Li C , Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. (2001) 2:1\u201311.","DOI":"10.1186\/gb-2001-2-8-research0032"},{"key":"2023051611591107400_B50","unstructured":"MacQueen L . Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability\u2014de Cam LM, et al, eds. (1967) Berkeley: University of California Press. 281\u2013297."},{"key":"2023051611591107400_B51","doi-asserted-by":"crossref","unstructured":"Madeira SC , Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE Trans. Comput. Biol. Bioinformatics (2004) 1:24\u201345.","DOI":"10.1109\/TCBB.2004.2"},{"key":"2023051611591107400_B52","unstructured":"McLachlan G. , Krishman T. The EM Algorithm and Extensions (1997) John Wiley and Son Ltd."},{"key":"2023051611591107400_B53","doi-asserted-by":"crossref","unstructured":"McShane LM , et al. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics (2002) 18:1462\u20131469.","DOI":"10.1093\/bioinformatics\/18.11.1462"},{"key":"2023051611591107400_B54","doi-asserted-by":"crossref","unstructured":"Mendes DJ , et al. Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics (2003) 19:122\u2013129.","DOI":"10.1093\/bioinformatics\/btg1069"},{"key":"2023051611591107400_B55","doi-asserted-by":"crossref","unstructured":"Michaud DJ , et al. eXPatGen: generating dynamic expression patterns for the systematic evaluation of analytical methods. Bioinformatics (2003) 19:1140\u20131146.","DOI":"10.1093\/bioinformatics\/btg132"},{"key":"2023051611591107400_B56","doi-asserted-by":"crossref","unstructured":"Milligan GW , Cooper MC. A study of the comparability of external criteria for hierachical cluster ananlysis. Multivar.Behav. Res. (1986) 21:441\u2013458.","DOI":"10.1207\/s15327906mbr2104_5"},{"key":"2023051611591107400_B57","doi-asserted-by":"crossref","unstructured":"Pal NR , Bezdek JC. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. (1995) 3:370\u2013379.","DOI":"10.1109\/91.413225"},{"key":"2023051611591107400_B58","unstructured":"Pareto V . Manual of Political Economy, 1971 Translation of 1927 Edition (1971) Augustus M. Kelley."},{"key":"2023051611591107400_B59","doi-asserted-by":"crossref","unstructured":"Quackenbush J . Computational analysis of microarray data. Nat. Rev. Genet. (2001) 2:418\u2013427.","DOI":"10.1038\/35076576"},{"key":"2023051611591107400_B60","doi-asserted-by":"crossref","unstructured":"Rand W . Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. (1971) 66:846\u2013850.","DOI":"10.1080\/01621459.1971.10482356"},{"key":"2023051611591107400_B61","unstructured":"Rayward-Smith VJ , Osman IH, Reeves CR, Smith GD. Modern \\nobreak Heuristic Search Methods (1996) John Wiley and Sons Ltd."},{"key":"2023051611591107400_B62","unstructured":"Romesburg HC . Cluster Analysis for Researchers (1984) Belmont."},{"key":"2023051611591107400_B63","doi-asserted-by":"crossref","unstructured":"Rousseeuw PJ . Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. (1987) 20:53\u201365.","DOI":"10.1016\/0377-0427(87)90125-7"},{"key":"2023051611591107400_B64","doi-asserted-by":"crossref","unstructured":"Shaw AD , et al. Discrimination of the variety and region of origin of extra virgin olive oils using {C-13 NMR} and multivariate calibration with variable reduction. Anal. Chim. Acta (1997) 384:357\u2013374.","DOI":"10.1016\/S0003-2670(97)00037-8"},{"key":"2023051611591107400_B65","doi-asserted-by":"crossref","unstructured":"Slonim DK . From patterns to pathways: gene expression data analysis comes of age. Nat. Genet. (2002) 32:502\u2013508.","DOI":"10.1038\/ng1033"},{"key":"2023051611591107400_B66","doi-asserted-by":"crossref","unstructured":"De Smet F , et al. Adaptive quality-based clustering of gene expression profiles. Bioinformatics (2002) 18:735\u2013746.","DOI":"10.1093\/bioinformatics\/18.5.735"},{"key":"2023051611591107400_B67","doi-asserted-by":"crossref","unstructured":"Tamayo P , et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA (1999) 96:2907\u20132912.","DOI":"10.1073\/pnas.96.6.2907"},{"key":"2023051611591107400_B68","doi-asserted-by":"crossref","unstructured":"Tavazoie S , et al. Systematic determination of genetic network architecture. Nat. Genet. (1999) 22:281\u2013285.","DOI":"10.1038\/10343"},{"key":"2023051611591107400_B69","unstructured":"Tibshirani R , Walther G, Botstein D, Brown P. Cluster validation by prediction strength. In: Technical report (2001) CA: Department of Statistics, Stanford University."},{"key":"2023051611591107400_B70","doi-asserted-by":"crossref","unstructured":"Tibshirani R , et al. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (2001) 63:411\u2013423.","DOI":"10.1111\/1467-9868.00293"},{"key":"2023051611591107400_B71","doi-asserted-by":"crossref","unstructured":"Toronen P . Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics (2004) 5:34.","DOI":"10.1186\/1471-2105-5-32"},{"key":"2023051611591107400_B72","unstructured":"van Rijsbergen C . Information Retrieval (1979) 2nd edn. Butterworths."},{"key":"2023051611591107400_B73","unstructured":"Vorhees E . The effectiveness and efficiency of agglomerative hierarchical clustering in document retrieval. Department of Computer Science, Cornell University. PhD thesis."},{"key":"2023051611591107400_B74","doi-asserted-by":"crossref","unstructured":"Yeung KY , et al. Validating clustering for gene expression data. Bioinformatics (2001) 17:309\u2013318.","DOI":"10.1093\/bioinformatics\/17.4.309"},{"key":"2023051611591107400_B75","doi-asserted-by":"crossref","unstructured":"Yeung KY , et al. Model-based clustering and data transformation for gene expression data. Bioinformatics (2001) 17:977\u2013987.","DOI":"10.1093\/bioinformatics\/17.10.977"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/15\/3201\/50340692\/bioinformatics_21_15_3201.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/21\/15\/3201\/50340692\/bioinformatics_21_15_3201.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,1]],"date-time":"2025-01-01T10:01:35Z","timestamp":1735725695000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/21\/15\/3201\/195678"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,8,1]]},"references-count":75,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2005,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bti517","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2005,8]]},"published":{"date-parts":[[2005,8,1]]}}}