{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,2]],"date-time":"2025-12-02T15:21:49Z","timestamp":1764688909178},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"S1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2013,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl <jats:italic>et al<\/jats:italic>., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as <jats:italic>Balanced Misclassification Index<\/jats:italic> (<jats:italic>BMI<\/jats:italic>, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance <jats:italic>d<\/jats:italic>\n              <jats:sub>2<\/jats:sub>, Pearson correlation <jats:italic>d<\/jats:italic>\n              <jats:sub>\n                <jats:italic>r<\/jats:italic>\n              <\/jats:sub> and mutual information <jats:italic>d<\/jats:italic>\n              <jats:sub>\n                <jats:italic>MI<\/jats:italic>\n              <\/jats:sub>. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness <jats:italic>et al<\/jats:italic>., one receives an indication that <jats:italic>d<\/jats:italic>\n              <jats:sub>\n                <jats:italic>MI<\/jats:italic>\n              <\/jats:sub> may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a \"feature matrix\". In particular it can be used to cluster ChIP-seq data.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-14-s1-s6","type":"journal-article","created":{"date-parts":[[2013,1,14]],"date-time":"2013-01-14T13:16:28Z","timestamp":1358169388000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis"],"prefix":"10.1186","volume":"14","author":[{"given":"Raffaele","family":"Giancarlo","sequence":"first","affiliation":[]},{"given":"Giosu\u00e9","family":"Lo Bosco","sequence":"additional","affiliation":[]},{"given":"Luca","family":"Pinello","sequence":"additional","affiliation":[]},{"given":"Filippo","family":"Utro","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2013,1,14]]},"reference":[{"key":"5580_CR1","unstructured":"Stanford Microarray DataBase. [http:\/\/smd.stanford.edu\/]"},{"key":"5580_CR2","doi-asserted-by":"publisher","first-page":"1499","DOI":"10.1038\/nbt1205-1499","volume":"23","author":"P D'haeseleer","year":"2005","unstructured":"D'haeseleer P: How does gene expression cluster work. Nat Biotechnol. 2005, 23: 1499-1501. 10.1038\/nbt1205-1499.","journal-title":"Nat Biotechnol"},{"key":"5580_CR3","doi-asserted-by":"publisher","DOI":"10.1201\/9780203011232","volume-title":"Statistical analysis of gene expression microarray data","author":"TP Speed","year":"2003","unstructured":"Speed TP: Statistical analysis of gene expression microarray data. 2003, Chapman & Hall\/CRC"},{"key":"5580_CR4","doi-asserted-by":"publisher","first-page":"3201","DOI":"10.1093\/bioinformatics\/bti517","volume":"21","author":"J Handl","year":"2005","unstructured":"Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005, 21: 3201-3212. 10.1093\/bioinformatics\/bti517.","journal-title":"Bioinformatics"},{"key":"5580_CR5","doi-asserted-by":"publisher","first-page":"943","DOI":"10.1038\/ng1422","volume":"36","author":"T Mehta","year":"2004","unstructured":"Mehta T, Tanik M, Allison D: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature genetics. 2004, 36: 943-947. 10.1038\/ng1422.","journal-title":"Nature genetics"},{"key":"5580_CR6","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1186\/1471-2105-11-503","volume":"11","author":"E Freyhult","year":"2010","unstructured":"Freyhult E, Landfors M, \u00d6nskog J, Hvidsten T, Ryd\u00e9n P: Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics. 2010, 11: 503-10.1186\/1471-2105-11-503.","journal-title":"BMC Bioinformatics"},{"key":"5580_CR7","doi-asserted-by":"publisher","first-page":"462","DOI":"10.1186\/1471-2105-9-462","volume":"9","author":"R Giancarlo","year":"2008","unstructured":"Giancarlo R, Scaturro D, Utro F: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics. 2008, 9: 462-10.1186\/1471-2105-9-462.","journal-title":"BMC Bioinformatics"},{"key":"5580_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1748-7188-6-1","volume":"6","author":"R Giancarlo","year":"2011","unstructured":"Giancarlo R, Utro F: Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms for Molecular Biology. 2011, 6: 1-10.1186\/1748-7188-6-1.","journal-title":"Algorithms for Molecular Biology"},{"key":"5580_CR9","volume-title":"Lecture Notes in Computer Science, Volume 6073","author":"R Giancarlo","year":"2010","unstructured":"Giancarlo R, Lo Bosco G, Pinello L: Distance functions, clustering algorithms and microarray data analysis. Lecture Notes in Computer Science, Volume 6073. 2010"},{"key":"5580_CR10","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/1471-2105-8-111","volume":"8","author":"I Priness","year":"2007","unstructured":"Priness I, Maimon O, Ben-Gal I: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007, 8: 1-12. 10.1186\/1471-2105-8-1.","journal-title":"BMC Bioinformatics"},{"key":"5580_CR11","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1148\/radiology.143.1.7063747","volume":"143","author":"BM JA Hanley","year":"1982","unstructured":"JA Hanley BM: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143: 29-36. 10.1148\/radiology.143.1.7063747.","journal-title":"Radiology"},{"key":"5580_CR12","volume-title":"Algorithms for Clustering Data","author":"A Jain","year":"1988","unstructured":"Jain A, Dubes R: Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall 1988"},{"key":"5580_CR13","doi-asserted-by":"publisher","first-page":"RESEARCH0036","DOI":"10.1186\/gb-2002-3-7-research0036","volume":"3","author":"S Dudoit","year":"2002","unstructured":"Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 2002, 3: RESEARCH0036.","journal-title":"Genome Biology"},{"key":"5580_CR14","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1186\/1471-2105-6-289","volume":"6","author":"V Di Ges\u00fa","year":"2005","unstructured":"Di Ges\u00fa V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D: Genclust: a genetic algorithm for clustering gene expression data. BMC Bioinformatics. 2005, 6: 289-10.1186\/1471-2105-6-289.","journal-title":"BMC Bioinformatics"},{"key":"5580_CR15","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1023\/A:1023949509487","volume":"52","author":"S Monti","year":"2003","unstructured":"Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning. 2003, 52: 91-118. 10.1023\/A:1023949509487.","journal-title":"Machine Learning"},{"key":"5580_CR16","doi-asserted-by":"publisher","first-page":"334","DOI":"10.1073\/pnas.95.1.334","volume":"95","author":"X Wen","year":"1998","unstructured":"Wen X, Fuhrman S, Michaels GS, Carr GS, Smith DB, Barker JL, Somogyi R: Large scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci USA. 1998, 95: 334-339. 10.1073\/pnas.95.1.334.","journal-title":"Proc Natl Acad Sci USA"},{"key":"5580_CR17","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1093\/bioinformatics\/17.4.309","volume":"17","author":"KY Yeung","year":"2001","unstructured":"Yeung KY, Haynor DR, Ruzzo WL: Validating clustering for gene expression data. Bioinformatics. 2001, 17: 309-318. 10.1093\/bioinformatics\/17.4.309.","journal-title":"Bioinformatics"},{"key":"5580_CR18","doi-asserted-by":"publisher","first-page":"531","DOI":"10.1126\/science.286.5439.531","volume":"286","author":"TR Golub","year":"1999","unstructured":"Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeeck M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126\/science.286.5439.531.","journal-title":"Science"},{"key":"5580_CR19","doi-asserted-by":"publisher","first-page":"4164","DOI":"10.1073\/pnas.0308531101","volume":"101","author":"JP Brunet","year":"2004","unstructured":"Brunet JP, Tamayo P, Golub T, Mesirov J: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA. 2004, 101: 4164-4169. 10.1073\/pnas.0308531101.","journal-title":"Proc Natl Acad Sci USA"},{"key":"5580_CR20","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1038\/35000501","volume":"403","author":"A Alizadeh","year":"2000","unstructured":"Alizadeh A, Eisen M, Davis R, Ma C, Lossos I, Rosenwald A, Boldrick J, Sabet H, Tran T, Yu X, Powell J, Yang L, Marti G, Moore T, Hudson JJ, Lu L, Lewis D, Tibshirani R, Sherlock G, Chan W, Greiner T, Weisenburger D, Armitage J, Warnke R, Levy R, Wilson W, Grever M, Byrd J, Botstein D, Brown P, Staudt L: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038\/35000501.","journal-title":"Nature"},{"key":"5580_CR21","unstructured":"NCI 60 cancer microarray project. [http:\/\/genome-www.stanford.edu\/NCI60]"},{"key":"5580_CR22","doi-asserted-by":"publisher","first-page":"4465","DOI":"10.1073\/pnas.012025199","volume":"99","author":"A Su","year":"2002","unstructured":"Su A, Cooke M, Ching K, Hakak Y, Walker J, Wiltshire T, Orth A, Vega R, Sapinoso L, Moqrich A, Patapoutian A, Hampton G, Schultz P, Hogenesch J: Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA. 2002, 99: 4465-4470. 10.1073\/pnas.012025199.","journal-title":"Proc Natl Acad Sci USA"},{"key":"5580_CR23","doi-asserted-by":"publisher","first-page":"3273","DOI":"10.1091\/mbc.9.12.3273","volume":"9","author":"PT Spellman","year":"1998","unstructured":"Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297. 10.1091\/mbc.9.12.3273.","journal-title":"Mol Biol Cell"},{"key":"5580_CR24","first-page":"120","volume-title":"Current Topics in Computational Biology","author":"R Shamir","year":"2003","unstructured":"Shamir R, Sharan R: Algorithmic approaches to clustering gene expression data. Current Topics in Computational Biology. Edited by: Jiang T, Smith T, Xu Y, Zhang MQ, Cambridge, Ma.: MIT Press. 2003, 120-161."},{"key":"5580_CR25","doi-asserted-by":"crossref","unstructured":"Cover TM, Thomas JA: Elements of Information Theory. New York City: Wiley-Interscience, 1991.","DOI":"10.1002\/0471200611"},{"key":"5580_CR26","doi-asserted-by":"publisher","first-page":"264","DOI":"10.1145\/331499.331504","volume":"31","author":"AK Jain","year":"1999","unstructured":"Jain AK, Murty MN, Flynn PJ: Data clustering: a review. ACM Computing Surveys. 1999, 31: 264-323. 10.1145\/331499.331504.","journal-title":"ACM Computing Surveys"},{"key":"5580_CR27","doi-asserted-by":"publisher","first-page":"655","DOI":"10.1007\/s11786-007-0025-3","volume":"1","author":"R Giancarlo","year":"2008","unstructured":"Giancarlo R, Scaturro D, Utro F: A tutorial on computational cluster analysis with applications to pattern discovery in microarray data. Mathematics in Computer Science. 2008, 1: 655-672. 10.1007\/s11786-007-0025-3.","journal-title":"Mathematics in Computer Science"},{"key":"5580_CR28","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1016\/j.tcs.2012.01.024","volume":"428","author":"R Giancarlo","year":"2012","unstructured":"Giancarlo R, Utro F: Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Computer Science. 2012, 428: 58-79.","journal-title":"Theoretical Computer Science"},{"key":"5580_CR29","doi-asserted-by":"publisher","first-page":"536","DOI":"10.1093\/bioinformatics\/18.4.536","volume":"18","author":"Y Xu","year":"2002","unstructured":"Xu Y, Olman V, Xu D: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning tree. Bioinformatics. 2002, 18: 536-545. 10.1093\/bioinformatics\/18.4.536.","journal-title":"Bioinformatics"},{"key":"5580_CR30","first-page":"13","volume-title":"Computational Intelligence Methods for Bioinformatics and Biostatistics, Volume 6685 of Lecture Notes in Computer Science","author":"R Giancarlo","year":"2011","unstructured":"Giancarlo R, Lo Bosco G, Pinello L, Utro F: The three steps of clustering in the post-genomic era: a synopsis. Computational Intelligence Methods for Bioinformatics and Biostatistics, Volume 6685 of Lecture Notes in Computer Science. Edited by: Rizzo R, Lisboa P. 2011, Springer Berlin\/Heidelberg, 13-30."},{"key":"5580_CR31","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-3324-9","volume-title":"Density Estimation for Statistics and Data Analysis (Chapman & Hall\/CRC Monographs on Statistics & Applied Probability)","author":"BWSilverman","year":"1986","unstructured":"BWSilverman: Density Estimation for Statistics and Data Analysis (Chapman & Hall\/CRC Monographs on Statistics & Applied Probability). 1986, Chapman and Hall\/CRC"},{"key":"5580_CR32","volume-title":"PhD thesis","author":"KY Yeung","year":"2001","unstructured":"Yeung KY: Cluster analysis of gene expression data. PhD thesis. 2001, University of Washington"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-14-S1-S6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T21:11:35Z","timestamp":1630530695000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-14-S1-S6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,1]]},"references-count":32,"journal-issue":{"issue":"S1","published-print":{"date-parts":[[2013,1]]}},"alternative-id":["5580"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-14-s1-s6","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,1]]},"assertion":[{"value":"14 January 2013","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S6"}}