The most commonly applied strategies for identifying genes with a common response profile are based on clustering algorithms. These methods have no explicit rules to define the appropriate number of groups of genes. Usually the number of clusters is decided on heuristic criteria or through the application of different methods proposed to assess the number of clusters in a data set. The purpose of this paper is to compare the performance of seven of these techniques, including traditional ones, and some recently proposed. All of them produce underestimations of the true number of clusters. However, within this limitation, the gDGC algorithm appears to be the best. It is the only one that explicitly states a rule for cutting a dendrogram on the basis of a testing hypothesis framework, allowing the user to calibrate the sensitivity, adjusting the significance level. 1. Introduction One of the main purposes of microarray experiments is to discover genes having differential expression level among a set a treatment conditions. Once the set of “candidates” genes is obtained, the problem of identifying those having a common response profile across experimental conditions remains open [1–3]. There are many strategies to proceed with. One of them is the exploration of gene’s ontology; the others—and more commonly applied—are based on unsupervised classification algorithms (cluster analysis). The main purpose of clustering techniques is to arrange a number of instances to produce meaningful grouping of them. Hierarchical clustering methods not only allow to group genes but also to trace their relationships. The outcome of hierarchical methods is displayed as a binary tree called dendrogram. A key point for interpreting a dendrogram is to decide where to cut it. This decision is equivalent to determine the number of clusters in the dataset. The problem is to realize which instances belong to different groups and which seem to be different just as the result of sampling errors. Several general-purpose methods have been proposed to estimate the optimal number of clusters in a dataset. The most popular are those introduced by Calinski and Harabasz [4], Hartigan [5], Sarle [6], and Kaufman and Rousseeuw [7]. Tibshirani et al. [8] proposed the Gap statistic as a method for assessing the number of clusters in a dataset. It compares the log of the within-cluster sum of squares against its expected value under a suitable null distribution. Authors exemplified its application to the discovery of groups in a hierarchical clustering of genes of a microarray experiment.
References
[1]
J. W. Lee, J. B. Lee, M. Park, and S. H. Song, “An extensive comparison of recent classification tools applied to microarray data,” Computational Statistics and Data Analysis, vol. 48, no. 4, pp. 869–885, 2005.
[2]
K. S. Pollard and M. J. van der Laan, “Cluster analysis of genomic data,” in Bioinformatics and Computational Biology Solutions Using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Eds., pp. 209–229, Springer, New York, NY, USA, 2005.
[3]
R. Gentleman and V. J. Carey, “Supervised machine learning,” in Bioconductor Case Studies, F. Hane, W. Huber, R. Gentleman, and S. Falcon, Eds., pp. 121–136, Springer, New York, NY, USA, 2008.
[4]
R. B. Calinski and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics, vol. 3, no. 1, pp. 1–27, 1974.
[5]
J. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, NY, USA, 1975.
[6]
W. S. Sarle, “The cubic clustering criterion,” SAS Technical Report A-108, SAS Institute, Cary, NC, USA, 1983.
[7]
L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York, NY, USA, 1990.
[8]
R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society: Series B, vol. 63, no. 2, pp. 411–423, 2001.
[9]
C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” Journal of the American Statistical Association, vol. 97, no. 458, pp. 611–631, 2002.
[10]
C. Fraley and A. E. Raftery, “MCLUST version 3 for R: normal mixture modeling and model-based clustering,” Tech. Rep. 504, Department of Statistics, University of Washington, Seattle, Wash, USA, 2006.
[11]
S. G. Valdano and J. A. Di Rienzo, “Discovering meaningful groups in hierarchical cluster analysis. An extension to the multivariate case of a multiple comparison method based on cluster analysis,” 2007, http://interstat.statjournals.net/YEAR/2007/abstracts/0704002.php.
[12]
J. A. Di Rienzo, A. W. Guzmán, and F. Casanoves, “A multiple-comparisons method based on the distribution of the root node distance of a binary tree,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 7, no. 2, pp. 129–142, 2002.
[13]
K. S. Pollard, M. J. van der Laan, and G. Wall, “hopach: Hierarchical ordered partitioning and collapsing hybrid (HOPACH),” R-package version 2.4.0., 2009, http://www.bioconductor.org/packages/release/bioc/html/hopach.html.
[14]
E. Manduchi, L. M. Scearce, J. E. Brestelli, G. R. Grant, K. H. Kaestner, and C. J. Stoeckert Jr., “Comparison of different labeling methods for two-channel high-density microarray experiments,” Physiol Genomics, vol. 10, no. 3, pp. 169–179, 2002.
[15]
Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995.