%0 Journal Article
%T R/BHC: fast Bayesian hierarchical clustering for microarray data
%A Richard S Savage
%A Katherine Heller
%A Yang Xu
%A Zoubin Ghahramani
%A William M Truman
%A Murray Grant
%A Katherine J Denby
%A David L Wild
%J BMC Bioinformatics
%D 2009
%I BioMed Central
%R 10.1186/1471-2105-10-242
%X We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering and demonstrate its use in clustering gene expression microarray data. The method performs bottom-up hierarchical clustering, using a Dirichlet Process (infinite mixture) to model uncertainty in the data and Bayesian model selection to decide at each step which clusters to merge.Biologically plausible results are presented from a well studied data set: expression profiles of A. thaliana subjected to a variety of biotic and abiotic stresses. Our method avoids several limitations of traditional methods, for example how many clusters there should be and how to choose a principled distance metric.Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis [1-3], little attention has been paid to uncertainty in the results obtained. In clustering, the patterns of expression of different genes across time, treatments, and tissues are grouped into distinct clusters (perhaps organized hierarchically), in which genes in the same cluster are assumed to be potentially functionally related or to be influenced by a common upstream factor. Such cluster structure is often used to aid the elucidation of regulatory networks. Agglomerative hierarchical clustering [1] is one of the most frequently used methods for clustering gene expression profiles. However, commonly used methods for agglomerative hierarchical clustering rely on the setting of some score threshold to distinguish members of a particular cluster from non-members, making the determination of the number of clusters arbitrary and subjective. The algorithm provides no guide to choosing the "correct" number of clusters or the level at which to prune the tree. It is often difficult to know which distance metric to choose, especially for structured data such as gene expression profiles. Moreover, these approaches
%U http://www.biomedcentral.com/1471-2105/10/242