Background A central problem in systems biology research is the identification and extension of biological modules–groups of genes or proteins participating in a common cellular process or physical complex. As a result, there is a persistent need for practical, principled methods to infer the modular organization of genes from genome-scale data. Results We introduce a novel approach for the identification of modules based on the persistence of isolated gene groups within an evolving graph process. First, the underlying genomic data is summarized in the form of ranked gene–gene relationships, thereby accommodating studies that quantify the relevant biological relationship directly or indirectly. Then, the observed gene–gene relationship ranks are viewed as the outcome of a random graph process and candidate modules are given by the identifiable subgraphs that arise during this process. An isolation index is computed for each module, which quantifies the statistical significance of its survival time. Conclusions The Miso (module isolation) method predicts gene modules from genomic data and the associated isolation index provides a module-specific measure of confidence. Improving on existing alternative, such as graph clustering and the global pruning of dendrograms, this index offers two intuitively appealing features: (1) the score is module-specific; and (2) different choices of threshold correlate logically with the resulting performance, i.e. a stringent cutoff yields high quality predictions, but low sensitivity. Through the analysis of yeast phenotype data, the Miso method is shown to outperform existing alternatives, in terms of the specificity and sensitivity of its predictions.
References
[1]
Beyer A, Bandyopadhyay S, Ideker T (2007) Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet 8: 699–710.
[2]
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799–804.
[3]
Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, et al. (2007) Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6: 439–450.
[4]
Kiemer L, Costa S, Ueffing M, Cesareni G (2007) WI-PHI: a weighted yeast interactome enriched for direct physical interactions. Proteomics 7: 932–943.
[5]
Rinner O, Mueller LN, Hubalek M, Muller M, Gstaiger M, et al. (2007) An integrated mass spectrometric and computational framework for the analysis of protein interaction networks. Nat Biotechnol 25: 345–352.
[6]
Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2.
[7]
Hartuv E, Schmitt AO, Lange J, Meier-Ewert S, Lehrach H, et al. (2000) An algorithm for clustering cDNA fingerprints. Genomics 66: 249–256.
[8]
van Dongen S (2000) Graph clustering by flow simulation [PhD thesis]. Utrecht: University of Utrecht.
[9]
Ling RF (1973) A probability theory of cluster analysis. Journal of the American Statistical Association 68: 159–164.
[10]
Lee W, St Onge RP, Proctor M, Flaherty P, Jordan MI, et al. (2005) Genome-wide requirements for resistance to functionally distinct DNA-damaging agents. PLoS Genet 1: e24.
[11]
Schluter C, Lam KK, Brumm J, Wu BW, Saunders M, et al. (2008) Global Analysis of Yeast Endosomal Transport Identifies the Vps55/68 Sorting Complex. Mol Biol Cell 19: 1282–1294.
[12]
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504.
[13]
R Development Core Team (2008) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
[14]
Giaever G, Chu AM, Ni L, Connelly C, Riles L, et al. (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature 418: 387–391.
[15]
Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, et al. (1999) Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285: 901–906.
[16]
Brohee S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7: 488.
[17]
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B 63: 411–423.
[18]
Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179.
[19]
Duda RO, Hart PE, Stork DG (2000) Pattern classification. New York; Chichester: John Wiley & Sons.
[20]
Huber W, Carey VJ, Long L, Falcon S, Gentleman R (2007) Graphs in molecular biology. BMC Bioinformatics 8: Suppl 6S8.
[21]
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20: 25–47.
[22]
Stuetzle W, Nugent R (2007) A generalized single linkage method for estimating the cluster tree of a density. University of Washington, Department of Statistics. Technical Report 514:
[23]
Wong MA, Lane T (1983) A kth Nearest Neighbour Clustering Procedure. Journal of the Royal Statistical Society Series B (Methodological) 45: 362–368.