|
BMC Bioinformatics 2005
The effects of normalization on the correlation structure of microarray dataAbstract: The paper presents a systematic study of correlation between the t-statistics associated with different genes. We report the effects of four different normalization methods using a large set of microarray data on childhood leukemia in addition to several sets of simulated data. Our findings help decipher the correlation structure of microarray data before and after the application of normalization procedures.A long-range correlation in microarray data manifests itself in thousands of genes that are heavily correlated with a given gene in terms of the associated t-statistics. By using normalization methods it is possible to significantly reduce correlation between the t-statistics computed for different genes. Normalization procedures affect both the true correlation, stemming from gene interactions, and the spurious correlation induced by random noise. When analyzing real world biological data sets, normalization procedures are unable to completely remove correlation between the test statistics. The long-range correlation structure also persists in normalized data.There are two major methodological problems that deal with the issue of stochastic dependence between gene expression signals in microarray data. The first arises naturally when adjustments for multiplicity of tests are made by pooling across genes (or tests) in an effort to find differentially expressed genes in two-sample comparisons. The empirical Bayes methodology in the nonparametric [1-3] and parametric formulations [4,5], and closely related methods exploiting a two-component mixture model [6-8] represent typical examples. The common feature of such methods is that a test statistic (measure of differential expression) is first calculated for each gene to account for biological variability and then all the statistics (or the associated p-values) are pooled together and treated as a sample from which to estimate the sampling distribution of this statistic, the false discovery rate (FDR), q-values, etc
|