|
Genome Biology 2010
Differential expression analysis for sequence count dataDOI: 10.1186/gb-2010-11-10-r106 Abstract: High-throughput sequencing of DNA fragments is used in a range of quantitative assays. A common feature between these assays is that they sequence large amounts of DNA fragments that reflect, for example, a biological system's repertoire of RNA molecules (RNA-Seq [1,2]) or the DNA or RNA interaction regions of nucleotide binding molecules (ChIP-Seq [3], HITS-CLIP [4]). Typically, these reads are assigned to a class based on their mapping to a common region of the target genome, where each class represents a target transcript, in the case of RNA-Seq, or a binding region, in the case of ChIP-Seq. An important summary statistic is the number of reads in a class; for RNA-Seq, this read count has been found to be (to good approximation) linearly related to the abundance of the target transcript [2]. Interest lies in comparing read counts between different biological conditions. In the simplest case, the comparison is done separately, class by class. We will use the term gene synonymously to class, even though a class may also refer to, for example, a transcription factor binding site, or even a barcode [5].We would like to use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.If reads were independently sampled from a population with given, fixed fractions of genes, the read counts would follow a multinomial distribution, which can be approximated by the Poisson distribution.Consequently, the Poisson distribution has been used to test for differential expression [6,7]. The Poisson distribution has a single parameter, which is uniquely determined by its mean; its variance and all other properties follow from it; in particular, the variance is equal to the mean. However, it has been noted [1,8] that the assumption of Poisson distribution is too restrictive: it predicts smaller variations than what is seen in the data.
|