|
Genome Biology 2010
Gene ontology analysis for RNA-seq: accounting for selection biasAbstract: Next generation sequencing of RNA (RNA-seq) gives unprecedented detail about the transcriptional landscape of an organism. Not only is it possible to accurately measure expression levels of transcripts in a sample [1], but this new technology promises to deliver a range of additional benefits, such as the investigation of alternative splicing [2], allele specific expression [3] and RNA editing [4]. However, in order to accurately make use of the data, it is vital that analysis techniques are developed to take into account the technical features of RNA-seq output. As many of the specific technical properties of RNA-seq data are not present in previous technologies such as microarrays, naive application of the same analysis methodologies, developed for these older technologies, may lead to bias in the results.In RNA-seq experiments the expression level of a transcript is estimated from the number of reads that map to that transcript. In many applications, the expected read count for a transcript is proportional to the gene's expression level multiplied by its transcript length. Therefore, even when two transcripts are expressed at the same level, differences in length will yield differing numbers of total reads. One consequence of this is that longer transcripts give more statistical power for detecting differential expression between samples [5]. Similarly, more highly expressed transcripts have a greater number of reads and greater power to detect differential expression. Hence, long or highly expressed transcripts are more likely to be detected as differentially expressed compared with their short and/or lowly expressed counterparts. The fact that statistical power increases with the number of reads is an unavoidable property of count data, which cannot be removed by normalization or re-scaling. Consequently, it is unsurprising that this selection bias has been shown to exist in a range of different experiments performed using different analysis methods, experiment
|