Background. A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression.
References
[1]
Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biology 11(10):R106
[2]
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Proceedings of the Royal Statistical Society Series B 57:289-300
[3]
Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. 2011. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-seq and microarrays. PLoS ONE 6:e17820
[4]
Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Lalo D, Le Gall C, Schaffer B, Le Crom S, Guedj M, Jaffrézic F. 2013. A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics 14:671-683
[5]
Dunne A, Pawitan Y, Doody L. 1996. Two-sided P-values from discrete asymmetric distributions based on uniformly most powerful unbiased tests. The Statistician 45(4):397-405
[6]
Efron B. 2004. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association 99(465):96-104
[7]
Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. 2013. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics 14:254
[8]
Jin J, Cai T. 2007. Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons. Journal of the American Statistical Association 102:495-506
[9]
Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3):R25
[10]
Li J, Witten D, Johnstone I, Tibshirani R. 2012. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics 13(3):523-538
[11]
Love M, Anders S, Huber W. 2013. Differential analysis of count data—the DESeq2 package. Available at http://www.bioconductor.org/packages/2.13/bioc/vignettes/deseq2/inst/doc/des.pdf
[12]
Lund S, Nettleton D, McCarthy D, Smyth G. 2012. Detecting differential expression in RNA- sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology 11(5) Article 8
[13]
Marioni JC, Mason C, Mane SM, Stephens S, Gilad Y. 2008. RNA-seq: an assessment of technical reproducability and comparison with gene expression arrays. Genome Research 18:1509-1517
[14]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammilian transcriptomes by RNA-seq. Nature Methods 5(7):621-628
[15]
Nettleton D, Hwang J, Caldo R, Wose R. 2006. Estimating the number of true null hypotheses from a histogram of values. Journal of Agricultural, Biological and Environmental Statistics 11(3):337-356
[16]
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464:768-772
[17]
R Development Core Team. 2013. R: a language and environment for statistical computing, v3.0.2. Vienna: R Foundation for Statistical Computing.
[18]
Robinson M, McCarthy D, Chen Y, Smyth G. 2013. edgeR: differential expression analysis of digital expression data user’s guide. Available at http://www.bioconductor.org/packages/release/bioc/vignettes/edger/inst/doc/edgerusersguide.pdf
[19]
Robinson M, McCarthy D, Smyth G. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139-140
[20]
Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11:R25
[21]
Robinson M, Smyth G. 2007. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23(21):2881-2887
[22]
Robinson M, Smyth G. 2008. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9(2):321-332
[23]
Robles J, Qureshi S, Stephen S, Wilson S, Burden C, Taylor J. 2012. Efficient experimental design and analysis strategies for the detection of differential expression using rna-sequencing. BMC Genomics 13:484
[24]
Smyth G. 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3 Article 3
[25]
Soneson C, Delorenzi M. 2013. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:91
[26]
Stephen S, Cullerne D, Spriggs A, Helliwell C, Lovell D, Taylor J. 2012. Biokanga: a suite of high performance bioinformatics applications. Available at http://code.google.com/p/biokanga/
[27]
Storey J, Tibshirani R. 2003. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100(16):9440-9445
[28]
Tjur T. 1998. Nonlinear regression, quasi likelihood, and overdispersion in generalised linear models. American Statistician 52:222-227