Background: The Poisson and the Negative Binomial distributions are commonly used to
model count data. The Poisson is characterized by the equality of mean and variance whereas the Negative Binomial
has a variance larger than the mean and therefore both models are appropriate
to model over-dispersed count data. Objectives: A new
two-parameter probability distribution called the Quasi-Negative Binomial
Distribution (QNBD) is being studied in this paper,
generalizing the well-known negative binomial distribution. This model
turns out to be quite flexible for analyzing count data. Our main objectives
are to estimate the parameters of the proposed distribution and to discuss its applicability to genetics data. As an
application, we demonstrate that the QNBDregression representation is utilized to model genomics data sets. Results: The new distribution is shown to provide a good fit with respect to the
“Akaike Information Criterion”, AIC, considered a measure of model goodness of
fit. The proposed distribution may serve as a viable alternative to other
distributions available in the literature for modeling count data exhibiting
overdispersion, arising in various fields of scientific investigation such as
genomics and biomedicine.
References
[1]
Takács, L. (1962) A Generalization of the Ballot Problem and Its Application in the Theory of Queues. Journal of the American Statistical Association, 57, 327-337.
https://doi.org/10.1080/01621459.1962.10480662
[2]
Consul, P.C. and Gupta, H.C. (1980) The Generalized Negative Binomial Distribution and Its Characterization by Zero Regression. SIAM Journal of Applied Mathematics, 39, 231-237. https://doi.org/10.1137/0139020
[3]
Consul, P.C. and Shenton, L.R. (1972) Use of Lagrange Expansion for Generating Generalized Probability Distributions. SIAM Journal of Applied Mathematics, 23, 239-248. https://doi.org/10.1137/0123026
[4]
Consul, P.C. and Famoye, F. (2006) Lagrangian Probability Distributions. Birkhäuser, Boston.
[5]
Shoukri, M.M. (1980) Estimation of Generalized Discrete Distributions. Unpublished PhD Thesis, The University of Calgary, Calgary.
[6]
Nelder, J.A. and Wedderburn, R.W.M. (1972) Generalized Linear Models. Journal of the Royal Statistical Society, Series A, 135, 370-384.
https://doi.org/10.2307/2344614
[7]
Kendall, M. and Ord, K. (2009) The Advanced Theory of Statistics. Vol. 1, 6th Edition, Griffin, London.
[8]
Rudick, R., Antel, J., Confavreux, C., Confavreux, C., Cutter, G., Ellison, G., et al. (1996) Clinical Outcomes Assessment in Multiple Sclerosis. Annals of Neurology, 40, 469-479. https://doi.org/10.1002/ana.410400321
[9]
Morgan, C.J., Aban, I.B., Katholi, C.R. and Cutter, G.R. (2010) Modeling Lesion Counts in Multiple Sclerosis When Patients Have Been Selected for Baseline Activity. Multiple Sclerosis, 16, 926-934. https://doi.org/10.1177/1352458510373110
[10]
Cramér, H. (1946) Mathematical Methods of Statistics. Princeton University Press, Princeton.
[11]
Szegő, G. (1939) Orthogonal Polynomials. Vol. 23, Colloquium Publications, American Mathematical Society, New York.
[12]
Shenton, L.R. and Wallington, P.A. (1962) The Bias of the Moment Estimators with an Application to the Negative Binomial Distribution. Biometrika, 49, 193-204.
https://doi.org/10.1093/biomet/49.1-2.193
[13]
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. Chapman Hall, London.
[14]
Cox, D.R. and Hinkley, D. (1974) Theoretical Statistics. Chapman and Hall, London.
[15]
McCarthy, D.J., Chen, Y. and Smyth, G.K. (2021) Differential Expression Analysis of RNA-Seq Experiments with Respect to Biological Variation. Nucleic Acids Research, 40, 4288-4297. https://doi.org/10.1093/nar/gks042
[16]
Pan, W. (2002) A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments. Bioinformatics, 18, 546-554. https://doi.org/10.1093/bioinformatics/18.4.546
[17]
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. and Gilad, Y. (2008) RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays. Genome Research, 18, 15-1517.
https://doi.org/10.1101/gr.079558.108
[18]
Koch, C.M., Chiu, S.F., Akbarpour, M., Bahart, A., Ridge, K.M., Bartom, E.T. and Winter, D.R. (2018) A Beginner’s Guide to Analysis of RNA Sequencing Data. American Journal of Respiratory Cell and Molecular Biology, 59, 145-157.
https://doi.org/10.1101/gr.079558.108
[19]
Yoon, S., Kim, S.Y. and Nam, D. (2016) Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates. PLoS ONE, 11, e0165919.
https://doi.org/10.1371/journal.pone.0165919
[20]
Auer, P.L. and Doerge, R.W. (2011) A Two-Stage Poisson Model for Testing RNA-Seq Data. Statistical Applications in Genetics and Molecular Biology, 10, 26.
https://doi.org/10.2202/1544-6115.1627
[21]
Yoon, S. and Nam, D. (2017) Gene Dispersion Is the Key Determinant of the Read Count Bias in Differential Expression Analysis of RNA-Seq Data. BMC Genomics, 18, Article No. 408. https://doi.org/10.1186/s12864-017-3809-0
[22]
Robinson, M.D. and Smyth, G.K. (2008) Small-Sample Estimation of Negative Binomial Dispersion, with Applications to SAGE Data. Biostatistics, 9, 321-332.
https://doi.org/10.1093/biostatistics/kxm030