|
大样本Gamma回归的最优子抽样
|
Abstract:
随着计算机行业的迅猛发展,人类社会逐渐迈入大数据时代。面对大规模右偏性和厚尾分布的数据,Gamma回归模型发挥着非常重要的作用。然而如何快速并准确估计出Gamma回归中感兴趣参数成为值得思考的热点问题。在本文中,我们提出两种两步算法分别有效地逼近Φ已知Gamma回归和Φ未知Gamma回归在全数据下的最大似然估计,从而解决了单参数与双参数大样本Gamma回归估计问题。首先在Φ已知情况下,可证明出在给定全数据下一般子抽样估计量渐近服从正态分布,推导出使估计量渐近均方误差最小的最优子抽样概率。为了进一步降低计算量,我们还提出了另一种最优子抽样概率。由于最优子抽样概率取决于未知参数,我们还提出了单参数两步算法。其次在Φ未知情况下,我们基于单参数两步算法提出了双参数两步算法。最后使用数值模拟表明两种算法的计算效率高,也证实了通过单参数两步算法得到的估计量与双参数两步算法得到的估计量差距不明显。
With the rapid development of computer industry, human society is gradually moving into the era of big data. Gamma regression models play a very important role in the face of large-sample right-skewed and thick-tailed data. However, how to quickly and accurately estimate the parameters of interest in Gamma regression has become a hot issue to be considered. In this paper, we propose two two-step algorithms to efficiently approximate the maximum likelihood estimates of Φ known Gamma regression and Φ unknown Gamma regression under full data, respectively, thus solving the single-parameter and two-parameter large-sample Gamma regression estimation problems. Firstly, in the case where Φ is known, it can be shown that the general subsampling estimates asymptotically obey a normal distribution given the full data, and the optimal subsampling probability that minimizes the asymptotic mean square error of the estimates is derived. To further reduce the computational effort, we also propose an alternative optimal subsampling probability. Since the optimal subsampling probability depends on the unknown parameters, we also propose a single-parameter two-step algorithm. Secondly, in the case of Φ unknown, we propose a two-parameter two-step algorithm based on the one-parameter two-step algorithm. Finally, using numerical simulations, we show that the two algorithms are computationally efficient and also confirm that the difference between the estimates obtained by the one-parameter two-step algorithm and those obtained by the two-parameter two-step algorithm is not significant.
[1] | 陈超. 两参数伽马分布形状参数估计方法研究[D]: [硕士学位论文]. 淄博: 山东理工大学, 2015. |
[2] | Qasim, M., Amin, M. and Amanullah, M. (2018) On the Performance of Some New Liu Parameters for the Gamma Regression Model. Journal of Statistical Computation and Simulation, 88, 3065-3080.
https://doi.org/10.1080/00949655.2018.1498502 |
[3] | 左卫兵, 钱莉, 谢蕾蕾. Gamma回归模型的几乎无偏岭估计[J]. 统计与决策, 2020, 36(18): 18-21.
https://doi.org/10.13546/j.cnki.tjyjc.2020.18.004 |
[4] | Kiefer, J. (1959) Optimum Experimental Designs. Journal of the Royal Statistical Society, 21, 272-319.
https://doi.org/10.1111/j.2517-6161.1959.tb00338.x |
[5] | van der Vaart, A. (1998) Asymptotic Statistics. Cambridge University Press, London.
https://doi.org/10.1017/CBO9780511802256 |
[6] | Ferguson, T.S. (1996) A Course in Large Sample Theory. Chapman and Hall, London.
https://doi.org/10.1007/978-1-4899-4549-5 |
[7] | Lin, N. and Xie, R. (2011) Aggregated Estimating Equation Estimation. Statistics and Its Interface, 4, 73-83.
https://doi.org/10.4310/SII.2011.v4.n1.a8 |
[8] | Schifano, E.D., Wu, J., Wang, C., Yan, J. and Chen, M.H. (2016) Online Updating of Statistical Inference in the Big Data Setting. Technometrics, 58, 393-403. https://doi.org/10.1080/00401706.2016.1142900 |
[9] | Politis, D.N., Romano, J.P. and Wolf, M. (1999) Subsampling. Springer, New York.
https://doi.org/10.1007/978-1-4612-1554-7 |
[10] | Ma, P. and Sun, X. (2015) Leveraging for Big Data Regression. Wiley Interdisciplinary Reviews: Computational Statistics, 7, 70-76. https://doi.org/10.1002/wics.1324 |
[11] | Ma, P., Mahoney, M.W. and Yu, B. (2015) A Statistical Perspective on Algorithmic Leveraging. Journal of Machine Learning Research, 16, 861-911. |
[12] | Ma, P., Zhang, X., Xing, X., Ma, J. and Mahoney, M. (2020) Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Vol. 108, 1026-1035. |
[13] | Drineas, P., Mahoney, M. and Muthukrishnan, S. (2006) Sampling Algorithms for L2 Regression and Applications. Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, Miami, 22-24 January 2006, 1127-1136. https://doi.org/10.1145/1109557.1109682 |
[14] | Drineas, P., Mahoney, M., Muthukrishnan, S. and Sarlos, T. (2011) Faster Least Squares Approximation. Numerische Mathematik, 117, 219-249. https://doi.org/10.1007/s00211-010-0331-6 |
[15] | Dhillon, P., Lu, Y.C. and Foster, D.P. (2013) New Subsampling Algorithms for Fast Least Squares Regression. Advances in Neural Information Processing Systems, 1, 360-368. |
[16] | Zhu, R., Ma, P. and Mahoney, M.W. (2015) Optimal Subsampling Approaches for Large Sample Linear Regression. |
[17] | Wang, H., Yang, M. and Stufken, J. (2019) Information-Based Optimal Subdata Selection for Big Data Linear Regression. Journal of the American Statistical Association, 114, 393-405. https://doi.org/10.1080/01621459.2017.1408468 |
[18] | Chen, X.Y. and Xie, M.G. (2014) A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data. Statistica Sinica, 24, 1655-1684. https://doi.org/10.5705/ss.2013.088 |
[19] | Wang, H., Zhu, R. and Ma, P. (2018) Optimal Subsampling for Large Sample Logistic Regression. Journal of the American Statistical Association, 113, 829-844. https://doi.org/10.1080/01621459.2017.1292914 |
[20] | Yao, Y. and Wang, H. (2018) Optimal Subsampling for Softmax Regression. Statistical Papers, 60, 585-599.
https://doi.org/10.1007/s00362-018-01068-6 |
[21] | Wang, H. and Ma, Y. (2021) Optimal Subsampling for Quantile Regression in Big Data. Biometrika, 108, 99-112.
https://doi.org/10.1093/biomet/asaa043 |
[22] | 李莉莉, 靳士檑, 周楷贺. 基于岭回归模型大数据最优子抽样算法研究[J]. 系统科学与数学, 2022, 42(1): 50-63. |
[23] | 牛晓阳, 邹家辉. 非参数局部多项式回归估计的最优子抽样算法[J]. 系统科学与数学, 2022, 42(1): 72-84. |
[24] | 王星. 大数据分析: 方法与应用[M]. 北京: 清华大学出版社, 2013. |
[25] | 周围, 朱勇, 杜玉晗, 谢俊好. Gamma分布在海面目标检测中的应用[J]. 雷达科学与技术, 2021, 19(3): 310-321. |
[26] | 吴林林. 利用雨滴谱对移动双偏振雷达进行质量控制及降水估测[D]: [博士学位论文]. 南京: 南京信息工程大学, 2014. |
[27] | 王俊轶. 显示画质影响要素研究[D]: [硕士学位论文]. 南京: 东南大学, 2019. |