全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于逆回归方法的分布式特征筛选
Distributed Feature Screening via Inverse Regression

DOI: 10.12677/aam.2025.141034, PP. 344-351

Keywords: 超高维,Gini相关系数,变量筛选,特征排序
Ultrahigh Dimension
, Gini Correlation Coefficient, Variable Screening, Feature Ranking

Full-Text   Cite this paper   Add to My Lib

Abstract:

本文我们提出了一个通过逆回归估计实现大数据设置的分布式筛选框架。本着分而治之的思想,本文提出的框架可用分布估计条件方差的逆回归模型来表达相关关系。通过分量估计的聚合,我们得到了一个最终的逆条件方差估计,可以很容易地用于筛选特征。该框架支持分布式存储和并行计算,因此在计算上具有吸引力。由于分量参数的无偏分布估计,最终的聚合估计具有较高的精度,且对数据段数量m不敏感。在一般条件下,我们证明了聚合估计器在概率收敛界和均方误差率方面与集中估计器一样有效;相应的筛选过程对广泛的相关度量具有一定的筛选特性。
In this paper, we propose a distributed screening framework for big data setup via inverse regression estimator. In the spirit of divide-and-conquer, the proposed framework expresses the dependent relative by inverse regression model in which can be distributively estimated inverse conditional variance. With the component estimates aggregated, we obtain a final inverse conditional variance estimator that can be readily used for screening features. This framework enables distributed storage and parallel computing and thus is computationally attractive. Due to the unbiased distributive estimation of the component parameters, the final aggregated estimate achieves a high accuracy that is insensitive to the number of data segments m. Under mild conditions, we show that the aggregated estimator is as efficient as the centralized estimator in terms of the probability convergence bound and the mean squared error rate; the corresponding screening procedure enjoys sure screening property for a wide range of correlation measures.

References

[1]  Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
https://doi.org/10.1198/016214501753382273
[2]  Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58, 267-288.
https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
[3]  Fan, J. and Lv, J. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70, 849-911.
https://doi.org/10.1111/j.1467-9868.2008.00674.x
[4]  Fan, J. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with Np-Dimensionality. The Annals of Statistics, 38, 3567-3604.
https://doi.org/10.1214/10-aos798
[5]  Fan, J., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557.
https://doi.org/10.1198/jasa.2011.tm09779
[6]  Zhu, L., Li, L., Li, R. and Zhu, L. (2011) Model-Free Feature Screening for Ultrahigh-Dimensional Data. Journal of the American Statistical Association, 106, 1464-1475.
https://doi.org/10.1198/jasa.2011.tm10563
[7]  Li, R., Zhong, W. and Zhu, L. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139.
https://doi.org/10.1080/01621459.2012.695654
[8]  Li, G., Peng, H., Zhang, J. and Zhu, L. (2012) Robust Rank Correlation Based Screening. The Annals of Statistics, 40, 1846-1877.
https://doi.org/10.1214/12-aos1024
[9]  Wu, Y. and Yin, G. (2015) Conditional Quantile Screening in Ultrahigh-Dimensional Heterogeneous Data. Biometrika, 102, 65-76.
https://doi.org/10.1093/biomet/asu068
[10]  Zhou, Y., Liu, J., Hao, Z. and Zhu, L. (2019) Model-Free Conditional Feature Screening with Exposure Variables. Statistics and Its Interface, 12, 239-251.
https://doi.org/10.4310/sii.2019.v12.n2.a5
[11]  Wang, H. and Xia, Y. (2009) Shrinkage Estimation of the Varying Coefficient Model. Journal of the American Statistical Association, 104, 747-757.
https://doi.org/10.1198/jasa.2009.0138
[12]  Fan, J. and Ren, Y. (2006) Statistical Analysis of DNA Microarray Data in Cancer Research. Clinical Cancer Research, 12, 4469-4473.
https://doi.org/10.1158/1078-0432.ccr-06-1033
[13]  Fan, J., Ma, Y. and Dai, W. (2014) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models. Journal of the American Statistical Association, 109, 1270-1284.
https://doi.org/10.1080/01621459.2013.879828
[14]  Hall, P. and Miller, H. (2009) Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems. Journal of Computational and Graphical Statistics, 18, 533-550.
https://doi.org/10.1198/jcgs.2009.08041
[15]  Luo, S. and Chen, Z. (2014) Sequential Lasso Cum EBIC for Feature Selection with Ultra-High Dimensional Feature Space. Journal of the American Statistical Association, 109, 1229-1240.
https://doi.org/10.1080/01621459.2013.877275
[16]  Chen, X. and Xie, M. (2014) A Split-and-Conquer Approach for Analysis of. Statistica Sinica, 24, 1655-1684.
https://doi.org/10.5705/ss.2013.088
[17]  Xu, C., Zhang, Y., Li, R. and Wu, X. (2016) On the Feasibility of Distributed Kernel Regression for Big Data. IEEE Transactions on Knowledge and Data Engineering, 28, 3041-3052.
https://doi.org/10.1109/tkde.2016.2594060
[18]  Jordan, M.I., Lee, J.D. and Yang, Y. (2018) Communication-Efficient Distributed Statistical Inference. Journal of the American Statistical Association, 114, 668-681.
https://doi.org/10.1080/01621459.2018.1429274
[19]  Gonçalves, A.R., Liu, X. and Banerjee, A. (2019) Two-Block vs. Multi-Block ADMM: An Empirical Evaluation of Convergence. arXiv: 1907.04524.
[20]  Chen, J. and Chen, Z. (2008) Extended Bayesian Information Criteria for Model Selection with Large Model Spaces. Biometrika, 95, 759-771.
https://doi.org/10.1093/biomet/asn034
[21]  Lin, N. and Xi, R. (2011) Aggregated Estimating Equation Estimation. Statistics and Its Interface, 4, 73-83.
https://doi.org/10.4310/sii.2011.v4.n1.a8
[22]  Wang, H. (2012) Factor Profiled Sure Independence Screening. Biometrika, 99, 15-28.
https://doi.org/10.1093/biomet/asr074
[23]  Li, K. (1991) Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical Association, 86, 316-327.
https://doi.org/10.2307/2290563
[24]  Ruppert, D., Sheather, S.J. and Wand, M.P. (1995) An Effective Bandwidth Selector for Local Least Squares Regression. Journal of the American Statistical Association, 90, 1257-1270.
https://doi.org/10.1080/01621459.1995.10476630
[25]  Zhang, J., Zhang, R. and Lu, Z. (2015) Quantile-Adaptive Variable Screening in Ultra-High Dimensional Varying Coefficient Models. Journal of Applied Statistics, 43, 643-654.
https://doi.org/10.1080/02664763.2015.1072141
[26]  Zhang, J., Zhang, R. and Zhang, J. (2017) Feature Screening for Nonparametric and Semiparametric Models with Ultrahigh-Dimensional Covariates. Journal of Systems Science and Complexity, 31, 1350-1361.
https://doi.org/10.1007/s11424-017-6310-6
[27]  Hoeffding, W. (1948) A Class of Statistics with Asymptotically Normal Distribution. The Annals of Mathematical Statistics, 19, 293-325.
https://doi.org/10.1214/aoms/1177730196
[28]  Wu, X. and Zhang, J. (2017) Researches on Rademacher Complexities in Statistical Learning Theory: A Survey. Acta Automatica Sinica, 43, 20-39.
[29]  Schechtman, E. and Yitzhaki, S. (1999) On the Proper Bounds of the Gini Correlation. Economics Letters, 63, 133-138.
https://doi.org/10.1016/s0165-1765(99)00033-6

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133