It is quite common that both categorical and continuous covariates appear
in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous.
And applicable feature screening method is very limited; to
handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional
multi-classification with both categorical and continuous covariates. The
proposed feature screening method will be based on Gini impurity to evaluate
the prediction power of covariates. Under certain regularity conditions, it is
proved that the proposed screening procedure possesses the sure screening
property and ranking consistency properties. We demonstrate the finite sample
performance of the proposed procedure by simulation studies and illustrate
using real data analysis.
References
[1]
Fan, J.Q. and Lv, J.C. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
[2]
Fan, J.Q., Samworth, R. and Wu, Y.C. (2009) Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 2013-2038.
http://arxiv.org/abs/0812.3201
[3]
Wang, H.S. (2009) Forward Regression for Ultra-High Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512-1524.
https://doi.org/10.1198/jasa.2008.tm08516
[4]
Fan, J.Q. and Song, R. (2010) Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Annals of Statistics, 38, 3567-3604.
https://doi.org/10.1214/10-AOS798
[5]
Fan, J.Q., Feng, Y. and Song, R. (2011) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models. Journal of the American Statistical Association, 106, 544-557. https://doi.org/10.1198/jasa.2011.tm09779
[6]
Zhu, L.P., Li, L.X., Li, R.Z. and Zhu, L.X. (2011) Model-Free Feature Screening for Ultrahigh-Dimensional Data. Journal of the American Statistical Association, 106, 1464-1475. https://doi.org/10.1198/jasa.2011.tm10563
[7]
Li, G.R., Peng, H., Zhang, J. and Zhu, L.X. (2012) Robust Rank Correlation Based Screening. Annals of Statistics, 40, 1846-1877. https://doi.org/10.1214/12-AOS1024
[8]
Li, R.Z., Zhong, W. and Zhu, L.P. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139.
https://doi.org/10.1080/01621459.2012.695654
[9]
He, X.M., Wang, L. and Hong, H.G. (2013) Quantile-Adaptive Model-Free Variable Screening for High-Dimensional Heterogeneous Data. Annals of Statistics, 41, 342-369.
https://doi.org/10.1214/13-AOS1087
[10]
Fan, J.Q., Ma, Y.B. and Dai, W. (2014) Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models. Journal of the American Statistical Association, 109, 1270-1284.
https://doi.org/10.1080/01621459.2013.879828
[11]
Liu, J.Y., Li, R.Z. and Wu, R.L. (2014) Feature Selection for Varying Coefficient Models with Ultrahigh-Dimensional Covariates. Statistics & Probability Letters, 109, 266-274. https://doi.org/10.1080/01621459.2013.850086
[12]
Nandy, D., Chiaromonte, F. and Li, R.Z. (2021) Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. Journal of the American Statistical Association, 117, 1516-1529.
https://doi.org/10.1080/01621459.2020.1864380
[13]
Pouyap, M., Bit joka, L., Mfoumou, E. and Toko, D. (2021) Improved Bearing Fault Diagnosis by Feature Extraction Based on GLCM, Fusion of Selection Methods, and Multiclass-Naive Bayes Classification. Journal of Signal and Information Processing, 12, 71-85. https://doi.org/10.4236/jsip.2021.124004
[14]
Fan, J.Q. and Fan, Y.Y. (2008) High-Dimensional Classification Using Features Annealed Independence Rules. Annals of Statistics, 36, 2605-2637.
https://doi.org/10.1214/07-AOS504
[15]
Mai, Q. and Zou, H. (2013) The Kolmogorov Filter for Variable Screening in High-Dimensional Binary Classification. Biometrika, 100, 229-234.
https://doi.org/10.1093/biomet/ass062
[16]
Cui, H.J., Li, R.Z. and Zhong, W. (2015) Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. Journal of the American Statistical Association, 110, 630-641. https://doi.org/10.1080/01621459.2014.920256
[17]
Lai, P., Song, F.L., Chen, K.W. and Liu, Z. (2017) Model Free Feature Screening with Dependent Variable in Ultrahigh Dimensional Binary Classification. Statistics & Probability Letters, 125, 141-148. https://doi.org/10.1016/j.spl.2017.02.011
[18]
Huang, D.Y., Li, R.Z. and Wang, H.S. (2014) Feature Screening for Ultrahigh Dimensional Categorical Data with Applications. Journal of Business & Economic Statistics, 32, 237-244. https://doi.org/10.1080/07350015.2013.863158
[19]
Ni, L. and Fang, F. (2016) Entropy-Based Model-Free Feature Screening for Ultrahigh-Dimensional Multiclass Classification. Journal of Nonparametric Statistics, 28, 515-530. https://doi.org/10.1080/10485252.2016.1167206
[20]
Ni, L., Fang, F. and Wan, F.J. (2017) Adjusted Pearson Chi-Square Feature Screening for Multi-Classification with Ultrahigh Dimensional Data. Metrika, 80, 805-828.
https://doi.org/10.1007/s00184-017-0629-9
[21]
Sheng, Y. and Wang, Q.H. (2020) Model-Free Feature Screening for Ultrahigh Dimensional Classification. Journal of Multivariate Analysis, 178, 1-12.
https://doi.org/10.1016/j.jmva.2020.104618
[22]
Anzarmou, Y., Mkhadri, A. and Oualkacha, K. (2022) The Kendall Interaction Filter for Variable Interaction Screening in High Dimensional Classification Problems. Journal of Applied Statistics, 1-19. https://doi.org/10.1080/02664763.2022.2031125
[23]
Breiman, L., Friedman, J.H., Stone, C.J. and Olshen, R.A. (1984) Classification and Regression Trees. Wadsworth International Group, Belmont.
[24]
Marco, T. (2012) Lectures on Probability Theory and Mathematical Statistics. CreateSpace Independent Publishing Platform, Scotts Valley.
[25]
Suykens, J.A.K. and Vandewalle, J. (1999) Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9, 293-300.
https://doi.org/10.1023/A:1018628609742
[26]
Lantz, B. (2015) Machine Learning with R. Mathematical & Statistical Software. Packt Publishing, Birmingham.
[27]
He, H.J. and Deng, G.M. (2022) Grouped Feature Screening for Ultra-High Dimensional Data for the Classification Model. Journal of Statistical Computation and Simulation, 1-24.