Classical Mahalanobis distance is used as a method of detecting outliers,
and is affected by outliers. Some robust Mahalanobis distance is proposed via the fast MCD estimator. However, the bias of the MCD estimator
increases significantly as the dimension increases. In this paper, we propose the
improved Mahalanobis distance based on a more robust Rocke estimator under
high-dimensional data. The results of numerical simulation and empirical
analysis show that our proposed method can better detect the outliers in the
data than the above two methods when there are outliers in the data and the
dimensions of data are very high.
References
[1]
Tukey, J.W. (1977) Exploratory Data Analysis (Vol. 2).
[2]
Drineas, P., Mahoney, M.W. and Muthukrishnan, S. (2006) Sampling Algorithms for l 2 Regression and Applications. Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1127-1136.
https://doi.org/10.1145/1109557.1109682
[3]
Drineas, P., Magdon-Ismail, M., Mahoney, M.W. and Woodruff, D.P. (2012) Fast Approximation of Matrix Coherence and Statistical Leverage. Journal of Machine Learning Research, 13, 3475-3506.
[4]
Drineas, P., Mahoney, M.W., Muthukrishnan, S. and Sarlós, T. (2011) Faster Least Squares Approximation. Numerischemathematik, 117, 219-249.
https://doi.org/10.1007/s00211-010-0331-6
[5]
Mahalanobis, P.C. (1936) On the Generalized Distance in Statistics. National Institute of Science of India, 2, 49-55.
[6]
Holland, J.H. (1975) Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence.
[7]
Yan, Z., Dai, X.W. and Tian, M.Z. (2016) Diagnosis of Outlier Based on Sampling of Lever Large Dataset. Mathematical Statistics and Management, 35, 794-802.
[8]
Shi, D.D. Jia, R.Y. and Huang, Y.T. (2009) Improvement of Outlier Detection Algorithm in High Dimension Based on Genetic Algorithm. Journal of Computer Technology and Development, 19, 141-143.
[9]
Stahel, W.A. (1981) Robusteschatzungen: Infinitesimaleoptimalitat und schatzungen von kovarianzmatrizen. Doctoral Dissertation, ETH, Zurich.
Rousseeuw, P.J. (1985) Multivariate Estimation with High Breakdown Point. Mathematical Statistics and Applications, 8, 37.
https://doi.org/10.1007/978-94-009-5438-0_20
[12]
Grübel, R. (1988) A Minimal Characterization of the Covariance Matrix. Metrika, 35, 49-52. https://doi.org/10.1007/BF02613285
[13]
Rousseeuw, P. and Yohai, V. (1984) Robust Regression by Means of S-Estimators. In: Franke, W.H.J. and Martin, D., Eds., Robust and Nonlinear Time Series Analysis, Springer, New York, 256-272. https://doi.org/10.1007/978-1-4615-7821-5_15
[14]
Rousseeuw, P.J. and Driessen, K.V. (1999) A Fast Algorithm for the Minimum Covariance determinant Estimator. Technometrics, 41, 212-223.
https://doi.org/10.1080/00401706.1999.10485670
[15]
Wang, B. and Chen, Y. (2005) Multivariate Anomaly Detection Based on Robust Mahalanobis Distance Based on. Statistics and Decision, 03X, 4-6.
[16]
Feng, L., Li, B. and Huang, L. (2014) Detection and Analysis of Lidar Point Cloud Gross Error Based on Robust Mahalanobis Distance. Geodesy and Geodynamics, 34, 168-173.
[17]
Maronna, R.A. and Yohai, V.J. (2017) Robust and Efficient Estimation of Multivariate Scatter and Location. Computational Statistics & Data Analysis, 109, 64-75.
https://doi.org/10.1016/j.csda.2016.11.006
[18]
Pena, D. and Prieto, F.J. (2007) Combining Random and Specific Directions for Outlier Detection and Robust Estimation in High-Dimensional Multivariate Data. Journal of Computational and Graphical Statistics, 16, 228-254.
https://doi.org/10.1198/106186007X181236
[19]
Rocke, D.M. (1996) Robustness Properties of S-Estimators of Multivariate Location and Shape in High Dimension. The Annals of Statistics, 24, 1327-1345.
https://doi.org/10.1214/aos/1032526972
[20]
Wolberg, W.H., Street, W.N. and Mangasarian, O.L. (1992) Breast Cancer Wisconsin (Diagnostic) Data Set. UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/
[21]
Wolberg, W.H., Street, W.N., Heisey, D.M. and Mangasarian, O.L. (1995) Computer-Derived Nuclear Features Distinguish Malignant from Benign Breast Cytology. Human Pathology, 26, 792-796. https://doi.org/10.1016/0046-8177(95)90229-5