In this work, a comprehensive framework for traditional outlier detection techniques based on simple and multiple linear regression models was studied. Two data sets were used for the illustration and evaluation of each class of outlier detection techniques (analytical and graphical methods). Outlier detection aims at identifying such outlier in order to improve the analytic of data and suitable model built. Furthermore, comparisons of the different methods were done to highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques. The results show that by removing the influential points (or outliers), the model adequacy increased (from R2 = 0.72 to R2 = 0.97). It was observed that Jackknife residuals and Atkinson’s measure methods are very useful in detecting outliers; hence, both methods were recommended for outliers’ detection.
Cite this paper
Arimie, C. O. , Biu, E. O. and Ijomah, M. A. (2020). Outlier Detection and Effects on Modeling. Open Access Library Journal, 7, e6619. doi: http://dx.doi.org/10.4236/oalib.1106619.
Bollen, K.A. and Jackman, R.W. (1990) Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases. In: Fox, J. and Scott, L.J., Eds., Modern Methods of Data Analysis, Sage, Newbury Park, 257-291.
Abuzaid, A.H., Hussin, A.G. and Mohamed, I.B. (2008) Identifying Single Outlier in Linear Circular Regression Model Based on Circular Distance. Journal of Applied Probability and Statistics, 3, 107-117.
Zhang, Y., Meratnia, N. and Havinga, P.J.M. (2010) Outlier Detection Techniques for Wireless Sensor Networks Survey. IEEE Communication Survey and Tutorial, 12, 159-170. https://doi.org/10.1109/SURV.2010.021510.00088
Rousseeuw, P.J. (1984) Least Median of Squares Regression. Journal of the American Statistical Association, 79, 871-880.
https://doi.org/10.1080/01621459.1984.10477105
Aggarwal, C.C. and Yu, P.S. (2013) Outlier Detection for High Dimensional Data.
https://www.researchgate.net/publication/2401320_Outlier_Detection_for_High_Dimensional_Data
Arning, A., Agrawal, R. and Raghavan, P. (1996) A Linear Method for Deviation Detection in Large Databases. KDD-1996, Portland, 2-4 August 1996, 164-169.
Sebert, D.M., Montgomery, D.C. and Rollier, D.A. (1998) Clustering Algorithm for Identifying Multiple Outliers in Linear Regression. Computational Statistics and Data Analysis, 27, 461-484. https://doi.org/10.1016/S0167-9473(98)00021-8
Worden, K., Manson, G. and Fieller, N.R.J. (2000) Damage Detection Using Outlier Analysis. Journal of Sound and Vibration, 229, 647-667.
https://doi.org/10.1006/jsvi.1999.2514
Kitagawa, G. (1984) Bayesian Analysis of Outliers via Akaike’s Predictive Likelihood of a Model. Communication Statistics—Simulation Computation, 13, 107-126.
https://doi.org/10.1080/03610918408812361
Fung, W.-K. and Bacon-Shone, J. (1993) Quasi-Bayesian Modelling of Multivariate Outliers. Computational Statistics and Data Analysis, 16, 271-278.
https://doi.org/10.1016/0167-9473(93)90129-H
Belsey, D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley, Hoboken.
https://doi.org/10.1002/0471725153
Meloun, M. and Militky, J. (2001) Detection of Single Influential Points in OLS Regression Model Building. Analytica Chimica Acta, 439, 169-191.
https://doi.org/10.1016/S0003-2670(01)01040-6