Multiple
imputations compensate for missing data and produce multiple datasets by
regression model and are considered the solver of the old problem of univariate
imputation. The univariate imputes data only from a specific column where the
data cell was missing. Multivariate imputation works simultaneously, with all
variables in all columns, whether missing or observed. It has emerged as a
principal method of solving missing data problems. All incomplete datasets
analyzed before Multiple Imputation by Chained Equations (MICE) presented were misdiagnosed; results
obtained were invalid and should not be countable to yield reasonable
conclusions. This article will highlight why multiple imputations and how the
MICE work with a particular focus on the cyber-security dataset.Removing
missing data in any dataset and replacing
it is imperative in analyzing the data and creating prediction models.
Therefore, a good imputation technique should recover the missingness,
which involves extracting the good features. However, the widely used
univariate imputation method does not impute missingness reasonably if the values
are too large and may thus lead to bias. Therefore, we aim to propose an
alternative imputation method that is efficient and removes potential bias
after removing the missingness.
References
[1]
Huque, M.H., Carlin, J.B., Simpson, J.A. and Lee, K.J. (2018) A Comparison of Multiple Imputation Methods for Missing data in Longitudinal Studies. BMC Medical Research Methodology, 18, 1-16. https://doi.org/10.1186/s12874-018-0615-6
[2]
Kontopantelis, E., White, I.R., Sperrin, M. and Buchan, I. (2017) Outcome-Sensitive Multiple Imputations: A Simulation Study. BMC Medical Research Methodology, 17, 1-13. https://doi.org/10.1186/s12874-016-0281-5
[3]
Rubin, D.B. (1996) Multiple Imputation after 18+ Years. Journal of the American Statistical Association, 91, 473-489.
https://doi.org/10.1080/01621459.1996.10476908
[4]
Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data. 2nd Ed., Wiley Interscience, New York. https://doi.org/10.1002/9781119013563
[5]
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M. and Rubin, D.B. (2006) Fully Conditional Specification in Multivariate Imputation. Journal of Statistical Computation and Simulation, 76, 1049-1064.
https://doi.org/10.1080/10629360600810434
[6]
Carpenter, J. and Kenward, M. (2013) Multiple Imputation and Its Application. 1st ed. Wiley, New York.
[7]
Rubin, D.B. (1993) Discussion: Statistical Disclosure Limitation. Journal of Official Statistics, 9, 461-468.
[8]
Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley, New York. https://doi.org/10.1002/9780470316696
[9]
White, I.R., Royston, P. and Wood, A.M. (2011) Multiple Imputation Using Chained Equations: Issues and Guidance for Practice. Statistics in Medicine, 30, 377-399.
https://doi.org/10.1002/sim.4067
[10]
Rubin, D.B. (2003) Discussion on Multiple Imputation. International Statistical Review, 71, 619-625. https://doi.org/10.1111/j.1751-5823.2003.tb00216.x
[11]
Van Buuren, S. (2010) Multiple Imputation of Multilevel Data. In: Hox, J. and Roberts, K., Eds., The Handbook of Advanced Multilevel Analysis, Routledge, Milton Park, UK.
[12]
Van Buuren, S. and Oudshoorn, K. (2000) Multivariate Imputation by Chained Equations: MICE V1.0 User’s Manual, Volume PG/VGZ/00.038. TNO Prevention and Health, Leiden.
[13]
Scheidegger, A. (2012) adaptMCMC: Implementation of a Generic Adaptive Monte Carlo Markov Chain Sampler. R Package Version 1.1.