The absence of some data values in any observed dataset has been a real
hindrance to achieving valid results in statistical research. This paperaimed at the
missing data widespread problem faced by analysts and statisticians in academia
and professional environments. Some data-driven methods were studied to obtain
accurate data. Projects that highly rely on data face this missing data
problem. And since machine learning models are only as good as the data used to
train them, the missing data problem has a real impact on the solutions
developed for real-world problems. Therefore, in this dissertation, there is an
attempt to solve this problem using different mechanisms. This is done by
testing the effectiveness of both traditional and modern data imputation
techniques by determining the loss of statistical power when these different
approaches are used to tackle the missing data problem. At the end of this
research dissertation, it should be easy to establish which methods are the
best when handling the research problem. It is recommended that using
Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the
best approach to dealing with missing data.
References
[1]
Brangetto, P. and Veenendaal, M.A. (2016) Influence Cyber Operations: The Use of Cyberattacks in Support of Influence Operations. 2016 8th International Conference on Cyber Conflict (CyCon), Tallinn, 31 May-3 June 2016, 113-126.
https://doi.org/10.1109/CYCON.2016.7529430
[2]
Schafer, J.L. (2003) Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ. Statistica Neerlandica, 57, 19-35.
https://doi.org/10.1111/1467-9574.00218
[3]
White, I.R., Royston, P. and Wood, A.M. (2011) Multiple Imputation Using Chained Equations: Issues and Guidance for Practice. Statistics in Medicine, 30, 377-399.
https://doi.org/10.1002/sim.4067
[4]
Little, R.J. and Rubin, D.B. (1989) The Analysis of Social Science Data with Missing Values. Sociological Methods & Research, 18, 292-326.
https://doi.org/10.1177%2F0049124189018002004
[5]
Van Buuren, S. (2018) Flexible Imputation of Missing Data. CRC Press, Boca Raton.
Rubin, D.B. (1978) Multiple Imputations in Sample Surveys—A Phenomenological Bayesian Approach to Nonresponse. In: Proceedings of the Survey Research Methods Section of the American Statistical Association, Vol. 1, American Statistical Association, Alexandria, 20-34.
[8]
Little, R.J. (1988) A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83, 1198-1202.
https://doi.org/10.1080/01621459.1988.10478722
[9]
Doove, L.L., Van Buuren, S. and Dusseldorp, E. (2014) Recursive Partitioning for Missing Data Imputation in the Presence of Interaction Effects. Computational Statistics & Data Analysis, 72, 92-104. https://doi.org/10.1016/j.csda.2013.10.025
[10]
Chen, H.Y. and Little, R. (1999) A Test of Missing Completely at Random for Generalised Estimating Equations with Missing Data. Biometrika, 86, 1-13.
https://doi.org/10.1093/biomet/86.1.1
[11]
Schafer, J.L. and Olsen, M.K. (1998) Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research, 33, 545-571. https://doi.org/10.1207/s15327906mbr3304_5
[12]
Graham, J.W. and Hofer, S.M. (2000) Multiple Imputation in Multivariate Research. In: Little, T.D., Schnabel, K.U. and Baumert, J., Eds., Modeling Longitudinal and Multilevel Data, Psychology Press, New York, 189-204.
https://doi.org/10.4324/9781410601940-15
[13]
Heitjan, D.F. and Basu, S. (1996) Distinguishing “Missing at Random” and “Missing Completely at Random”. The American Statistician, 50, 207-213.
https://doi.org/10.1080/00031305.1996.10474381
[14]
McPherson, S., Barbosa-Leiker, C., Mamey, M.R., McDonell, M., Enders, C.K. and Roll, J. (2015) A ‘Missing Not at Random’ (MNAR) and ‘Missing at Random’ (MAR) Growth Model Comparison with a Buprenorphine/Naloxone Clinical Trial. Addiction, 110, 51-58. https://doi.org/10.1111/add.12714
[15]
Little, R.J. and Smith, P.J. (1987) Editing and Imputation for Quantitative Survey Data. Journal of the American Statistical Association, 82, 58-68.
https://doi.org/10.1080/01621459.1987.10478391
[16]
Little, R.J. and Rubin, D.B. (2019) Statistical Analysis with Missing Data. Vol. 793, John Wiley & Sons, Hoboken. https://doi.org/10.1002/9781119482260
[17]
Graham, J.W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P. and Schafer, J.L. (1997) Analysis with Missing Data in Prevention Research.
https://content.apa.org/doi/10.1037/10222-010
[18]
Rubin, D.B. (2003) Discussion on Multiple Imputation. International Statistical Review, 71, 619-625. https://doi.org/10.1111/j.1751-5823.2003.tb00216.x
[19]
Schafer, J.L. and Graham, J.W. (2002) Missing Data: Our View of the State of the Art. Psychological Methods, 7, 147-177.
https://doi.apa.org/doi/10.1037/1082-989X.7.2.147
[20]
Van Buuren, S. (2011) Multiple Imputation of Multilevel Data. Routledge, 181-204.
https://doi.org/10.1201/b11826
[21]
Van Buuren, S., Groothuis-Oudshoorn, K., Robitzsch, A., Vink, G., Doove, L., Jolani, S., et al. (2015) Package ‘Mice’.
[22]
Diggle, P.J. (1979) On Parameter Estimation and Goodness-of-Fit Testing for Spatial Point Patterns. Biometrics, 35, 87-101. https://doi.org/10.2307/2529938
[23]
Barr, D.R. and Davidson, T. (1973) A Kolmogorov-Smirnov Test for Censored Samples. Technometrics, 15, 739-757.
https://doi.org/10.1080/00401706.1973.10489108
[24]
Saylordot Organisation (2019) 11.2 Chi-Square One Sample Test of Goodness of Fit.
https://saylordotorg.github.io/text_introductory-statistics/s15-02-chi-square-one-sample-goodness.html
[25]
Tallarida, R.J. and Murray, R.B. (1987) Chi-Square Test. In: Manual of Pharmacologic Calculations. Springer, New York, 140-142.
https://doi.org/10.1007/978-1-4612-4974-0_43
[26]
Kent, J.T. (1982) Robust Properties of Likelihood Ratio Tests. Biometrika, 69, 19-27.
https://doi.org/10.1093/biomet/69.1.19
[27]
Scheffer, J.A. (2000) An Analysis of the Missing Data Methodology for Different Types of Data: A Thesis Presented in Partial Fulfilment of the Requirements for the Degree of Master of Applied Statistics at Massey University. Doctoral Dissertation, Massey University, Palmerston North.
[28]
StackOverflow (2017) Machine Learning with Incomplete Data.
https://stackoverflow.com/questions/39386936/machine-learning-with-incomplete-data
[29]
Enders, C.K. (2010) Applied Missing Data Analysis. Guilford Press, New York.
[30]
Jakobsen, J.C., Gluud, C., Wetterslev, J. and Winkel, P. (2017) When and How Should Multiple Imputation Be Used for Handling Missing Data in Randomised Clinical Trials—A Practical Guide with Flowcharts. BMC Medical Research Methodology, 17, Article No. 162. https://doi.org/10.1186/s12874-017-0442-1
[31]
Graham, J.W., Hofer, S.M. and MacKinnon, D.P. (1996) Maximizing the Usefulness of Data Obtained with Planned Missing Value Patterns: An Application of Maximum Likelihood Procedures. Multivariate Behavioral Research, 31, 197-218.
https://doi.org/10.1207/s15327906mbr3102_3
[32]
Andradóttir, S. and Bier, V.M. (2000) Applying Bayesian Ideas in Simulation. Simulation Practice and Theory, 8, 253-280.
https://doi.org/10.1016/S0928-4869(00)00025-2
[33]
Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J. (2011) Multiple Imputation by Chained Equations: What Is It and How Does It Work? International Journal of Methods in Psychiatric Research, 20, 40-49. https://doi.org/10.1002/mpr.329