全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Study on the Missing Data Mechanisms and Imputation Methods

DOI: 10.4236/ojs.2021.114030, PP. 477-492

Keywords: Missing Data, Mechanisms, Imputation Techniques, Models

Full-Text   Cite this paper   Add to My Lib

Abstract:

The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper aimed at the missing data widespread problem faced by analysts and statisticians in academia and professional environments. Some data-driven methods were studied to obtain accurate data. Projects that highly rely on data face this missing data problem. And since machine learning models are only as good as the data used to train them, the missing data problem has a real impact on the solutions developed for real-world problems. Therefore, in this dissertation, there is an attempt to solve this problem using different mechanisms. This is done by testing the effectiveness of both traditional and modern data imputation techniques by determining the loss of statistical power when these different approaches are used to tackle the missing data problem. At the end of this research dissertation, it should be easy to establish which methods are the best when handling the research problem. It is recommended that using Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the best approach to dealing with missing data.

References

[1]  Brangetto, P. and Veenendaal, M.A. (2016) Influence Cyber Operations: The Use of Cyberattacks in Support of Influence Operations. 2016 8th International Conference on Cyber Conflict (CyCon), Tallinn, 31 May-3 June 2016, 113-126.
https://doi.org/10.1109/CYCON.2016.7529430
[2]  Schafer, J.L. (2003) Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ. Statistica Neerlandica, 57, 19-35.
https://doi.org/10.1111/1467-9574.00218
[3]  White, I.R., Royston, P. and Wood, A.M. (2011) Multiple Imputation Using Chained Equations: Issues and Guidance for Practice. Statistics in Medicine, 30, 377-399.
https://doi.org/10.1002/sim.4067
[4]  Little, R.J. and Rubin, D.B. (1989) The Analysis of Social Science Data with Missing Values. Sociological Methods & Research, 18, 292-326.
https://doi.org/10.1177%2F0049124189018002004
[5]  Van Buuren, S. (2018) Flexible Imputation of Missing Data. CRC Press, Boca Raton.
[6]  Rubin, D.B. (1976) Inference and Missing Data. Biometrika, 63, 581-592.
https://doi.org/10.1093/biomet/63.3.581
[7]  Rubin, D.B. (1978) Multiple Imputations in Sample Surveys—A Phenomenological Bayesian Approach to Nonresponse. In: Proceedings of the Survey Research Methods Section of the American Statistical Association, Vol. 1, American Statistical Association, Alexandria, 20-34.
[8]  Little, R.J. (1988) A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83, 1198-1202.
https://doi.org/10.1080/01621459.1988.10478722
[9]  Doove, L.L., Van Buuren, S. and Dusseldorp, E. (2014) Recursive Partitioning for Missing Data Imputation in the Presence of Interaction Effects. Computational Statistics & Data Analysis, 72, 92-104.
https://doi.org/10.1016/j.csda.2013.10.025
[10]  Chen, H.Y. and Little, R. (1999) A Test of Missing Completely at Random for Generalised Estimating Equations with Missing Data. Biometrika, 86, 1-13.
https://doi.org/10.1093/biomet/86.1.1
[11]  Schafer, J.L. and Olsen, M.K. (1998) Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research, 33, 545-571.
https://doi.org/10.1207/s15327906mbr3304_5
[12]  Graham, J.W. and Hofer, S.M. (2000) Multiple Imputation in Multivariate Research. In: Little, T.D., Schnabel, K.U. and Baumert, J., Eds., Modeling Longitudinal and Multilevel Data, Psychology Press, New York, 189-204.
https://doi.org/10.4324/9781410601940-15
[13]  Heitjan, D.F. and Basu, S. (1996) Distinguishing “Missing at Random” and “Missing Completely at Random”. The American Statistician, 50, 207-213.
https://doi.org/10.1080/00031305.1996.10474381
[14]  McPherson, S., Barbosa-Leiker, C., Mamey, M.R., McDonell, M., Enders, C.K. and Roll, J. (2015) A ‘Missing Not at Random’ (MNAR) and ‘Missing at Random’ (MAR) Growth Model Comparison with a Buprenorphine/Naloxone Clinical Trial. Addiction, 110, 51-58.
https://doi.org/10.1111/add.12714
[15]  Little, R.J. and Smith, P.J. (1987) Editing and Imputation for Quantitative Survey Data. Journal of the American Statistical Association, 82, 58-68.
https://doi.org/10.1080/01621459.1987.10478391
[16]  Little, R.J. and Rubin, D.B. (2019) Statistical Analysis with Missing Data. Vol. 793, John Wiley & Sons, Hoboken.
https://doi.org/10.1002/9781119482260
[17]  Graham, J.W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P. and Schafer, J.L. (1997) Analysis with Missing Data in Prevention Research.
https://content.apa.org/doi/10.1037/10222-010
[18]  Rubin, D.B. (2003) Discussion on Multiple Imputation. International Statistical Review, 71, 619-625.
https://doi.org/10.1111/j.1751-5823.2003.tb00216.x
[19]  Schafer, J.L. and Graham, J.W. (2002) Missing Data: Our View of the State of the Art. Psychological Methods, 7, 147-177.
https://doi.apa.org/doi/10.1037/1082-989X.7.2.147
[20]  Van Buuren, S. (2011) Multiple Imputation of Multilevel Data. Routledge, 181-204.
https://doi.org/10.1201/b11826
[21]  Van Buuren, S., Groothuis-Oudshoorn, K., Robitzsch, A., Vink, G., Doove, L., Jolani, S., et al. (2015) Package ‘Mice’.
[22]  Diggle, P.J. (1979) On Parameter Estimation and Goodness-of-Fit Testing for Spatial Point Patterns. Biometrics, 35, 87-101.
https://doi.org/10.2307/2529938
[23]  Barr, D.R. and Davidson, T. (1973) A Kolmogorov-Smirnov Test for Censored Samples. Technometrics, 15, 739-757.
https://doi.org/10.1080/00401706.1973.10489108
[24]  Saylordot Organisation (2019) 11.2 Chi-Square One Sample Test of Goodness of Fit.
https://saylordotorg.github.io/text_introductory-statistics/s15-02-chi-square-one-sample-goodness.html
[25]  Tallarida, R.J. and Murray, R.B. (1987) Chi-Square Test. In: Manual of Pharmacologic Calculations. Springer, New York, 140-142.
https://doi.org/10.1007/978-1-4612-4974-0_43
[26]  Kent, J.T. (1982) Robust Properties of Likelihood Ratio Tests. Biometrika, 69, 19-27.
https://doi.org/10.1093/biomet/69.1.19
[27]  Scheffer, J.A. (2000) An Analysis of the Missing Data Methodology for Different Types of Data: A Thesis Presented in Partial Fulfilment of the Requirements for the Degree of Master of Applied Statistics at Massey University. Doctoral Dissertation, Massey University, Palmerston North.
[28]  StackOverflow (2017) Machine Learning with Incomplete Data.
https://stackoverflow.com/questions/39386936/machine-learning-with-incomplete-data
[29]  Enders, C.K. (2010) Applied Missing Data Analysis. Guilford Press, New York.
[30]  Jakobsen, J.C., Gluud, C., Wetterslev, J. and Winkel, P. (2017) When and How Should Multiple Imputation Be Used for Handling Missing Data in Randomised Clinical Trials—A Practical Guide with Flowcharts. BMC Medical Research Methodology, 17, Article No. 162.
https://doi.org/10.1186/s12874-017-0442-1
[31]  Graham, J.W., Hofer, S.M. and MacKinnon, D.P. (1996) Maximizing the Usefulness of Data Obtained with Planned Missing Value Patterns: An Application of Maximum Likelihood Procedures. Multivariate Behavioral Research, 31, 197-218.
https://doi.org/10.1207/s15327906mbr3102_3
[32]  Andradóttir, S. and Bier, V.M. (2000) Applying Bayesian Ideas in Simulation. Simulation Practice and Theory, 8, 253-280.
https://doi.org/10.1016/S0928-4869(00)00025-2
[33]  Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J. (2011) Multiple Imputation by Chained Equations: What Is It and How Does It Work? International Journal of Methods in Psychiatric Research, 20, 40-49.
https://doi.org/10.1002/mpr.329

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133