OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Open Journal of Statistics 2025

Evaluating Utility of Machine Learning-Based Imputation Methods to Account for Attrition in Multi-Stage Epilepsy Prevalence Surveys

DOI: 10.4236/ojs.2025.153018, PP. 337-360

Daniel M. Mwanga, Isaac C. Kipchirchir, George O. Muhua, Charles R. Newton, Damazo T. Kadengye

Keywords: Prevalence, Missing Data, Machine Learning, Multiple Imputation, Inverse Probability Weighting, Attrition, Epilepsy, Population-Based Studies

Full-Text Cite this paper Add to My Lib

Abstract:

Attrition is a common challenge in statistical analysis for longitudinal or multi-stage cross-sectional studies. While strategies to reduce attrition should ideally be implemented during the study design phase, they remain common in real-world research, necessitating statistical methods to address them. Traditional approaches like multiple imputation (MI) and inverse probability weighting (IPW) rely on the assumption that data is missing at random (MAR), which is not always plausible. Recent developments in machine learning (ML) based methods offer promising alternatives because of their ability to capture complex patterns in data and handle non-linear relationships more effectively. This study examines four ML-based imputation methods to account for attrition and compares them with conventional MI and IPW in a two-stage epilepsy population-based prevalence survey involving 56,425 participants. Simulated attrition levels from 5% to 50% were applied following the MAR mechanism to assess the performance of the different methods. This was replicated 100 times using different random seeds. Results showed that bias increased with an increase in attrition levels. Complete case analysis had the largest bias in all scenarios. k-nearest neighbor (KNN) and sequential KNN (sKNN) performed similarly to MI under MAR but exhibited less bias than MI and IPW when data were MNAR. While IPW performed similarly to MI under MAR, it had greater bias under MNAR. Both missForest and the MI implemented using random forest were outperformed by sKNN and KNN. We have demonstrated that even a small attrition proportion of 5% can significantly bias estimates if not properly addressed. ML methods, particularly sKNN and KNN demonstrated potential for addressing attrition when data are MNAR. Choosing the appropriate method to address missing data should be preceded by an evaluation of different available methods that could be suitable for the data being analysed. Future research should explore ML methods in various study designs and consider integrating ML into the MI framework to improve prediction accuracy for missing data due to attrition.

References

[1]	Ngugi, A.K., Bottomley, C., Chengo, E., Kombe, M.Z., Kazungu, M., Bauni, E., et al. (2012) The Validation of a Three-Stage Screening Methodology for Detecting Active Convulsive Epilepsy in Population-Based Studies in Health and Demographic Surveillance Systems. Emerging Themes in Epidemiology, 9, Article No. 8. https://doi.org/10.1186/1742-7622-9-8
[2]	Mwanga, D.M., Kadengye, D.T., Otieno, P.O., Wekesah, F.M., Kipchirchir, I.C., Muhua, G.O., et al. (2024) Prevalence of All Epilepsies in Urban Informal Settlements in Nairobi, Kenya: A Two-Stage Population-Based Study. The Lancet Global Health, 12, E1323-E1330. https://doi.org/10.1016/s2214-109x(24)00217-1
[3]	Ngugi, A.K., Bottomley, C., Kleinschmidt, I., Wagner, R.G., Kakooza-Mwesige, A., Ae-Ngibise, K., et al. (2013) Prevalence of Active Convulsive Epilepsy in Sub-Saharan Africa and Associated Risk Factors: Cross-Sectional and Case-Control Studies. The Lancet Neurology, 12, 253-263. https://doi.org/10.1016/s1474-4422(13)70003-6
[4]	Kariuki, S.M., Ngugi, A.K., Kombe, M.Z., Kazungu, M., Chengo, E., Odhiambo, R., et al. (2021) Prevalence and Mortality of Epilepsies with Convulsive and Non-Convulsive Seizures in Kilifi, Kenya. Seizure, 89, 51-55. https://doi.org/10.1016/j.seizure.2021.04.028
[5]	Kadengye, D.T., Ceulemans, E. and Van den Noortgate, W. (2013) Direct Likelihood Analysis and Multiple Imputation for Missing Item Scores in Multilevel Cross-Classification Educational Data. Applied Psychological Measurement, 38, 61-80. https://doi.org/10.1177/0146621613491138
[6]	Little, R.J., Carpenter, J.R. and Lee, K.J. (2022) A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation. Sociological Methods & Research, 53, 1105-1135. https://doi.org/10.1177/00491241221113873
[7]	Rubin, D.B. (1976) Inference and Missing Data. Biometrika, 63, 581-592. https://doi.org/10.1093/biomet/63.3.581
[8]	Carpenter, J.R. and Smuk, M. (2021) Missing Data: A Statistical Framework for Practice. Biometrical Journal, 63, 915-947. https://doi.org/10.1002/bimj.202000196
[9]	Sterne, J.A.C., White, I.R., Carlin, J.B., Spratt, M., Royston, P., Kenward, M.G., et al. (2009) Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls. BMJ, 338, b2393. https://doi.org/10.1136/bmj.b2393
[10]	Jakobsen, J.C., Gluud, C., Wetterslev, J. and Winkel, P. (2017) When and How Should Multiple Imputation Be Used for Handling Missing Data in Randomised Clinical Trials—A Practical Guide with Flowcharts. BMC Medical Research Methodology, 17, Article No. 162. https://doi.org/10.1186/s12874-017-0442-1
[11]	Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., et al. (2012) The Prevention and Treatment of Missing Data in Clinical Trials. New England Journal of Medicine, 367, 1355-1360. https://doi.org/10.1056/nejmsr1203730
[12]	Morris, T.P., Kahan, B.C. and White, I.R. (2014) Choosing Sensitivity Analyses for Randomised Trials: Principles. BMC Medical Research Methodology, 14, Article No. 11. https://doi.org/10.1186/1471-2288-14-11
[13]	Clark, T.G. and Altman, D.G. (2003) Developing a Prognostic Model in the Presence of Missing Data. Journal of Clinical Epidemiology, 56, 28-37. https://doi.org/10.1016/s0895-4356(02)00539-5
[14]	Kristman, V., Manno, M. and Côté, P. (2004) Loss to Follow-up in Cohort Studies: How Much Is Too Much? European Journal of Epidemiology, 19, 751-760. https://doi.org/10.1023/b:ejep.0000036568.02655.f8
[15]	Seaman, S.R., White, I.R., Copas, A.J. and Li, L. (2012) Combining Multiple Imputation and Inverse‐Probability Weighting. Biometrics, 68, 129-137. https://doi.org/10.1111/j.1541-0420.2011.01666.x
[16]	Lee, K.J., Carlin, J.B., Simpson, J.A. and Moreno-Betancur, M. (2023) Assumptions and Analysis Planning in Studies with Missing Data in Multiple Variables: Moving Beyond the MCAR/MAR/MNAR Classification. International Journal of Epidemiology, 52, 1268-1275. https://doi.org/10.1093/ije/dyad008
[17]	Gachau, S., Quartagno, M., Njagi, E.N., Owuor, N., English, M. and Ayieko, P. (2020) Handling Missing Data in Modelling Quality of Clinician-Prescribed Routine Care: Sensitivity Analysis of Departure from Missing at Random Assumption. Statistical Methods in Medical Research, 29, 3076-3092. https://doi.org/10.1177/0962280220918279
[18]	Kuhn, M. (2008) Building Predictive Models in R Using the Caret Package. Journal of Statistical Software, 28, 1-26. https://doi.org/10.18637/jss.v028.i05
[19]	RCoreTeam, R. (2013) A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
[20]	Stekhoven, D.J. and Bühlmann, P. (2012) Missforest—Non-Parametric Missing Value Imputation for Mixed-Type Data. Bioinformatics, 28, 112-118. https://doi.org/10.1093/bioinformatics/btr597
[21]	National Institute of Health Research (NIHR) (2020) Research and Innovation for Global Health Epilepsy Pathway Innovation in Africa (EPInA). https://epina.web.ox.ac.uk/
[22]	Beguy, D., Elung’ata, P., Mberu, B., Oduor, C., Wamukoya, M., Nganyi, B., et al. (2015) Health & Demographic Surveillance System Profile: The Nairobi Urban Health and Demographic Surveillance System (NUHDSS). International Journal of Epidemiology, 44, 462-471. https://doi.org/10.1093/ije/dyu251
[23]	Emina, J., Beguy, D., Zulu, E.M., Ezeh, A.C., Muindi, K., Elung’ata, P., et al. (2011) Monitoring of Health and Demographic Outcomes in Poor Urban Settlements: Evidence from the Nairobi Urban Health and Demographic Surveillance System. Journal of Urban Health, 88, 200-218. https://doi.org/10.1007/s11524-011-9594-1
[24]	Placencia, M., Sander, J.W.A.S., Shorvon, S.D., Ellison, R.H. and Cascante, S.M. (1992) Validation of a Screening Questionnaire for the Detection of Epileptic Seizures in Epidemiological Studies. Brain, 115, 783-794. https://doi.org/10.1093/brain/115.3.783
[25]	Rubin, D.B. and Schenker, N. (1991) Multiple Imputation in Health‐Are Databases: An Overview and Some Applications. Statistics in Medicine, 10, 585-598. https://doi.org/10.1002/sim.4780100410
[26]	Little, R.J.A. and Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc.
[27]	Batista, G.E.A.P.A., Prati, R.C. and Monard, M.C. (2004) A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6, 20-29. https://doi.org/10.1145/1007730.1007735
[28]	Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., et al. (2001) Missing Value Estimation Methods for DNA Microarrays. Bioinformatics, 17, 520-525. https://doi.org/10.1093/bioinformatics/17.6.520
[29]	Kigo, S.N., Omondi, E.O. and Omolo, B.O. (2023) Assessing Predictive Performance of Supervised Machine Learning Algorithms for a Diamond Pricing Model. Scientific Reports, 13, Article No. 17315. https://doi.org/10.1038/s41598-023-44326-w
[30]	Jones, G.D., Kariuki, S.M., Ngugi, A.K., Mwesige, A.K., Masanja, H., Owusu-Agyei, S., Wagner, R., Cross, J.H., Sander, J.-S., Newton, C.R., et al. (2023) Development and Validation of a Diagnostic Aid for Convulsive Epilepsy in Sub-Saharan Africa: A Retrospective Case-Control Study. The Lancet Digital Health, 5, e185-e193.
[31]	Mensah, J.A., Nortey, E.N.N., Ocran, E., Iddi, S. and Asiedu, L. (2024) De-Occlusion and Recognition of Frontal Face Images: A Comparative Study of Multiple Imputation Methods. Journal of Big Data, 11, Article No. 60. https://doi.org/10.1186/s40537-024-00925-6
[32]	Manning, C.D., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval. Cambridge University Press. https://doi.org/10.1017/cbo9780511809071
[33]	Mandrekar, J.N. (2010) Receiver Operating Characteristic Curve in Diagnostic Test Assessment. Journal of Thoracic Oncology, 5, 1315-1316. https://doi.org/10.1097/jto.0b013e3181ec173d
[34]	Cai, J., Zeng, D., Li, H., Butera, N.M., Baldoni, P.L., Maitra, P., et al. (2023) Comparisons of Statistical Methods for Handling Attrition in a Follow‐up Visit with Complex Survey Sampling. Statistics in Medicine, 42, 1641-1668. https://doi.org/10.1002/sim.9692
[35]	Waljee, A.K., Mukherjee, A., Singal, A.G., Zhang, Y., Warren, J., Balis, U., et al. (2013) Comparison of Imputation Methods for Missing Laboratory Data in Medicine. BMJ Open, 3, e002847. https://doi.org/10.1136/bmjopen-2013-002847
[36]	Zhou, Y., Aryal, S. and Bouadjenek, M.R. (2024) Review for Handling Missing Data with Special Missing Mechanism. arXiv:2404.04905.
[37]	Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B. and Tabona, O. (2021) A Survey on Missing Data in Machine Learning. Journal of Big Data, 8, Article No. 140. https://doi.org/10.1186/s40537-021-00516-9

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133