%0 Journal Article
%T Evaluating Utility of Machine Learning-Based Imputation Methods to Account for Attrition in Multi-Stage Epilepsy Prevalence Surveys
%A Daniel M. Mwanga
%A Isaac C. Kipchirchir
%A George O. Muhua
%A Charles R. Newton
%A Damazo T. Kadengye
%J Open Journal of Statistics
%P 337-360
%@ 2161-7198
%D 2025
%I Scientific Research Publishing
%R 10.4236/ojs.2025.153018
%X Attrition is a common challenge in statistical analysis for longitudinal or multi-stage cross-sectional studies. While strategies to reduce attrition should ideally be implemented during the study design phase, they remain common in real-world research, necessitating statistical methods to address them. Traditional approaches like multiple imputation (MI) and inverse probability weighting (IPW) rely on the assumption that data is missing at random (MAR), which is not always plausible. Recent developments in machine learning (ML) based methods offer promising alternatives because of their ability to capture complex patterns in data and handle non-linear relationships more effectively. This study examines four ML-based imputation methods to account for attrition and compares them with conventional MI and IPW in a two-stage epilepsy population-based prevalence survey involving 56,425 participants. Simulated attrition levels from 5% to 50% were applied following the MAR mechanism to assess the performance of the different methods. This was replicated 100 times using different random seeds. Results showed that bias increased with an increase in attrition levels. Complete case analysis had the largest bias in all scenarios. <i>k</i>-nearest neighbor (KNN) and sequential KNN (sKNN) performed similarly to MI under MAR but exhibited less bias than MI and IPW when data were MNAR. While IPW performed similarly to MI under MAR, it had greater bias under MNAR. Both missForest and the MI implemented using random forest were outperformed by sKNN and KNN. We have demonstrated that even a small attrition proportion of 5% can significantly bias estimates if not properly addressed. While MI is still the most preferred for missing data assuming MAR, ML methods, particularly sKNN and KNN demonstrated potential for addressing attrition when data are MNAR. Choosing the appropriate method to address missing data should be preceded by an evaluation of different available methods that could be suitable for the data being analysed. Future research should explore ML methods in various study designs and consider integrating ML into the very robust MI framework to improve prediction accuracy for missing data due to attrition.
%K Prevalence
%K Missing Data
%K Machine Learning
%K Multiple Imputation
%K Inverse Probability Weighting
%K Attrition
%K Epilepsy
%K Population-Based Studies
%U http://www.scirp.org/journal/PaperInformation.aspx?PaperID=143749