Aims. The purpose of this study was to compare methods for handling missing data in analysis of the National Tuberculosis Surveillance System of the Centers for Disease Control and Prevention. Because of the high rate of missing human immunodeficiency virus (HIV) infection status in this dataset, we used multiple imputation methods to minimize the bias that may result from less sophisticated methods. Methods. We compared analysis based on multiple imputation methods with analysis based on deleting subjects with missing covariate data from regression analysis (case exclusion), and determined whether the use of increasing numbers of imputed datasets would lead to changes in the estimated association between isoniazid resistance and death. Results. Following multiple imputation, the odds ratio for initial isoniazid resistance and death was 2.07 (95% CI 1.30, 3.29); with case exclusion, this odds ratio decreased to 1.53 (95% CI 0.83, 2.83). The use of more than 5 imputed datasets did not substantively change the results. Conclusions. Our experience with the National Tuberculosis Surveillance System dataset supports the use of multiple imputation methods in epidemiologic analysis, but also demonstrates that close attention should be paid to the potential impact of missing covariates at each step of the analysis. 1. Background Missing data is a common problem in epidemiologic research. Analytic techniques used in multivariable analysis, such as regression models, rely on methods that exclude cases with missing covariate data from analysis. This missing data approach has important limitations. First, case exclusion will always lead to loss of statistical power. Second, case exclusion will introduce bias into the analysis if excluded subjects differ from included subjects in ways that are relevant for the parameter of interest . The potential for bias using case exclusion depends on the mechanism for missingness. For missing-at-random (MAR) data, the missingness of a particular observation depends only on observed covariates, and for missing-not-at-random (MNAR) data, missingness may depend on both observed and unobserved covariates. For either MAR or MNAR data, case exclusion will introduce bias, as subjects excluded from analysis will differ from subjects included in analysis according to either the measured or unmeasured covariates. In contrast, when data is missing-completely-at-random (MCAR), missingness can be considered a random deletion of observations without respect to measured or unmeasured covariates, and case exclusion does not lead to the
A. R. T. Donders, G. J. M. G. van der Heijden, T. Stijnen, and K. G. M. Moons, “Review: a gentle introduction to imputation of missing values,” Journal of Clinical Epidemiology, vol. 59, no. 10, pp. 1087–1091, 2006.
C. Vinnard, C. A. Winston, E. P. Wileyto, R. R. MacGregor, and G. P. Bisson, “Isoniazid resistance and death in patients with tuberculous meningitis: retrospective cohort study,” British Medical Journal, vol. 341, no. 7773, p. 596, 2010.
K. J. Lee and J. B. Carlin, “Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation,” American Journal of Epidemiology, vol. 171, no. 5, pp. 624–632, 2010.
K. G. Moons, R. A. Donders, T. Stijnen, and F. E. Harrell Jr., “Using the outcome for imputation of missing predictor values was preferred,” Journal of Clinical Epidemiology, vol. 59, pp. 1092–1101, 2006.