Use of name recognition software, census data and multiple imputation to predict missing data on ethnicity: application to cancer registry records

DOI: 10.1186/1472-6947-12-3

Routine records from cancer screening services, name recognition software (Nam Pehchan and Onomap), 2001 national Census data, and multiple imputation were used to predict the ethnicity of the 23% of cases that were still missing following linkage with self-reported ethnicity from inpatient hospital records.The name recognition software were good predictors of ethnicity for South Asian cancer cases when compared with data on ethnicity derived from hospital inpatient records, especially when combined (sensitivity 90.5%; specificity 99.9%; PPV 93.3%). Onomap was a poor predictor of ethnicity for other minority ethnic groups (sensitivity 4.4% for Black cases and 0.0% for Chinese/Other ethnic groups). Area-based data derived from the national Census was also a poor predictor non-White ethnicity (sensitivity: South Asian 7.4%; Black 2.3%; Chinese/Other 0.0%; Mixed 0.0%).Currently, neither method for assigning individuals to an ethnic group (name recognition and ethnic distribution of area of residence) performs well across all ethnic groups. We recommend further development of name recognition applications and the identification of additional methods for predicting ethnicity to improve their precision and accuracy for comparisons of health outcomes. However, real improvements can only come from better recording of ethnicity by health services.This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in the UK. It implements existing approaches in a novel situation and evaluates their utility. It combines four differing approaches to dealing with missing data of this type: the use of an additional source of self-reported ethnicity to replace the missing data; the use of name recognition software to predict the ethnicity of individuals; the use of Census data based on area of residence to predict the ethnicity of individuals; and finally, the use of multiple imputation (MI) to make an allowance for


