%0 Journal Article
%T Comparing Imputation Procedures for Affymetrix Gene Expression Datasets Using MAQC Datasets
%A Sreevidya Sadananda Sadasiva Rao
%A Lori A. Shepherd
%A Andrew E. Bruno
%A Song Liu
%A Jeffrey C. Miecznikowski
%J Advances in Bioinformatics
%D 2013
%I Hindawi Publishing Corporation
%R 10.1155/2013/790567
%X Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with is most accurate under the error measures considered. The k-nearest neighbor method with has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with has the best overall performance and k-nearest neighbor method with has the worst overall performance. These results hold true for both 5% and 10% missing values. 1. Introduction In microarray experiments, randomly missing values may occur due to scratches on the chip, spotting errors, dust, or hybridization errors. Other nonrandom missing values may be biological in nature, for example, probes with low intensity values or intensity values that may exceed a readable threshold. These missing values will create incomplete gene expression matrices where the rows refer to genes and the columns refer to samples. These incomplete expression matrices will make it difficult for researchers to perform downstream analyses such as differential expression inference, clustering or dimension reduction methods (e.g., principal components analysis), or multidimensional scaling. Hence, it is critical to understand the nature of the missing values and to choose an accurate method to impute the missing values. There have been several methods put forth to impute missing data in microarray experiments. In one of the first papers related to microarrays, Troyanskaya et al. [1] examine several methods of imputing missing data and ultimately suggest a -nearest neighbors approach. Researchers also explored applying previously developed schemes for microarrays such as the nonlinear iterative partial least
%U http://www.hindawi.com/journals/abi/2013/790567/