OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Advances in Bioinformatics 2013

Comparing Imputation Procedures for Affymetrix Gene Expression Datasets Using MAQC Datasets

DOI: 10.1155/2013/790567

Sreevidya Sadananda Sadasiva Rao,Lori A. Shepherd,Andrew E. Bruno,Song Liu,Jeffrey C. Miecznikowski

Full-Text Cite this paper Add to My Lib

Abstract:

Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with is most accurate under the error measures considered. The k-nearest neighbor method with has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with has the best overall performance and k-nearest neighbor method with has the worst overall performance. These results hold true for both 5% and 10% missing values. 1. Introduction In microarray experiments, randomly missing values may occur due to scratches on the chip, spotting errors, dust, or hybridization errors. Other nonrandom missing values may be biological in nature, for example, probes with low intensity values or intensity values that may exceed a readable threshold. These missing values will create incomplete gene expression matrices where the rows refer to genes and the columns refer to samples. These incomplete expression matrices will make it difficult for researchers to perform downstream analyses such as differential expression inference, clustering or dimension reduction methods (e.g., principal components analysis), or multidimensional scaling. Hence, it is critical to understand the nature of the missing values and to choose an accurate method to impute the missing values. There have been several methods put forth to impute missing data in microarray experiments. In one of the first papers related to microarrays, Troyanskaya et al. [1] examine several methods of imputing missing data and ultimately suggest a -nearest neighbors approach. Researchers also explored applying previously developed schemes for microarrays such as the nonlinear iterative partial least

References

[1]	O. Troyanskaya, M. Cantor, G. Sherlock et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001.
[2]	H. Wold, “Path models with latent variables: the NIPALS approach,” in Quantitative Sociology: International Perspectives on Mathematical and Statistical Modeling, pp. 307–357, 1975.
[3]	S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and S. Ishii, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096, 2003.
[4]	T. H. B？, B. Dysvik, and I. Jonassen, “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Research, vol. 32, no. 3, p. e34, 2004.
[5]	H. Kim, G. H. Golub, and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005.
[6]	M. Ouyang, W. J. Welsh, and P. Georgopoulos, “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, vol. 20, no. 6, pp. 917–923, 2004.
[7]	J. C. Miecznikowski, S. Damodaran, K. F. Sellers, D. E. Coling, R. Salvi, and R. A. Rabin, “A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data,” Proteome Science, vol. 9, p. 14, 2011.
[8]	G. N. Brock, J. R. Shaffer, R. E. Blakesley, M. J. Lotz, and G. C. Tseng, “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinformatics, vol. 9, no. 1, p. 12, 2008.
[9]	M. Celton, A. Malpertuy, G. Lelandais, and A. G. de Brevern, “Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments,” BMC Genomics, vol. 11, no. 1, p. 15, 2010.
[10]	S. Oh, D. D. Kang, G. N. Brock, and G. C. Tseng, “Biological impact of missing-value imputation on downstream analyses of gene expression profiles,” Bioinformatics, vol. 27, no. 1, Article ID btq613, pp. 78–86, 2011.
[11]	R. Mei, X. Di, T. B. Ryder et al., “Analysis of high density expression microarrays with signed-rank call algorithms,” Bioinformatics, vol. 18, no. 12, pp. 1593–1599, 2002.
[12]	R. Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, “Bioinformatics and computational biology solutions using R and Bioconductor,” Statistics for Biology and Health, 2005.
[13]	L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry, “Affy-Analysis of Affymetrix GeneChip data at the probe level,” Bioinformatics, vol. 20, no. 3, pp. 307–315, 2004.
[14]	L. Shi, “The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements,” Nature Biotechnology, vol. 24, no. 9, pp. 1151–1161, 2006.
[15]	J. J. Chen, H. Hsueh, R. R. Delongchamp, C. Lin, and C. Tsai, “Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data,” BMC Bioinformatics, vol. 8, no. 1, p. 412, 2007.
[16]	L. Shi, W. D. Jones, R. V. Jensen et al., “The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies,” BMC Bioinformatics, vol. 9, supplement 9, p. S10, 2008.
[17]	S. E. Choe, M. Boutros, A. M. Michelson, G. M. Church, and M. S. Halfon, “Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset,” Genome Biology, vol. 6, no. 2, p. R16, 2005.
[18]	Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, “Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset,” BMC Bioinformatics, vol. 11, no. 1, p. 285, 2010.
[19]	Q. Zhu, J. C. Miecznikowski, and M. S. Halfon, “A wholly defined Agilent microarray spike-in dataset,” Bioinformatics, vol. 27, no. 9, Article ID btr135, pp. 1284–1289, 2011.
[20]	I. Affymetrix, “Statistical algorithms description document,” Technical Paper, 2002.
[21]	C. L. Wilson and C. J. Miller, “Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis,” Bioinformatics, vol. 21, no. 18, pp. 3683–3685, 2005.
[22]	R. C. Gentleman, V. J. Carey, D. M. Bates et al., “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biology, vol. 5, no. 10, p. R80, 2004.
[23]	T. Hastie, R. Tibshirani, B. Narasimhan, and G. Chu, Impute: Imputation for Microarray Data, 1999, R package version 1.10.0.
[24]	T.H. BB？, B. Dysvik, and I. Jonassen, “Lsimpute: Accurate estimation of missing values in microarray data with least squares methods,” 2005, http://www.ii.uib.no/~trondb/imputation/.
[25]	D. V. Nguyen, N. Wang, and R. J. Carroll, “Evaluation of missing value estimation for microarray data,” Journal of Data Science, vol. 2, no. 4, pp. 347–370, 2004.
[26]	W. Stacklies and H. Redestig, PcaMethods: A Collection of PCA Methods, 2007, R package version 1.18.0.
[27]	S. S. Sadasiva Rao, L. A. Shepherd, A. E. Bruno, S. Liu, and J. C. Miecznikowski, “A full analysis of imputation procedures for Affymetrix gene expression datasets,” Technical Report 1202, SUNY University at Buffalo-Department of Biostatistics, Buffalo, NY, USA, 2012.
[28]	T. A. Patterson, E. K. Lobenhofer, S. B. Fulmer-Smentek et al., “Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project,” Nature Biotechnology, vol. 24, no. 9, pp. 1140–1150, 2006.
[29]	Z. Wen, C. Wang, Q. Shi et al., “Evaluation of gene expression data generated from expired Affymetrix GeneChip？ microarrays using MAQC reference RNA samples,” BMC Bioinformatics, vol. 11, supplement 6, p. S10, 2010.
[30]	J. Luo, M. Schumacher, A. Scherer et al., “A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data,” Pharmacogenomics Journal, vol. 10, no. 4, pp. 278–291, 2010.
[31]	K. Kadota and K. Shimizu, “Evaluating methods for ranking differentially expressed genes applied to microArray quality control data,” BMC Bioinformatics, vol. 12, no. 1, p. 227, 2011.
[32]	T. Aittokallio, “Dealing with missing values in large-scale studies: microarray data imputation and beyond,” Briefings in Bioinformatics, vol. 11, no. 2, Article ID bbp059, pp. 253–264, 2009.
[33]	J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio, “Missing value imputation improves clustering and interpretation of gene expression microarray data,” BMC Bioinformatics, vol. 9, no. 1, p. 202, 2008.
[34]	A. Liew, N. Law, and H. Yan, “Missing value imputation for gene expression data: computational techniques to recover missing data from available information,” Briefings in Bioinformatics, vol. 12, no. 5, Article ID bbq080, pp. 498–513, 2011.
[35]	B. M. Bolstad, R. A. Irizarry, M. ？strand, and T. P. Speed, “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias,” Bioinformatics, vol. 19, no. 2, pp. 185–193, 2003.
[36]	R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed, “Summaries of Affymetrix GeneChip probe level data,” Nucleic Acids Research, vol. 31, no. 4, p. e15, 2003.
[37]	R. A. Irizarry, B. Hobbs, F. Collin et al., “Exploration, normalization, and summaries of high density oligonucleotide array probe level data,” Biostatistics, vol. 4, no. 2, pp. 249–264, 2003.
[38]	Z. Wu, R. A. Irizarry, R. Gentleman, F. Martinez-Murillo, and F. Spencer, “A model-based background adjustment for oligonucleotide expression arrays,” Journal of the American Statistical Association, vol. 99, no. 468, pp. 909–917, 2004.
[39]	A. R. Dabney and J. D. Storey, “A reanalysis of a published Affymetrix GeneChip control dataset,” Genome Biology, vol. 7, no. 3, p. 401, 2006.
[40]	D. P. Gaile and J. C. Miecznikowski, “Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent,” BMC Genomics, vol. 8, no. 1, p. 105, 2007.
[41]	J. M. Perkel, “Six things you won't find in the MAQC,” The Scientist, vol. 20, no. 11, p. 68, 2007.
[42]	P. Liang, “MAQC papers over the cracks,” Nature Biotechnology, vol. 25, no. 1, pp. 27–28, 2007.
[43]	L. Shi, W. D. Jones, R. V. Jensen et al., “Reply to MAQC papers over the cracks,” Nature Biotechnology, vol. 25, pp. 28–29, 2007.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133