|
多重检验技术在大数据分析中的应用
|
Abstract:
在对大数据进行假设检验时,为了控制假阳性,需要采用多重检验技术。多重检验技术有多种,本文通过对大数据进行实际分析,比较各种算法的优缺点,给出不同方法的适用场合,从而对数据分析人员给以理论上的指导。文章首先阐述多重检验的必要性以及多重检验的相关概念;其次分别介绍对总体错误率和错误发现率进行控制的两类方法;最后将这几种多重检验方法应用到基因大数据中对基因的表达与否进行判断。实验结果表明,控制错误发现率的方法优于控制总体错误率的方法,在控制错误发现率的方法中,q值法的结果最好。原因在于q值法考虑了原假设的先验信息,能很好地控制错误发现率的大小,因此具有较高的精确性和检验功效。
In the hypothesis test of big data, in order to control false positives, multiple test technology needs to be used. There are many kinds of multiple test techniques. This paper makes a practical analysis of big data, compares the advantages and disadvantages of various algorithms, and gives the application occasions of different methods, so as to give theoretical guidance to data analysts. Firstly, this paper expounds the necessity and the related concepts of multiple testing; Secondly, two kinds of methods to control the family-wise error rate and false discovery rate are introduced respectively; Finally, these multiple test methods are applied to gene big data to judge whether the genes are expressed or not. The experimental results show that the method of controlling the false discovery rate is better than the method of controlling the family-wise error rate. Among the methods of controlling the false discovery rate, the q-value method has the best result. The reason is that the q-value method considers the prior information of the original hypothesis and can well control the false discovery rate, so it has high accuracy and power.
[1] | Peddada, S.D., Lobenhofer, E.K., Li, L., et al. (2003) Gene Selection and Clustering for Time-Course and Dose-Re- sponse Microarray Experiments Using Order-Restricted Inference. Bioinformatics, 19, 834-841.
https://doi.org/10.1093/bioinformatics/btg093 |
[2] | Simmons, S.J. and Peddada, S.D. (2007) Order-Restricted Inference for Ordered Gene Expression (ORIOGEN) Data under Heteroscedastic Variances. Bioinformatics, 1, 414-419. https://doi.org/10.6026/97320630001414 |
[3] | ? Silicon Genetics. Multiple Testing Corrections. |
[4] | 杨柳. 多重假设检验中错误率控制过程的分析[D]: [硕士学位论文]. 哈尔滨: 黑龙江大学, 2009. |
[5] | 刘遵雄, 陈昊. 多重相关检验中错误发现率的控制算法[J]. 井冈山大学学报(自然科学版), 2016, 37(3): 35-40. |
[6] | Holm, S. (1979) A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6, 65- 70. |
[7] | Hommel, G. (1988) A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonferroni Test. Biometrika, 75, 383-386. https://doi.org/10.1093/biomet/75.2.383 |
[8] | Hochberg, Y. (1988) A Sharper Bonferroni Procedure for Multiple Tests of Significance. Biometrika, 75, 800-802.
https://doi.org/10.1093/biomet/75.4.800 |
[9] | 裴艳波. 多重假设检验问题中关于三种错误测度-FWER, FDR和pFDR的讨论[D]: [硕士学位论文]. 长春: 东北师范大学, 2005. |
[10] | Storey, J.D. (2002) A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 479-498. https://doi.org/10.1111/1467-9868.00346 |
[11] | Storey, J.D. (2003) The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value. The Annals of Statistics, 31, 2013-2035. https://doi.org/10.1214/aos/1074290335 |
[12] | 王婷, 曾平, 黄水平, 等. 错误发现率和q值及其微阵列数据分析的应用[J]. 现代预防医学, 2013, 40(5): 811-814. |
[13] | Storey, J.D., Tibshirani, R., Storey, J.D. and Tibshirani, R. (2003) Statistical Significance for Genomewide Studies. Proceedings of the National Academy of Sciences, 100, 9440-9445. https://doi.org/10.1073/pnas.1530509100 |
[14] | Robertson, T., Wright, F.T. and Dykstra, R.L. (1990) Order Restricted Statistical Inference. Journal of the American Statal Association, 85, 398-409. https://doi.org/10.2307/2289813 |
[15] | 刘瑞银. 基于趋势性的剂量反应研究[D]: [博士学位论文]. 长春: 东北师范大学, 2011. |