|
Biophysics 2020
基于基因表达数据的乳腺癌分期预测
|
Abstract:
乳腺癌的医治方案以及预后基本由分期所决定。因此,能够准确定位患者所属的分期变得尤为重要。本文旨在探求可以通过基因表达数据预测患者的乳腺癌分期的方法。对数据集进行过采样,对数据较少的晚期样本进行有放回随机抽取至与早期样本同等大小的样本,获得平衡的分期数据。构建随机森林模型对平衡样本的分期进行预测,其准确率达到96.75%,模型的灵敏性和特异性分别为97.5%和89.3%。将随机森林模型与k-近邻、支持向量机方法相比,随机森林模型的AUC (Area Under
Curve)值明显高于其他两种方法。采用十折交叉验证对随机森林预测模型进行评估,平均准确率为96.71%。最终结果表明随机森林模型具有良好的预测性能。对随机森林算法中重要性得分排名前200的基因进行功能富集分析,富集得到的通路多与乳腺癌相关,可以认为选用的基因表达数据预测分期有意义,从而为今后乳腺癌的治疗方法和预后提供了一定的依据。
The stage of breast cancer determines its treatment and prognosis. Therefore, it is particularly significant to accurately locate the stage to which the patient belongs. This article aims to explore methods that can predict the stage of breast cancer through patients’ gene expression data. We obtained a balanced training data set by oversampling the data set and conducting random sampling with replacement on the late-stage samples with fewer data in order to select samples of the same size as the early samples. After that, we constructed a random forest model to predict the stage based on the balanced samples and achieved an accuracy of 96.75% with sensitivity 97.5% and specificity 89.3%. Then we compared the random forest model with kNN and SVM, the AUC values of the random forest model are higher than that of the other two methods. Ten-fold cross-validation was chosen to evaluate the random forest prediction model, and the average accuracy was 96.71%. The final result shows that the random forest model has impressive performance. After selecting the top 200 genes in importance according to the importance scores in random forest, we performed functional enrichment analysis. The pathways obtained by the enrichment were mostly related to breast cancer. It can be considered that the selected gene expression data are meaningful to predict the stage, so as to provide a certain basis for the treatment and prognosis of breast cancer in the future.
[1] | 薛卫成, 阚秀. 介绍乳腺癌TNM分期系统(第6版) [J]. 诊断病理学杂志, 2008, 15(3): 161-164. |
[2] | 吴信东, 库玛尔, 主编. 数据挖掘十大算法[M]. 李文波, 吴素研, 译. 北京: 清华大学出版社, 2013. |
[3] | 薛薇. R语言数据挖掘方法及应用[M]. 北京: 电子工业出版社, 2016. |
[4] | 方匡南, 吴见彬, 朱建平, 谢邦昌. 随机森林方法研究综述[J]. 统计与信息论坛, 2012, 26(3): 32-38. |
[5] | 刘定祥, 乔少杰, 张永清, 韩楠, 魏军林, 张榕珂, 黄萍. 不平衡分类的数据采样方法综述[J]. 重庆理工大学学报(自然科学), 2019, 33(7): 102-112. |
[6] | Breiman, L. (1996) Bagging Predictors. Machine Learning, 24, 123-140.
https://doi.org/10.1007/BF00058655 |
[7] | Liaw, A. and Winener, M. (2002) Classification and Regression by RandomForest. R News, 2, 18-22. |
[8] | Andy, L. and Matthew, W. Classification and Regression by random Forest.
https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf |
[9] | 李洪城. R语言机器学习实用案例分析[M]. 北京: 机械工业出版社, 2017: 64-95. |
[10] | 李航. 统计学习方法[M]. 北京:清华大学出版社, 2012:95-123. |
[11] | 孔德锋. 机器学习在乳腺癌诊断中的应用[J]. 信息通信, 2019(7): 18-21. |
[12] | 蒋帅. 基于AUC的分类器性能评估问题研究[D]: [硕士学位论文]. 吉林: 吉林大学, 2016. |
[13] | 侯珂珂, 蔡莉莉. 基于重采样策略的随机森林算法在乳腺肿瘤分类中的研究[J]. 现代计算机, 2019(34): 32-35+58. |
[14] | 王靖. 基于GO的基因功能及疾病相关通路分析[D]: [博士学位论文]. 成都: 电子科技大学, 2012. |
[15] | 高翠红. 乳腺癌患者血浆、尿液中氨基酸谱的变化[J]. 中华临床营养杂志, 2014, 22(5): 293-296.
https://doi.org/10.3760/cma.j.issn.1674-635X.2014.05.008 |
[16] | 舒坤贤, 王光利, 邬力祥. p53基因调控网络研究进展[J]. 重庆工商大学学报(自然科学版), 2008, 25(5): 474-478. https://doi.org/10.3969/j.issn.1672-058X.2008.05.009 |
[17] | 鄂征, 主编. 癌变机理研究[M]. 北京: 北京出版社, 1999. |