%0 Journal Article
%T 基于基因表达数据的乳腺癌分期预测
Prediction of Breast Cancer Stage Based on Gene Expression Data
%A 程佩文
%A 石涵钰
%A 夏心语
%A 陈园园
%J Biophysics
%P 29-37
%@ 2330-1694
%D 2020
%I Hans Publishing
%R 10.12677/BIPHY.2020.83003
%X
乳腺癌的医治方案以及预后基本由分期所决定。因此,能够准确定位患者所属的分期变得尤为重要。本文旨在探求可以通过基因表达数据预测患者的乳腺癌分期的方法。对数据集进行过采样,对数据较少的晚期样本进行有放回随机抽取至与早期样本同等大小的样本,获得平衡的分期数据。构建随机森林模型对平衡样本的分期进行预测,其准确率达到96.75%,模型的灵敏性和特异性分别为97.5%和89.3%。将随机森林模型与k-近邻、支持向量机方法相比,随机森林模型的AUC (Area Under
Curve)值明显高于其他两种方法。采用十折交叉验证对随机森林预测模型进行评估,平均准确率为96.71%。最终结果表明随机森林模型具有良好的预测性能。对随机森林算法中重要性得分排名前200的基因进行功能富集分析,富集得到的通路多与乳腺癌相关,可以认为选用的基因表达数据预测分期有意义,从而为今后乳腺癌的治疗方法和预后提供了一定的依据。
The stage of breast cancer determines its treatment and prognosis. Therefore, it is particularly significant to accurately locate the stage to which the patient belongs. This article aims to explore methods that can predict the stage of breast cancer through patients’ gene expression data. We obtained a balanced training data set by oversampling the data set and conducting random sampling with replacement on the late-stage samples with fewer data in order to select samples of the same size as the early samples. After that, we constructed a random forest model to predict the stage based on the balanced samples and achieved an accuracy of 96.75% with sensitivity 97.5% and specificity 89.3%. Then we compared the random forest model with kNN and SVM, the AUC values of the random forest model are higher than that of the other two methods. Ten-fold cross-validation was chosen to evaluate the random forest prediction model, and the average accuracy was 96.71%. The final result shows that the random forest model has impressive performance. After selecting the top 200 genes in importance according to the importance scores in random forest, we performed functional enrichment analysis. The pathways obtained by the enrichment were mostly related to breast cancer. It can be considered that the selected gene expression data are meaningful to predict the stage, so as to provide a certain basis for the treatment and prognosis of breast cancer in the future.