Predicting the stages of cancer
accurately is crucial for effective treatment planning. In this study, we aimed to develop a model using gene
expression data and XGBoost (eXtreme
Gradient Boosting) that include clinical and demographic variables to predict specific lung cancer stages in patients. By
conducting the feature selection using the Wilcoxon Rank Test, we picked the
most impactful genes associated with lung cancer stage prediction. Our model
achieved an overall accuracy of 82% in classifying lung cancer stages according to patients’ gene expression data. These
findings demonstrate the potential of gene expression analysis and
machine learning techniques in improving the accuracy of lung cancer stage
prediction, aiding in personalized treatment decisions.
References
[1]
Crick, F. (1970) Central Dogma of Molecular Biology. Nature, 227, 561-563.
https://doi.org/10.1038/227561a0
[2]
Collins, K., Jacks, T. and Pavletich, N.P. (1997) The Cell Cycle and Cancer. Proceedings of the National Academy of Sciences of the United States of America, 94, 2776-2778. https://doi.org/10.1073/pnas.94.7.2776
[3]
Kastan, M.B. and Bartek, J. (2004) Cell-Cycle Checkpoints and Cancer. Nature, 432, 316-323. https://doi.org/10.1038/nature03097
[4]
(2013) Focusing on the Cell Biology of Cancer. Nature Cell Biology, 15, 1.
https://doi.org/10.1038/ncb2667
[5]
Dingil, N., Inan, Z. and Şentürk, A. (2022) Association between the DNA Repair Gene Polymorphisms and Lung Cancer in Turkish Population. Advances in Lung Cancer, 11, 15-29. https://doi.org/10.4236/alc.2022.112002
[6]
Cooper, G. and Adams, K. (2023) The Cell: A Molecular Approach. Oxford University Press, Oxford.
[7]
Li, Y., Wu, X., Yang, P., Jiang, G. and Luo, Y. (2022) Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. Genomics, Proteomics & Bioinformatics, 20, 850-866. https://doi.org/10.1016/j.gpb.2022.11.003
[8]
Preston, J., Van Zeeland, A. and Peiffer, D.A. (2021) Innovation at Illumina: The Road to the $600 Human Genome. Nature Portfolio, Berlin.
[9]
Li, Y., Kang, K., Krahn, J.M., et al. (2017) A Comprehensive Genomic Pan-Cancer Classification Using the Cancer Genome Atlas Gene Expression Data. BMC Genomics, 18, Article No. 508. https://doi.org/10.1186/s12864-017-3906-0
[10]
Yang, S. and Naiman, D.Q. (2014) Multiclass Cancer Classification Based on Gene Expression Comparison. Statistical Applications in Genetics and Molecular Biology, 13, 477-496. https://doi.org/10.1515/sagmb-2013-0053
[11]
Kaur, P., Schlatzer, D., Cooke, K. and Chance, M.R. (2012) Pairwise Protein Expression Classifier for Candidate Biomarker Discovery for Early Detection of Human Disease Prognosis. BMC Bioinformatics, 13, Article No. 191.
https://doi.org/10.1186/1471-2105-13-191
[12]
Haibe-Kains, B., Desmedt, C., Loi, S., Culhane, A. C., Bontempi, G., Quackenbush, J. and Sotiriou, C. (2012) A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes. Journal of the National Cancer Institute, 104, 311-325.
https://doi.org/10.1093/jnci/djr545
[13]
Tamborero, D., Gonzalez-Perez, A., Perez-Llamas, C., Deu-Pons, J., Kandoth, C., Reimand, J. and Lopez-Bigas, N. (2013) Comprehensive Identification of Mutational Cancer Driver Genes across 12 Tumor Types. Scientific Reports, 3, Article No. 2650. https://doi.org/10.1038/srep02650
[14]
Raoof, S.S., Jabbar, M.A. and Fathima, S.A. (2020) Lung Cancer Prediction Using Machine Learning: A Comprehensive Approach. 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, 5-7 March 2020, 108-115. https://doi.org/10.1109/ICIMIA48430.2020.9074947
[15]
Chen, T. and Guestrin, C. (2016) Xgboost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794.
https://doi.org/10.1145/2939672.2939785
[16]
Wang, W., Chakraborty, G. and Chakraborty, B. (2020) Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm. Applied Sciences, 11, Article No. 202. https://doi.org/10.3390/app11010202
[17]
Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A. and Wang, Y. (2008) The Properties of High-Dimensional Data Spaces: Implications for exploring Gene and Protein Expression Data. Nature Reviews Cancer, 8, 37-49.
https://doi.org/10.1038/nrc2294
[18]
Bradley, A.P. (1997) The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30, 1145-1159.
https://doi.org/10.1016/S0031-3203(96)00142-2