In addition to non-normality, censoring is one of the characteristics of survival data. All traditional procedures and models take into consideration this censoring characteristic in relation to survival data analysis. However, no studies have been done on the effect of censoring levels in survival data analysis. The main objective of this paper is to look at the effect of censoring levels in survival data analysis in relation to big data. Data of sizes n = 10,000, n = 50,000 and n = 100,000 were simulated each at censoring levels of p = 0.1, p = 0.5 and p = 0.9. For comparison sake, also small/moderate sized survival datasets were also simulated. Censoring levels had a low effect on small/moderate sized datasets and had a significant effect on big datasets. This was depicted by the plots of survivor function. Visually, it was evident that as the level of censoring increases, there is a tendency to overestimate survival prospects. Model fit was much better for small/moderate datasets as compared to model fit for big datasets. This supports the idea of many researchers that traditional survival statistical models are inferior when handling big data. Surprising, the model fit for high censoring level (p = 0.9) had a much better fit both on small/moderate and big datasets.
References
[1]
Brilleman, S.L., Wolfe, R., Moreno-Betancur, M. and Crowther, M.J. (2021) Simulating Survival Data Using the Simsurv R Package. Journal of Statistical Software, 97, 1-27. https://doi.org/10.18637/jss.v097.i03 https://www.jstatsoft.org/index.php/jss/article/view/v097i03
[2]
Collett, D. (2003) Modelling Survival Data in Medical Research. 2nd Edition, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis. https://books.google.co.zw/books?id=4t3-GWDKDRQC
[3]
Heckman, J.J. and Robb, R. (1985) Alternative Methods for Evaluating the Impact of Interventions. In: Heckman, J.J. and Singer, B.S., Eds., Longitudinal Analysis of Labor Market Data, Cambridge University Press, 156-246. https://doi.org/10.1017/ccol0521304539.004
[4]
Lee, E.T. and Wang, J.W. (2003) Statistical Methods for Survival Data Analysis. Wiley. https://doi.org/10.1002/0471458546
[5]
Wang, P., Li, Y. and Reddy, C. (2017) Machine Learning for Survival Analysis: A Survey. arxiv abs/1708.04649.
[6]
Lin, D.Y. (2007) On the Breslow Estimator. Lifetime Data Analysis, 13, 471-480. https://doi.org/10.1007/s10985-007-9048-y
[7]
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39, 1-13. https://doi.org/10.18637/jss.v039.i05
[8]
Riahi, Y. and Riahi, S. (2018) Big Data and Big Data Analytics: Concepts, Types and Technologies. International Journal of Research and Engineering, 5, 524-528. https://doi.org/10.21276/ijre.2018.5.9.5
[9]
Hiba, J., Hadi, H., Hameed Shnain, A., Hadishaheed, S. and Haji, A. (2015) Big Data and Five V’s Characteristics. 2393-2835. https://www.iraj.in/journal/journal_file/journal_pdf/12-105-142063747116-23.pdf
[10]
Collet, D. (2015) Modelling Survival Data in Medical Research. Chapman & Hall/CRC Texts in Statistical Science, CRC Press. https://books.google.co.zw/books?id=Okf7CAAAQBAJ
[11]
Dunn, O.J. and Clark, V.A. (2009) Basic Statistics. Wiley. https://doi.org/10.1002/9780470496862
[12]
Harden, J.J. and Kropko, J. (2018) Simulating Duration Data for the Cox Model. Political Science Research and Methods, 7, 921-928. https://doi.org/10.1017/psrm.2018.19
[13]
Berkowitz, M., Altman, R.M. and Loughin, T.M. (2024) Random Forests for Survival Data: Which Methods Work Best and under What Conditions? The International Journal of Biostatistics, 20, 315-345. https://doi.org/10.1515/ijb-2023-0056