The increasing volume of data in the area of environmental sciences needs analysis and interpretation. Among the challenges generated by this “data deluge”, the development of efficient strategies for the knowledge discovery is an important issue. Here, statistical and tools from computational intelligence are applied to analyze large data sets from meteorology and climate sciences. Our approach allows a geographical mapping of the statistical property to be easily interpreted by meteorologists. Our data analysis comprises two main steps of knowledge extraction, applied successively in order to reduce the complexity from the original data set. The goal is to identify a much smaller subset of climatic variables that might still be able to describe or even predict the probability of occurrence of an extreme event. The first step applies a class comparison technique: p-value estimation. The second step consists of a decision tree (DT) configured from the data available and the p-value analysis. The DT is used as a predictive model, identifying the most statistically significant climate variables of the precipitation intensity. The methodology is employed to the study the climatic causes of an extreme precipitation events occurred in Alagoas and Pernambuco States (Brazil) at June/2010.
References
[1]
Ruivo, H.M., Sampaio, G. and Ramos, F.M. (2014) Knowledge Extraction from Large Climatological Data Sets Using a Genome-Wide Analysis Approach: Application to the 2005 and 2010 Amazon Droughts. Climatic Change, 124, 347-361.
https://doi.org/10.1007/s10584-014-1066-7
[2]
Ruivo, H.M., Campos Velho, H.F., Sampaio, G. and Ramos, F.M. (2015) Analysis of Extreme Precipitation Events Using a Novel Data Mining Approach. American Journal of Environmental Engineering, 5, 96-105.
[3]
Ruivo, H.M., Campos Velho, H.F., Ramos, F.M. and Sampio, G. (2013) P-Value and Decision Tree for Analysis of Extreme Rainfall. Ciência e Natura, 1, 231-234.
https://doi.org/10.5902/2179460X11604
[4]
Fayyad, U., Piatesky-Shapiro, G., Smyth, P. and Uthurusamy, R. (1996) Advances in Knowledge Discovery and Data Mining. The MIT Press, Cambridge.
[5]
Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W. and Zhao, Y. (2003) Design and Analysis of DNA Microarray Investigations. Series: Statistics for Biology and Health, Vol. 209, Springer, Berlin.
[6]
Hardin, J., Mitani, A., Hicks, L. and VanKoten, B. (2007) A Robust Measure of Correlation between Two Genes on a Microarray. BMC Bioinformatics, 8, 220.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-220
[7]
Witten, I.H. and Frank, E.S. (2000) Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation. 2nd Edition, Morgan Kaufmann Publishers, Burlington.
[8]
Quinlan, J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burlington.
[9]
Shannon, C.E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27, 623-656. https://doi.org/10.1002/j.1538-7305.1948.tb00917.x
[10]
Shannon, C.E. and Weaver, W. (1949) The Mathematical Theory of Communication. University of Illinois Press, Champaign.
[11]
Mitchell, T.M. (1997) Machine Learning. The Mc-Graw-Hill Companies, New York.
[12]
Dee, D.P., Uppala, S.M., Simmons, A.J., Berrisford, P., Poli, P., Kobayashi, S., Andrae, U., et al. (2011) The ERA-Interim Reanalysis: Configuration and Performance of the Data Assimilation System. Quarterly Journal of the Royal Meteorological Society, 137, 553-597.
[13]
Fialho, W.M.B. and Molion, L.C.B (2011) Eventos Extremos: Alagoas Junho de 2010. UFPel, Pelotas.
[14]
Climanálise (2010) Boletim de Monitoramento e Análise Climática. CPTEC/INPE, Vol. 25. http://climanalise.cptec.inpe.br/~rclimanl/boletim/pdf/pdf10/jun10.pdf