Wireless sensor networks (WSNs) are a rapidly emerging technology with a great potential in many ubiquitous applications. Although these sensors can be inexpensive, they are often relatively unreliable when deployed in harsh environments characterized by a vast amount of noisy and uncertain data, such as urban traffic control, earthquake zones, and battlefields. The data gathered by distributed sensors—which serve as the eyes and ears of the system—are delivered to a decision center or a gateway sensor node that interprets situational information from the data streams. Although many other machine learning techniques have been extensively studied, real-time data mining of high-speed and nonstationary data streams represents one of the most promising WSN solutions. This paper proposes a novel stream mining algorithm with a programmable mechanism for handling missing data. Experimental results from both synthetic and real-life data show that the new model is superior to standard algorithms. 1. Introduction It is anticipated that wireless sensor networks (WSNs) will enable the technology of today to be employed in future applications ranging from tracking, monitoring, and spying systems to various other technologies likely to improve aspects of everyday life. WSNs offer an inexpensive way to collect data over a distributed environment that may be harsh in nature, such as biochemical contamination sites, seismic zones, and terrain subject to extreme weather or battlegrounds. The sensors employed in WSNs—which are miniatures embedded computing devices—continue to produce large volumes of streaming data obtained from their environment until the end of their lifetime. It is known that when the battery power in such sensors is exhausted, the likelihood of erroneous data being generated will grow rapidly [1]. Both uncertain environmental factors and the low cost of the sensors may contribute to an intermittent transmission loss and inaccurate measurement. Even when they seldom occur, errors and noises in data streams sensed by a large number of sensors may be misinterpreted as outliers; they frequently trigger false alarms that might either lead to undesirable consequences in critical applications or reduce measurement sensitivity. Data classification is a popular data mining technique used to determine predefined classes (verdicts) to which unseen data freshly obtained from a WSN map, thereby providing situational information about current events in an environment covered by a dense network of sensors. At the core of the classification technique is a decision
References
[1]
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, “Online outlier detection in sensor data using non-parametric models,” in Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06), pp. 187–198, Seoul, Korea, September 2006.
[2]
B. Lantow, “Impact of wireless sensor network data on business data processing,” in Proceedings of the Forum Poster Session in Conjunction with Business Informatics Research (BIR '05), Sk?vde, Sweden, 2005.
[3]
M. Bahrepour, N. Meratnia, Z. Taghikhaki, and P. Havinga, “Sensor fusion-based activity recognition for Parkinson patients,” in Sensor Fusion—Foundation and Applications, pp. 171–191, InTech.
[4]
K. P. Lam, M. H?ynck, B. Dong et al., “Occupancy detection through an extensive environmental sensor network in an open-plan office building,” in Proceedings of the 11th International IBPSA Conference, pp. 1452–1459, Glasgow, Scotland, July 2009.
[5]
Y. Ding and J. S. Simonoff, “An investigation of missing data methods for classification trees applied to binary response data,” Journal of Machine Learning Research, vol. 11, pp. 131–170, 2010.
[6]
K. Lakshminarayan, S. A. Harp, and T. Samad, “Imputation of missing data in industrial databases,” Applied Intelligence, vol. 11, no. 3, pp. 259–275, 1999.
[7]
R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, NY, USA, 1987.
[8]
A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognition, vol. 41, no. 12, pp. 3692–3705, 2008.
[9]
W. Nick Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large-scale classification,” in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382, San Francisco, Calif, USA, August 2001.
[10]
H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235, Washington, DC, USA, August 2003.
[11]
S. Hashemi and Y. Yang, “Flexible decision tree for data stream classification in the presence of concept change, noise and missing values,” Data Mining and Knowledge Discovery, vol. 19, no. 1, pp. 95–131, 2009.
[12]
Y. Zhang, N. Meratnia, and P. Havinga, “Outlier detection techniques for wireless sensor networks: a survey,” IEEE Communications Surveys and Tutorials, vol. 12, no. 2, pp. 159–170, 2010.
[13]
D. Janakiram, V. A. Mallikarjuna Reddy, and A. V. U. P. Kumar, “Outlier detection in wireless sensor networks using Bayesian belief networks,” in Proceedings of the 1st International Conference on Communication System Software and Middleware, pp. 1–6, 2006.
[14]
Y. Hang and S. Fong, “Stream mining over fluctuating network traffic at variable data rates,” in Proceedings of the 6th International Conference on Advanced Information Management and Service (IMS '10), pp. 436–441, Seoul, Korea, November 2010.
[15]
The Second International Knowledge Discovery and Data Mining Tools Competition, Sponsored by the American Association for Artificial Intelligence (AAAI) Epsilon Data Mining Laboratory Paralyzed Veterans of America (PVA), http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html.
[16]
A. Frank and A. Asuncion, UCI Machine Learning Repository, Irvine, Calif, USA, University of California, School of Information and Computer Science, 2010, http://archive.ics.uci.edu/ml.