%0 Journal Article
%T Efficient Computation of k-Medians over Data Streams Under Memory Constraints
%A Zhi-Hong Chong
%A Jeffrey Xu Yu
%A Zhen-Jie Zhang
%A Xue-Min Lin
%A Wei Wang
%A Ao-Ying Zhou
%A
Zhi-Hong Chong
%A Jeffrey Xu Yu
%A Zhen-Jie Zhang
%A Xue-Min Lin
%A Wei Wang
%A and Ao-Ying Zhou
%J 计算机科学技术学报
%D 2006
%I
%X In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 )-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches. This work is supported by the National High Technology Development 863 Program of China under Grant No. 2002AA413310 and ARC Discovery Grant, Australia under Grant Nos. DP0346004 and DP0345710. Zhi-Hong Chong received his B.S. degree in computer science from Nanjing Meteorological Institute and M.S. degrees in economics from Institute of Fiscal Studies, Ministry of Finance, P.R. China, in 1991, 1999, respectively. He is currently a Ph.D. candidate in Fudan University, China. His research interests cover data mining, data streams. He has published several research papers in these areas in major international conferences and reputable journals. Jeffrey Xu Yu received his B.E., M.E. and Ph.D. degrees in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively. Jeffrey Xu Yu was a research fellow (Apr. 1990–Mar. 1991) and a faculty member (Apr. 1991–July 1992) in the Institute of Information Sciences and Electronics, University of Tsukuba. From July 1992 to June 2000, he was a lecturer in the Department of Computer Science, The Australian National University. Currently, he is an associate professor in the Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong. Jeffrey Xu Yu is a member of ACM, and a member of IEEE Computer Society. His interests cover XML database, data mining, data warehouse and on-line analytical processing, web-technology, quer
%K data streams
%K k-medians
%K cluster
%K data mining
数据流
%K 数据采集
%K 中间值
%K 数据存储
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=F57FEF5FAEE544283F43708D560ABF1B&aid=6039E0481F830ED20AE7339B98DCCA46&yid=37904DC365DD7266&vid=659D3B06EBF534A7&iid=0B39A22176CE99FB&sid=03E56C113B4E5A88&eid=D0182A31A5EB14BA&journal_id=1000-9000&journal_name=计算机科学技术学报&referenced_num=0&reference_num=27