全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于密度的聚类中心自动确定的混合属性数据聚类算法研究

DOI: 10.16383/j.aas.2015.c150062, PP. 1798-1813

Keywords: 数据挖掘,混合属性,数据聚类,密度,混合距离度量

Full-Text   Cite this paper   Add to My Lib

Abstract:

?面对广泛存在的混合属性数据,现有大部分混合属性聚类算法普遍存在聚类质量低、聚类算法参数依赖性大、聚类类别个数和聚类中心无法准确自动确定等问题,针对这些问题本文提出了一种基于密度的聚类中心自动确定的混合属性数据聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数值占优、分类占优和均衡型混合属性数据三类,分析不同情况的特征选取相应的距离度量方式.在计算数据集各个点的密度和距离分布图基础上,深入分析获得规律:高密度且与比它更高密度的数据点有较大距离的数据点最可能成为聚类中心,通过线性回归模型和残差分析确定奇异点,理论论证这些奇异点即为聚类中心,从而实现了自动确定聚类中心.采用粒子群算法(Particleswarmoptimization,PSO)寻找最优dc值,通过参数dc能够计算得到任意数据对象的密度和到比它密度更高的点的最小距离,根据聚类中心自动确定方法确定每个簇中心,并将其他点按到最近邻的更高密度对象的最小距离划分到相应的簇中,从而实现聚类.最终将本文提出算法与其他现有的多种混合属性聚类算法在多个数据集上进行算法性能比较,验证本文提出算法具有较高的聚类质量.

References

[1]  Hsu C C, Huang Y P. Incremental clustering of mixed data based on distance hierarchy. Expert Systems with Applications, 2008, 35(3): 1177-1185
[2]  Lloyd S P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137
[3]  Berget I, Mevik B H, Nas T. New modifications and applications of fuzzy C-means methodology. Computational Statistics & Data Analysis, 2008, 52(5): 2403-2418
[4]  Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Washington: ACM Press, 1998. 73-84
[5]  S. H. Cluster Analysis Algorithms. West Sussex: Ellis Horwood Limited, 1980.
[6]  Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Montreal: ACM Press, 1996. 103-114
[7]  Ester M, Kriegel H P, Sander J, Xu X W. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD. 1996. 226-232
[8]  Bi Kai, Wang Xiao-Dan, Xing Ya-Qiong. Fuzzy clustering ensemble based on fuzzy measure and DS evidence theory. Control and Decision, 2015, 30(5): 823-830 (毕凯, 王晓丹, 邢雅琼. 基于模糊测度和证据理论的模糊聚类集成方法. 控制与决策, 2015, 30(5): 823-830)
[9]  Liu Z G, Pan Q, Dezert J, Mercier G. Credal C-means clustering method based on belief functions. Knowledge-Based Systems, 2015, 74: 119-132
[10]  Huang Z X. A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Research Issues on Data Mining and Knowledge Discovery. Arizona: ACM Press, 1997. 1-8
[11]  Gan G, Wu J, Yang Z. A genetic fuzzy K-modes algorithm for clustering categorical data. Expert Systems with Applications, 2009, 36(2): 1615-1620
[12]  Barbara D, Couto J, Li Y. COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th International Conference on Information and Knowledge Management. Virginia: ACM Press, 2002. 582-589
[13]  Huang Z X. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining. Singapore: World Scientific Publishing, 1997. 21-34
[14]  Chatzis S P. A fuzzy C-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Systems with Applications, 2011, 38(7): 8684-8689
[15]  Gath I, Geva A B. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1989, 711(7): 773-780
[16]  Li C, Biswas G. Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 673-690
[17]  Goodall D W. A new similarity index based on probability. Biometrics, 1966, 22(4): 882-907
[18]  Huang Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304
[19]  Jain A K, Dubes R C. Algorithms for Clustering Data. New Jersey: Prentice-Hall, 1988.
[20]  Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.
[21]  Chen W F, Feng G C. Spectral clustering: a semi-supervised approach. Neurocomputing, 2012, 77(1): 229-242
[22]  Zhang W, Yoshida T, Tang X J, Wang Q. Text clustering using frequent itemsets. Knowledge-Based Systems, 2010, 23(5): 379-388
[23]  Hsu C C, Chen C L, Su Y W. Hierarchical clustering of mixed data based on distance hierarchy. Information Sciences, 2007, 177(20): 4474-4492
[24]  Zheng Z, Gong M G, Ma J J, Jiao L C, Wu Q D. Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the 2010 IEEE Congress on Evolutionary Computation. Barcelona: IEEE, 2010. 1-8
[25]  Hsu C C, Chen Y C. Mining of mixed data with application to catalog marketing. Expert Systems with Applications, 2007, 32(1): 12-23
[26]  Ahmad A, Dey L. A K-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 2007, 63(2): 503-527
[27]  Ji J C, Bai T, Zhou C G, Ma C, Wang Z. An improved K-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 2013, 120: 590-596
[28]  Ji J C, Pang W, Zhou C G, Han X, Wang Z. A fuzzy K-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-based Systems, 2012, 30: 129-135
[29]  Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344(6191): 1492-1496
[30]  Wang Song-Gui, Shi Jian-Hong, Yin Su-Ju, Wu Mi-Xia. Introduction to Linear Models. Beijing: Science Press, 2004. (王松桂, 史建红, 尹素菊, 吴密霞. 线性模型引论. 北京: 科学出版社, 2004.)

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133