OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

模式识别与人工智能 2012

基于词性和中心点改进的文本聚类方法

, PP. 996-1001

施侃晟,刘海涛,宋文涛

Keywords: 文本聚类,k-均值,词性特征,样本平均相似度,孤立点

Full-Text Cite this paper Add to My Lib

Abstract:

针对k-均值算法对初始点敏感、易陷入局部最优的问题，提出一种基于词性和中心点改进的文本聚类方法(STICS)。通过改进文本的语义型表示，优化中心点的选取，并消除孤立点的负面影响，从而获得较好的聚类效果。STICS考虑不同词性特征对文本的贡献，采用加权的向量空间模型来表示文本。对于中心点的选取，首先度量每个样本的样本平均相似度，其次选取样本平均相似度最大的样本作为第一个聚类中心。此外，STICS消除孤立点的负面影响，以此提高聚类效果。实验结果表明文中方法确实具有更好的聚类效果。

References

[1]	Liu Yuanchao,Wang Xiaolong,Xu Zhiming,et al.Survey of Text Clustering.Journal of Chinese Information,2006,20(3): 55-62 (in Chinese)(刘远超,王晓龙,徐志明,等.文档聚类综述.中文信息学报,2006,20(3): 55-62)
[2]	MacQueen J.Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley,USA,1967,Ⅰ: 281-297
[3]	Chen Hao,He Tingting,Ji Donghong.An Unsupervised Approach to Word Sense Disambiguation Based on HowNet.Journal of Chinese Information Processing,2005,19(4): 10-16 (in Chinese)(陈浩,何婷婷,姬东鸿.基于k-means聚类的无导词义消歧.中文信息学报,2005,19(4): 10-16)
[4]	Shameem M U S,Ferdous R.An Efficient k-means Algorithm Integrated with Jaccard Distance Measure for Document Clustering // Proc of the 1st Asian Himalayas International Conference on Internet.Kathmandu,Nepal,2009: 1-6
[5]	Qing Xiaoping,Zheng Shijue.A New Method for Initializing the K-means Clustering Algorithm // Proc of the 2nd International Symposium on Knowledge Acquisition and Modeling.Wuhan,China,2009: 41-44
[6]	Chen Xuhui,Xu Yong.K-means Clustering Algorithm with Refined Initial Center // Proc of the 2nd International Conference on Biomedical Engineering and Informatics.Tianjin,China,2009: 1-4
[7]	Xu Houjin,Liu Yongyan,Deng Chengyu,et al.K-cmeans Text Clustering Algorithm Based on Similarity Center.Computer Engineering and Design,2010,31(8): 1802-1805 (in Chinese)(许厚金,刘永炎,邓成玉,等.基于相似中心的k-cmeans文本聚类算法.计算机工程与设计,2010,31(8): 1802-1805)
[8]	Salton G,Wong A,Yang C S.A Vector Space Model for Information Retrieval.Communications of the ACM,1975,18(11): 613-620
[9]	Sahon G,Buckley B.Term-Weighting Approaches in Automatic Text Retrieval.Information Processing and Management,1988,24(5): 513-523
[10]	Zhao Shiqi,Liu Ting,Li Sheng.Text Clustering Based on Subjects.Journal of Chinese Information Processing,2007,21(2): 58-61 (in Chinese)(赵世奇,刘挺,李生.一种基于主题的文本聚类方法.中文信息学报,2007,21(2): 58-61)
[11]	Zhao Ying,Karypis G.Evaluation of Hierarchical Clustering Algorithms for Document Dataset // Proc of the 11th International Conference on Information and Knowledge Management.New York,USA,2002: 515-524
[12]	Shi Kansheng,Shi Zhangzu.Computer Aided Generation Method for Theme Report and Knowledge Base: China,200810063295.1.2011-05-08 (in Chinese)(施侃晟,施章祖.计算机辅助报告与知识库产生的方法.中国,200810063295.1.2011-05-08)

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133