全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于最小词频阈值的文档特征选择*

, PP. 531-537

Keywords: 文本分类,特征选择,信息增益,互信息,χ2统计

Full-Text   Cite this paper   Add to My Lib

Abstract:

为降低内容无关的特征词对文本分类系统的影响,在对与文本内容无关的特征词进行分析后发现:不相关特征词的词频普遍较低,利用最小词频阈值滤除低频特征可以明显降低无关特征的数量.为此,提出基于最小词频阈值的文档频评估函数.利用该函数选择特征可以有效减少与内容无关的噪声特征,改善分类质量.实验结果显示,几种基于最小词频阈值的文档频评估函数比基于普通文档频的评估函数的分类准确性有不同程度的改进,其中对互信息的改进最为显著,宏平均F1值比词频方法提高40%,比普通文档频方法提高15%~30%.

References

[1]  Zhou S G, Guan J H, Hu Y F, et al. A Chinese Document Categorization System without Dictionary Support and Segmentation Processing. Journal of Computer Research and Development, 2001, 38 (7): 839-844 (in Chinese) (周水庚,关佶红,胡运发,等. 一个无需词典支持和切词处理的中文文档分类算法.计算机研究与发展, 2001, 38(7): 839-844)
[2]  Wu X Q, Wu L D, et al. A Machine Learning Based Word Segmentation System without Manual Dictionary. Pattern Recognition and Artificial Intelligence, 1996, 9(4): 297-303 (in Chinese) (黄萱菁,吴立德,等. 基于机器学习的无需人工编制词典的切词系统.模式识别与人工智能,1996, 9(4): 297-303)
[3]  Yang Y M, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In: Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997, 412-420
[4]  Mladenic' D, Grobelnik M. Feature Selection on Hierarchy of Web Documents. Decision Support Systems, 2003, 35(1): 45-87
[5]  Rogat M, Yang Y M. High-Performing Feature Selection for Text Classification. In: Proc of the 11th International Conference on Information and Knowledge Management. McLean, USA, 2002, 659-661
[6]  Chen Z P, Lin Y P, Peng Y, et al. A Irrelevant Information Preprocess Based on the Minimal Class Difference. Acta Electronica Sinica, 2003, 31(11): 1750-1753 (in Chinese) (陈治平,林亚平,彭 雅,等.基于最小类差异的无关信息预处理算法.电子学报, 2003, 31(11): 1750-1753)
[7]  John G H, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In: Proc of the 11th International Conference on Machine Learning. New Brunswick, USA, 1994, 121-129
[8]  Soucy P, Mineau P. A Simple Feature Selection Method for Text Classification. In: Proc of the 17th International Joint Conference on Artificial Intelligence. Seattle, USA, 2001, 897-902
[9]  Yang Y M, Liu X. A Re-Examination of Text Categorization Methods. In: Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Berkeley, USA, 1999, 42-49
[10]  Zhou S G. The Key Techniques Research for Chinese Text Database. Ph.D Dissertation. College of Information, Fudan University, Shanghai, China, 2000 (in Chinese) (周水庚.中文文本数据库若干关键技术研究.博士学位论文.复旦大学,信息学院,上海, 2000)

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133