全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于类别加权和方差统计的特征选择方法

Keywords: 文本分类,不均衡数据集,特征选择方法,类别加权,方差统计

Full-Text   Cite this paper   Add to My Lib

Abstract:

为提高不均衡文本分类的准确率和稳定性,提出了一种基于类别加权和方差统计的联合特征选择方法.首先,基于类别文档数大小对特征选择的影响,给出了一种类别加权策略以强化小类别的特征;其次,在探究特征类别区分能力的基础上,设计了类别方差统计策略来凸显含有丰富类别信息的特征;最后,将2种策略相融合,实现了一种联合特征选择的新算法.在Reuters-21578和复旦大学语料这2个不均衡语料上的实验都表明:该算法有效,特别是在小类别的分类效果上远远好于IG、CHI和DFICF等流行的通用算法.

References

[1]  YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization[C]Proc of the 14th International Conference on Machine Learning(ICML'97). San Francisco: Morgan Kaufmann, 1997:
[2]  412-420.
[3]  QUINLAN J R. Constructing decision tree, C4.5 [ J].Programs for Machine Learning, 1993, 3: 17-26.
[4]  COVER T M, THOMAS J A. Elements of Information Theory[M]. New York: John Wiley and Sons, 1991:274.
[5]  周茜, 赵明盛, 扈旻. 中文文本分类中的特征选择研究[J]. 中文信息学报, 2004, 18(3): 17-23.
[6]  ZHOU Qian, ZHAO Ming-sheng, HU Min. Study on feature selection in Chinese text categorization[J]. Journal of Chinese Information Processing, 2004, 18(3): 17-23.(in Chinese)
[7]  LI S, ZHOU G, WANG Z, et al. Imbalanced sentiment classification[C]Proc of CIKM-11. New York: ACM,2011: 2469-2472.
[8]  谷琼, 袁磊, 宁彬, 等. 一种基于混合重取样策略的非均衡数据集分类算法[J]. 计算机工程与科学, 2012,34(10): 128-134.
[9]  GU Qiong, YUAN Lei, NING Bin, et al. A novel classification algorithm for imbalanced datasets based on hybrid resampling strategy [ J ]. Journal of Computer Engineering and Science, 2012, 34(10): 128-134. (in
[10]  Chinese)
[11]  JOSHI M V, KUMAR V, AGARWAL R C. Evaluating boosting algorithms to classify rare classes: comparison and improvements [ C] Proc of ICDM. San Jose: IEEE,2001: 257-264.
[12]  YANG Y. The research of imbalanced data set of sample sampling method based on K-means cluster and genetic algorithm[J]. Energy Procedia, 2012(17): 164-170.
[13]  李卓然, 张永. 基于集成的非均衡数据分类主动学习算法[J]. 计算机应用与软件, 2012, 29(6): 81-83.
[14]  LI Zhuo-ran, ZHANG Yong. Imbalanced data classification active learning algorithm based on boosting [J]. Journal of Computer Applications and Software, 2012, 29(6): 81-83. (in Chinese)
[15]  ZHENG Z, WU X, SRIHARI R. Feature selection for mtext categorization on imbalanced data[C]Proc of ACM SIGKDD Explorations Newsletter. New York: ACM,2004: 80-89.
[16]  CASTILLO MDD, SERRANO J I. A multi-strategy approach for digital text categorization from imbalanced documents[J]. SIGKDD Explorations Newsletter, 2004, 6(1): 70-79.
[17]  BONG C H, NARAYANAN K. An empirical study of mfeature selection for text categorization based on term weight [ C ] Proc of the 2004 IEEE International Conference on Web Intelligence. Washington D. C. :
[18]  IEEE Computer Socity, 2004: 599-602.
[19]  LI S S, ZONG C Q. A new approach to feature selection for text categorization [ C] Proc of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering. Wuhan: IEEE, 2005: 626-630.
[20]  徐燕, 李锦涛, 王斌, 等. 不均衡数据集上文本分类的特征选择研究[J]. 计算机研究与发展, 2007, 44(增刊1): 58-62.
[21]  XU Yan, LI Jin-tao, WANG Bin, et al. A study of feature selection for text categorization on imbalanced data [J]. Journal of Computer Research and Development,2007, 44(Suppl 1): 58-62. (in Chinese)
[22]  CHEN J N, HUANG H K, TIAN F Z, et al. Feature selection for text classification with naiva bayes [ J]. Expert System with Applications, 2009, 36(3): 5432-5435.
[23]  LEE C H, LEE D H, CHUANG J W. Using genetic feature selection for improving cyber attack detection rate [C]Proc of the 3rd IASTED Int爷1 Conf Advances in Computer Science and Technology. Anaheim: ACTA,
[24]  2007: 517-522.
[25]  LIU H, YU L. Toward integrating feature selection algorithms for classification and clustering [ J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(4): 491-502.
[26]  VERIKAS A, BACAUSKIENE M. Feature selection with neural networks[J]. Pattern Recognition Letters, 2002(23): 1323-1335.
[27]  WESTON J, MUKHERJEE S, CHAPELLE O, et al.Feature selection for SVMs[C] Proc of NIPS 2000.Denver: MIT Press, 2000: 668-674.
[28]  SANCHIS J S, MARRUGAT J, SORIAOLIVAS S, et al.Support vector machines and genetic algorithms for detecting unstable angina[M]Computers in Cardiology.Memphis: IEEE Computer Society Press, 2002: 413-416.
[29]  MLADENIC D, GROBELNK M. Feature selection for unbalanced class distribution and Naive Bayes [C]Proc of ICML. Bled: Morgan Kaufmann, 1999: 258-267.
[30]  陈铁明, 马继霞, Samuel H. Huang, 等. 一种新的快速特征选择和数据分类方法[J]. 计算机研究与发展,2012, 49(4): 735-745.
[31]  CHEN Tie-ming, MA Ji-xia, HUANG S H, et al. Novel and efficient method on feature selection and data classification[ J]. Journal of Computer Research and Development, 2012, 49(4): 735-745. (in Chinese)
[32]  靖红芳, 王斌, 杨雅辉, 等. 基于类别分布的特征选择框架[J]. 计算机研究与发展, 2009, 46(9): 1586-1593.
[33]  裴英博, 刘晓霞. 文本分类中改进型CHI 特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4):128-130.
[34]  PEI Ying-bo, LIU Xiao-xia. Study on improved CHI for feature selection in Chinese text categorization [ J ]. Computer Engineering and Applications, 2011, 47(4):128-130. (in Chinese)
[35]  JING Hong-fang, WANG Bin, YANG Ya-hui, et al. Category distribution-based feature selection framework[J]. Journal of Computer Research and Development,2009, 46(9): 1586-1593. (in Chinese)

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133