|
计算机应用研究 2012
Improved TFIDF feature extraction algorithm based on semantic association and information gain
|
Abstract:
Both the traditional and improved term frequency-inverse document frequency (TFIDF) algorithms ignored the difference of distributions among different categories in feature extraction. Due to the lacking of consideration of semantic relationships within some certain categories, the selected feature word cannot describe the contents of the document correctly and accurately. In order to select feature more accurately, in this paper, based on the previous improvements, introduced the semantic association of words to analyze the semantic of text, redesigned the weights equation, and proposed the new TFIDF algorithm combined with semantic and information gain. The developed algorithm can make up for the shortcomings of the lack of semantic information in statistical method. Experimental results illustrate that the improved algorithm can effectively improve text classification accuracy.