|
Content Oriented Automatic Text CategorizationKeywords: The aim of this paper is to propose deep parallelism may be established between and an Automatic Abstract: The project is to implement a web spam classifier, which given a web page, will analyze its features and try to determine whether the page is spam or not. The efficiency of the classifier will be compared to the results spam detection in the text datasets using Na ve Baye’s classifier text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text ategorization. In this study, the investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with NLP and Clustering algorithms. In consideration of the distribution of relevant documents in the collection, the propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminatingpower for text categorization task. a consistently better performance while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets
|