|
现代图书情报技术 2008
An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules
|
Abstract:
This paper presents a new algorithm for the elimination of noise in Web pages based on a group of content-related rules.First,the authors present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page's table tree.Next,an algorithm is presented in order to evaluate the similarity of anchor text's topic and the content of the page.To some extent,as the new algorithm takes semantic facts of the Web pages into consideration,it acquires higher accuracy than pure rule-based algorithm,while requires lower time complexity.The result of experiment indicates that this algorithm performs very effectively when purifying great mass of Web pages.