|
计算机应用研究 2010
Detection and elimination of similar Web pages based on text structure and extraction of long sentences
|
Abstract:
As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness. The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.