OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

计算机应用研究 2010

Detection and elimination of similar Web pages based on text structure and extraction of long sentences
基于正文结构和长句提取的网页去重算法*

HUANG Ren,FENG Sheng,YANG Ji-yun,LIU Yu,AO Min,
黄仁,冯胜,杨吉云,刘宇,敖民

Keywords: detection and elimination of similar Web pages,text structure tree,extraction of long sentences,layer fingerprint
网页去重,正文结构树,长句提取,层次指纹

Full-Text Cite this paper Add to My Lib

Abstract:

As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness. The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

Detection and elimination of similar Web pages based on text structure and extraction of long sentences基于正文结构和长句提取的网页去重算法*

Detection and elimination of similar Web pages based on text structure and extraction of long sentences
基于正文结构和长句提取的网页去重算法*