%0 Journal Article
%T Detection and elimination of similar Web pages based on text structure and extraction of long sentences
基于正文结构和长句提取的网页去重算法*
%A HUANG Ren
%A FENG Sheng
%A YANG Ji-yun
%A LIU Yu
%A AO Min
%A
黄仁
%A 冯胜
%A 杨吉云
%A 刘宇
%A 敖民
%J 计算机应用研究
%D 2010
%I
%X As regard to the feature of the similarity and that of the text structure of Web pages,this paper proposed a dynamic,stratified and robust algorithm to detect and delete similar Web pages.By this method,expressed the texts of Web pages in the style of text structure trees.Then,thus implemented a dynamic algorithm to extract features of texts and a layer fingerprint algorithm to calculate similarity.That the extraction of the features made use of the algorithm of extraction of long sentences guarantees the robustness. The experimental results show that the method can carry out accurate detection concerning completely similar Web pages and partly similar ones.
%K detection and elimination of similar Web pages
%K text structure tree
%K extraction of long sentences
%K layer fingerprint
网页去重
%K 正文结构树
%K 长句提取
%K 层次指纹
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=A9D9BE08CDC44144BE8B5685705D3AED&aid=0FD8DF9FCFBF9827CCB35D7B9C9A75CB&yid=140ECF96957D60B2&vid=DB817633AA4F79B9&iid=DF92D298D3FF1E6E&sid=B6521CEA65B8A16F&eid=C9E61AB37F867E3C&journal_id=1001-3695&journal_name=计算机应用研究&referenced_num=1&reference_num=7