%0 Journal Article
%T Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents
基于缩进轮廓的HTML文档重复模式挖掘方法
%A ZHU Yan-xu
%A WANG Huai-min
%A SHI Dian-x
%A YIN Gang
%A YUAN Lin
%A LI Xiang
%A
朱沿旭
%A 王怀民
%A 史殿习
%A 尹刚
%A 袁霖
%A 李翔
%J 计算机科学
%D 2011
%I
%X Mining repeated patterns is the key to find encoding templates of Web pages, which is the basis for automatic Web data extraction and Web content mining. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for massive Web pages processing. In order to improve performance,the paper presented a novel indent shape based approach for mining repeated patterns of HTML documents. Firstly, the approach defines the indent shape model, which is a kind of simplified abstraction of HTML documents consisting of indents and first tags of each line; Then, it detects repeated patterns indirectly by identifying tandem repeated waves from indent shape. Extensive experiments show that our approach achieves better performance compared with existing approaches.
%K Mining repeated patterns
%K Web data extraction
%K Web content mining
%K Indent shape
%K Tandem repeated waves
重复模式挖掘,Web数据抽取,Web内容挖掘,缩进轮廓,串联重复波段
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=64A12D73428C8B8DBFB978D04DFEB3C1&aid=2AC66EB828B1E43509FBC75AD84F0729&yid=9377ED8094509821&vid=16D8618C6164A3ED&iid=5D311CA918CA9A03&sid=31611641D4BB139F&eid=BBF7D98F9BEDEC74&journal_id=1002-137X&journal_name=计算机科学&referenced_num=0&reference_num=0