全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Indent Shape Based Approach for Mining Repeated Patterns of HTML Documents
基于缩进轮廓的HTML文档重复模式挖掘方法

Keywords: Mining repeated patterns,Web data extraction,Web content mining,Indent shape,Tandem repeated waves
重复模式挖掘,Web数据抽取,Web内容挖掘,缩进轮廓,串联重复波段

Full-Text   Cite this paper   Add to My Lib

Abstract:

Mining repeated patterns is the key to find encoding templates of Web pages, which is the basis for automatic Web data extraction and Web content mining. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for massive Web pages processing. In order to improve performance,the paper presented a novel indent shape based approach for mining repeated patterns of HTML documents. Firstly, the approach defines the indent shape model, which is a kind of simplified abstraction of HTML documents consisting of indents and first tags of each line; Then, it detects repeated patterns indirectly by identifying tandem repeated waves from indent shape. Extensive experiments show that our approach achieves better performance compared with existing approaches.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133