全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

A New Way of Extracting the Topic Information in Web Pages Based on DIV Tag-tree
基于DIV标签树的网页主题信息抽取方法①

Keywords: extraction of topic information,DIV tag-tree,STU-DIV model-tree,topic corelation,Cut-Tree algorithm
主题信息抽取
,DIV标签树,STU-DIV模型树,主题相关度,剪枝算法

Full-Text   Cite this paper   Add to My Lib

Abstract:

Since CSS+DIV Topological Mode has become the major trend of the structural layout of web pages, the efficient extraction of the topic information in these web pages has become one of the urgent tasks for all professional surfing engines. This paper puts forward a new way of extracting the topic information in web pages based on the DIV tag-tree. It divides HTML files into DIV-forest with the help of DIV-tag. Then it filters the noise nodes in DIV tag-trees and sets up STU-DIV model-trees. Finally, it crops the DIV tag-trees irrelevant to the topic information by Topic Corelation Analysis and Cut-Tree Algorithm. It proves that this method can efficiently extract the topic information in web pages by analyzing several news web pages .

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133