OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

软件学报 2009

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction
基于网页格式信息量的博客文章和评论抽取模型

CAO Dong-Lin,LIAO Xiang-Wen,XU Hong-Bo,BAI Shuo,
曹冬林,廖祥文,许洪波,白硕

Keywords: blog information extraction,minimal main text subtree,effective information ratio,Web format information,vision information,information quantity of separate position
博客信息抽取,最小正文子树,有效信息率,网页格式信息,视觉信息,切分位置信息量

Full-Text Cite this paper Add to My Lib

Abstract:

Based on the information theory, this paper presents a model based on Web format information quantity in blog information extraction. First, the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page. Second, the format information of blog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text. This model is language insensitive and can be used in a lot of blogs which are written in different natural languages. Experimental results show that this method achieves high precision in locating main text and separating the post and comment.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction基于网页格式信息量的博客文章和评论抽取模型

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction
基于网页格式信息量的博客文章和评论抽取模型