OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

计算机科学 2011

Parsing DOM Tree Reversely and Extracting Web Main Page Information
逆序解析DOM树及网页正文信息提取

ZHANG Rui-xue,SONG Ming-qiu,GONG Yan-lei,
张瑞雪,宋明秋,公衍磊

Keywords: DOM tree,Web content extracting,Structural similarity,Parsing reversely
DOM树，网页正文提取，结构相似性，逆序解析

Full-Text Cite this paper Add to My Lib

Abstract:

To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

Parsing DOM Tree Reversely and Extracting Web Main Page Information逆序解析DOM树及网页正文信息提取

Parsing DOM Tree Reversely and Extracting Web Main Page Information
逆序解析DOM树及网页正文信息提取