全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Parsing DOM Tree Reversely and Extracting Web Main Page Information
逆序解析DOM树及网页正文信息提取

Keywords: DOM tree,Web content extracting,Structural similarity,Parsing reversely
DOM树,网页正文提取,结构相似性,逆序解析

Full-Text   Cite this paper   Add to My Lib

Abstract:

To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133