%0 Journal Article %T Parsing DOM Tree Reversely and Extracting Web Main Page Information
逆序解析DOM树及网页正文信息提取 %A ZHANG Rui-xue %A SONG Ming-qiu %A GONG Yan-lei %A
张瑞雪 %A 宋明秋 %A 公衍磊 %J 计算机科学 %D 2011 %I %X To extract main content from HTML Web page, generally, we should parse HTML, visit the whole DOM tree, and extract the data from the tree by distribution. However, this method separates the two processes of parsing and extracting and therefore restricts the speed. Actually, parsing the whole DOM tree is unnecessary. Here we supposed the algorithm of parsing DOM tree by reverse order. Then combining with the theory of DOM similarity and the traditional method of parsing DOM we parsed IWM tree with both normal order and reverse order, and at the same time we fixed the positions of other targots and got them. On the one hand, this method only parses part of DOM tree, so it reduces the time cost by parsing. On the other hand, we do not have to visit the whole tree to search the target information, as a result, it saves the searching time. Overall, this method improves the speed much. At the end of this paper, we gave the proof on the superiority of this method. %K DOM tree %K Web content extracting %K Structural similarity %K Parsing reversely
DOM树,网页正文提取,结构相似性,逆序解析 %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=64A12D73428C8B8DBFB978D04DFEB3C1&aid=4321C878BB49ABC55A47462A170F946A&yid=9377ED8094509821&vid=16D8618C6164A3ED&iid=E158A972A605785F&sid=527AEE9F3446633A&eid=D0130CB18500EA84&journal_id=1002-137X&journal_name=计算机科学&referenced_num=0&reference_num=10