%0 Journal Article %T A New Way of Extracting the Topic Information in Web Pages Based on DIV Tag-tree
基于DIV标签树的网页主题信息抽取方法① %A OU YANG Liu-Bo %A YANG Zhu %A YI Xian %A
欧阳柳波 %A 杨柱 %A 易显 %J 计算机系统应用 %D 2010 %I %X Since CSS+DIV Topological Mode has become the major trend of the structural layout of web pages, the efficient extraction of the topic information in these web pages has become one of the urgent tasks for all professional surfing engines. This paper puts forward a new way of extracting the topic information in web pages based on the DIV tag-tree. It divides HTML files into DIV-forest with the help of DIV-tag. Then it filters the noise nodes in DIV tag-trees and sets up STU-DIV model-trees. Finally, it crops the DIV tag-trees irrelevant to the topic information by Topic Corelation Analysis and Cut-Tree Algorithm. It proves that this method can efficiently extract the topic information in web pages by analyzing several news web pages . %K extraction of topic information %K DIV tag-tree %K STU-DIV model-tree %K topic corelation %K Cut-Tree algorithm
主题信息抽取 %K DIV标签树 %K STU-DIV模型树 %K 主题相关度 %K 剪枝算法 %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=D4F6864C950C88FFCE5B6C948A639E39&aid=A674ADBDE78411BDB593B79DCB7D622A&yid=140ECF96957D60B2&vid=2A8D03AD8076A2E3&iid=94C357A881DFC066&sid=23104246A5FCFCEF&eid=E0F6F365E4766526&journal_id=1003-3254&journal_name=计算机系统应用&referenced_num=0&reference_num=9