%0 Journal Article
%T A General Approach to Extracting Topical Information in HTML Pages
一种通用HTML网页主题信息提取方法*
%A Xu Wen
%A Du Yuncheng
%A Li Yuqin
%A Shi Shuicai
%A
许文
%A 都云程
%A 李渝勤
%A 施水才
%J 现代图书情报技术
%D 2007
%I
%X By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.
%K DOM
信息提取
%K 分块
%K 相关度
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=B5EDD921F3D863E289B22F36E70174A7007B5F5E43D63598017D41BB67247657&cid=E46382710BF131B2&jid=24AADBCD0D5373C73F37F78D10E2F717&aid=589FA9B026E74CCD&yid=A732AF04DDA03BB3&vid=0B39A22176CE99FB&iid=CA4FD0336C81A37A&sid=1371F55DA51B6E64&eid=BE33CC7147FEFCA4&journal_id=1003-3513&journal_name=现代图书情报技术&referenced_num=4&reference_num=8