%0 Journal Article %T A General Approach to Extracting Topical Information in HTML Pages
一种通用HTML网页主题信息提取方法* %A Xu Wen %A Du Yuncheng %A Li Yuqin %A Shi Shuicai %A
许文 %A 都云程 %A 李渝勤 %A 施水才 %J 现代图书情报技术 %D 2007 %I %X By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information. %K DOM
信息提取 %K 分块 %K 相关度 %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=B5EDD921F3D863E289B22F36E70174A7007B5F5E43D63598017D41BB67247657&cid=E46382710BF131B2&jid=24AADBCD0D5373C73F37F78D10E2F717&aid=589FA9B026E74CCD&yid=A732AF04DDA03BB3&vid=0B39A22176CE99FB&iid=CA4FD0336C81A37A&sid=1371F55DA51B6E64&eid=BE33CC7147FEFCA4&journal_id=1003-3513&journal_name=现代图书情报技术&referenced_num=4&reference_num=8