|
现代图书情报技术 2007
A General Approach to Extracting Topical Information in HTML Pages
|
Abstract:
By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.