%0 Journal Article
%T 基于WEB网页文本信息抽取研究与实现<br>Research and Implementation of Text Information Extraction Based on WEB
%A 刘三星
%J Hans Journal of Data Mining
%P 69-74
%@ 2163-1468
%D 2015
%I Hans Publishing
%R 10.12677/HJDM.2015.54010
%X <div style=\"text-align:justify;\">
	<span style=\"line-height:1.5;\">本文以传统的信息抽取理论和方法为基础，实现了一种基于XML特征的网页文本抽取方法。研究了一般网页的特征，实现了一种基于XML标签特征的网页提取方法，对网页进行HTML页面标准化，将其转成XML语言，并且根据XML语言的特点，对其内部语言进行转化，从GB转为UTF，并对其进行标准化，然后通过熟悉XML标签的各种特性，对网页文本根据标签进行抽取。&lt;br/&gt;In this paper, based on the theory and method of traditional information extraction, a method of Web Text Extraction Based on XML features is realized. The characteristics of general web pages are studied. A method of web page extraction based on XML tag feature is implemented. The HTML pages are standardized. The XML language is converted into XML language. According to the fea-tures of XML language, the internal language is transformed from GB to UTF, and then the standard is also extracted.</span><span style=\"line-height:1.5;\"></span>
</div>
%K 互联网，信息抽取，HTML，XML，文本信息抽取<br>Internet
%K Information Extraction
%K HTML
%K XML
%K Text Information Extraction
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=16411