全部 标题 作者
关键词 摘要


基于WEB网页文本信息抽取研究与实现
Research and Implementation of Text Information Extraction Based on WEB

DOI: 10.12677/HJDM.2015.54010, PP. 69-74

Keywords: 互联网,信息抽取,HTML,XML,文本信息抽取
Internet
, Information Extraction, HTML, XML, Text Information Extraction

Full-Text   Cite this paper   Add to My Lib

Abstract:

本文以传统的信息抽取理论和方法为基础,实现了一种基于XML特征的网页文本抽取方法。研究了一般网页的特征,实现了一种基于XML标签特征的网页提取方法,对网页进行HTML页面标准化,将其转成XML语言,并且根据XML语言的特点,对其内部语言进行转化,从GB转为UTF,并对其进行标准化,然后通过熟悉XML标签的各种特性,对网页文本根据标签进行抽取。
In this paper, based on the theory and method of traditional information extraction, a method of Web Text Extraction Based on XML features is realized. The characteristics of general web pages are studied. A method of web page extraction based on XML tag feature is implemented. The HTML pages are standardized. The XML language is converted into XML language. According to the fea-tures of XML language, the internal language is transformed from GB to UTF, and then the standard is also extracted.

References

[1]  陶庆, 刘峰. Web数据挖掘在电子商务中的应用研究[J]. 电脑知识与技术, 2008(12): 415-416.
[2]  Chang, C.H., Kayed, M., Girgis, M.R. and Shaalan, K.F. (2006) A Survey of Web Information Extraction Systems. IEEE Transac-tions on Knowledge and Data Engineering, 18, 1411-1428.
http://dx.doi.org/10.1109/TKDE.2006.152
[3]  毕蕾, 沈洁, 徐法艳, 魏榴花, 朱燕, 孙荣霜. 领域本体指导的Web商品信息抽取[J]. 计算机工程与设计, 2008, 29(24): 6393-6396.
[4]  Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S. and Teixeira, J.S. (2002) A Brief Survey of Web Data Extraction Tools. Federal University of Minas Gerais, Belo Horizonte.

Full-Text

comments powered by Disqus