%0 Journal Article
%T Study on general extracting method of Web topic text<br>一种Web主题文本通用提取方法
%A PU Qiang
%A LI Xin
%A LIU Qi-he
%A YANG Guo-wei
%A <br>蒲强
%A 李鑫
%A 刘启和
%A 杨国纬
%J 计算机应用
%D 2007
%I 
%X A simple and efficient method of generally extracting Chinese topic text from Web pages was proposed in this paper in order to build a large Chinese text corpus. This method just utilizes length of Chinese texts and series of punctuations, along with a few rules of discrimination, to extract needed text from Web pages accurately without analyzing HTML tags. The experiment shows the extraction is so fast and accurate that it can achieve the requirement of constructing a large Chinese text corpus.
%K Web text
%K text extracting
%K text corpus<br>Web文本
%K 文本提取
%K 文本语料库
%K 主题
%K 文本长度
%K 文本提取
%K 方法
%K text
%K topic
%K method
%K of
%K general
%K 快速性
%K 结果
%K 实验
%K 通用性
%K 标记分析
%K HTML
%K 网页
%K 判别规则
%K 配合
%K 符号序列
%K 标点
%K 利用
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=831E194C147C78FAAFCC50BC7ADD1732&aid=14E16CBD27430512730BF41D7320BBD6&yid=A732AF04DDA03BB3&vid=DB817633AA4F79B9&iid=B31275AF3241DB2D&sid=C919C6DD1115AFC0&eid=D3EC5D34434DACC5&journal_id=1001-9081&journal_name=计算机应用&referenced_num=0&reference_num=4