%0 Journal Article
%T L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises
L-tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises
%A Xu-Bin Deng
%A Yang-Yong Zhu
%A
Xu-Bin
%A Deng
%A and
%A Yang-Yong
%A Zhu
%J 计算机科学技术学报
%D 2005
%I
%X In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.
%K data extraction
%K data model
%K extraction algorithm
%K regular expression
%K wrapper
树型匹配
%K 数据分离模型
%K 分离算法
%K 逻辑性
%K 数据库
%U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=F57FEF5FAEE544283F43708D560ABF1B&aid=B6AAD06A177D0F6AF9E2F852ECE3680E&yid=2DD7160C83D0ACED&vid=A04140E723CB732E&iid=B31275AF3241DB2D&sid=50FF665B2730AEEC&eid=1F7317C17A9AF4FA&journal_id=1000-9000&journal_name=计算机科学技术学报&referenced_num=3&reference_num=12