|
Extracting Semi-Structured Information Based On SubtreesKeywords: Trees Abstract: A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. It then extracts each record from the data region and identifies it whether it is a flat or nested records based on visual information – the area covered and the number of data items present in each record. The next step is data items extraction from these records and transferring them into the database.This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.
|