|
Record-Level Information Extraction from a Web Page based on Visual FeaturesAbstract: Web databases contain a huge amount ofstructured data which are easily obtained via their queryinterfaces only. Query results are presented indynamically generated web pages, usually in the form ofdata records, for human use. Decisive for web dataintegration applications is the problem of automaticallyextracting data records from query result pages, such ascomparison shopping sites, meta-search engines, etc. Anumber of approaches to query result extraction havebeen proposed. As the structures of web pages becomemore critical, these approaches start to fail. Query resultpages usually also contain other types of information inaddition to query results, e.g., advertisements, navigationbar, etc. Most of the existing approaches do not move outsuch impertinent contents which may affect the accuracyof data record extraction. We have observed that queryresults are usually displayed in regular visual patternsand terms used in a query often reappear in query results.
|