|
Cleaning Various Noise Patterns in Web Pages for Web Data ExtractionKeywords: Noise Patterns , Information Retrieval , Case Based Reasoning , Noise , Neural Network , Data Patterns Abstract: Cleaning Web pages before mining becomes critical for improving performance of information retrieval and information extraction. With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. So, we investigate to remove various noisy data patterns in Web pages instead of extracting relevant content from Web pages to get main content information. In this paper, we propose an approachNoiseEliminator that detect multiple noise patterns and remove these noise patterns from Web pages of any Web sites. Our approach is based on the basic idea of Case-Based Reasoning (CBR) to find noise pattern from mixture (data and noise together) patterns in current Web page by matching similar noise pattern kept in Case-Based. We also apply back propagation neural network algorithm to classify various noise patterns, data patterns and mixture patterns in current Web page. The classification result of neural network is used for removing noise patterns. We have implemented our method on several commercial Web sites and News Web sites to evaluate the performance and improvement of our approach. Experimental results show the effectiveness of the approach.
|