International Journal of Engineering Sciences and Emerging Technologies , 2013,
Abstract: In today’s world, World Wide Web is the most popular information providers. A website is a collection of web pages and Web pages usually include information for the users. The web sites are designed with common templates and content. The template is used to access the content easily by consistent structures even the templates are not explicitly announced. The current Template extraction techniques are degrading the performance of web applications such as search engine due to irrelevant terms in templates.In this work, we present new method for extracting templates from a large number of web documents which are generated from heterogeneous templates. This paper cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously.
Template Extraction from Heterogeneous Web Pages
Trupti B. Mane , Prof. Girish P. Potdar
International Journal of Advanced Computer Research , 2012,
Abstract: The World Wide Web (WWW) is getting a lot of attention as it is becoming huge repository ofinformation. A web page gets deployed on websiteby its web template system. Those templates can beused by any individual or organization to set up their website. Also the templates provide its readersthe ease of access to the contents guided by consistent structures. Hence the template detection techniques are emerging as Web Templates are becoming more and more important. Earlier systems consider all documents are guaranteed to conform to a common template and hence template extraction is done with those assumptions. However it is not feasible in real application. Our focus is on extracting templates from heterogeneous web pages. But due to large variety of web documents, there is a need to manage unknown number of templates. This can be achieved by clustering web documents by selecting a good partition method. The correctness of extracted templates depending on quality of clustering
RoadRunner for Heterogeneous Web Pages Using Extended MinHash
A Suresh Babu,P. Premchand,A. Govardhan
International Journal of Database Management Systems , 2012,
Abstract: The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structured data ready for post processing. Roadrunner will be used to extract information from template web pages. In this paper, we present novel algorithm for extracting templates from a large number of web documents which are generated from heterogeneous templates. The proposed system focuses on information extraction from heterogeneous web pages. We cluster the web documents based on the common template structures so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system.
Automatically Extracting Academic Papers from Web Pages Using Conditional Random Fields Model  [cached]
Wei Liu,Jianxun Zeng
Journal of Software , 2011, DOI: 10.4304/jsw.6.8.1409-1416
Abstract: A huge amount of academic papers(including research reports) are being released in web pages. It is important to extract these papers in a structured way for many popular applications, such as science and technology information retrieval and digital library. However, few investigations have been done on the issue of academic paper extraction. This paper proposed a unified approach for automatically extracting academic papers from web pages based on CRF model. In the proposed approach, both academic paper extraction and semantic labeling are performed simultaneously by employing the theoretical Conditional Random Fields(CRF) model. Experimental results show that our approach can achieve significantly better extraction results.
The Development and the Evaluation of a System for Extracting Events from Web Pages  [PDF]
Mihai-Constantin AVORNICULUI,Silviu Claudiu POPA,Constantin AVORNICULUI
Informatica Economica Journal , 2010,
Abstract: The centralization of a particular event is primarily useful for running news services. These services should provide updated information, if possible even in real time, on a specific type of event. These events and their extraction involved the automatic analysis of linguistic structure documents to determine the possible sequences in which these events occur in documents. This analysis will provide structured and semi-structured documents in which the unit events can be extracted automatically. In order to measure the quality of a system, a methodology will be introduced, which describes the stages and how the decomposition of a system for extracting events in components, quality attributes and properties will be defined for these components, and finally will be introduced metrics for evaluation.
A New Way of Extracting the Topic Information in Web Pages Based on DIV Tag-tree

OU YANG Liu-Bo,YANG Zhu,YI Xian,

计算机系统应用 , 2010,
Abstract: Since CSS+DIV Topological Mode has become the major trend of the structural layout of web pages, the efficient extraction of the topic information in these web pages has become one of the urgent tasks for all professional surfing engines. This paper puts forward a new way of extracting the topic information in web pages based on the DIV tag-tree. It divides HTML files into DIV-forest with the help of DIV-tag. Then it filters the noise nodes in DIV tag-trees and sets up STU-DIV model-trees. Finally, it crops the DIV tag-trees irrelevant to the topic information by Topic Corelation Analysis and Cut-Tree Algorithm. It proves that this method can efficiently extract the topic information in web pages by analyzing several news web pages .
Extracting Information by Mining Structures of Web Pages

LI Yuan,GENG Hua,ZHANG Meng,PAN Jin-Gui,

计算机科学 , 2006,
Abstract: To simplify the task of obtaining information from the vast number of information sources that are available on the WWW, we have developed two different methods to extract information of fine grain. This paper firstly describes the principles of the two methods, which work by mining structures of Web pages, and then compares the advantages and disadvantages of them. Finally, we test the performance of the two methods and analyze the experiment results.
A Methodology for Template Extraction from Heterogeneous Web Pages
Vidya Kadam,Prakash. R. Devale
Indian Journal of Computer Science and Engineering , 2012,
Abstract: The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the performance and resources of tools that processes the web pages. Thus, the template detection techniques have received a lot of attention to improve the performance of search engines, clustering and classification of web documents. In this paper, we are presenting the approach to detect and extract the templates from heterogeneous web documents and cluster them into different group. The pages belong to each group should possess the same structure .This saves thetime to find out best templates from a large number of web document and also saves the memory which is required to find out the best template structure.
Semantic Extraction from List Web Pages  [PDF]
Ismail Jellouli,Mohammed El Mohajir
International Journal of Computer Science Issues , 2012,
Abstract: Extracting structured information from web pages is a problem that has many applications and that gained increased interest in recent years. We propose an approach that can achieve extraction and semantic description of data contained in a list web page. Our approach is fully automatic and is based on a seed ontology that contains minimal information about the domain. It uses an instance-based classifier to characterize the attributes of the ontology. In opposition to existing methods, our approach does not make any assumption on the design of web pages ; it is totally layout independent. Experimental results obtained from different web pages of different web sites from different domains show that our approach is effective.
Cleaning Various Noise Patterns in Web Pages for Web Data Extraction
Thanda Htwe
International Journal of Network and Mobile Technologies , 2010,
Abstract: Cleaning Web pages before mining becomes critical for improving performance of information retrieval and information extraction. With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. So, we investigate to remove various noisy data patterns in Web pages instead of extracting relevant content from Web pages to get main content information. In this paper, we propose an approachNoiseEliminator that detect multiple noise patterns and remove these noise patterns from Web pages of any Web sites. Our approach is based on the basic idea of Case-Based Reasoning (CBR) to find noise pattern from mixture (data and noise together) patterns in current Web page by matching similar noise pattern kept in Case-Based. We also apply back propagation neural network algorithm to classify various noise patterns, data patterns and mixture patterns in current Web page. The classification result of neural network is used for removing noise patterns. We have implemented our method on several commercial Web sites and News Web sites to evaluate the performance and improvement of our approach. Experimental results show the effectiveness of the approach.
