oalib
Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
A Methodology for Template Extraction from Heterogeneous Web Pages
Vidya Kadam,Prakash. R. Devale
Indian Journal of Computer Science and Engineering , 2012,
Abstract: The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the performance and resources of tools that processes the web pages. Thus, the template detection techniques have received a lot of attention to improve the performance of search engines, clustering and classification of web documents. In this paper, we are presenting the approach to detect and extract the templates from heterogeneous web documents and cluster them into different group. The pages belong to each group should possess the same structure .This saves thetime to find out best templates from a large number of web document and also saves the memory which is required to find out the best template structure.
Template Extraction from Heterogeneous Web Pages
Trupti B. Mane , Prof. Girish P. Potdar
International Journal of Advanced Computer Research , 2012,
Abstract: The World Wide Web (WWW) is getting a lot of attention as it is becoming huge repository ofinformation. A web page gets deployed on websiteby its web template system. Those templates can beused by any individual or organization to set up their website. Also the templates provide its readersthe ease of access to the contents guided by consistent structures. Hence the template detection techniques are emerging as Web Templates are becoming more and more important. Earlier systems consider all documents are guaranteed to conform to a common template and hence template extraction is done with those assumptions. However it is not feasible in real application. Our focus is on extracting templates from heterogeneous web pages. But due to large variety of web documents, there is a need to manage unknown number of templates. This can be achieved by clustering web documents by selecting a good partition method. The correctness of extracted templates depending on quality of clustering
An Approach for Content Retrieval from Web Pages Using Clustering Techniques  [PDF]
R. Manjula, A. Chilambuchelvan
Circuits and Systems (CS) , 2016, DOI: 10.4236/cs.2016.79230
Abstract: Mining the content from an information database provides challenging solutions to the industry experts and researchers, due to the overcrowded information in huge data. In web searching, the information retrieved is not an appropriate, because it gives ambiguous information for the user query, and the user cannot get relevant information within the stipulated time. To overcome these issues, we propose a new methodology for information retrieval EPCRR by providing the top most exact information to the user, by using the collaborative clustered automated filter which makes use of the collaborative data set and filter works on the prediction by providing the highest ranking for the exact data retrieved. The retrieval works on the basis of recommendation of data which consists of relevant data set with highest priority from the cluster of data which is on high usage. In this work, we make use of the automated wrapper which works similar to the meta crawler functionality and it obtains the content in the semantic usage data format. Obtained information from the user to the agent will be ranked based on the Enabled Pile clustered data with respect to the metadata information from the agent and end-user. The information is given to the end-user with the top most ranking data within the stipulated time and the remaining top information will be moved to the data repository for future use. The data collected will remain stable based on the user preference and works on the intelligence system approach in which the user can choose any information under any instances and can be provided with suitable high range of exact content. In this approach, we find that the proposed algorithm has produced better results than existing work and it costs less online computation time.
A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates  [cached]
Qingzhong Li,Yanhui Ding,An Feng,Yongquan Dong
Journal of Software , 2010, DOI: 10.4304/jsw.5.5.506-513
Abstract: Web information extraction is the key part of web data integration. With the need of e-commerce website and the development of web design, web pages with multiple presentation templates arise. The current web information extraction systems are usually based on single presentation template, so web pages with multiple presentation templates can’t be extracted efficiently. This paper focuses on the extraction problem about web pages with multiple presentation templates. Four different kinds of this problem have been considered, and a novel method based on path entropy, presentation regularity and ontology knowledge is presented. The experiment indicates that this method is very promising and it achieves excellent recall and precision.
Template Extraction from Heterogeneous Web Pages Using Text Clustering
T.L.N.Divya, G.Loshma, Dr. Nagaratna P Hegde
International Journal of Computer Trends and Technology , 2012,
Abstract: Now a days most of the information is stored in text databases. This information consists of large collection of documents from Heterogeneous web pages. Now we extract template from these heterogeneous templates, and to extract template we use different algorithms to find similarity of underlying template structures in the documents and we cluster the web documents based on the similarity of underlying template structure in the documents so that template is extracted with various clusters. We use different algorithms to find similarity between the web pages. Previously the algorithms used are RTDM, Text-Hash and Text-Max. But the time and space occupied by this algorithms is more. In this paper we are using WaveK-Means algorithm to find similarity between the web pages. This algorithm provides better performance compared to previous algorithms in terms of space and time. The space and time consumed by this algorithm is less compared to RTDM, Text-Hash and Text-Max. Our Experimental results with real life data sets confirm effectiveness and robustness of our algorithm.
A Comparison of Techniques for Sampling Web Pages  [PDF]
Eda Baykan,Monika Henzinger,Stefan F. Keller,Sebastian De Castelberg,Markus Kinzler
Computer Science , 2009,
Abstract: As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results.
Clustering Techniques  [PDF]
LINDEN, R.
Salesian Journal on Information Systems , 2009,
Abstract: This tutorial describes clustering methods that allow for the extraction of interesting characteristics directly from the data, separating them into functional groups or inserting them into a hierarchy, for further study.
RoadRunner for Heterogeneous Web Pages Using Extended MinHash
A Suresh Babu,P. Premchand,A. Govardhan
International Journal of Database Management Systems , 2012,
Abstract: The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structured data ready for post processing. Roadrunner will be used to extract information from template web pages. In this paper, we present novel algorithm for extracting templates from a large number of web documents which are generated from heterogeneous templates. The proposed system focuses on information extraction from heterogeneous web pages. We cluster the web documents based on the common template structures so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system.
An Overview of Web Data Extraction Techniques  [PDF]
Devika K,Subu Surendran
International Journal of Scientific Engineering and Technology , 2013,
Abstract: Web pages are usually generated for visualization not for data exchange. Each page may contain several groups of structured data. Web pages are generated by plugging data values to predefined templates. Manual data extraction from semi supervised web pages is a difficult task. This paper focuseson study of various automatic web data extraction techniques. There are mainly two types of techniques one is based on wrapper induction another is automatic extraction. In wrapper induction set of extraction rules are used, which are learnt from multiple pages containing similar data records.
A Fast Algorithm for Multiple Templates Locating Based on Templates Clustering
基于模板聚类与综合的多模板快速定位算法

WEI Yan feng,PENG Si long,
韦燕凤
,彭思龙

中国图象图形学报 , 2004,
Abstract: Aimed to locating all the instances of multiple templates in one image, a fast and more effective multiple templates locating algorithms based on clustering and synthesizing of templates is proposed. This algorithm can process those multiple templates even if only some of which are similar to each other. But all the templates must be almost the same size. First, a hierarchical clustering algorithm with feedback is applied to cluster the templates into some categories. In each category a mathematical model is applied to synthesize the templates in it. And thus a mother template is constructed. Second, the mother template of each category is used to search and matching in the translation space. And then the matched mother template is guided to check all the son templates. Edge maps are extracted for clustering, synthesizing, and matching. The partial Hausdorff distance matching with fast algorithm is suggested for mother template searching and matching procedure. Our algorithm is tested with difference multiple templates in integrated circuit micro images database. The results show that the scheme is efficient and effective for the task of multiple templates matching and locating.
Page 1 /100
Display every page Item


Home
Copyright © 2008-2017 Open Access Library. All rights reserved.