OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

模式识别与人工智能 2014

Web语料抓取中基于相似度的URL过滤规则生成算法*

, PP. 631-637

陈荟慧,舒云星,林丽

Keywords: URL相似度,Web语料爬取,URL过滤,语料分类

Full-Text Cite this paper Add to My Lib

Abstract:

Web语料是语料库的重要组成部分，但对冗余URL的访问开支影响大规模语料爬取工作的质量和效率，使用高效的URL过滤规则可提高Web爬取的质量和效率.因网站虚拟目录下的文件分布不均匀，为发现目标文件聚集区域，提出一种生成URL过滤规则的方法.该方法使用正则表达式将URL元素通配化，归并相同元素后划分为子集，再计算子集内URL之间的相似度，并根据相似程度较高的URL构造虚拟目录树，基于虚拟目录树生成语料爬取的URL过滤规则和分类规则.文中详细介绍虚拟目录树的生成算法，并通过实验对比不同相似度阈值对目录树生成结果和URL过滤效果的影响.

References

[1]	Chang C H, Mohammed K, Girgis M R, et al. A Survey of Web Information Extraction Systems. IEEE Trans on Knowledge and Data Engineering, 2006, 18(10): 1411-1428
[2]	Wang H C, Ruan S H, Tang Q J. The Implementation of a Web Crawler URL Filter Algorithm Based on Caching // Proc of the 2nd International Workshop on Computer Science and Engineering. Qingdao, China, 2009: 453-456
[3]	Broder A Z, Najork M, Wiener J L. Efficient URL Caching for World Wide Web Crawling // Proc of the 12th International Conference on World Wide Web. Budapest, Hungary, 2003: 679-689
[4]	Qu C, Wang B Z, Wei P P. Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity // Proc of the IEEE International Symposium on Information Technology in Medicine and Education. Xiamen, China, 2008: 1045-1048
[5]	Kong Y Y, Shi H J. Deep Web Data Region Identification Based on Similar URL. Computer Engineering, 2012, 38(2): 48-50 (in Chinese) (孔燕燕,施化吉.基于相似URL的深层网数据区域识别.计算机工程, 2012, 38(2): 48-50)
[6]	Nie T Z, Wang Z H, Kou Y, et al. Crawling Result Pages for Data Extraction Based on URL Classification // Proc of the 7th Web Information Systems and Applications. Huhehot, China, 2010: 79-84
[7]	Wang J Y, Lochovsky F H. Data-Rich Section Extraction from HTML Pages // Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002: 313-322
[8]	Yang S H, Lin H L, Han Y B. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software, 2008, 19(2): 209-223 (in Chinese) (杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报, 2008, 19(2): 209-223)
[9]	Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance // Proc of the 13th International Conference on World Wide Web. New York, USA, 2004: 502-511
[10]	Wong W C, Fu A W C. Finding Structure and Characteristics of Web Documents for Classification // Proc of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Dallas, USA, 2000: 96-105
[11]	Srikantaiah K C, Suraj M, Venugopal K R, et al. Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining. ACEEE International Journal on Information Technology, 2013, 3(1): 42-49
[12]	Hu L M, Zhang Z B, Xu W D, et al. Improved Crawler Algorithm Based on Hierarchical Structure Preservation. Application Research of Computers, 2013, 30(8): 2381-2385 (in Chinese)(胡廉民,张泽斌,徐威迪,等.基于分层结构保留的增量网络爬虫算法.计算机应用研究, 2013, 30(8): 2381-2385)
[13]	Zhang M, Sun M. Design and Implementation of Qualified Spider Based on Heritrix. Computer Applications and Software, 2013, 30(4): 33-35 (in Chinese)(张敏,孙敏.基于 Heritrix 限定爬虫的设计与实现.计算机应用与软件, 2013, 30(4): 33-35)
[14]	Chang B B, Yu S W. The Technology and Application of Corpus. Foreign Languages Research, 2009, (5): 43-51 (in Chinese)(常宝宝,俞士汶.语料库技术及其应用.外语研究, 2009, (5): 43-51)

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133