Chang C H, Mohammed K, Girgis M R, et al. A Survey of Web Information Extraction Systems. IEEE Trans on Knowledge and Data Engineering, 2006, 18(10): 1411-1428
[2]
Wang H C, Ruan S H, Tang Q J. The Implementation of a Web Crawler URL Filter Algorithm Based on Caching // Proc of the 2nd International Workshop on Computer Science and Engineering. Qingdao, China, 2009: 453-456
[3]
Broder A Z, Najork M, Wiener J L. Efficient URL Caching for World Wide Web Crawling // Proc of the 12th International Conference on World Wide Web. Budapest, Hungary, 2003: 679-689
[4]
Qu C, Wang B Z, Wei P P. Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity // Proc of the IEEE International Symposium on Information Technology in Medicine and Education. Xiamen, China, 2008: 1045-1048
[5]
Kong Y Y, Shi H J. Deep Web Data Region Identification Based on Similar URL. Computer Engineering, 2012, 38(2): 48-50 (in Chinese) (孔燕燕,施化吉.基于相似URL的深层网数据区域识别.计算机工程, 2012, 38(2): 48-50)
[6]
Nie T Z, Wang Z H, Kou Y, et al. Crawling Result Pages for Data Extraction Based on URL Classification // Proc of the 7th Web Information Systems and Applications. Huhehot, China, 2010: 79-84
[7]
Wang J Y, Lochovsky F H. Data-Rich Section Extraction from HTML Pages // Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002: 313-322
[8]
Yang S H, Lin H L, Han Y B. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software, 2008, 19(2): 209-223 (in Chinese) (杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法.软件学报, 2008, 19(2): 209-223)
[9]
Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance // Proc of the 13th International Conference on World Wide Web. New York, USA, 2004: 502-511
[10]
Wong W C, Fu A W C. Finding Structure and Characteristics of Web Documents for Classification // Proc of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Dallas, USA, 2000: 96-105
[11]
Srikantaiah K C, Suraj M, Venugopal K R, et al. Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining. ACEEE International Journal on Information Technology, 2013, 3(1): 42-49
[12]
Hu L M, Zhang Z B, Xu W D, et al. Improved Crawler Algorithm Based on Hierarchical Structure Preservation. Application Research of Computers, 2013, 30(8): 2381-2385 (in Chinese)(胡廉民,张泽斌,徐威迪,等.基于分层结构保留的增量网络爬虫算法.计算机应用研究, 2013, 30(8): 2381-2385)
[13]
Zhang M, Sun M. Design and Implementation of Qualified Spider Based on Heritrix. Computer Applications and Software, 2013, 30(4): 33-35 (in Chinese)(张 敏,孙 敏.基于 Heritrix 限定爬虫的设计与实现.计算机应用与软件, 2013, 30(4): 33-35)
[14]
Chang B B, Yu S W. The Technology and Application of Corpus. Foreign Languages Research, 2009, (5): 43-51 (in Chinese)(常宝宝,俞士汶.语料库技术及其应用.外语研究, 2009, (5): 43-51)