OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

模式识别与人工智能 2015

基于DOM树层次特征的多记录网页抽取*

DOI: 10.16451/j.cnki.issn1003-6059.201502004, PP. 125-131

陈巧灵,廖祥文,魏晶晶,陈国龙

Keywords: 信息抽取,多记录网页,抽取算法

Full-Text Cite this paper Add to My Lib

Abstract:

现有的多记录网页抽取方法通常是对文件对象模型(DOM)树进行整体纵向结构分析，计算的结构相似度普遍偏低，使其不能正确识别记录区域.文中提出基于DOM树层次特征的记录抽取方法，该方法利用DOM树不同层次节点的不同作用对其进行横向分析，将寻找相似子树的问题转换为寻找节点块的相似子块，最后采用双向拓展搜索非重叠重复子块进行记录分隔.实验表明该方法能抽取现有抽取器无法处理的页面，多个数据源的抽取结果验证其有效性.

References

[1]	China Internet Network Information Center. The 32nd Statistical Report on Internet Development in China[EB/OL]. [ 2013-07-17]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201307/t20130717_40664.htm (in Chinese)(中国互联网络信息中心.第32次中国互联网络发展状况统计报告[EB/OL]. [ 2013-07-17]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201307/t20130717_40664.htm)
[2]	Pretzsch S, Muthmann K, Schil A. FODEX-Towards Generic Data Extraction from Web Forums // Proc of the 26th International Conference on Advanced Information Networking and Applications. Fukuoka, Japan, 2012: 821-826
[3]	Liu W, Yan H L, Xiao J G. Automatically Extracting User Reviews from Forum Sites. Computers and Mathematics with Applications, 2011, 62(7): 2779-2792
[4]	Liu J, Song X Y, Jiang J T, et al. An Unsupervised Method for Author Extraction from Web Pages Containing User-Generated Content // Proc of the 21st ACM International Conference on Information and Knowledge Management. Maui, USA, 2012: 2387-2390
[5]	Song X Y, Liu J, Cao Y B, et al. Automatic Extraction of Web Data Records Containing User-Generated Content // Proc of the 19th ACM International Conference on Information and Knowledge Management. Toronto, Canada, 2010: 39-48
[6]	Yang J M, Cai R, Wang Y D, et al. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums // Proc of the 18th International Conference on World Wide Web. Madrid, Spain, 2009: 181-190
[7]	Van der Meer J, Frasincar F. Automatic Review Identification on the Web Using Pattern Recognition. Software: Practice and Experience, 2013, 43(12): 1415-1436
[8]	Yin X X, Tan W Z, Li X, et al. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries // Proc of the 19th International Conference on World Wide Web . Raleigh, USA, 2010: 991-1000
[9]	Hong J L, Tan E X, Fauzi F. Data Extraction for Search Engine Using Safe Matching // Proc of the 24th Australasian Joint Conference on Artificial Intelligence. Perth, Australia, 2011: 759-768
[10]	Zhao H K, Meng W Y, Wu Z H, et al. Fully Automatic Wrapper Generation for Search Engines // Proc of the 14th International Conference on World Wide Web . Chiba, Japan, 2005: 66-75
[11]	Hong J L, Siew E G, Egerton S. WMS-Extracting Multiple Sections Data Records from Search Engine Results Pages // Proc of the ACM Symposium on Applied Computing. Sierre, Switzerland, 2010: 1696-1701
[12]	Liu B, Grossman R, Zhai Y H. Mining Data Records in Web Pages // Proc of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, USA, 2003: 601-606
[13]	Miao G X, Tatemura J C, Hsiung W P, et al. Extracting Data Records from the Web Using Tag Path Clustering // Proc of the 18th International Conference on World Wide Web. Madrid, Spain, 2009: 981-990
[14]	Wang Y, Li B C, Lin C. Data Extraction from Web Forums Based on Similarity of Page Layout. Journal of Chinese Information Processing, 2010, 24(2): 68-75 (in Chinese)(王允,李弼程,林琛.基于网页布局相似度的Web论坛数据抽取.中文信息学报, 2010, 24(2): 68-75)
[15]	Yamada Y, Craswell N, Nakatoh T, et al. Testbed for Information Extraction from Deep Web // Proc of the 13th International Conference on World Wide Web. New York, USA, 2004: 346-347

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133