|
- 2015
一种基于网页块特征的多级网页聚类方法
|
Abstract:
摘要: 利用网页的结构特征,提出一种多级网页聚类方法。该方法首先对网页进行分块,然后使用网页的块特征对网页进行聚类。在聚类过程中,通过调整阈值,能够提供三级聚类:同站点网页聚类、同站点同结构网页聚类、同站点同结构同模板网页聚类。与已有的网页聚类方法相比较,该方法能够提供多级聚类结果,满足不同的聚类需求,而且在聚类的准确率和效率方面有本质上的提高。
Abstract: A multi-level page clustering method based on page segmentation was proposed. In this method, pages were divided into several blocks, and then clustered by using the block feature. By adjusting the threshold of similarity between pages, three-level clustering was obtained: the first level is pages from the same website, the second level is pages from the same website with the same structures, and the last level is pages produced with the same template from the same website. Compared with traditional methods, this method not only could provide multi-level clustering, but also can cluster pages effectively
[1] | Jan Zeleny, Radek Burget. Cluster-based page segmentation: a fast and precise method for web page pre-processing[C]// Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2013:7.1-7.11. |
[2] | 常育红, 姜哲, 朱小燕. 基于标记树表示方法的页面结构分析[J].计算机工程与应用, 2004,40(16):129-132. CHANG Yuhong, JIANG Zhe, ZHU Xiaoyan. Web page structure analysis based on tag tree method[J]. Computer Engineering and Applications, 2004, 40(16):129-132. |
[3] | 余钧. 网页中数据记录的自动抽取和归并[D]. 北京:中国科学院大学,2014. Yujun. Research on automatic extraction and integration of data record in Web page[D]. Beijing: University of Chinese Academy of Sciences, 2014. |
[4] | 李睿, 曾俊瑀, 周四望. 基于局部标签树匹配的改进网页聚类算法[J].计算机应用, 2010, 30(3):818-820.. LI Rui, ZENG Junyu, ZHOU Siwang. Improved Web page clustering algorithm based on partial tag tree matching[J]. Journal of Computer Applications, 2010, 30(3):818-820. |
[5] | Valter Crescenzi, Paolo Merialdo, Paolo Missier. Clustering Web pages based on their structure[J]. Data & Knowledge Engineering, 2005(54):279-299. |
[6] | XIAO Yunpeng, TAO Yang, LI Qian. Web page adaptation for mobile device[C]// Proceedings of IEEE 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM '08).Washington: IEEE Computer Society, 2008:1-5. |
[7] | Tomoyuki Nanno, Suguru Saito, Manabu Okumura. Structuring Web pages based on repetation of elements[J]. Transactions of Information Processing Society of Japan, 2004, 45(9):2157-2167. |
[8] | CAI Deng, YU Shipeng,WEN Jirong, et al. Vips: a vision based page segmentation algorithm[R]. Microsoft Research, 2003. |
[9] | Chaw Su Win, Mie Mie Su Thwin. Web page segmentation and informative content extraction for effective information retrieval[J]. International Journal of Computer & Communication Engineering Research(IJCCER), 2014, 2(2):35-45. |