|
- 2016
基于元信息的云盘资源检索结果去重
|
Abstract:
摘要: 区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。
Abstract: Different from classical duplicate detection methods which calculating text similarity of web pages, the multi-media cloud disk resources only have limited meta-information to deduplicate search results. The research is based on a newly established cloud disk resources search engine. This paper analyzed the characteristic of cloud disk resource meta-information, finding that besides resource names, extension filename, size and ownership are significant features to detect duplicate records. According to this, this paper proposed a feature normalization method and trained an unsupervised method to capture the task. Experiments proved that this method is able to solve the cloud disk resources search results deduplicating problem effectively
[1] | 葛晓玢,刘杰,崔健,等.基于版权信息的新闻网页去重策略研究[J].电脑知识与技术, 2012, 8(26):6211-6214. GE Xiaofen, LIU Jie, CUI Jian, et al. Research on the strategy of news web page based on copyright information[J]. Computer Knowledge and Technology, 2012, 8(26):6211-6214. |
[2] | DALVI N, OLTEANU M, RAGHAVAN M, et al. Deduplicating a places database[C] //Proceedings of the 23rd International Conference on World Wide Web. New York: ACM, 2014:409-418. |
[3] | 王开军, 李健, 张军英,等. 聚类分析中类数估计方法的实验比较[J]. 计算机工程, 2008, 34(9):198-199. WANG Kaijun, LI Jian, ZHANG Junying, et al. An experimental comparison of the methods of class number estimation in cluster analysis[J]. Computer Engineering, 2008, 34(9):198-199. |
[4] | RISTAD E S, YIANILOS P N. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5):522-532. |
[5] | ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate record detection: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16. |
[6] | GIBSON D, PUNERA K, TOMKINS A. The volume and evolution of web page templates[C] //Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005: 830-839. |
[7] | FETTERLY D, MANASSE M, NAJORK M. Detecting phrase-level duplication on the world wide web[C] //Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2005: 170-177. |
[8] | 陈基漓,牛秦洲.基于特征码的网页去重[J]. 微计算机信息, 2006, 22(9):113-115. CHEN Jili, NIU Qinzhou. Page to weight based on feature code[J]. Micro Computer Information, 2006, 22(9):113-115. |
[9] | 张玉连,王莎莎,宋桂江,等.基于元搜索的网页去重算法[J].燕山大学学报, 2011, 35(2):121-123. ZHANG Yulian, WANG Shasha, SONG Guijiang, et al. A meta search based algorithm for page weight[J]. Journal of Yanshan University, 2011, 35(2):121-123. |
[10] | 闫俊伢.基于MD5的网页去重算法的设计与研究[J]. 实验室研究与探索, 2013, 32(12):105-108. YAN Junya. Design and research of elimination algorithm based on MD5 web page[J]. Laboratory Research and Exploration, 2013, 32(12):105-108. |
[11] | 熊忠阳,牙漫,张玉芳,等.基于网页正文结构和特征串的相似网页去重算法[J].计算机应用,2013,33(2):554-557. XIONG Zhongyang, YA Man, ZHANG Yufang, et al. Based on Web page text structure and characteristic string of similar web page to weight algorithm[J]. Computer Application, 2013, 33(2):554-557. |
[12] | MANKU G S, JAIN A, DAS S A. Detecting near-duplicates for web crawling[C] //Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 141-150. |
[13] | 黄仁,冯胜,杨吉云,等.基于正文结构和长句提取的网页去重算法[J].计算机应用研究, 2010, 27(7):2489-2491. HUANG Ren, FENG Sheng, YANG Jiyun, et al. Descreen algorithm based on text structure and extraction of long sentences[J]. Computer Application Research, 2010, 27(7): 2489-2491. |
[14] | 徐朝辉,赵淑梅,闫付亮,等.一种基于特征向量的改进DSC网页去重算法[J].科学技术与工程,2013,13(8):2250-2253. XU Chaohui, ZHAO Shumei, YAN Fuliang, et al. An improved DSC page de weight algorithm based on feature vectors[J]. Science Technology and Engineering, 2013, 13(8):2250-2253. |
[15] | 曹玉娟,牛振东,赵堃,等.基于概念和语义网络的近似网页检测算法[J].软件学报, 2011, 22(8):1816-1826. CAO Yujuan, NIU Zhendong, ZHAO Kun, et al. Approximate web page detection algorithm based on concept and semantic web[J]. Journal of Software, 2011, 22(8):1816-1826. |
[16] | HENZINGER M. Finding near-duplicate web pages: a large-scale evaluation of algorithms[C] //Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2006: 284-291. |