%0 Journal Article %T 基于元信息的云盘资源检索结果去重<br>Deduplicating search results of cloud disk resources using meta-information %A 刘驰 %A 闫宏飞< %A br> %A LIU Chi %A YAN Hong-fei %J 山东大学学报(理学版) %D 2016 %R 10.6040/j.issn.1671-9352.1.2015.060 %X 摘要: 区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。<br>Abstract: Different from classical duplicate detection methods which calculating text similarity of web pages, the multi-media cloud disk resources only have limited meta-information to deduplicate search results. The research is based on a newly established cloud disk resources search engine. This paper analyzed the characteristic of cloud disk resource meta-information, finding that besides resource names, extension filename, size and ownership are significant features to detect duplicate records. According to this, this paper proposed a feature normalization method and trained an unsupervised method to capture the task. Experiments proved that this method is able to solve the cloud disk resources search results deduplicating problem effectively %K 搜索引擎 %K 云盘资源 %K 元信息 %K 去重 %K < %K br> %K search engine %K deduplicate %K meta-information %K cloud disk resources %U http://lxbwk.njournal.sdu.edu.cn/CN/10.6040/j.issn.1671-9352.1.2015.060