%0 Journal Article
%T 半结构化实体解析算法
Semi-Structured Entity Resolution Algorithm
%A 韦海浪
%A 李贵
%A 李征宇
%A 韩子扬
%A 曹科研
%J Hans Journal of Data Mining
%P 1-15
%@ 2163-1468
%D 2020
%I Hans Publishing
%R 10.12677/HJDM.2020.101001
%X 实体解析是指识别一个或多个数据集中的相似或相同的记录。该文主要针对模式未知的半结构化数据,提出了一种基于字符串相似度的实体解析算法,将记录分成多个子字符串,采用编辑相似度计算子字符串之间关联度,在此基础上引入二分图最大加权匹配算法度量记录之间的关联度。由于该方法的计算时间复杂度比较高,对于Web大数据集实体解析来说,所需的时间成本较大,因此,该文还提出了一种基于集合相似度的实体解析算法,将记录看作所有属性值的集合,每个属性值作为集合中的元素,用一个标记数组来表示每个元素,根据这些标记数组为每个记录创建一个签名,找出与签名相匹配的其他相似记录。并且采用优化后的最大匹配算法来选出真正相似的记录。最后,该文采用实际数据集进行实验验证了上述方法比传统方法更有效。
Entity resolution is the identification of similar or identical records in one or more datasets. In this paper, an entity resolution algorithm based on string similarity is proposed for semi-structured data with unknown patterns. The records are divided into several substrings, and the correlation between substrings is calculated by editing similarity. On this basis, the maximum weighted matching algorithm of binary graph is introduced to measure the correlation between records. Due to the computing time complexity of this method is higher, for Web entity resolution large data sets, the time cost is larger; therefore, this article also puts forward a kind of entity resolution algorithm based on set similarity, considering record as a collection of all the property values, each attribute value as the elements in the collection, using an array of tag to represent each element, according to these tags array for each record to create a signature, to find other similar records match the sig-nature. The optimized maximum matching algorithm is used to select the truly similar records. Fi-nally, this paper uses the actual data set to verify that the above method is more effective than the traditional method.
%K 实体解析,编辑相似度,集合相似度,二分图最大加权匹配
Entity Resolution
%K Edit Similarity
%K Set Similarity
%K Maximum Weighted Matching of Binary Graph
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=33406