%0 Journal Article %T 基于模式感知元分块技术的Web实体解析算法
Web Entity Resolution Algorithm Based on Schema-Aware Meta-Blocking Technology %A 韦海浪 %A 李贵 %A 李征宇 %A 韩子扬 %A 曹科研 %J Hans Journal of Data Mining %P 16-29 %@ 2163-1468 %D 2020 %I Hans Publishing %R 10.12677/HJDM.2020.101002 %X 实体解析ER (Entity Resolution)是识别一个或多个数据源中同一实体记录。对于在多数据源中直接比较每对记录计算复杂度较大的问题,通常采用分块的方法。由于在Web数据源中大部分是模式未知的,通常采用元分块技术,虽然减少了丢失可能的匹配,但是增加了在同一块中放置不匹配实体记录的可能性。为此提出了一种基于局部敏感哈希的属性匹配归纳法从多个Web大数据集中对属性进行匹配划分,去除了属性间冗余的比较;然后通过一种基于聚合熵加权图的元分块技术,来提高Web数据源的分块质量,去除了分块中实体记录之间多余的比较,降低了算法的复杂度。最后采用实际数据集进行实验验证了该算法的有效性。
Entity Resolution is the identification of the same Entity record in one or more data sources. For problems with high complexity of directly comparing each pair of records in multiple data sources, chunking is usually adopted. Since most of the Schema are unknown in Web data sources, Me-ta-blocking techniques are commonly used, which reduce the possibility of missing matches but in-crease the possibility of placing mismatched entity records in the same block. To solve the above problems, an attribute matching induction method based on locally sensitive hashing is proposed to conduct attribute’s matching division from multiple Web big data sets to remove redundant com-parison among attributes. Then, a block technique based on aggregation entropy weighted graph is used to improve the block quality of Web data sets, remove redundant comparisons in the blocks and reduce the complexity of the algorithm. Finally, the effectiveness of the algorithm is verified by experiments with actual data sets. %K 实体解析,聚合熵,元分块,局部敏感哈希
Entity Resolution %K Aggregation Entropy %K Meta-Blocking %K Locally Sensitive Hash %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=33407