%0 Journal Article
%T 基于布尔匹配规则的实体解析方法
Entity Resolution Based on Boolean Matching Rules
%A 褚良旭
%A 李贵
%A 李征宇
%A 韩子扬
%A 曹科研
%J Hans Journal of Data Mining
%P 121-133
%@ 2163-1468
%D 2021
%I Hans Publishing
%R 10.12677/HJDM.2021.112012
%X 实体解析(ER)是数据集成和数据清洗的一个重要步骤。判断记录是否相似可以通过记录的属性(属性值)是否相似来判断。基于规则的实体解析方法,通过制定规则来将每个属性(属性值)的相似度都进行比较(属性匹配规则),为了减小其求解的搜索空间,属性匹配规则将每个属性都采用相同的相似度算法和阈值来进行比较,这导致实体解析的精度不高。为了提高精度,本文提出一种基于布尔匹配规则的改进的实体解析规则生成算法,与传统的基于属性匹配规则和机器学习的实体解析方法相比,改进的实体匹配规则算法精度更高。本文首先提出一种基于语法约束的布尔匹配规则。在此基础上,本文提出了一种规则合成(Rule Evolution)算法,他可以根据输入的实例验证规则,并自动合成对整个数据集有效的ER规则。在真实数据集和合成数据集上的实验结果表明,我们的方法具有很高的准确性,本文提出的规则在有效性上优于其他可解释规则(如低深度的决策树,其他基于规则的实体解析)。
Entity resolution (ER) is an important step in data integration and data cleaning. Judging whether the records are similar can be judged by whether the attributes (attribute values) of the records are similar. The rule-based entity analysis method compares the similarity of each attribute (attribute value) by formulating rules (attribute matching rules). In order to reduce the search space for its solution, the attribute matching rules adopt the same for each attribute. The similarity algorithm and threshold are compared, which leads to low accuracy of entity analysis. In order to improve the accuracy, this paper proposes an improved entity parsing rule generation algorithm based on Boolean matching rules. Compared with the traditional entity parsing methods based on attribute matching rules and machine learning, the improved entity matching rule algorithm has higher ac-curacy. This article first proposes a Boolean matching rule based on grammatical constraints. Then, based on the proposed Boolean formula rules, this paper proposes a Rule Evolution algorithm, which can verify the rules according to the input examples and automatically synthesize the effec-tive ER rules for the entire data set. Experimental results on real data sets and synthetic data sets show that our method has high accuracy. The rules proposed in this paper are better than other in-terpretable rules (such as low-depth decision trees, other rule-based Entity resolution).
%K 实体解析,布尔匹配规则,属性匹配规则,数据集成
Entity Resolution
%K Boolean Matching Rules
%K Attribute Matching Rules
%K Data Integration
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=41977