|
基于布尔匹配规则的实体解析方法
|
Abstract:
实体解析(ER)是数据集成和数据清洗的一个重要步骤。判断记录是否相似可以通过记录的属性(属性值)是否相似来判断。基于规则的实体解析方法,通过制定规则来将每个属性(属性值)的相似度都进行比较(属性匹配规则),为了减小其求解的搜索空间,属性匹配规则将每个属性都采用相同的相似度算法和阈值来进行比较,这导致实体解析的精度不高。为了提高精度,本文提出一种基于布尔匹配规则的改进的实体解析规则生成算法,与传统的基于属性匹配规则和机器学习的实体解析方法相比,改进的实体匹配规则算法精度更高。本文首先提出一种基于语法约束的布尔匹配规则。在此基础上,本文提出了一种规则合成(Rule Evolution)算法,他可以根据输入的实例验证规则,并自动合成对整个数据集有效的ER规则。在真实数据集和合成数据集上的实验结果表明,我们的方法具有很高的准确性,本文提出的规则在有效性上优于其他可解释规则(如低深度的决策树,其他基于规则的实体解析)。
Entity resolution (ER) is an important step in data integration and data cleaning. Judging whether the records are similar can be judged by whether the attributes (attribute values) of the records are similar. The rule-based entity analysis method compares the similarity of each attribute (attribute value) by formulating rules (attribute matching rules). In order to reduce the search space for its solution, the attribute matching rules adopt the same for each attribute. The similarity algorithm and threshold are compared, which leads to low accuracy of entity analysis. In order to improve the accuracy, this paper proposes an improved entity parsing rule generation algorithm based on Boolean matching rules. Compared with the traditional entity parsing methods based on attribute matching rules and machine learning, the improved entity matching rule algorithm has higher ac-curacy. This article first proposes a Boolean matching rule based on grammatical constraints. Then, based on the proposed Boolean formula rules, this paper proposes a Rule Evolution algorithm, which can verify the rules according to the input examples and automatically synthesize the effec-tive ER rules for the entire data set. Experimental results on real data sets and synthetic data sets show that our method has high accuracy. The rules proposed in this paper are better than other in-terpretable rules (such as low-depth decision trees, other rule-based Entity resolution).
[1] | Elmagarmid, A.K., Ipeirotis, P.G. and Verykios, V.S. (2007) Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19, 1-16. https://doi.org/10.1109/TKDE.2007.250581 |
[2] | Jain, A.K., Murty, M.N. and Flynn, P.J. (1999) Data Clustering: A Review. ACM Computing Surveys, 31, 264-323.
https://doi.org/10.1145/331499.331504 |
[3] | Winkler, W. (2006) Overview of Record Linkage and Current Re-search Directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, Washington DC. |
[4] | Fellegi, I. and Sunter, A. (1969) A Theory for Record Linkage. Journal of the American Statistical Association, 64, 1183-1210. https://doi.org/10.1080/01621459.1969.10501049 |
[5] | Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proceedings of the 9th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining, Washington DC, August 2003, 39-48. https://doi.org/10.1145/956750.956759 |
[6] | Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J. and Zhu, X. (2014) Corleone: Hands-off Crowdsourcing for Entity Matching. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, June 2014, 601-612. https://doi.org/10.1145/2588555.2588576 |
[7] | Cohen, W.W. (2000) Data Integration Using Similarity Joins and a Word-Based Information Representation Language. ACM Transactions on Information Systems, 18, 288-321. https://doi.org/10.1145/352595.352598 |
[8] | Singla, P. and Domingos, P. (2006) Entity Resolution with Markov Logic. 6th International Conference on Data Mining, Hong Kong, 18-22 December 2006, 572-582. https://doi.org/10.1109/ICDM.2006.65 |
[9] | Firmani, D., Saha, B. and Srivastava, D. (2016) Online Entity Resolu-tion Using an Oracle. Proceedings of the VLDB Endowment, 9, 384-395. https://doi.org/10.14778/2876473.2876474 |
[10] | Wang, J., Kraska, T., Franklin, M.J. and Feng, J. (2012) Crowder: Crowd Sourcing Entity Resolution. Proceedings of the VLDB Endowment, 5, 1483-1494. https://doi.org/10.14778/2350229.2350263 |
[11] | Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N. and Yin, S. (2014) NADEEF/ER: Generic and Interactive Entity Resolution. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, June 2014, 1071-1074. https://doi.org/10.1145/2588555.2594511 |
[12] | Wang, J., Li, G., Yu, J.X. and Feng, J. (2011) Entity Matching: How Similar Is Similar. Proceedings of the VLDB Endowment, 4, 622-633. https://doi.org/10.14778/2021017.2021020 |
[13] | Talburt, J. (2010) Entity Resolution and Information Quality. Morgan Kaufmann, Burlington.
https://doi.org/10.1016/B978-0-12-381972-7.00003-8 |
[14] | Naumann, F. and Herschel, M. (2010) An Introduction to Duplicate Detection. Synthesis Lectures on Data Management, 2, 1-87. https://doi.org/10.2200/S00262ED1V01Y201003DTM003 |
[15] | Whang, S. and Garcia-Molina, H. (2010) Entity Resolution with Evolving Rules. Proceedings of the VLDB Endowment, 3, 1326-1337. https://doi.org/10.14778/1920841.1921004 |
[16] | Alur, R., Bodik, R. and Dallal, E. (2015) Syntax-Guided Synthe-sis. Dependable Software Systems Engineering, 40, 1-25. |
[17] | Solar-Lezama, A. (2009) The Sketching Approach to Program Synthesis. Asian Symposium on Programming Languages and Systems, Seoul, 14-16 December 2009, 4-13. https://doi.org/10.1007/978-3-642-10672-9_3 |
[18] | Singh, R., Meduri, V., Elmagarmid, A., Madden, S. and Papot-ti, P. (2017) Synthesizing Entity Matching Rules by Examples. Proceedings of the VLDB Endowment, 11, 189-202. https://doi.org/10.14778/3149193.3149199 |
[19] | Ganzinger, H., Hagen, G., Nieuwenhuis, R., Oliveras, A. and Tinelli, C. (2004) Dpll(T): Fast Decision Procedures. International Conference on Computer Aided Verification, Boston, 13-17 July 2004, 175-188.
https://doi.org/10.1007/978-3-540-27813-9_14 |