|
一种基于树模型的关联实体解析方法
|
Abstract:
在大数据时代,Web数据呈现多样性和关联性,在实体解析(Entity Resolution)中体现为解析的数据集往往包含多个实体集,实体集之间具有关联关系。这种关系导致解析一个实体集的结果可以使另一个实体集的解析受益,这种具有关联关系的实体解析称为关联实体解析(Related Entity Resolution)。本文针对一对多类型关联实体的实体解析问题提出了关联树模型,并引申出相似节点、相似树、相似性传递等概念。我们提出了一种基于树的一对多关联实体解析方法。初始时依据关联实体的关联关系构建关联树;将本节点的属性相似度和关联子节点的部分属性相似度结合起来判断节点是否匹配;基于深度优先原则遍历关联树的每一个节点,依据节点的实体解析结果筛选出满足相似传递性的部分子节点,在遍历完叶子节点的过程中,生成部分相似子树,再对根节点的子节点集中节点进行相似匹配,寻找其他相似子树。本文提出一种相似树索引来表示关联树的匹配结果。用房地产大数据通过实验验证文中提出的关联树搜索算法比已有的关联实体识别算法在一对多关联实体上效率更高。
In the era of big data, Web data is featured with obvious diversity and relevance. In Entity Resolution, it is reflected in the parsed data set that often contains multiple entity sets, and there is an association relationship between entities. Based on that relationship, the result of parsing one entity set can benefit the parsing of another entity set, and that kind of entity resolution with an associated relationship is called Related Entity Resolution. In this paper, focusing on the one-to-many types of related entities, the concept of relevance tree was proposed, and concepts such as similar nodes, similar trees, and similarity transfer were further derived. A relevance tree search algorithm was proposed in the study. Initially, a relevance tree was constructed according to the association relationship of the associated entities. Then, based on the depth-first principle, it traversed each node of the relevance tree, screened out some sub-nodes that meet the similarity transitivity based on the entity analysis results of the node, and continued to deepen the relevance tree until the leaf nodes were all traversed and subtrees of partial similarity were generated. After that, it matched the nodes in the sub-node set of the root node to find other similar subtrees. Based on the above, the similar tree index was proposed to represent the matching result of the relevance tree. It proposed a relevance tree merging algorithm, which merged nodes of the three types of relevance subtrees that may appear in the relevance tree based on the similarity tree index on the basis of maintaining the relationship between entities, and generated the “neat” relevance tree. Through experiments, it verified that compared with the existing related entity recognition algorithm, the relevance tree search algorithm proposed in the paper achieved higher efficiency on the one-to-many real estate related data set.
[1] | Elmagarmid, A.K., Ipeirotis, P.G. and Verykios, V.S. (2007) Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19, 1-16. https://doi.org/10.1109/TKDE.2007.250581 |
[2] | Sun, Y. and Han, J. (2012) 2013 Mining Heterogeneous Information Networks: A Structural Analysis Approach. ACM SIGKDD Explora-tions Newsletter, 14, 20-28. https://doi.org/10.1145/2481244.2481248 |
[3] | Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E. and Widom, J. (2009) Swoosh: A Generic Approach to Entity Resolution. The VLDB Journal, 18, 255-276. https://doi.org/10.1007/s00778-008-0098-x |
[4] | Efthymiou, V., Stefanidis, K. and Christophides, V. (2015) Big Data entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web. 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, 29 October-1 November2015, 401-410. https://doi.org/10.1109/BigData.2015.7363781 |
[5] | Papadakis, G., Ioannou, E., Niederée, C. and Fankhauser, P. (2011) Efficient Entity Resolution for Large Heterogeneous InformationSpaces. Proceedings of the 4th ACM Internation-al Conference on Web Search and Data Mining (WSDM), Hong Kong (China), 9-12 February 2011, 535-544. https://doi.org/10.1145/1935826.1935903 |
[6] | Christen, P. (2012) A Survey of Indexing Techniques for Scala-ble Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537-1555. https://doi.org/10.1109/TKDE.2011.127 |
[7] | Fellegi, I.P. and Sunter, A.B. (1969) A Theory for Record Link-age. Journal of the American Statistical Association, 64, 1183-1210. https://doi.org/10.1080/01621459.1969.10501049 |
[8] | Galhardas, H., Florescu, D., Shasha, D., Simon, E. and Saita, C. (2001) Declarative Data Cleaning: Language, Model and Algorithms. Proceedings of 27th International Con-ference on Very Large Data Bases, Rome, 11-14 September 2001, 371-380. |
[9] | Kenig, B. and Gal, A. (2013) Mfib-locks: An Effective Blocking Algorithm for Entityresolution. Information Systems, 38, 908-926. https://doi.org/10.1016/j.is.2012.11.008 |
[10] | Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C. and Nejdl, W. (2013) A Blocking Framework for Entity Resolution in Highly heterogeneous Information Spaces. IEEE Transactions on Knowledge and Data Engineering, 25, 2665-2682. https://doi.org/10.1109/TKDE.2012.150 |
[11] | Bhattacharya, I. and Getoor, L. (2007) Collective Entity Resolution in Relational Data. ACM Transactions on Knowledge Discovery from Data, 1, 5-es. https://doi.org/10.1145/1217299.1217304 |
[12] | Arasu, A., Ré, C. and Suciu, D. (2009) Large-Scale Deduplication with Constraints Using Dedupalog. 2009 IEEE 25th International Conference on Data Engineering, Shanghai, 29 March-2 April 2009, 952-963.
https://doi.org/10.1109/ICDE.2009.43 |
[13] | Culotta, A. and Mccallum, A. (2005) A Conditional Model of Dedu-plication for Multi-Type Relational Data. Technical Report, University of Massachusetts, Amherst. |
[14] | Dong, X., Halevy, A.Y. and Madhavan, J. (2005) Reference Reconciliation in Complex Information Spaces. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, 14-16 June 2005, 85-96.
https://doi.org/10.1145/1066157.1066168 |
[15] | Singla, P. and Domingos, P. (2004) Multi-Relational Record Linkage. The KDD-2004 Workshop on Multi-Relational Data Mining, Seattle, WA, August 2004, 31-48. |
[16] | Singla, P. and Domingos, P. (2006) Entity Resolution with Markov Logic. 6th International Conference on Data Mining, Hong Kong (China), 18-22 December 2006, 572-582. https://doi.org/10.1109/ICDM.2006.65 |
[17] | Whang, S.E. and Garcia-Molina, H. (2012) Joint Entity Resolution. 2012 IEEE 28th International Conference on Data Engineering, Ar-lington, 1-5 April 2012, 294-305. https://doi.org/10.1109/ICDE.2012.119 |
[18] | 褚良旭, 李贵, 李征宇, 韩子扬, 曹科研. 一种基于属性显著度的实体解析算法[J]. 数据挖掘, 2021, 11(2): 27-37.
https://doi.org/10.12677/HJDM.2021.112004 |