OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Hans Journal of Data Mining 2020

基于模式感知元分块技术的Web实体解析算法
Web Entity Resolution Algorithm Based on Schema-Aware Meta-Blocking Technology

DOI: 10.12677/HJDM.2020.101002, PP. 16-29

韦海浪, 李贵, 李征宇, 韩子扬, 曹科研

Keywords: 实体解析，聚合熵，元分块，局部敏感哈希
Entity Resolution, Aggregation Entropy, Meta-Blocking, Locally Sensitive Hash

Full-Text Cite this paper Add to My Lib

Abstract:

实体解析ER (Entity Resolution)是识别一个或多个数据源中同一实体记录。对于在多数据源中直接比较每对记录计算复杂度较大的问题，通常采用分块的方法。由于在Web数据源中大部分是模式未知的，通常采用元分块技术，虽然减少了丢失可能的匹配，但是增加了在同一块中放置不匹配实体记录的可能性。为此提出了一种基于局部敏感哈希的属性匹配归纳法从多个Web大数据集中对属性进行匹配划分，去除了属性间冗余的比较；然后通过一种基于聚合熵加权图的元分块技术，来提高Web数据源的分块质量，去除了分块中实体记录之间多余的比较，降低了算法的复杂度。最后采用实际数据集进行实验验证了该算法的有效性。
Entity Resolution is the identification of the same Entity record in one or more data sources. For problems with high complexity of directly comparing each pair of records in multiple data sources, chunking is usually adopted. Since most of the Schema are unknown in Web data sources, Me-ta-blocking techniques are commonly used, which reduce the possibility of missing matches but in-crease the possibility of placing mismatched entity records in the same block. To solve the above problems, an attribute matching induction method based on locally sensitive hashing is proposed to conduct attribute’s matching division from multiple Web big data sets to remove redundant com-parison among attributes. Then, a block technique based on aggregation entropy weighted graph is used to improve the block quality of Web data sets, remove redundant comparisons in the blocks and reduce the complexity of the algorithm. Finally, the effectiveness of the algorithm is verified by experiments with actual data sets.

References

[1]	[1] Dong, X.L. and Srivastava, D. (2015) Big Data Integration. Synthesis Lectures on Data Management. https://doi.org/10.2200/S00578ED1V01Y201404DTM040
[2]	Christen, P. (2012) A Survey of Indexing Tech-niques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537-1555. https://doi.org/10.1109/TKDE.2011.127
[3]	Papadakis, G., Alexiou, G., Papastefanatos, G. and Koutrika, G. (2015) Schema-Agnostic vs. Schema-Based Configurations for Blocking Methods on Homogeneous Data. Proceedings of the VLDB Endowment, 9, 312-323. https://doi.org/10.14778/2856318.2856326
[4]	Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C. and Nejdl, W. (2013) A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. IEEE Transac-tions on Knowledge and Data Engineering, 25, 2665-2682. https://doi.org/10.1109/TKDE.2012.150
[5]	Papadakis, G., Papastefanatos, G. and Koutrika, G. (2014) Super-vised Meta-Blocking. Proceedings of the VLDB Endowment, 7, 1929-1940. https://doi.org/10.14778/2733085.2733098
[6]	Papadakis, G., Papastefanatos, G., Palpanas, T. and Koubarakis, M. (2016) Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-Blocking. 19th International Conference on Extending Database Technology, Bordeaux, 15-18 March 2016, 221-232.
[7]	Kopcke, H. and Rahm, E. (2010) Frameworks for Entity Matching: A Comparison. Data & Knowledge Engineering, 69, 197-210. https://doi.org/10.1016/j.datak.2009.10.003
[8]	Naumann, F. and Herschel, M. (2010) An Introduction to Du-plicate Detection. Synthesis Lectures on Data Management. https://doi.org/10.2200/S00262ED1V01Y201003DTM003
[9]	Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S. and Srivastava, D. (2001) Approximate String Joins in a Database (Almost) for Free. 27th International Conference on Very Large Data Bases, 11-14 September 2001, 491-500.
[10]	McCallum, A., Nigam, K. and Ungar, L.H. (2000) Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, 20-23 August 2000, 169-178. https://doi.org/10.1145/347090.347123
[11]	Ma, Y. and Tran, T. (2013) Typimatch: Type-Specific Un-supervised Learning of Keys and Key Values for Heterogeneous Web Data Integration. 6th ACM International Confer-ence on Web Search and Data Mining, Rome, 4-8 February 2013, 325-334. https://doi.org/10.1145/2433396.2433439
[12]	Agresti, A. and Kateri, M. (2011) Categorical Data Analysis. In: Interna-tional Encyclopedia of Statistical Science, Springer, Berlin, 206-208. https://doi.org/10.1007/978-3-642-04898-2_161

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于模式感知元分块技术的Web实体解析算法Web Entity Resolution Algorithm Based on Schema-Aware Meta-Blocking Technology

基于模式感知元分块技术的Web实体解析算法
Web Entity Resolution Algorithm Based on Schema-Aware Meta-Blocking Technology