全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

BMGSJoin:一种基于MapReduce的图相似度连接算法*

DOI: 10.16451/j.cnki.issn1003-6059.201505011, PP. 472-480

Keywords: 图相似度连接,MapReduce,布隆过滤器

Full-Text   Cite this paper   Add to My Lib

Abstract:

图相似度连接在数据挖掘领域应用广泛,尤其是在数据预处理阶段,可用于数据清理、近复本检测等,其研究具有十分重要的意义.针对基于编辑距离约束的图相似度连接问题进行研究,返回两个图集合中所有编辑距离不超过给定阈值的图对.基于分布式编程框架MapReduce,设计采用“过滤-验证”框架的MGSJoin算法,利用基于路径的q-gram签名实现非解候选对的过滤,计数过滤.鉴于该算法键值对数量庞大的潜在问题,引入BloomFilter技术对算法进行改进并设计BMGSJoin算法.实验结果表明,提出的两种图相似度连接算法能较大地改善现有算法的效率和可扩展性,并能较好地应对当前大数据挖掘分析的需求.

References

[1]  Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 2008, 51(1): 107-113
[2]  Deng D, Li G L, Hao S, et al. MassJoin: A MapReduce-Based Method for Scalable String Similarity Joins // Proc of the 30th IEEE International Conference on Data Engineering. Chicago, USA, 2014: 340-351
[3]  Bloom B H. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 1970, 13(7): 422-426
[4]  Broder A, Mitzenmacher M. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 2004, 1(4): 485-509
[5]  Koutris P. Bloom Filters in Distributed Query Execution [EB/OL]. [2014-01-23]. http://courses.cs.washington.edu/courses/cse544/11wi/projects/koutris.pdf
[6]  Cohen S, Matias Y. Spectral Bloom Filters // Proc of the ACM SIGMOD International Conference on Management of Data. San Diego, USA, 2003: 241-252
[7]  Lin X M, Wang W. Set and String Similarity Queries: A Survey. Chinese Journal of Computers, 2011, 34(10): 1853-1862 (in Chinese)(林学民, 王 炜.集合和字符串的相似度查询.计算机学报, 2011, 34(10): 1853-1862
[8]  Pang J, Gu Y, Xu J, et al. Research Advance on Similarity Join Queries. Journal of Frontiers of Computer Science and Technology, 2013, 7(1): 1-13 (in Chinese)(庞 俊,谷 峪,许 嘉,等.相似性连接查询技术研究进展.计算机科学与探索, 2013, 7(1): 1-13)
[9]  Wang G, Wang B, Yang X C, et al. Efficiently Indexing Large Sparse Graphs for Similarity Search. IEEE Trans on Knowledge and Data Engineering, 2012, 24(3): 440-451
[10]  Cohen J. Graph Twiddling in a MapReduce World. Computing in Science & Engineering, 2009, 11(4): 29-41
[11]  Lattanzi S, Moseley B, Suri S, et al. Filtering: A Method for Solving Graph Problems in MapReduce // Proc of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures. San Jose, USA, 2011: 85-94
[12]  Suri S, Vassilvitskii S. Counting Triangles and the Curse of the Last Reducer // Proc of the 20th International Conference on World Wide Web. Hyderabad, India, 2011: 607-614
[13]  Kang U, Tsourakakis C E, Appel A P, et al. HADI: Mining Radii of Large Graphs. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 8.1-8.24
[14]  Bahmani B, Chakrabarti K, Xin D. Fast Personalized Pagerank on MapRreduce // Proc of the ACM SIGMOD International Conference on Management of data. Athens, Greece, 2011: 973-984
[15]  Kang U, Tong H H, Sun J M, et al. Gbase: A Scalable and General Graph Management System // Proc of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, USA, 2011: 1091-1099
[16]  Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A Peta-Scale Graph Mining System Implementation and Observations // Proc of the 9th IEEE International Conference on Data Mining. Miami, USA, 2009: 229-238
[17]  Bahmani B, Kumar R, Vassilvitskii S. Densest Subgraph in Streaming and MapReduce. Proceedings of the VLDB Endowment, 2012, 5(5): 454-465
[18]  Afrati F N, Fotakis D, Ullman J D. Enumerating Subgraph Instances Using Map-Reduce // Proc of the 29th IEEE International Conference on Data Engineering. Brisbane, Australia, 2013: 62-73
[19]  Yan X F, Han J W. gSpan: Graph-Based Substructure Pattern Mining // Proc of the IEEE International Conference on Data Mining. Maebashi, Japan, 2002: 721-724
[20]  Yan X F, Yu P S, Han J W. Graph Indexing: A Frequent Structure-Based Approach // Proc of the ACM SIGMOD International Conference on Management of Data. Paris, France, 2004: 335-346
[21]  He H H, Singh A K. Closure-Tree: An Index Structure for Graph Queries // Proc of the 22nd International Conference on Data Engineering. Atlanta, USA, 2006: 38.1-38.12
[22]  Williams D W, Huan J, Wang W. Graph Database Indexing Using Structured Graph Decomposition // Proc of the 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007: 976-985
[23]  Tian Y Y, Patel J M. Tale: A Tool for Approximate Large Graph Matching // Proc of the 24th IEEE International Conference on Data Engineering. Cancun, Mexico, 2008: 963-972
[24]  Yan X F, Yu P S, Han J W. Substructure Similarity Search in Graph Databases // Proc of the ACM SIGMOD International Conference on Management of Data. Baltimore, USA, 2005: 766-777
[25]  Zhao X, Xiao C, Lin X M, et al. Efficient Graph Similarity Joins with Edit Distance Constraints // Proc of the 28th IEEE International Conference on Data Engineering. Washington, USA, 2012: 834-845
[26]  Sanfeliu A, Fu K S. A Distance Measure between Attributed Relational Graphs for Pattern Recognition. IEEE Trans on Systems, Man and Cybernetics, 1983, SMC-13(3): 353-362
[27]  Bunke H, Allermann G. Inexact Graph Matching for Structural Pattern Recognition. Pattern Recognition Letters, 1983, 1(4): 245-253
[28]  Zeng Z P, Tung A K H, Wang J Y, et al. Comparing Stars: On Approximating Graph Edit Distance // Proc the 35th International Conference on Very Large Data Bases. Lyon, France, 2009: 25-36

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133