OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Computer Science and Application 2020

基于Spark的层次聚类算法的研究与应用
Based on the Hierarchical Clustering Algorithm Research and Application of Spark

DOI: 10.12677/CSA.2020.105085, PP. 824-831

刘卫华, 史婷婷, 许学添, 邹同浩

Keywords: Spark，聚类算法，优化改进，大数据处理
Spark, Clustering Algorithm, Optimization and Improvement, Big Data Processing

Full-Text Cite this paper Add to My Lib

Abstract:

信息化高速发展的时代，信息数据大量产生，如没得到较好的整理归类，就无法满足对数据查找和使用上的快捷便利与准确性。随着信息安全科学技术的发展，这些数据在整理分类上的需求日益增长，但是在传统的聚类算法上，已经不能满足现在信息数据处理的需要。因此，对原算法的优化改进或重建新的算法成为现在最为迫切的事情。同时，在海量的数据处理上，单台计算机的硬件设施也无法满足对数据处理分类的需求。针对上述情况，基于Spark在分布式计算框架的基础上，本文对聚类算法进行了优化改进。利用Apache Spark的大数据处理框架，扩展了对计算模型的使用，并在内存上提供可以并行的计算框架，利用借着中间结果缓存在内存中，减少磁盘I/O的重复操作次数，从而可以更好地为迭代式计算、交互式查询等多种计算需求服务。通过对聚类算法的优化提高对数据分析处理归类的计算效率，实现本文研究的意义。
In the era of rapid development of information technology, a large number of information data are generated. If they are not properly sorted and classified, they cannot meet the requirements of fast, convenient and accurate data search and use. With the development of information security science and technology, the demand for sorting and sorting of these data is increasing, but the traditional clustering algorithm can no longer meet the needs of current information data processing. Therefore, the optimization and improvement of the original algorithm or the reconstruction of a new algorithm has become the most urgent thing now. At the same time, on huge amounts of data processing, a single computer hardware facility cannot meet the demand of classification of data processing. According to the above situation, this article is based on the Spark in a distributed computing framework, on the basis of the clustering algorithm is optimized to improve. The use of Apache Spark's big data processing framework extends the use of the computing model, and pro-vides a parallel computing framework in memory. By caching intermediate results in memory, the number of repeated disk I/O operations can be reduced, so as to better serve the needs of iterative computing, interactive query and other computing requirements. Through the optimization of clustering algorithm to improve the computational efficiency of data analysis, processing and classification, the significance of this study is realized.

References

[1]	Manyika, J., Chui, M., Brown, B., et al. (2011) Big Data: The Next Frontier for Innovation, Competition and Produc-tivity. Analytics, 3-17.
[2]	Lohr, S. (2012) The Age of Big Data. International Journal of Communications, Network and System Sciences, 16, 10-15.
[3]	Yu, Q.L. (2015) Learning Analytics: The Next Frontier for Computer Assisted Language Learning in Big Data Age. 2015 IEEE 31st International Conference on Data Engineering (ICDE), Seoul, Korea, 13-16 April 2015, 1-8. https://doi.org/10.1051/shsconf/20151702013
[4]	Khan, M., Jin, Y., Li, M., et al. (2016) Hadoop Performance Modeling for Job Estimation and Resource Provisioning. IEEE Transactions on Parallel & Distributed Systems, 27, 441-454. https://doi.org/10.1109/TPDS.2015.2405552
[5]	Guo, Y., Rao, J., Cheng, D., et al. (2017) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. IEEE Transactions on Parallel & Distributed Systems, 28, 11-20. https://doi.org/10.1109/TPDS.2016.2587645
[6]	Li, Z., Yang, C., Liu, K., et al. (2016) Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data. International Journal of Geo-Information, 5, 173. https://doi.org/10.3390/ijgi5100173
[7]	李璐明, 蒋新华, 廖律超. 基于弹性分布数据集的海量空间数据密度聚类[J]. 湖南大学学报(自科版), 2015(8): 116-124.
[8]	宋杰, 郭朝鹏, 张一川, 等. 增量式迭代计算模型研究与实现[J]. 计算机学报, 2016(1): 109-125.
[9]	侯丽利, 董书宝. 基于NoSQL数据库的大数据查询技术的研究与应用[J]. 无线互联科技, 2015(1): 147-154.
[10]	穆罕默德扎基( Mohammed J. Zaki),小瓦格纳梅拉. 数据挖掘与分析概念与算法[M]. 北京: 人民邮电出版社, 2017: 155-167.
[11]	牛新征, 佘堏. 面向大规模数据的快速并行聚类划分算法研究[J]. 计算机科学, 2012, 39(1): 134-137.
[12]	金相郁. 中国区域划分的层次聚类分析[J]. 城市规划学刊, 2004(2): 23-28.
[13]	闫安, 刘琪林. 一种基于参考点的快速密度聚类算法[J]. 微电子学与计算机, 2017, 34(10): 32-35.
[14]	赵慧, 刘希玉, 崔海青. 网格聚类算法[J]. 计算机技术与发展, 2010, 20(9): 83-85.
[15]	张忠林, 曹志宇, 李元韬. 基于加权欧式距离的k-means算法研究[J]. 郑州大学学报(工学版), 2010, 31(1): 89-92.
[16]	Hartigan, J.A. (1979) A K-Means Clustering Algorithm. Applied Statistics, 2, 100-108. https://doi.org/10.2307/2346830
[17]	张蓉, 钟艳. 基于BIRCH算法的模糊集数据库挖掘算法[J]. 科技通报, 2014(4): 47-49.
[18]	宋雨, 焦谱, 李刚. 大数据预处理中属性约简的特性保持分析[J]. 计算机测量与控制, 2015, 23(12): 4191-4194.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于Spark的层次聚类算法的研究与应用Based on the Hierarchical Clustering Algorithm Research and Application of Spark

基于Spark的层次聚类算法的研究与应用
Based on the Hierarchical Clustering Algorithm Research and Application of Spark