信息化高速发展的时代,信息数据大量产生,如没得到较好的整理归类,就无法满足对数据查找和使用上的快捷便利与准确性。随着信息安全科学技术的发展,这些数据在整理分类上的需求日益增长,但是在传统的聚类算法上,已经不能满足现在信息数据处理的需要。因此,对原算法的优化改进或重建新的算法成为现在最为迫切的事情。同时,在海量的数据处理上,单台计算机的硬件设施也无法满足对数据处理分类的需求。针对上述情况,基于Spark在分布式计算框架的基础上,本文对聚类算法进行了优化改进。利用Apache Spark的大数据处理框架,扩展了对计算模型的使用,并在内存上提供可以并行的计算框架,利用借着中间结果缓存在内存中,减少磁盘I/O的重复操作次数,从而可以更好地为迭代式计算、交互式查询等多种计算需求服务。通过对聚类算法的优化提高对数据分析处理归类的计算效率,实现本文研究的意义。
In the era of rapid development of information technology, a large number of information data are generated. If they are not properly sorted and classified, they cannot meet the requirements of fast, convenient and accurate data search and use. With the development of information security science and technology, the demand for sorting and sorting of these data is increasing, but the traditional clustering algorithm can no longer meet the needs of current information data processing. Therefore, the optimization and improvement of the original algorithm or the reconstruction of a new algorithm has become the most urgent thing now. At the same time, on huge amounts of data processing, a single computer hardware facility cannot meet the demand of classification of data processing. According to the above situation, this article is based on the Spark in a distributed computing framework, on the basis of the clustering algorithm is optimized to improve. The use of Apache Spark's big data processing framework extends the use of the computing model, and pro-vides a parallel computing framework in memory. By caching intermediate results in memory, the number of repeated disk I/O operations can be reduced, so as to better serve the needs of iterative computing, interactive query and other computing requirements. Through the optimization of clustering algorithm to improve the computational efficiency of data analysis, processing and classification, the significance of this study is realized.
References
[1]
Manyika, J., Chui, M., Brown, B., et al. (2011) Big Data: The Next Frontier for Innovation, Competition and Produc-tivity. Analytics, 3-17.
[2]
Lohr, S. (2012) The Age of Big Data. International Journal of Communications, Network and System Sciences, 16, 10-15.
[3]
Yu, Q.L. (2015) Learning Analytics: The Next Frontier for Computer Assisted Language Learning in Big Data Age. 2015 IEEE 31st International Conference on Data Engineering (ICDE), Seoul, Korea, 13-16 April 2015, 1-8.
https://doi.org/10.1051/shsconf/20151702013
[4]
Khan, M., Jin, Y., Li, M., et al. (2016) Hadoop Performance Modeling for Job Estimation and Resource Provisioning. IEEE Transactions on Parallel & Distributed Systems, 27, 441-454. https://doi.org/10.1109/TPDS.2015.2405552
[5]
Guo, Y., Rao, J., Cheng, D., et al. (2017) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. IEEE Transactions on Parallel & Distributed Systems, 28, 11-20. https://doi.org/10.1109/TPDS.2016.2587645
[6]
Li, Z., Yang, C., Liu, K., et al. (2016) Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data. International Journal of Geo-Information, 5, 173. https://doi.org/10.3390/ijgi5100173