全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于MapReduce的数据倾斜优化方法研究
Research on Data Skew Optimization Methods Based on MapReduce

DOI: 10.12677/sea.2025.142020, PP. 217-227

Keywords: 大数据处理,分布式计算,MapReduce框架,数据倾斜优化
Big Data Processing
, Distributed Computing, MapReduce Framework, Data Skew Optimization

Full-Text   Cite this paper   Add to My Lib

Abstract:

本文针对MapReduce框架在处理大规模数据时常见的数据倾斜问题,提出了一种基于抽样的映射分区优化方法。该方法通过水塘抽样算法对数据进行抽样,获取数据分布信息,并结合整体数据分布估计算法和映射分区算法实现数据的均衡分区。实验结果表明,该方法在不同倾斜度下均表现出良好的性能,显著降低了作业执行时间,提高了分区的平衡性,提升了集群资源利用率。
This paper proposes a sampling-based mapping partitioning optimization method to address the common data skew problem in the MapReduce framework when processing large-scale data. The method uses reservoir sampling to sample the data, obtain information on data distribution, and then combines the overall data distribution estimation algorithm and the mapping partitioning algorithm to achieve balanced data partitioning. Experimental results show that the proposed method performs well under different degrees of skewness, significantly reducing job execution time, improving partition balance, and enhancing cluster resource utilization.

References

[1]  Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51, 107-113.
https://doi.org/10.1145/1327452.1327492

[2]  Fan, Y., Wu, W., Xu, Y. and Chen, H. (2014) Improving Mapreduce Performance by Balancing Skewed Loads. China Communications, 11, 85-108.
https://doi.org/10.1109/cc.2014.6911091

[3]  Zhuo, T., Wen, M., Kenli, L., et al. (2016) A Data Skew Oriented Reduce Placement Algorithm Based on Sampling. IEEE Transactions on Cloud Computing, 8, 1149-1161.
[4]  Rivault, S., Bamha, M., Limet, S. and Robert, S. (2022) A Scalable Similarity Join Algorithm Based on MapReduce and LSH. International Journal of Parallel Programming, 50, 360-380.
https://doi.org/10.1007/s10766-022-00733-6

[5]  Suguna, R., Divya, M. and Ranjani, R. (2014) A Novel Approach for Dynamic Cloud Partitioning and Load Balancing in Cloud Computing Environment. Journal of Theoretical and Applied Information Technology, 62, 662-667.
[6]  Gavagsaz, E., Rezaee, A. and Haj Seyyed Javadi, H. (2018) Load Balancing in Join Algorithms for Skewed Data in Mapreduce Systems. The Journal of Supercomputing, 75, 228-254.
https://doi.org/10.1007/s11227-018-2578-0

[7]  Gavagsaz, E., Rezaee, A. and Haj Seyyed Javadi, H. (2018) Load Balancing in Reducers for Skewed Data in MapReduce Systems by Using Scalable Simple Random Sampling. The Journal of Supercomputing, 74, 3415-3440.
https://doi.org/10.1007/s11227-018-2391-9

[8]  Gao, T., Guo, Y., Zhang, B., Cicotti, P., Lu, Y., Balaji, P., et al. (2020) Memory-efficient and Skew-Tolerant MapReduce over MPI for Supercomputing Systems. IEEE Transactions on Parallel and Distributed Systems, 31, 2734-2748.
https://doi.org/10.1109/tpds.2019.2932066

[9]  Chen, L., Lu, W., Bao, E., Wang, L., Xing, W. and Cai, Y. (2018) Naive Bayes Classifier Based Partitioner for MapReduce. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 101, 778-786.
https://doi.org/10.1587/transfun.e101.a.778

[10]  Wang, Z., Chen, Q., Suo, B., Pan, W. and Li, Z. (2019) Reducing Partition Skew on MapReduce: An Incremental Allocation Approach. Frontiers of Computer Science, 13, 960-975.
https://doi.org/10.1007/s11704-018-6586-2

[11]  Liu, Z., Zhang, S., Liu, Y., Wang, X. and Yin, D. (2021) Run-Time Dynamic Resource Adjustment for Mitigating Skew in MapReduce. Computer Modeling in Engineering & Sciences, 126, 771-790.
https://doi.org/10.32604/cmes.2021.013244

[12]  Daikoku, H., Kawashima, H. and Tatebe, O. (2019) Skew-aware Collective Communication for MapReduce Shuffling. IEICE Transactions on Information and Systems, 102, 2389-2399.
https://doi.org/10.1587/transinf.2019pap0019

[13]  张元鸣, 蒋建波, 陆佳炜, 等. 面向MapReduce的迭代式数据均衡分区策略[J]. 计算机学报, 2019, 42(8): 1873-1885.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133