%0 Journal Article %T 基于MapReduce的数据倾斜优化方法研究
Research on Data Skew Optimization Methods Based on MapReduce %A 王涛 %A 王晓玲 %A 王希胤 %J Software Engineering and Applications %P 217-227 %@ 2325-2278 %D 2025 %I Hans Publishing %R 10.12677/sea.2025.142020 %X 本文针对MapReduce框架在处理大规模数据时常见的数据倾斜问题,提出了一种基于抽样的映射分区优化方法。该方法通过水塘抽样算法对数据进行抽样,获取数据分布信息,并结合整体数据分布估计算法和映射分区算法实现数据的均衡分区。实验结果表明,该方法在不同倾斜度下均表现出良好的性能,显著降低了作业执行时间,提高了分区的平衡性,提升了集群资源利用率。
This paper proposes a sampling-based mapping partitioning optimization method to address the common data skew problem in the MapReduce framework when processing large-scale data. The method uses reservoir sampling to sample the data, obtain information on data distribution, and then combines the overall data distribution estimation algorithm and the mapping partitioning algorithm to achieve balanced data partitioning. Experimental results show that the proposed method performs well under different degrees of skewness, significantly reducing job execution time, improving partition balance, and enhancing cluster resource utilization. %K 大数据处理, %K 分布式计算, %K MapReduce框架, %K 数据倾斜优化
Big Data Processing %K Distributed Computing %K MapReduce Framework %K Data Skew Optimization %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=111078