吉林大学学报(理学版)

• 计算机科学 • 上一篇    下一篇

基于MapReduce的两表数据倾斜连接的优化算法

赵宇兰   

  1. 山西大学商务学院 信息学院, 太原 030031
  • 收稿日期:2016-04-14 出版日期:2016-11-26 发布日期:2016-11-29
  • 通讯作者: 赵宇兰 E-mail:zhaoyulan24@163.com

Optimization Algorithm of Two Table DataSkew Join Based on MapReduce

ZHAO Yulan   

  1. Information Faculty, Business College of Shanxi University, Taiyuan 030031, China
  • Received:2016-04-14 Online:2016-11-26 Published:2016-11-29
  • Contact: ZHAO Yulan E-mail:zhaoyulan24@163.com

摘要: 针对Range partition算法不能优化数据集严重倾斜情形下的两表连接效率问题, 提出一种改进的数据倾斜连接算法. 该算法将倾斜数据和非倾斜数据区别处理, 利用复制、 广播方法将数据发送到每个Reduce节点, 通过一轮Map/Reduce任务完成所有的连接操作, 可有效均衡每个Reduce处理量, 解决了数据严重倾斜对两表连接性能的影响. 与传统的分区连接算法比较结果表明, 该算法有效.

关键词: 连接算法优化, 数据倾斜, Range partition算法, MapReduce

Abstract: Aiming at the problem that Range partition algorithm could not optimize two table join efficiency, which contained heavily skewed data, we proposed an improved algorithm for the data skew connection. The algorithm took different treatment for skew data and nonskew data,  sent data to each Reduce node by using the methods of replicating and broadcasting,  and completed all the connection operation through a round of Map/Reduce tasks. The algorithm could effectively balance processing of each Reduce, which solved the impact of the heavily skewed data on the performance of two table join. The results show that the algorithm is effective by comparing with the traditional partition join algorithm.

Key words: optimization of join algorithm, data skew, MapReduce, Range partition algorithm

中图分类号: 

  • TP311