吉林大学学报(理学版)

• 计算机科学 • 上一篇    下一篇

MapReduce框架下基于抽样的分布式K-Means聚类算法

杨杰明1, 吴启龙1, 曲朝阳1, 杨烁2, 阚中峰2, 高冶2   

  1. 1. 东北电力大学 信息工程学院, 吉林 吉林 132012; 2. 国网吉林供电公司 信息通信分公司, 吉林 吉林 132000
  • 收稿日期:2016-05-18 出版日期:2017-01-26 发布日期:2017-02-02
  • 通讯作者: 吴启龙 E-mail:neduqlw@foxmail.com

Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework

YANG Jieming1, WU Qilong1, QU Zhaoyang1, YANG Shuo2, KAN Zhongfeng2, GAO Ye2   

  1. 1. School of Information Engineering, Northeast Electric Power University, Jilin 132012, Jilin Province, China;2. Information and Telecommunication Department, State Grid Jilin Electric Power Supply Company, Jilin 132000, Jilin Province, China
  • Received:2016-05-18 Online:2017-01-26 Published:2017-02-02
  • Contact: WU Qilong E-mail:neduqlw@foxmail.com

摘要: 提出一种MapReduce框架下基于抽样的分布式K-Means聚类算法, 解决海量数据环境下并行执行K-Means算法时, 时间开销较大的问题. 该算法使用抽
样方法, 在保证数据分布不变的前提下, 对数据集的规模进行约减, 并在MapReduce框架下对聚类算法进行优化. 实验结果表明, 该算法在保持良好聚类效果的同时, 能有效缩短聚类时间, 对大规模数据集具有较高的执行效率和较好的可扩展性.

关键词: 抽样, MapReduce, 分布式计算, K-Means聚类算法

Abstract: We proposed a distributed K-Means clustering algorithm based on sampling under MapReduce framework, in order to solve the problems of high time cost of parallel execution of K-Means algorithm in the massive data environment. The algorithm used sampling algorithm to reduce the original data size on the premise of ensuring the invariant data distribution, and the clustering algorithm was optimized under the MapReduce framework. The experimental results demonstrate that the algorithm can effectively reduce the clustering time while maintaining good clustering effect, and it has high execution efficiency and good scalability for large scale datasets.

Key words: MapReduce, distributed computing, sampling, K-Means clustering algorithm

中图分类号: 

  • TP391