Journal of Jilin University Science Edition

Previous Articles     Next Articles

Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework

YANG Jieming1, WU Qilong1, QU Zhaoyang1, YANG Shuo2, KAN Zhongfeng2, GAO Ye2   

  1. 1. School of Information Engineering, Northeast Electric Power University, Jilin 132012, Jilin Province, China;2. Information and Telecommunication Department, State Grid Jilin Electric Power Supply Company, Jilin 132000, Jilin Province, China
  • Received:2016-05-18 Online:2017-01-26 Published:2017-02-02
  • Contact: WU Qilong E-mail:neduqlw@foxmail.com

Abstract: We proposed a distributed K-Means clustering algorithm based on sampling under MapReduce framework, in order to solve the problems of high time cost of parallel execution of K-Means algorithm in the massive data environment. The algorithm used sampling algorithm to reduce the original data size on the premise of ensuring the invariant data distribution, and the clustering algorithm was optimized under the MapReduce framework. The experimental results demonstrate that the algorithm can effectively reduce the clustering time while maintaining good clustering effect, and it has high execution efficiency and good scalability for large scale datasets.

Key words: MapReduce, distributed computing, sampling, K-Means clustering algorithm

CLC Number: 

  • TP391