MapReduce框架下基于抽样的分布式K-Means聚类算法

吉林大学学报(理学版)

MapReduce框架下基于抽样的分布式K-Means聚类算法

杨杰明¹, 吴启龙¹, 曲朝阳^1, 杨烁², 阚中峰², 高冶^2

1. 东北电力大学信息工程学院, 吉林吉林 132012； 2. 国网吉林供电公司信息通信分公司, 吉林吉林 132000

收稿日期:2016-05-18 出版日期:2017-01-26 发布日期:2017-02-02
通讯作者: 吴启龙 E-mail:neduqlw@foxmail.com

Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework

YANG Jieming^1, WU Qilong^1, QU Zhaoyang^1, YANG Shuo^2, KAN Zhongfeng^2, GAO Ye²

1. School of Information Engineering, Northeast Electric Power University, Jilin 132012, Jilin Province, China;2. Information and Telecommunication Department, State Grid Jilin Electric Power Supply Company, Jilin 132000, Jilin Province, China

Received:2016-05-18 Online:2017-01-26 Published:2017-02-02
Contact: WU Qilong E-mail:neduqlw@foxmail.com

摘要/Abstract

摘要： 提出一种MapReduce框架下基于抽样的分布式K-Means聚类算法, 解决海量数据环境下并行执行K-Means算法时, 时间开销较大的问题. 该算法使用抽
样方法, 在保证数据分布不变的前提下, 对数据集的规模进行约减, 并在MapReduce框架下对聚类算法进行优化. 实验结果表明, 该算法在保持良好聚类效果的同时, 能有效缩短聚类时间, 对大规模数据集具有较高的执行效率和较好的可扩展性.

关键词: 抽样, MapReduce, 分布式计算, K-Means聚类算法

Abstract: We proposed a distributed K-Means clustering algorithm based on sampling under MapReduce framework, in order to solve the problems of high time cost of parallel execution of K-Means algorithm in the massive data environment. The algorithm used sampling algorithm to reduce the original data size on the premise of ensuring the invariant data distribution, and the clustering algorithm was optimized under the MapReduce framework. The experimental results demonstrate that the algorithm can effectively reduce the clustering time while maintaining good clustering effect, and it has high execution efficiency and good scalability for large scale datasets.

Key words: MapReduce, distributed computing, sampling, K-Means clustering algorithm

中图分类号:

TP391

杨杰明, 吴启龙, 曲朝阳, 杨烁, 阚中峰, 高冶. MapReduce框架下基于抽样的分布式K-Means聚类算法[J]. 吉林大学学报(理学版), 2017, 55(01): 109-115.

YANG Jieming, WU Qilong, QU Zhaoyang, YANG Shuo, KAN Zhongfeng, GAO Ye. Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework[J]. Journal of Jilin University Science Edition, 2017, 55(01): 109-115.

[1]	金晓民, 张丽萍. 基于最小生成树的多层次k-Means聚类算法及其在数据挖掘中的应用[J]. 吉林大学学报(理学版), 2018, 56(5): 1187-1192.
[2]	颜含, 潘鸿, 高彦伟. 周期单变点Poisson过程及参数Bayes估计[J]. 吉林大学学报(理学版), 2017, 55(03): 599-605.
[3]	李艳, 李葆华, 王金环. 一种新的基于LDA-MURE模型的音乐个性化推荐算法[J]. 吉林大学学报(理学版), 2017, 55(02): 371-375.
[4]	赵宇兰. 基于MapReduce的两表数据倾斜连接的优化算法[J]. 吉林大学学报(理学版), 2016, 54(06): 1383-1387.
[5]	李建伟, 王康平, 黄岚, 王贵参. MapReduce模型下基于R树索引的Skyline查询算法[J]. 吉林大学学报(理学版), 2016, 54(04): 833-838.
[6]	魏晓辉, 李聪, 李洪亮, 李翔, 刘圆圆, 李丽娜,庄园. 支持大规模流数据处理的在线MapReduce数据传输机制[J]. 吉林大学学报(理学版), 2015, 53(02): 273-279.
[7]	刘桂霞, 李广力, 李涵. 云平台下基因基因相互作用识别算法[J]. 吉林大学学报(理学版), 2014, 52(03): 546-550.
[8]	魏晓辉, 付庆午, 李洪亮. Hadoop平台下基于资源预测的Delay调度算法[J]. J4, 2013, 51(01): 101-106.
[9]	史海芳, 姬永刚. 简单半序约束下多个正态总体分布参数的Bayes估计与等值检验[J]. J4, 2013, 51(01): 1-8.
[10]	胡果荣,, 史宁中, 张宝学. 一个计算积分的替代抽样方法[J]. J4, 2006, 44(03): 362-366.
[11]	杨丰凯,王德辉,宋立新. （Be-B）模型下二行动线性决策问题的抽样信息期望值[J]. J4, 2005, 43(05): 603-606.