MapReduce框架下基于抽样的分布式K-Means聚类算法

Abstract

Abstract: We proposed a distributed K-Means clustering algorithm based on sampling under MapReduce framework, in order to solve the problems of high time cost of parallel execution of K-Means algorithm in the massive data environment. The algorithm used sampling algorithm to reduce the original data size on the premise of ensuring the invariant data distribution, and the clustering algorithm was optimized under the MapReduce framework. The experimental results demonstrate that the algorithm can effectively reduce the clustering time while maintaining good clustering effect, and it has high execution efficiency and good scalability for large scale datasets.

Key words: MapReduce, distributed computing, sampling, K-Means clustering algorithm

CLC Number:

TP391

YANG Jieming, WU Qilong, QU Zhaoyang, YANG Shuo, KAN Zhongfeng, GAO Ye. Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework[J].Journal of Jilin University Science Edition, 2017, 55(01): 109-115.

[1]	LI Huimin, LI Shuiyan. Sampling Set Conditions in Subsets of Sobolev Spaces [J]. Journal of Jilin University Science Edition, 2021, 59(2): 279-282.
[2]	LIU Shuqin. Computer Generated Image Identification Algorithm Based on LTCP Features#br# [J]. Journal of Jilin University Science Edition, 2019, 57(2): 393-398.
[3]	JIN Xiaomin, ZHANG Liping. Multilevel k-Means Clustering Algorithm Based onMinimum Spanning Tree and Its Application in Data Mining#br# [J]. Journal of Jilin University Science Edition, 2018, 56(5): 1187-1192.
[4]	YAN Han, PAN Hong, GAO Yanwei. Poisson Process with Periodic Single Change Pointand Bayesian Estimation of Parameter [J]. Journal of Jilin University Science Edition, 2017, 55(03): 599-605.
[5]	LI Yan,LI Baohua, WANG Jinhuan. A New Personalized Music RecommendationAlgorithm Based on LDA-MURE Model [J]. Journal of Jilin University Science Edition, 2017, 55(02): 371-375.
[6]	ZHAO Yulan. Optimization Algorithm of Two Table DataSkew Join Based on MapReduce [J]. Journal of Jilin University Science Edition, 2016, 54(06): 1383-1387.
[7]	LI Jianwei, WANG Kangping, HUANG Lan, WANG Guishen. Skyline Query Algorithm Based on RTree Index in MapReduce Model [J]. Journal of Jilin University Science Edition, 2016, 54(04): 833-838.
[8]	LOU Jianlou, YU Huatao, QU Zhaoyang. Load Decomposition Method for Nonintrusive Household System [J]. Journal of Jilin University Science Edition, 2015, 53(04): 744-753.
[9]	WEI Xiaohui, LI Cong, LI Hongliang, LI Xiang, LIU Yuanyuan, LI Lina, ZHUANG Yuan. Online MapReduce Data Transmission MechanismSupporting LargeScale Stream Data Processing [J]. Journal of Jilin University Science Edition, 2015, 53(02): 273-279.
[10]	LIU Guixia, LI Guangli, LI Han. Algorithms for Detecting GeneGene Interactions Based on Cloud Platform [J]. Journal of Jilin University Science Edition, 2014, 52(03): 546-550.
[11]	SU Li-Meng, WANG Ying, JIANG Wan-Yuan, LI Wen-Hui. Face Recognition Based on Local Binary Pattern andResampling Bidirectional 2DLDA [J]. J4, 2013, 51(03): 459-464.
[12]	WEI Xiao-Hui, FU Qiang-Wu, LI Hong-Liang. Resource Forecast Delay Algorithm for Hadoop Systems [J]. J4, 2013, 51(01): 101-106.
[13]	SHI Hai-Fang, JI Yong-Gang. Bayesian Parameter Estimation and Testing of \|Parametersfrom Normal Populations under Semiorder Restrictions [J]. J4, 2013, 51(01): 1-8.
[14]	WANG Xin, DIAO Lian-Xi, XUE Long. Particle Filter Algorithm Based on Principal Component Analysis [J]. J4, 2012, 50(06): 1156-1162.
[15]	HU Guo-rong,, SHI Ning-zhong, ZHANG Bao-xue. An Alternative Algorithm for Estimating Integrals [J]. J4, 2006, 44(03): 362-366.

Distributed K-Means Clustering Algorithm Based onSampling under MapReduce Framework

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 10