带学习过程的随机K最近邻算法

doi:10.13229/j.cnki.jdxbgxb.20220202

摘要/Abstract

摘要：

针对传统K最近邻（KNN）算法没有学习过程，进行分类预测时需要遍历全部学习样本、时效性差且对k值敏感的缺点，本文提出了两种带学习过程的随机KNN算法（RKNN），包括对样本Bootstrap抽样的SRKNN算法和对样本特征Bootstrap抽样的ARKNN算法，均属于Bagging集成学习，学习多个简单KNN后投票输出结果。算法对样本的特征进行组合得到组合特征，简单KNN基于组合特征得到。重点研究了如何选取特征的最优组合系数，得到了取得最好分类精度时的特征最优组合系数选取规则和公式。RKNN算法在构造简单KNN时引入学习，分类时不再遍历全部学习样本而只需要用二分查找法即可，其分类时间复杂度比传统KNN算法分类时间复杂度低一个数量级。RKNN算法的分类精度比传统KNN算法的分类精度有大幅提升，解决了使用KNN算法难以选取k值的问题。理论分析和实验结果均验证了本文RKNN算法的有效性。

关键词: 机器学习, KNN算法, 随机KNN, Bagging集成学习, AdaBoost

Abstract:

The traditional KNN （K-nearest neighbor） algorithm is a classic machine learning algorithm. This algorithm has no learning process and needs to traverse all the learning samples when classifying， and is time-sensitive and sensitive to the k value. This paper proposes two random KNN algorithms （RKNN） with a learning process， including the SRKNN algorithm on sample Bootstrap sampling and the ARKNN algorithm on sample feature Bootstrap sampling， both of which belong to Bagging ensemble learning. After learning multiple simple KNNs， the voting output results. The algorithm combines the features of the samples to obtain the combined features， and the simple KNN is obtained based on the combined features. It focuses on how to select the optimal combination coefficient of features， and obtains the selection rules and formulas of the optimal combination features for the best classification accuracy. The RKNN algorithm introduces learning when constructing a simple KNN， it no longer needs to traverse all the learning samples when classifying， but only needs to use the binary search method， and its classification time complexity is an order of magnitude lower than that of the traditional KNN algorithm. The classification accuracy of the RKNN algorithm is significantly improved than that of the traditional KNN algorithm. The RKNN algorithm solves the problem that it is difficult to select the k value using the KNN algorithm. Both theoretical analysis and experimental results show that the proposed RKNN algorithm is an efficient improvement to the KNN algorithm.

Key words: machine learning, K-nearest neighbor algorithm, random K-Nearest neighbor, Bagging ensemble learning, AdaBoost

中图分类号:

TP18

付忠良,陈晓清,任伟,姚宇. 带学习过程的随机K最近邻算法[J]. 吉林大学学报(工学版), 2024, 54(1): 209-220.

Zhong-liang FU,Xiao-qing CHEN,Wei REN,Yu YAO. Random K-nearest neighbor algorithm with learning process[J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(1): 209-220.

图/表 8

表1

表2

表3

表4

表5

表6

图1

表7

参考文献 27

1	Cover T M, Hart P E. Nearst neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
2	Hart P E. The condensed nearest neighbor rule[J]. IEEE Trans actions on Information Theory, 1968, 14(3):515-516.
3	李荣陆, 胡运发. 基于密度的KNN 文本分类器学习样本裁剪方法[J]. 计算机研究与发展, 2004, 41(4): 539-545.
	Li Rong-lu, Hu Yun-Fa. A density-based method for reducing the amount of training data in kNN text classification[J]. Journal of Computer Research and Development, 2004, 41(4): 539-545.
4	张孝飞, 黄河燕. 一种采用聚类技术改进的KNN 文本分类方法[J]. 模式识别与人工智能, 2009, 22(6): 936-940.
	Zhang Xiao-fei, Huang He-yan. An improved KNN text categorization algorithm by adopting cluster technology[J]. Pattern Recognition and Artificial Intelligence, 2009, 22(6): 936-940.
5	Hwang W J, Wen K W. Fast KNN classification algorithm based on partial distance search[J]. Electron Letter, 1998, 34(21): 2062-2063.
6	Pan J S, Qiao Y L, Sun S H. A fast k-nearest neighbors classification algorithm[J]. IEICE Transactions Fundamentals, 2004, 87(4): 961-961.
7	Samet H. k-Nearst neighbor finding using MaxNearstDist [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(2): 243-252.
8	Guo G D, Wang H, Bell D, et al. Using KNN model for automatic text categorization[J]. Soft Computing-A Fusion of Foundations,Methodologies and Application, 2006, 10(5): 423-430.
9	朱付保, 谢利杰, 汤萌萌, 等. 基于模糊C-Means 的改进型KNN分类方法[J]. 华中师范大学学报: 自然科学版, 2017, 51(6): 754-759.
	Zhu Fu-bao, Xie Li-jie, Tang meng-meng, et al. Improved KNN classification algorithm based fuzzy C-means[J]. Journal of Central China Normal University (Natural Science Edition), 2017, 51(6): 754-759.
10	Samet H. The Design an Analysis of Spatial Data Structure[M]. Reading: Addison-Wesley, 1990.
11	尚文倩, 黄厚宽, 刘玉玲, 等.文本分类中基于基尼指数的特征选择算法研究 [J]. 计算机研究与发展, 2006, 43(10): 1688-1694.
	Shang Wen-qian, Huang Hou-kuan, Liu Yu-ling, et al. Research on the algorithm of feature selection based on Gini index for text categorization[J]. Journal of Computer Research and Development, 2006,43(10): 1688-1694.
12	林用民, 朱卫东. 模糊KNN在文本分类中的应用研究[J]. 计算机应用与软件, 2008, 25(9): 185-187.
	Lin Yong-min, Zhu Wei-dong. Study on the application of fuzzy KNN to text categorization[J]. Computer Applications and Software, 2008, 25(9): 185-187.
13	Thanh N P, Kappas M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery[J]. Sensors, 2018, 18(1): 18.
14	黄光华, 殷锋, 冯九林. 一种交叉验证和距离加权方法改进的KNN算法研究[J]. 西南民族大学学报: 自然科学版, 2020, 46(2): 172-177.
	Huang Guang-hua, Yin Feng, Feng Jiu-lin. An improved KNN algorithm based on cross validation and distance weighting[J]. Journal of Southwest Minzu University (Natural Science Edition), 2020, 46(2): 182-177.
15	Gou J P, Ma H X, Ou W H, et al. A generalized mean distance-based k-nearest neighbor classifier[J]. Expert Systems with Applications, 2019, 115: 356-372.
16	Gou J P, Qiu W M, Zhang Y, et al. A local mean representation-based K-nearest neighbor classifier[J]. ACM Transactions on Intelligent Systems, 2019, 29 (10): 1-25.
17	Bicego M, Loog M. Weighted K-nearest neighbor revisited[C]∥The 23th International Conference on Pattern Recognition (ICPR),Cancun, Mexico, 2016, 1642-1647.
18	Gou J P, Qiu W M, Zhang Y, et al. Locality constrained representation-based K-nearest neighbor classification[J]. Knowledge-Based Systems, 2019, 167(3): 38-52.
19	Ma H, Gou J, Ou W, et al. A new nearest neighbor classifier based on multi-harmonic mean distances[C]∥International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Shenzhen, China, 2017: 115-125.
20	Zhang S C, Li X L, Zong M, et al. Learning k for kNN classification[J]. ACM Transactions on Intelligent Systems and Technology, 2017, 8(3): 1-19.
21	Zhong X F, Guo S Z, Gao L, et al. An improved k-NN classification with dynamic k [C]∥Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 2017: 211-216.
22	Li B, Chen Y, Chen Y. The nearest neighbor algorithm of local probability centers[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(1): 141-154.
23	Breiman L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140.
24	Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
25	付忠良, 张丹普, 王莉莉. 多标签AdaBoost 算法的改进算法[J]. 四川大学学报: 工程科学版, 2015, 47(5): 103-109.
	Fu Zhong-liang, Zhang Dan-pu, Wang Li-li. Improvement on AdaBoost for multi-label classification[J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(5): 103-109.
26	Breiman L. Random forest[J]. Machine Learning, 2001, 29(1): 1-10.
27	付忠良. 分类器线性组合及分类器线性组合的有效性和最佳组合问题的研究[J]. 计算机研究与发展, 2009, 46(7): 1206-1216.
	Fu Zhong-liang. Effective property and best combination of classifier linear combination[J]. Journal of Computer Research and Development, 2009, 46(7): 1206-1216.

相关文章 15

[1]	王玫,宋志远. 基于TrAdaBoost算法为内核的行人航迹推算技术[J]. 吉林大学学报(工学版), 2023, 53(8): 2364-2370.
[2]	耿庆田,刘植,李清亮,于繁华,李晓宁. 基于一种深度学习模型的土壤湿度预测[J]. 吉林大学学报(工学版), 2023, 53(8): 2430-2436.
[3]	潘恒彦,张文会,梁婷婷,彭志鹏,高维,王永岗. 基于MIMIC与机器学习的出租车驾驶员交通事故诱因分析[J]. 吉林大学学报(工学版), 2023, 53(2): 457-467.
[4]	袁伟,袁小慧,高岩,李坤宸,赵登峰,刘朝辉. 基于自然驾驶数据的电动公交踏板误操作辨识方法[J]. 吉林大学学报(工学版), 2023, 53(12): 3342-3350.
[5]	周丰丰,颜振炜. 基于混合特征的特征选择神经肽预测模型[J]. 吉林大学学报(工学版), 2023, 53(11): 3238-3245.
[6]	耿庆田,赵杨,李清亮,于繁华,李晓宁. 基于注意力机制的LSTM和ARIMA集成方法在土壤温度中应用[J]. 吉林大学学报(工学版), 2023, 53(10): 2973-2981.
[7]	车翔玖,于英杰,刘全乐. 增强Bagging集成学习及多目标检测算法[J]. 吉林大学学报(工学版), 2022, 52(12): 2916-2923.
[8]	段亮,宋春元,刘超,魏苇,吕成吉. 基于机器学习的高速列车轴承温度状态识别[J]. 吉林大学学报(工学版), 2022, 52(1): 53-62.
[9]	李光松,李文清,李青. 基于随机性特征的加密和压缩流量分类[J]. 吉林大学学报(工学版), 2021, 51(4): 1375-1386.
[10]	朱小龙,谢忠. 基于机器学习的地理空间数据抽取算法[J]. 吉林大学学报(工学版), 2021, 51(3): 1011-1016.
[11]	李阳,李硕,井丽巍. 基于贝叶斯模型与机器学习算法的金融风险网络评估模型[J]. 吉林大学学报(工学版), 2020, 50(5): 1862-1869.
[12]	方伟,黄羿,马新强. 基于机器学习的虚拟网络感知数据缺陷自动检测[J]. 吉林大学学报(工学版), 2020, 50(5): 1844-1849.
[13]	刘洲洲,尹文晓,张倩昀,彭寒. 基于离散优化算法和机器学习的传感云入侵检测[J]. 吉林大学学报(工学版), 2020, 50(2): 692-702.
[14]	赵东, 臧雪柏, 赵宏伟. 基于果蝇优化的随机森林预测方法[J]. 吉林大学学报(工学版), 2017, 47(2): 609-614.
[15]	金立生, 王岩, 刘景华, 王亚丽, 郑义. 基于Adaboost算法的日间前方车辆检测[J]. 吉林大学学报(工学版), 2014, 44(6): 1604-1608.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	样本容量	特征个数	类别
Wdbc	569	31	2
BreastC	256	30	2
Pima	768	8	2
SpectData	267	22	2
Ionosphere	351	34	2
Sonar	208	60	2
Tae	151	6	2
TicTac	958	10	2

数据集	AdaBoost	SVM	KNN	ARKNN	SRKNN
Wdbc	95.85±1.47	96.14±1.26	91.35±1.69	97.06±1.16	96.98±0.94
BreastC	94.03±2.11	94.55±2.23	91.43±3.73	94.55±2.23	93.77±2.26
Pima	74.48±2.36	76.39±1.58	67.04±1.43	74.48±1.66	74.87±1.55
SpectD	83.63±2.27	80.88±3.06	78.13±3.13	79.38±1.01	82.75±2.73
Iono	90.86±2.14	87.24±1.87	87.14±1.81	87.90±1.13	87.24±1.82
Sonar	78.39±5.21	77.90±3.15	81.77±3.68	72.26±3.29	72.26±2.77
Tae	91.11±2.63	87.11±2.59	91.56±5.86	91.56±4.31	86.44±2.89
TicTac	74.03±2.28	65.28±0	72.22±1.44	74.93±1.62	62.95±2.85

数据集	AdaBoost	SVM	KNN	ARKNN	SRKNN
Wdbc	95.26±1.18	94.44±1.66	90.64±2.08	96.26±1.15	96.08±1.33
BreastC	93.51±1.84	94.03±1.45	89.87±2.52	94.16±2.74	94.03±2.19
Pima	74.96±2.76	77.43±3.00	70.04±1.97	75.65±2.64	76.48±3.00
SpectD	84.25±2.81	80.50±3.59	69.25±3.22	79.50±1.00	82.38±2.53
Iono	89.10±2.10	85.90±1.99	89.14±3.07	89.24±3.33	86.67±2.92
Sonar	79.68±4.03	77.58±3.78	68.39±5.16	75.48±3.21	75.48±3.66
Tae	90.44±3.86	89.11±1.85	92.67±3.87	93.00±4.58	86.67±1.99
TicTac	74.44±3.24	65.28±0.00	85.45±0.98	74.31±2.84	65.07±1.39

数据集	AdaBoost	SVM	KNN	ARKNN	SRKNN
Wdbc	95.5±1.83	95.38±1.24	87.89±2.01	97.19±1.19	97.19±1.07
BreastC	94.29±2.97	94.94±2.21	86.62±2.47	95.71±1.67	95.45±1.93
Pima	75.78±1.68	76.22±1.66	68.61±0.8	75.78±1.68	75.91±1.72
SpectD	83.88±3.23	81.63±3.92	61.75±5.81	78.25±1.39	83.25±3.63
Iono	90.76±2.09	88.95±3.64	90.14±1.95	88.67±2.81	87.71±2.46
Sonar	76.94±5.2	76.77±3.62	62.9±1.61	77.42±4.5	76.94±5.5
Tae	90.89±2.10	87.56±4.28	88.67±2.52	91.11±2.95	86.44±2.10
TicTac	75.21±1.29	65.28±0	87.71±1.58	73.65±2.84	65.94±1.75

数据集	AdaBoost	SVM	ARKNN	SRKNN
Wdbc	124	3683	255	262
BreastC	138	1709	94	87
Pima	153	5716	224	216
SpectD	116	7	78	86
Iono	148	9	117	121
Sonar	142	7	88	75
Tae	125	18	46	41
TicTac	119	120	305	315