吉林大学学报(工学版) ›› 2024, Vol. 54 ›› Issue (1): 209-220.doi: 10.13229/j.cnki.jdxbgxb.20220202

• 计算机科学与技术 • 上一篇    

带学习过程的随机K最近邻算法

付忠良(),陈晓清,任伟,姚宇   

  1. 中国科学院大学 成都计算机应用研究所,成都 610299
  • 收稿日期:2022-03-03 出版日期:2024-01-30 发布日期:2024-03-28
  • 作者简介:付忠良(1967-),男,研究员,博士.研究方向:机器学习,机器视觉.E-mail: fzliang@casit.com.cn
  • 基金资助:
    国家自然科学基金项目(6197010131);四川省重大科技专项项目(2021YFS0019)

Random K-nearest neighbor algorithm with learning process

Zhong-liang FU(),Xiao-qing CHEN,Wei REN,Yu YAO   

  1. Chengdu Institute of Computer Applications,University of Chinese Academy of Sciences,Chengdu 610299,China
  • Received:2022-03-03 Online:2024-01-30 Published:2024-03-28

摘要:

针对传统K最近邻(KNN)算法没有学习过程,进行分类预测时需要遍历全部学习样本、时效性差且对k值敏感的缺点,本文提出了两种带学习过程的随机KNN算法(RKNN),包括对样本Bootstrap抽样的SRKNN算法和对样本特征Bootstrap抽样的ARKNN算法,均属于Bagging集成学习,学习多个简单KNN后投票输出结果。算法对样本的特征进行组合得到组合特征,简单KNN基于组合特征得到。重点研究了如何选取特征的最优组合系数,得到了取得最好分类精度时的特征最优组合系数选取规则和公式。RKNN算法在构造简单KNN时引入学习,分类时不再遍历全部学习样本而只需要用二分查找法即可,其分类时间复杂度比传统KNN算法分类时间复杂度低一个数量级。RKNN算法的分类精度比传统KNN算法的分类精度有大幅提升,解决了使用KNN算法难以选取k值的问题。理论分析和实验结果均验证了本文RKNN算法的有效性。

关键词: 机器学习, KNN算法, 随机KNN, Bagging集成学习, AdaBoost

Abstract:

The traditional KNN (K-nearest neighbor) algorithm is a classic machine learning algorithm. This algorithm has no learning process and needs to traverse all the learning samples when classifying, and is time-sensitive and sensitive to the k value. This paper proposes two random KNN algorithms (RKNN) with a learning process, including the SRKNN algorithm on sample Bootstrap sampling and the ARKNN algorithm on sample feature Bootstrap sampling, both of which belong to Bagging ensemble learning. After learning multiple simple KNNs, the voting output results. The algorithm combines the features of the samples to obtain the combined features, and the simple KNN is obtained based on the combined features. It focuses on how to select the optimal combination coefficient of features, and obtains the selection rules and formulas of the optimal combination features for the best classification accuracy. The RKNN algorithm introduces learning when constructing a simple KNN, it no longer needs to traverse all the learning samples when classifying, but only needs to use the binary search method, and its classification time complexity is an order of magnitude lower than that of the traditional KNN algorithm. The classification accuracy of the RKNN algorithm is significantly improved than that of the traditional KNN algorithm. The RKNN algorithm solves the problem that it is difficult to select the k value using the KNN algorithm. Both theoretical analysis and experimental results show that the proposed RKNN algorithm is an efficient improvement to the KNN algorithm.

Key words: machine learning, K-nearest neighbor algorithm, random K-Nearest neighbor, Bagging ensemble learning, AdaBoost

中图分类号: 

  • TP18

表1

实验数据集"

数据集样本容量特征个数类别
Wdbc569312
BreastC256302
Pima76882
SpectData267222
Ionosphere351342
Sonar208602
Tae15162
TicTac958102

表2

实验数据集上各算法准确率(k=1) (%)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.85±1.4796.14±1.2691.35±1.6997.06±1.1696.98±0.94
BreastC94.03±2.1194.55±2.2391.43±3.7394.55±2.2393.77±2.26
Pima74.48±2.3676.39±1.5867.04±1.4374.48±1.6674.87±1.55
SpectD83.63±2.2780.88±3.0678.13±3.1379.38±1.0182.75±2.73
Iono90.86±2.1487.24±1.8787.14±1.8187.90±1.1387.24±1.82
Sonar78.39±5.2177.90±3.1581.77±3.6872.26±3.2972.26±2.77
Tae91.11±2.6387.11±2.5991.56±5.8691.56±4.3186.44±2.89
TicTac74.03±2.2865.28±072.22±1.4474.93±1.6262.95±2.85

表3

实验数据集上各算法准确率(k=3) (%)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.26±1.1894.44±1.6690.64±2.0896.26±1.1596.08±1.33
BreastC93.51±1.8494.03±1.4589.87±2.5294.16±2.7494.03±2.19
Pima74.96±2.7677.43±3.0070.04±1.9775.65±2.6476.48±3.00
SpectD84.25±2.8180.50±3.5969.25±3.2279.50±1.0082.38±2.53
Iono89.10±2.1085.90±1.9989.14±3.0789.24±3.3386.67±2.92
Sonar79.68±4.0377.58±3.7868.39±5.1675.48±3.2175.48±3.66
Tae90.44±3.8689.11±1.8592.67±3.8793.00±4.5886.67±1.99
TicTac74.44±3.2465.28±0.0085.45±0.9874.31±2.8465.07±1.39

表4

实验数据集上各算法准确率(k=5) (%)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.5±1.8395.38±1.2487.89±2.0197.19±1.1997.19±1.07
BreastC94.29±2.9794.94±2.2186.62±2.4795.71±1.6795.45±1.93
Pima75.78±1.6876.22±1.6668.61±0.875.78±1.6875.91±1.72
SpectD83.88±3.2381.63±3.9261.75±5.8178.25±1.3983.25±3.63
Iono90.76±2.0988.95±3.6490.14±1.9588.67±2.8187.71±2.46
Sonar76.94±5.276.77±3.6262.9±1.6177.42±4.576.94±5.5
Tae90.89±2.1087.56±4.2888.67±2.5291.11±2.9586.44±2.10
TicTac75.21±1.2965.28±087.71±1.5873.65±2.8465.94±1.75

表5

算法学习时间(k=5,T=20) (ms)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc12436830255262
BreastC138170909487
Pima15357160224216
SpectD116707886
Iono14890117121
Sonar142708875
Tae1251804641
TicTac1191200305315

表6

单样本分类时间(k=5,T=20) (ms)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc8.2180.3080.4450.0530.055
BreastC8.8540.3130.4300.0500.047
Pima7.8040.3370.8650.0440.043
SpectD8.2700.4670.4440.0510.046
Iono8.1150.3250.4650.0450.056
Sonar7.9580.0920.1400.0130.014
Tae8.9390.3720.6470.0570.045
TicTac7.9630.3230.9320.0500.052

图1

不同T值的分类正确率"

表7

两种组合系数的准确率(k=5,T=20) (%)"

数据集ARKNNSRKNN
AccurateApproximateAccurateApproximate
Wdbc96.9093.5196.2693.57
BreastC95.9793.1294.5593.64
Pima75.6174.3575.4374.78
SpectD79.5079.1379.5082.38
Iono87.5286.2988.2985.71
Sonar76.1371.6172.2672.58
Tae88.8989.5688.0086.22
TicTac75.3574.4466.3267.43
1 Cover T M, Hart P E. Nearst neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
2 Hart P E. The condensed nearest neighbor rule[J]. IEEE Trans actions on Information Theory, 1968, 14(3):515-516.
3 李荣陆, 胡运发. 基于密度的KNN 文本分类器学习样本裁剪方法[J]. 计算机研究与发展, 2004, 41(4): 539-545.
Li Rong-lu, Hu Yun-Fa. A density-based method for reducing the amount of training data in kNN text classification[J]. Journal of Computer Research and Development, 2004, 41(4): 539-545.
4 张孝飞, 黄河燕. 一种采用聚类技术改进的KNN 文本分类方法[J]. 模式识别与人工智能, 2009, 22(6): 936-940.
Zhang Xiao-fei, Huang He-yan. An improved KNN text categorization algorithm by adopting cluster technology[J]. Pattern Recognition and Artificial Intelligence, 2009, 22(6): 936-940.
5 Hwang W J, Wen K W. Fast KNN classification algorithm based on partial distance search[J]. Electron Letter, 1998, 34(21): 2062-2063.
6 Pan J S, Qiao Y L, Sun S H. A fast k-nearest neighbors classification algorithm[J]. IEICE Transactions Fundamentals, 2004, 87(4): 961-961.
7 Samet H. k-Nearst neighbor finding using MaxNearstDist [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(2): 243-252.
8 Guo G D, Wang H, Bell D, et al. Using KNN model for automatic text categorization[J]. Soft Computing-A Fusion of Foundations,Methodologies and Application, 2006, 10(5): 423-430.
9 朱付保, 谢利杰, 汤萌萌, 等. 基于模糊C-Means 的改进型KNN分类方法[J]. 华中师范大学学报: 自然科学版, 2017, 51(6): 754-759.
Zhu Fu-bao, Xie Li-jie, Tang meng-meng, et al. Improved KNN classification algorithm based fuzzy C-means[J]. Journal of Central China Normal University (Natural Science Edition), 2017, 51(6): 754-759.
10 Samet H. The Design an Analysis of Spatial Data Structure[M]. Reading: Addison-Wesley, 1990.
11 尚文倩, 黄厚宽, 刘玉玲, 等.文本分类中基于基尼指数的特征选择算法研究 [J]. 计算机研究与发展, 2006, 43(10): 1688-1694.
Shang Wen-qian, Huang Hou-kuan, Liu Yu-ling, et al. Research on the algorithm of feature selection based on Gini index for text categorization[J]. Journal of Computer Research and Development, 2006,43(10): 1688-1694.
12 林用民, 朱卫东. 模糊KNN在文本分类中的应用研究[J]. 计算机应用与软件, 2008, 25(9): 185-187.
Lin Yong-min, Zhu Wei-dong. Study on the application of fuzzy KNN to text categorization[J]. Computer Applications and Software, 2008, 25(9): 185-187.
13 Thanh N P, Kappas M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery[J]. Sensors, 2018, 18(1): 18.
14 黄光华, 殷锋, 冯九林. 一种交叉验证和距离加权方法改进的KNN算法研究[J]. 西南民族大学学报: 自然科学版, 2020, 46(2): 172-177.
Huang Guang-hua, Yin Feng, Feng Jiu-lin. An improved KNN algorithm based on cross validation and distance weighting[J]. Journal of Southwest Minzu University (Natural Science Edition), 2020, 46(2): 182-177.
15 Gou J P, Ma H X, Ou W H, et al. A generalized mean distance-based k-nearest neighbor classifier[J]. Expert Systems with Applications, 2019, 115: 356-372.
16 Gou J P, Qiu W M, Zhang Y, et al. A local mean representation-based K-nearest neighbor classifier[J]. ACM Transactions on Intelligent Systems, 2019, 29 (10): 1-25.
17 Bicego M, Loog M. Weighted K-nearest neighbor revisited[C]∥The 23th International Conference on Pattern Recognition (ICPR),Cancun, Mexico, 2016, 1642-1647.
18 Gou J P, Qiu W M, Zhang Y, et al. Locality constrained representation-based K-nearest neighbor classification[J]. Knowledge-Based Systems, 2019, 167(3): 38-52.
19 Ma H, Gou J, Ou W, et al. A new nearest neighbor classifier based on multi-harmonic mean distances[C]∥International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Shenzhen, China, 2017: 115-125.
20 Zhang S C, Li X L, Zong M, et al. Learning k for kNN classification[J]. ACM Transactions on Intelligent Systems and Technology, 2017, 8(3): 1-19.
21 Zhong X F, Guo S Z, Gao L, et al. An improved k-NN classification with dynamic k [C]∥Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 2017: 211-216.
22 Li B, Chen Y, Chen Y. The nearest neighbor algorithm of local probability centers[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(1): 141-154.
23 Breiman L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140.
24 Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
25 付忠良, 张丹普, 王莉莉. 多标签AdaBoost 算法的改进算法[J]. 四川大学学报: 工程科学版, 2015, 47(5): 103-109.
Fu Zhong-liang, Zhang Dan-pu, Wang Li-li. Improvement on AdaBoost for multi-label classification[J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(5): 103-109.
26 Breiman L. Random forest[J]. Machine Learning, 2001, 29(1): 1-10.
27 付忠良. 分类器线性组合及分类器线性组合的有效性和最佳组合问题的研究[J]. 计算机研究与发展, 2009, 46(7): 1206-1216.
Fu Zhong-liang. Effective property and best combination of classifier linear combination[J]. Journal of Computer Research and Development, 2009, 46(7): 1206-1216.
[1] 王玫,宋志远. 基于TrAdaBoost算法为内核的行人航迹推算技术[J]. 吉林大学学报(工学版), 2023, 53(8): 2364-2370.
[2] 耿庆田,刘植,李清亮,于繁华,李晓宁. 基于一种深度学习模型的土壤湿度预测[J]. 吉林大学学报(工学版), 2023, 53(8): 2430-2436.
[3] 潘恒彦,张文会,梁婷婷,彭志鹏,高维,王永岗. 基于MIMIC与机器学习的出租车驾驶员交通事故诱因分析[J]. 吉林大学学报(工学版), 2023, 53(2): 457-467.
[4] 袁伟,袁小慧,高岩,李坤宸,赵登峰,刘朝辉. 基于自然驾驶数据的电动公交踏板误操作辨识方法[J]. 吉林大学学报(工学版), 2023, 53(12): 3342-3350.
[5] 周丰丰,颜振炜. 基于混合特征的特征选择神经肽预测模型[J]. 吉林大学学报(工学版), 2023, 53(11): 3238-3245.
[6] 耿庆田,赵杨,李清亮,于繁华,李晓宁. 基于注意力机制的LSTM和ARIMA集成方法在土壤温度中应用[J]. 吉林大学学报(工学版), 2023, 53(10): 2973-2981.
[7] 车翔玖,于英杰,刘全乐. 增强Bagging集成学习及多目标检测算法[J]. 吉林大学学报(工学版), 2022, 52(12): 2916-2923.
[8] 段亮,宋春元,刘超,魏苇,吕成吉. 基于机器学习的高速列车轴承温度状态识别[J]. 吉林大学学报(工学版), 2022, 52(1): 53-62.
[9] 李光松,李文清,李青. 基于随机性特征的加密和压缩流量分类[J]. 吉林大学学报(工学版), 2021, 51(4): 1375-1386.
[10] 朱小龙,谢忠. 基于机器学习的地理空间数据抽取算法[J]. 吉林大学学报(工学版), 2021, 51(3): 1011-1016.
[11] 李阳,李硕,井丽巍. 基于贝叶斯模型与机器学习算法的金融风险网络评估模型[J]. 吉林大学学报(工学版), 2020, 50(5): 1862-1869.
[12] 方伟,黄羿,马新强. 基于机器学习的虚拟网络感知数据缺陷自动检测[J]. 吉林大学学报(工学版), 2020, 50(5): 1844-1849.
[13] 刘洲洲,尹文晓,张倩昀,彭寒. 基于离散优化算法和机器学习的传感云入侵检测[J]. 吉林大学学报(工学版), 2020, 50(2): 692-702.
[14] 赵东, 臧雪柏, 赵宏伟. 基于果蝇优化的随机森林预测方法[J]. 吉林大学学报(工学版), 2017, 47(2): 609-614.
[15] 金立生, 王岩, 刘景华, 王亚丽, 郑义. 基于Adaboost算法的日间前方车辆检测[J]. 吉林大学学报(工学版), 2014, 44(6): 1604-1608.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!