Journal of Jilin University(Engineering and Technology Edition) ›› 2024, Vol. 54 ›› Issue (1): 209-220.doi: 10.13229/j.cnki.jdxbgxb.20220202

Previous Articles    

Random K-nearest neighbor algorithm with learning process

Zhong-liang FU(),Xiao-qing CHEN,Wei REN,Yu YAO   

  1. Chengdu Institute of Computer Applications,University of Chinese Academy of Sciences,Chengdu 610299,China
  • Received:2022-03-03 Online:2024-01-30 Published:2024-03-28

Abstract:

The traditional KNN (K-nearest neighbor) algorithm is a classic machine learning algorithm. This algorithm has no learning process and needs to traverse all the learning samples when classifying, and is time-sensitive and sensitive to the k value. This paper proposes two random KNN algorithms (RKNN) with a learning process, including the SRKNN algorithm on sample Bootstrap sampling and the ARKNN algorithm on sample feature Bootstrap sampling, both of which belong to Bagging ensemble learning. After learning multiple simple KNNs, the voting output results. The algorithm combines the features of the samples to obtain the combined features, and the simple KNN is obtained based on the combined features. It focuses on how to select the optimal combination coefficient of features, and obtains the selection rules and formulas of the optimal combination features for the best classification accuracy. The RKNN algorithm introduces learning when constructing a simple KNN, it no longer needs to traverse all the learning samples when classifying, but only needs to use the binary search method, and its classification time complexity is an order of magnitude lower than that of the traditional KNN algorithm. The classification accuracy of the RKNN algorithm is significantly improved than that of the traditional KNN algorithm. The RKNN algorithm solves the problem that it is difficult to select the k value using the KNN algorithm. Both theoretical analysis and experimental results show that the proposed RKNN algorithm is an efficient improvement to the KNN algorithm.

Key words: machine learning, K-nearest neighbor algorithm, random K-Nearest neighbor, Bagging ensemble learning, AdaBoost

CLC Number: 

  • TP18

Table 1

The experimental data sets"

数据集样本容量特征个数类别
Wdbc569312
BreastC256302
Pima76882
SpectData267222
Ionosphere351342
Sonar208602
Tae15162
TicTac958102

Table 2

Accuracy rates of each method on the experimental data sets(k=1)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.85±1.4796.14±1.2691.35±1.6997.06±1.1696.98±0.94
BreastC94.03±2.1194.55±2.2391.43±3.7394.55±2.2393.77±2.26
Pima74.48±2.3676.39±1.5867.04±1.4374.48±1.6674.87±1.55
SpectD83.63±2.2780.88±3.0678.13±3.1379.38±1.0182.75±2.73
Iono90.86±2.1487.24±1.8787.14±1.8187.90±1.1387.24±1.82
Sonar78.39±5.2177.90±3.1581.77±3.6872.26±3.2972.26±2.77
Tae91.11±2.6387.11±2.5991.56±5.8691.56±4.3186.44±2.89
TicTac74.03±2.2865.28±072.22±1.4474.93±1.6262.95±2.85

Table 3

Accuracy rates of each method on the experimental data sets (k=3)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.26±1.1894.44±1.6690.64±2.0896.26±1.1596.08±1.33
BreastC93.51±1.8494.03±1.4589.87±2.5294.16±2.7494.03±2.19
Pima74.96±2.7677.43±3.0070.04±1.9775.65±2.6476.48±3.00
SpectD84.25±2.8180.50±3.5969.25±3.2279.50±1.0082.38±2.53
Iono89.10±2.1085.90±1.9989.14±3.0789.24±3.3386.67±2.92
Sonar79.68±4.0377.58±3.7868.39±5.1675.48±3.2175.48±3.66
Tae90.44±3.8689.11±1.8592.67±3.8793.00±4.5886.67±1.99
TicTac74.44±3.2465.28±0.0085.45±0.9874.31±2.8465.07±1.39

Table 4

Accuracy rates of each method on the experimental data sets(k=5)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc95.5±1.8395.38±1.2487.89±2.0197.19±1.1997.19±1.07
BreastC94.29±2.9794.94±2.2186.62±2.4795.71±1.6795.45±1.93
Pima75.78±1.6876.22±1.6668.61±0.875.78±1.6875.91±1.72
SpectD83.88±3.2381.63±3.9261.75±5.8178.25±1.3983.25±3.63
Iono90.76±2.0988.95±3.6490.14±1.9588.67±2.8187.71±2.46
Sonar76.94±5.276.77±3.6262.9±1.6177.42±4.576.94±5.5
Tae90.89±2.1087.56±4.2888.67±2.5291.11±2.9586.44±2.10
TicTac75.21±1.2965.28±087.71±1.5873.65±2.8465.94±1.75

Table 5

Learning time of each method(k=5,T=20)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc12436830255262
BreastC138170909487
Pima15357160224216
SpectD116707886
Iono14890117121
Sonar142708875
Tae1251804641
TicTac1191200305315

Tab 6

Single sample classification time of each method(k=5,T=20)"

数据集AdaBoostSVMKNNARKNNSRKNN
Wdbc8.2180.3080.4450.0530.055
BreastC8.8540.3130.4300.0500.047
Pima7.8040.3370.8650.0440.043
SpectD8.2700.4670.4440.0510.046
Iono8.1150.3250.4650.0450.056
Sonar7.9580.0920.1400.0130.014
Tae8.9390.3720.6470.0570.045
TicTac7.9630.3230.9320.0500.052

Fig.1

Accuracy rates on different T"

Table 7

Accuracy rates of two combination coefficients (k=5,T=20)"

数据集ARKNNSRKNN
AccurateApproximateAccurateApproximate
Wdbc96.9093.5196.2693.57
BreastC95.9793.1294.5593.64
Pima75.6174.3575.4374.78
SpectD79.5079.1379.5082.38
Iono87.5286.2988.2985.71
Sonar76.1371.6172.2672.58
Tae88.8989.5688.0086.22
TicTac75.3574.4466.3267.43
1 Cover T M, Hart P E. Nearst neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
2 Hart P E. The condensed nearest neighbor rule[J]. IEEE Trans actions on Information Theory, 1968, 14(3):515-516.
3 李荣陆, 胡运发. 基于密度的KNN 文本分类器学习样本裁剪方法[J]. 计算机研究与发展, 2004, 41(4): 539-545.
Li Rong-lu, Hu Yun-Fa. A density-based method for reducing the amount of training data in kNN text classification[J]. Journal of Computer Research and Development, 2004, 41(4): 539-545.
4 张孝飞, 黄河燕. 一种采用聚类技术改进的KNN 文本分类方法[J]. 模式识别与人工智能, 2009, 22(6): 936-940.
Zhang Xiao-fei, Huang He-yan. An improved KNN text categorization algorithm by adopting cluster technology[J]. Pattern Recognition and Artificial Intelligence, 2009, 22(6): 936-940.
5 Hwang W J, Wen K W. Fast KNN classification algorithm based on partial distance search[J]. Electron Letter, 1998, 34(21): 2062-2063.
6 Pan J S, Qiao Y L, Sun S H. A fast k-nearest neighbors classification algorithm[J]. IEICE Transactions Fundamentals, 2004, 87(4): 961-961.
7 Samet H. k-Nearst neighbor finding using MaxNearstDist [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(2): 243-252.
8 Guo G D, Wang H, Bell D, et al. Using KNN model for automatic text categorization[J]. Soft Computing-A Fusion of Foundations,Methodologies and Application, 2006, 10(5): 423-430.
9 朱付保, 谢利杰, 汤萌萌, 等. 基于模糊C-Means 的改进型KNN分类方法[J]. 华中师范大学学报: 自然科学版, 2017, 51(6): 754-759.
Zhu Fu-bao, Xie Li-jie, Tang meng-meng, et al. Improved KNN classification algorithm based fuzzy C-means[J]. Journal of Central China Normal University (Natural Science Edition), 2017, 51(6): 754-759.
10 Samet H. The Design an Analysis of Spatial Data Structure[M]. Reading: Addison-Wesley, 1990.
11 尚文倩, 黄厚宽, 刘玉玲, 等.文本分类中基于基尼指数的特征选择算法研究 [J]. 计算机研究与发展, 2006, 43(10): 1688-1694.
Shang Wen-qian, Huang Hou-kuan, Liu Yu-ling, et al. Research on the algorithm of feature selection based on Gini index for text categorization[J]. Journal of Computer Research and Development, 2006,43(10): 1688-1694.
12 林用民, 朱卫东. 模糊KNN在文本分类中的应用研究[J]. 计算机应用与软件, 2008, 25(9): 185-187.
Lin Yong-min, Zhu Wei-dong. Study on the application of fuzzy KNN to text categorization[J]. Computer Applications and Software, 2008, 25(9): 185-187.
13 Thanh N P, Kappas M. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery[J]. Sensors, 2018, 18(1): 18.
14 黄光华, 殷锋, 冯九林. 一种交叉验证和距离加权方法改进的KNN算法研究[J]. 西南民族大学学报: 自然科学版, 2020, 46(2): 172-177.
Huang Guang-hua, Yin Feng, Feng Jiu-lin. An improved KNN algorithm based on cross validation and distance weighting[J]. Journal of Southwest Minzu University (Natural Science Edition), 2020, 46(2): 182-177.
15 Gou J P, Ma H X, Ou W H, et al. A generalized mean distance-based k-nearest neighbor classifier[J]. Expert Systems with Applications, 2019, 115: 356-372.
16 Gou J P, Qiu W M, Zhang Y, et al. A local mean representation-based K-nearest neighbor classifier[J]. ACM Transactions on Intelligent Systems, 2019, 29 (10): 1-25.
17 Bicego M, Loog M. Weighted K-nearest neighbor revisited[C]∥The 23th International Conference on Pattern Recognition (ICPR),Cancun, Mexico, 2016, 1642-1647.
18 Gou J P, Qiu W M, Zhang Y, et al. Locality constrained representation-based K-nearest neighbor classification[J]. Knowledge-Based Systems, 2019, 167(3): 38-52.
19 Ma H, Gou J, Ou W, et al. A new nearest neighbor classifier based on multi-harmonic mean distances[C]∥International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Shenzhen, China, 2017: 115-125.
20 Zhang S C, Li X L, Zong M, et al. Learning k for kNN classification[J]. ACM Transactions on Intelligent Systems and Technology, 2017, 8(3): 1-19.
21 Zhong X F, Guo S Z, Gao L, et al. An improved k-NN classification with dynamic k [C]∥Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 2017: 211-216.
22 Li B, Chen Y, Chen Y. The nearest neighbor algorithm of local probability centers[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(1): 141-154.
23 Breiman L. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140.
24 Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
25 付忠良, 张丹普, 王莉莉. 多标签AdaBoost 算法的改进算法[J]. 四川大学学报: 工程科学版, 2015, 47(5): 103-109.
Fu Zhong-liang, Zhang Dan-pu, Wang Li-li. Improvement on AdaBoost for multi-label classification[J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(5): 103-109.
26 Breiman L. Random forest[J]. Machine Learning, 2001, 29(1): 1-10.
27 付忠良. 分类器线性组合及分类器线性组合的有效性和最佳组合问题的研究[J]. 计算机研究与发展, 2009, 46(7): 1206-1216.
Fu Zhong-liang. Effective property and best combination of classifier linear combination[J]. Journal of Computer Research and Development, 2009, 46(7): 1206-1216.
[1] Mei WANG,Zhi-yuan SONG. Pedestrian dead reckoning technology based on TrAdaBoost algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(8): 2364-2370.
[2] Qing-tian GENG,Zhi LIU,Qing-liang LI,Fan-hua YU,Xiao-ning LI. Prediction of soil moisture based on a deep learning model [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(8): 2430-2436.
[3] Heng-yan PAN,Wen-hui ZHANG,Ting-ting LIANG,Zhi-peng PENG,Wei GAO,Yong-gang WANG. Inducement analysis of taxi drivers' traffic accidents based on MIMIC and machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(2): 457-467.
[4] Wei YUAN,Xiao-hui YUAN,Yan GAO,Kun-chen LI,Deng-feng ZHAO,Zhao-hui LIU. Identification method for electric bus pedal misoperation based on natural driving data [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(12): 3342-3350.
[5] Feng-feng ZHOU,Zhen-wei YAN. A model for identifying neuropeptides by feature selection based on hybrid features [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(11): 3238-3245.
[6] Qing-tian GENG,Yang ZHAO,Qing-liang LI,Fan-hua YU,Xiao-ning LI. Integrated LSTM and ARIMA method based on attention model for soil temperature [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(10): 2973-2981.
[7] Xiang-jiu CHE,Ying-jie YU,Quan-le LIU. Enhanced Bagging ensemble learning and multi⁃target detection algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(12): 2916-2923.
[8] Liang DUAN,Chun-yuan SONG,Chao LIU,Wei WEI,Cheng-ji LYU. State recognition in bearing temperature of high-speed train based on machine learning algorithms [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(1): 53-62.
[9] Guang-song LI,Wen-qing LI,Qing LI. Encrypted and compressed traffic classification based on random feature set [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(4): 1375-1386.
[10] Xiao-long ZHU,Zhong XIE. Geospatial data extraction algorithm based on machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(3): 1011-1016.
[11] Yang LI,Shuo LI,Li-wei JING. Estimate model based on Bayesian model and machine learning algorithms applicated in financial risk assessment [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(5): 1862-1869.
[12] Wei FANG,Yi HUANG,Xin-qiang MA. Automatic defect detection for virtual network perceptual data based on machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(5): 1844-1849.
[13] Zhou-zhou LIU,Wen-xiao YIN,Qian-yun ZHANG,Han PENG. Sensor cloud intrusion detection based on discrete optimization algorithm and machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(2): 692-702.
[14] ZHAO Dong, ZANG Xue-bai, ZHAO Hong-wei. Random forest prediction method based on optimization of fruit fly [J]. 吉林大学学报(工学版), 2017, 47(2): 609-614.
[15] JIN Li-sheng, WANG Yan, LIU Jing-hua, WANG Ya-li, ZHENG Yi. Front vehicle detection based on Adaboost algorithm in daytime [J]. 吉林大学学报(工学版), 2014, 44(6): 1604-1608.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!