吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (6): 1079-1085.

• • 上一篇    下一篇

基于随机森林模型的不平衡大数据分类算法 

魏亚明1 , 孟 媛2   

  1. 1. 徐州市中心医院 信息处, 江苏 徐州 221000; 2. 江苏师范大学 研究生院, 江苏 徐州 221000
  • 收稿日期:2022-11-11 出版日期:2023-11-30 发布日期:2023-12-01
  • 作者简介:魏亚明(1991— ), 男, 江苏徐州人, 徐州市中心医院工程师, 主要从事大数据、 神经网络研究, ( Tel)86-13785083958 (E-mail)lb23060010@ cumt. edu. cn
  • 基金资助:
    江苏省自然科学基金资助项目(BK2013573) 

Unbalanced Big Data Classification Algorithm Based on Random Forest Model

WEI Yaming 1 , MENG Yuan 2   

  1. 1. Information Department, Xuzhou Central Hospital, Xuzhou 221000, China; 2. Graduate School, Jiangsu Normal University, Xuzhou 221000, China
  • Received:2022-11-11 Online:2023-11-30 Published:2023-12-01

摘要: 针对目前不平衡大数据分类算法分类效果较差的问题, 提出基于随机森林模型的不平衡大数据分类 算法。 首先采用 SVM( Support Vector Machine) 支持向量机算法对不平衡大数据进行信息过滤, 然后利用反 k 近邻法检测并消除离群点, 通过增量主成分分析法去掉不平衡大数据中协方差矩阵存在的奇异性, 并依据 熵值法对其展开权重解析, 进而提取不平衡大数据特征信息。 将 CART(Classification and Regression Trees)决策 树当作不平衡大数据的基分类器, 进而构建随机森林决策树分类器, 最后将提取的不平衡大数据特征信息输入 分类器中, 实现不平衡大数据分类。 实验结果表明, 该算法对不平衡大数据的采样效果较好, 并且分类精准 度、 稳定性和性能都较高。 

关键词: 随机森林模型, 不平衡大数据分类, SVM 支持向量机, k 近邻法, CART 决策树 

Abstract: In response to the problem of poor classification performance faced by current imbalanced big data classification algorithms, a random forest model based imbalanced big data classification algorithm is proposed. Firstly, the SVM(Support Vector Machine) algorithm is used to filter information on imbalanced big data, and then the anti k-nearest neighbor method is used to detect and eliminate outliers. The singularity of the covariance matrix in imbalanced big data is removed through incremental principal component analysis. And based on the entropy method, weight analysis is carried out to extract imbalanced big data feature information. The CART (Classification and Regression Trees) decision tree is used as the base classifier for imbalanced big data, and a random forest decision tree classifier is constructed. The extracted imbalanced big data feature information is input into the classifier to achieve imbalanced big data classification. The experimental results show that the proposed algorithm has good sampling performance, high classification accuracy, high stability, and high performance for imbalanced big data. 

Key words: stochastic forest model, unbalanced big data classification, support vector machine( SVM), Anti k-nearest neighbor method, classification and regression trees(CART) decision tree

中图分类号: 

  • TP391