基于随机森林模型的不平衡大数据分类算法

吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (6): 1079-1085.

基于随机森林模型的不平衡大数据分类算法

魏亚明¹ , 孟媛²

1. 徐州市中心医院信息处, 江苏徐州 221000; 2. 江苏师范大学研究生院, 江苏徐州 221000

收稿日期:2022-11-11 出版日期:2023-11-30 发布日期:2023-12-01
作者简介:魏亚明(1991— ), 男, 江苏徐州人, 徐州市中心医院工程师, 主要从事大数据、神经网络研究, ( Tel)86-13785083958 (E-mail)lb23060010@ cumt. edu. cn
基金资助:
江苏省自然科学基金资助项目(BK2013573)

Unbalanced Big Data Classification Algorithm Based on Random Forest Model

WEI Yaming¹, MENG Yuan²

1. Information Department, Xuzhou Central Hospital, Xuzhou 221000, China; 2. Graduate School, Jiangsu Normal University, Xuzhou 221000, China

Received:2022-11-11 Online:2023-11-30 Published:2023-12-01

摘要/Abstract

摘要： 针对目前不平衡大数据分类算法分类效果较差的问题, 提出基于随机森林模型的不平衡大数据分类算法。首先采用 SVM( Support Vector Machine) 支持向量机算法对不平衡大数据进行信息过滤, 然后利用反 k 近邻法检测并消除离群点, 通过增量主成分分析法去掉不平衡大数据中协方差矩阵存在的奇异性, 并依据熵值法对其展开权重解析, 进而提取不平衡大数据特征信息。将 CART(Classification and Regression Trees)决策树当作不平衡大数据的基分类器, 进而构建随机森林决策树分类器, 最后将提取的不平衡大数据特征信息输入分类器中, 实现不平衡大数据分类。实验结果表明, 该算法对不平衡大数据的采样效果较好, 并且分类精准度、稳定性和性能都较高。

关键词: 随机森林模型, 不平衡大数据分类, SVM 支持向量机, 反 k 近邻法, CART 决策树

Abstract: In response to the problem of poor classification performance faced by current imbalanced big data classification algorithms, a random forest model based imbalanced big data classification algorithm is proposed. Firstly, the SVM(Support Vector Machine) algorithm is used to filter information on imbalanced big data, and then the anti k-nearest neighbor method is used to detect and eliminate outliers. The singularity of the covariance matrix in imbalanced big data is removed through incremental principal component analysis. And based on the entropy method, weight analysis is carried out to extract imbalanced big data feature information. The CART (Classification and Regression Trees) decision tree is used as the base classifier for imbalanced big data, and a random forest decision tree classifier is constructed. The extracted imbalanced big data feature information is input into the classifier to achieve imbalanced big data classification. The experimental results show that the proposed algorithm has good sampling performance, high classification accuracy, high stability, and high performance for imbalanced big data.

Key words: stochastic forest model, unbalanced big data classification, support vector machine( SVM), Anti k-nearest neighbor method, classification and regression trees(CART) decision tree

中图分类号:

TP391

魏亚明, 孟媛. 基于随机森林模型的不平衡大数据分类算法 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1079-1085.

WEI Yaming , MENG Yuan . Unbalanced Big Data Classification Algorithm Based on Random Forest Model[J]. Journal of Jilin University (Information Science Edition), 2023, 41(6): 1079-1085.

[1]	欧阳继红 , 曹竞月 , 王腾 . Copula 层次化变分推理[J]. 吉林大学学报(信息科学版), 2024, 42(1): 51-58.
[2]	李婉莹 , 刘学艳 , 杨博. 隐私保护的图像替代数据生成方法[J]. 吉林大学学报(信息科学版), 2024, 42(1): 59-66.
[3]	安志伟 , 刘玉敏 , 袁硕 , 魏海军 . 基于 UNet++卷积神经网络的断层识别 [J]. 吉林大学学报(信息科学版), 2024, 42(1): 100-110.
[4]	籍风磊, 陈少琦, 梁楠, 迟学芬, 李志军. 基于手机相机的可见光成像通信实验系统设计[J]. 吉林大学学报(信息科学版), 2023, 41(6): 1023-1029.
[5]	苏雯, 徐鑫林, 胡宇超, 黄博涵, 周佩廷. 面向垃圾图像分类的残差语义强化网络 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1030-1040.
[6]	任伟建, 张志强, 康朝海, 霍凤财, 孙勤江, 陈建玲. 基于动态语义特征的视觉 SLAM 系统[J]. 吉林大学学报(信息科学版), 2023, 41(6): 1041-1047.
[7]	陈雪松, 邹梦. 基于 BERT-BiGRU-CNN 模型的短文本分类研究 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1048-1053.
[8]	沈晨, 张培珍, 刘欢, 唐杰平, 高守勇, 王振鹏. 基于 VMD-Hilbert 变换的大型网箱养殖鱼群声特性研究 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1054-1062.
[9]	吴薇, 阮星, 蔡闯华, 刘长勇, 刘彦秀, 王宜怀. 资源受限 MCU 的轻量化部署策略和实现[J]. 吉林大学学报(信息科学版), 2023, 41(6): 1063-1071.
[10]	吴淑娟, 张铭. 基于 CycleGAN 图像增强的输送皮带洒料检测技术 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1072-1078.
[11]	刘樱琪, 宋杨, 李梓木, 罗维, 黄新睿, 王昊丰. 基于深度学习的心电信号分析检测系统 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1135-1142.
[12]	李王波, 范昕桐, 顾玲嘉. 基于星载被动微波的中国东北森林雪深反演[J]. 吉林大学学报(信息科学版), 2023, 41(5): 914-921.
[13]	陈雪松, 詹子依, 王浩畅. 融合 SikuBERT 模型与 MHA 的古汉语命名实体识别[J]. 吉林大学学报(信息科学版), 2023, 41(5): 866-875.
[14]	梁楠, 王成喜, 张春飞, 徐涛, 籍风磊. 基于 Python 的多维度、层次化的综合实验平台[J]. 吉林大学学报(信息科学版), 2023, 41(5): 858-865.
[15]	卞冰阳, 孙圣博, 佟伟华, 滕岩, 肖莉莉, 孙野, 王烁, 苗政, 纪铁凤, 张磊. 基于影像三维可视化技术的解剖教学模式研究[J]. 吉林大学学报(信息科学版), 2023, 41(5): 885-893.