吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (6): 1112-1119.

• • 上一篇    下一篇

面向不平衡数据集的网络入侵检测算法

徐忠原1 , 杨秀华2a , 王 业2b , 李 玲2b   

  1. 1. 长春建筑学院 电气信息学院, 长春 130604; 2. 吉林大学 a. 大数据和网络管理中心; b. 通信工程学院, 长春 130012
  • 收稿日期:2023-07-14 出版日期:2023-11-30 发布日期:2023-12-01
  • 通讯作者: 李玲(1965— ), 女, 黑龙江齐齐哈尔人, 吉林大学教授, 硕士生 导师, 主要从事云计算应用、 机器学习与人工智能算法在通信、 医学领域中应用研究, ( Tel) 86-13596491550 E-mail:liling2002@ jlu. edu. cn
  • 作者简介:徐忠原(1964— ), 男, 黑龙江勃利人, 长春建筑学院工程师, 主要从事云计算应用、 物联网系统设计研究, ( Tel)86- 13908710700(E-mail)271407906@ qq. com
  • 基金资助:
    吉林省科技发展计划基金资助项目(20190302073GX)

Network Intrusion Detection Algorithm for Imbalanced Datasets

XU Zhongyuan 1 , YANG Xiuhua 2a , WANG Ye 2b , LI Ling 2b   

  1. 1. College of Electrical Information, Changchun University of Architecture, Changchun 130604, China; 2a. Big Data and Network Information Center; 2b. College of Communication Engineering, Jilin University, Changchun 130012, China
  • Received:2023-07-14 Online:2023-11-30 Published:2023-12-01

摘要: 针对入侵检测数据集存在类别不平衡问题, 提出了系统化数据预处理与混合采样相结合的网络入侵检测 算法。 根据入侵检测数据集的特征分布, 对特征值进行系统化处理。 首先对 ProtoService State 3 个类别 特征, 合并每类特征中样本数较少的取值, 以降低独热编码的维度; 然后依据数值分布将其中 18 个极端分布 的数值特征进行对数处理后再执行 Z-score 标准化。 设计了 Nearmiss-1 欠采样与 SMOTE(Synthetic Minority Over- sampling Technique)过采样相结合的类别不平衡处理技术, 将训练集中每类样本按照 ProtoService State 类别 特征分成子类, 对每个子类进行等比例欠采样或过采样。 建立了入侵检测模型 PSSNS-RF(Nearmiss and SMOTE based on Proto, Service, State-Random Forest), UNSW-NB15 数据集上的多分类检出率达到 97. 02% , 解决了 数据不平衡问题, 显著提高了少数类的检出率。

关键词:  网络入侵检测, 不平衡数据集, 特征选择, 网络安全 

Abstract: A network intrusion detection algorithm that combines systematic data pre-processing and hybrid sampling is proposed for the problem of class imbalance in intrusion detection datasets. Based on the feature distribution of the intrusion detection dataset, the feature values are systematically processed as follows: for the three categorical features, “Proto’’,“Service’’ and “State’’, minor categories within each feature are combined to reduce the total dimension of one-hot encoding; the 18 extremely distributed numerical features are processed with logarithm and then standardized according to the numerical distribution. The class imbalance processing technology, which combines Nearmiss-1 under-sampling and SMOTE ( Synthetic Minority Over-sampling Technique) is designed. Each class of samples in the training dataset is divided into sub-classes based on the “Proto’’,“ Service’’ and “ State’’ categorical features, and each sub-class is under-sampled or oversampled in equal proportion. The intrusion detection model PSSNS-RF ( Nearmiss and SMOTE based on Proto, Service, State-Random Forest) is built, which achieves a 97. 02% multiclass detection rate in the UNSW-NB15 dataset, resolving the data imbalance problem and significantly improving the detection rate of minority classes.

Key words: network intrusion detection, imbalanced dataset, feature selection, network security

中图分类号: 

  • TP393. 08