Journal of Jilin University(Engineering and Technology Edition) ›› 2021, Vol. 51 ›› Issue (4): 1375-1386.doi: 10.13229/j.cnki.jdxbgxb20200314

Previous Articles    

Encrypted and compressed traffic classification based on random feature set

Guang-song LI1(),Wen-qing LI1(),Qing LI2   

  1. 1.School of Cyber Security,Information Engineering University,Zhengzhou 450001,China
    2.School of Information Systems Engineering,Information Engineering University,Zhengzhou 450001,China
  • Received:2020-05-10 Online:2021-07-01 Published:2021-07-14
  • Contact: Wen-qing LI E-mail:lgsok@163.com;qingqiujingshui@163.com

Abstract:

When encryption or compression algorithms are used to transmit data over the network, the payload data is generally random. Using existing traffic detection methods, it is difficult to effectively distinguish encrypted traffic from compressed traffic. To solve this problem, based on the differences between randomness of encrypted data and compressed data, this paper proposes the ECF randomness feature set. Without relying on the information of the network protocols, the packet headers, and the compression identifiers, the current mainstream machine learning algorithms are used to achieve accurate identification of encrypted or compressed data. Experiment results show that this method has higher accuracy compared with current methods and it also has good performance with generalization and migration.

Key words: traffic classification, encrypted traffic, compressed traffic, machine learning

CLC Number: 

  • TP309.7

Table 1

Sources of raw data"

类别文件类型来源数量大小
文本txtMulti?Domain Sentiment Dataset2251.12 GB
图片JPEGSUN201235000935 MB
视频aviUCF10142500852 MB
二进制文件exe dll lib elfwin10、ubuntu 15系统文件19101.33 GB
音频MP3网易音乐1981.03 GB
混合文档PDF互联网下载472814 MB

Table 2

Open source"

序号名称来源
1githubhttps://github.com/SengGe2019

2

3

4

5

6

7

Multi?Domain Sentiment Dataset

SUN2012

UCF101

zip

Gzip

cryptoPP?820

http://www.cs.jhu.edu/~mdredze/datasets/sentiment

http://groups.csail.mit.edu/vision/SUN/

https://www.crcv.ucf.edu/data/UCF101.php

ftp://ftp.info-zip.org/pub/infozip/

http://www.gnu.org/software/gzip/

https://www.cryptopp.com/

Table 3

Data sizes after being processed and encrypted"

文件类型

原始

大小

使用算法

最终

大小

分割总数/份
文本1.12 GBAES、3DES、IDEA1.12 GB951 480
RAR221 MB227 169
ZIP360 MB368 784
GZIP362 MB371 014
混合文件814 MBAES、3DES、IDEA814 MB833 267
RAR703 MB719 866
ZIP734 MB749 290
GZIP733 MB750 763
二进制文件1.33 GBAES、3DES、IDEA1.33 GB1 403 012
RAR443 MB453 626
ZIP566 MB577 740
GZIP564MB578 680
图片935 MBAES、3DES、IDEA935 MB954 959
RAR910 MB930 238
ZIP911 MB931 171
GZIP914 MB933 469
视频852 MAES、3DES、IDEA852 M872 607
RAR840 M859 172
ZIP839 M858 454
GZIP841 M860 094
音频1.03 GAES、3DES、IDEA1.03 G1 087 750
RAR1.01 G1 068 870
ZIP1.02 G1 070 249
GZIP1.02 G1 074 898

Fig.1

Expected frequencies of CHI squared statistic"

Fig.2

α-Renyi entropy averages"

Table 4

Different encrypted and compressed data features"

编号特征名称简 介类型
ECF?1卡方检测目标数据与均匀分布数据之间的偏离程度,以字节的均匀分布作为理论推断值,取值按照式(1)给出的方法计算得出。float
ECF?2α?Renyi熵检测目标数据的字节分布密度的随机性,以目标数据的实际字节分布作为离散的密度分布,取值根据式(2)给出的方法计算得出。float
ECF?3单比特频数检验目标数据中0和1的比例与随机数据的相似性,取值为NIST测试集中Frequency Test检测值p_value。float
ECF?4块内频数将目标数据分成若干子块,检测每一个子块中0和1的比例与随机数据的相似性,取值为NIST测试集中Frequency Test within a Block检测值p_value。float
ECF?5游程检验目标数据中0或1的游程长度以及交替频率,确定数据中游程震荡是否过快或过慢,取值为NIST测试集中Runs Test检测值p_value。float
ECF?6最大游程将目标数据分为若干子块,检测子块中1的最长游程与随机性数据的差距,取值为NIST测试集中Longest Run Of Ones检测值p_value。float
ECF?7傅里叶变换检测目标数据的离散傅里叶变换的峰值高度,测试数据周期性与随机数据之间的偏差,取值为NIST测试集中Discrete Fourier Transform (Spectral) Test检测值p_value。float
ECF?8非重叠匹配检测目标数据中固定字符串出现的频率,检测中一旦匹配及跳过匹配数据重新检索,取值为NIST测试集中Non?overlapping Template Matching Test检测值p_value。float
ECF?9序列化检测目标数据中不同长度固定字符串出现的频率与随机性数据的差距,取值为NIST测试集中Serial Test检测值p_value1。float
ECF?10序列化检测目标数据中不同长度固定字符串出现的频率与随机性数据的差距,取值为NIST测试集中Serial Test检测值p_value2。float
ECF?11累加和检验目标数据在不同长度上的累加和与随机序列期望值的偏差,将目标数据中(0,1)转化为(-1,1)后正向计算累加和随机游走的最大偏移值,取值为NIST测试集中Cumulative Sums Test检测值p_value1。float
ECF?12累加和检验目标数据在不同长度上的累加和与随机序列期望值的偏差,将目标数据中(0,1)转化为(-1,1)后正向计算累加和随机游走的最大偏移值,取值为NIST测试集中Cumulative Sums Test检测值p_value2。float

Fig.3

Test model composition"

Table 5

Parameters and performances of detection models"

模型模型参数D_1kBD_2kBD_4kBD_8kBD_16kBD_32kBD_64kBdetect
随机森林n_estimators400200200200200200200305
max_depth1212121212121212
min_samples_split1001001001001001001002
min_samples_leaf505050505050501
测试精度/%71.4277.986.5892.0297.0798.9299.5173.9
Xgboostn_estimators600600600500400400400400
max_depth1212121212121212
min_child_weight4004004004004004002003
测试精度/%71.4678.0686.6792.0697.0498.7699.4573.23
MLPhidden_layer_sizes32,1632,1632,1632,1632,1632,1632,1632,16
max_iter500500500500500500500500
测试精度/%71.4377.8686.5891.9297.0198.799.3172.75

Fig.4

Comparison between proposed method and other methods"

Table 6

Results of Xgboost for D_1 kB data"

精度召回率

调和平

均精度

分类数据量
加密0.67 240 1080.83 703 6240.74 574 024112 209
压缩0.78 566 9910.59 429 3870.67 671 169112 791
宏平均0.71 535 1110.71 535 1110.71 535 111225 000

Fig.5

Generalization of models for different data with size 1 kB"

Fig.6

Generalization of models for D_1kB"

Fig.7

Generalization of models for D_64kB"

Table 7

Comparison between different models"

随机森林XgboostMLP
模型主要参数

n_estimators 40

max_depth 12

min_samples_split 100

min_samples_leaf 50

n_estimators 600

max_depth 12

min_child_weight 400

reg_alpha 0.15

layer_sizes(32.16)

max_iter 500

activation logistic

所需时间/s884.31535.12280.312

Fig.8

Influence of training data size on detection precision"

1 王勇, 周慧怡, 俸皓,等. 基于深度卷积神经网络的网络流量分类方法[J]. 通信学报,2018, 39(1):14-23.
Wang Yong, Zhou Hui-yi, Feng Hao, et al. Network traffic classification method basing on CNN[J]. Journal on Communications, 2018, 39(1):14-23.
2 Sandvine. 2018 Global Internet Phenomena Report[R]. Ontario Canda: Sandvine Incorporated ULC Waterloo, 2018.
3 潘吴斌, 程光, 郭晓军, 等. 网络加密流量识别研究综述及展望[J]. 通信学报, 2016, 37(9):154-167.
Pan Wu-bin, Cheng Guang, Guo Xiao-jun, et al. Review and perspective on encrypted traffic identification research[J]. Journal on Communications, 2016, 37(9):154-167.
4 Dorfinger P, Panholzer G, John W. Entropy estimation for real-time encrypted traffic identification (short paper)[C]∥International Workshop on Traffic Monitoring and Analysis, Berlin,Germany,2011: 164-171.
5 朱玉娜, 韩继红, 袁霖, 等. 基于熵估计的安全协议密文域识别方法[J]. 电子与信息学报, 2016, 38(8): 1865-1871.
Zhu Yu-na, Han Ji-hong, Yuan Lin, et al. Protocol Ciphertext Field Identification by Entropy Estimating[J]. Journal of Electronics & Information Technology, 2016, 38(8): 1865-1871.
6 赵博, 郭虹, 刘勤让, 等. 基于加权累积和检验的加密流量盲识别算法[J]. 软件学报, 2013, 24(6): 1334-1345.
Zhao Bo, Guo Hong, Liu Qin-rang, et al. Protocol independent identification of encrypted traffic based on weighted eumnlative sum test[J]. Journal of Software, 2013, 24(6): 1334-1345.
7 King T, D'Agostino R B, Stephens M A. Goodness-of-fit techniques[J]. Journal of Educational Statistics, 1987, 12(4):412-416.
8 Shannon C E. Communication Theory of Secrecy Systems[J]. The Bell System Technical Journal, 1949, 28(4):656-715.
9 Malhotra P. Detection of encrypted streams for egress monitoring[D]. Malhotra, Paras:Iowa State University, 2007.
10 Conte T M, Wolfe A. Techniques for detecting encrypted data[P]. US:8799671,2014-08-05.
11 Hahn D, Apthorpe N, Feamster N. Detecting compressed cleartext traffic from consumer internet of things devices[J]. arXiv preprint arXiv:, 2018.
12 Casino F, Choo K K R, Patsakis C. HEDGE: efficient traffic classification of encrypted and compressed packets[J]. IEEE Transactions on Information Forensics and Security, 2019, 14(11): 2916-2926.
13 Wang R, Shoshitaishvili Y, Kruegel C, et al. Steal this movie: automatically bypassing DRM protection in streaming media services[C]∥ Proceedings of the 22nd USENIX conference on Security, Berkeley,USA,2013: 687-702.
14 Wang Y, Zhang Z, Guo L, et al. Using entropy to classify traffic more deeply[C]∥IEEE Sixth International Conference on Networking, Architecture, and Storage, Dalian, China, 2011: 45-52.
15 Khakpour A R, Liu A X. An information-theoretical approach to high-speed flow nature identification[J]. IEEE/ACM Transactions on Networking (TON), 2013, 21(4): 1076-1089.
16 雷博, 范九伦. 一维Renyi熵阈值法中参数的自适应选取[J]. 光子学报, 2009, 38(9):2439-2443.
Lei Bo, Fan Jiu-lun. Self-adaptation preferences in one-dimensional Renyi entropy thresholding[J]. Acta Photonica Sinica, 2009, 38(9): 2439-2443.
17 Rukhin A L, Soto J, Nechvatal J R, et al. SP 800-22 Rev. 1a. A statistical test suite for random and pseudorandom number generators for cryptographic applications[S]. National Institute of Standards and Technology,USA,2010-09-16.
18 周志华. 机器学习[M]. 北京:清华大学出版社, 2016.
19 石竑松, 张翀斌, 杨永生,等. 随机性检测及其片面性[J]. 清华大学学报:自然科学版, 2011, 51(10):1269-1273.
Shi Hong-song, Zhang Chong-bin,Yang Yong-sheng, et al. On randomness test and its incompleteness[J]. Journal of Tsinghua University(Science and Technology), 2011,51(10):1269-1273.
[1] Xiao-long ZHU,Zhong XIE. Geospatial data extraction algorithm based on machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(3): 1011-1016.
[2] Yang LI,Shuo LI,Li-wei JING. Estimate model based on Bayesian model and machine learning algorithms applicated in financial risk assessment [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(5): 1862-1869.
[3] Wei FANG,Yi HUANG,Xin-qiang MA. Automatic defect detection for virtual network perceptual data based on machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(5): 1844-1849.
[4] Zhou-zhou LIU,Wen-xiao YIN,Qian-yun ZHANG,Han PENG. Sensor cloud intrusion detection based on discrete optimization algorithm and machine learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2020, 50(2): 692-702.
[5] ZHAO Dong, ZANG Xue-bai, ZHAO Hong-wei. Random forest prediction method based on optimization of fruit fly [J]. 吉林大学学报(工学版), 2017, 47(2): 609-614.
[6] XIA Jing-bo, BAI Jun, ZHAO Xiao-huan, WU Ji-xiang. Online network traffic classification using relevant vector machine [J]. 吉林大学学报(工学版), 2014, 44(2): 459-464.
[7] WU Qi, LIU Jian-nan, KOU Wen-long, ZHANG Zong-sheng. Internet traffic identification by using improved one class support vector machines [J]. 吉林大学学报(工学版), 2013, 43(增刊1): 124-127.
[8] TU Wei-wei, LI Ming, ZHOU Zhi-hua. Mining software defect factor [J]. 吉林大学学报(工学版), 2012, 42(增刊1): 382-386.
[9] LIU Yuan-ning, SHEN Ting-jie, ZHANG Hao, LI Xin, WEI Qing-kai, HE Yu-zhe. New feature extraction methods of microRNA target genes [J]. 吉林大学学报(工学版), 2012, 42(02): 418-422.
[10] GUO Kong-hui, WANG Xian-yun. Nonparametric models of shock absorber based on support vector machine regression [J]. 吉林大学学报(工学版), 2011, 41(增刊1): 1-4.
[11] NI Ping, LIAO Jian-xin, ZHU Xiao-min,. Method for service level agreement measurement without negotiation [J]. 吉林大学学报(工学版), 2011, 41(01): 264-0269.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!