吉林大学学报(工学版) ›› 2022, Vol. 52 ›› Issue (10): 2325-2332.doi: 10.13229/j.cnki.jdxbgxb20210231

• 交通运输工程·土木工程 • 上一篇    

基于SSC与XGBoost的高速公路异常收费数据修复算法

裴莉莉(),孙朝云(),韩雨希,李伟,户媛姣   

  1. 长安大学 信息工程学院,西安 710064
  • 收稿日期:2021-03-15 出版日期:2022-10-01 发布日期:2022-11-11
  • 通讯作者: 孙朝云 E-mail:peilili@chd.edu.cn;zhaoyunsun@126.com
  • 作者简介:裴莉莉(1995-),女,博士研究生.研究方向:交通大数据分析与机器学习. E-mail:peilili@chd.edu.cn
  • 基金资助:
    国家重点研发计划“综合交通运输与智能交通”专项项目(2018YFB1600202);长安大学博士研究生创新能力培养项目(300203211241);陕西省科技厅“揭榜挂帅”科研项目(2022JBGS3-08)

Algorithm for repairing abnormal toll data of expressway based on SSC and XGBoost

Li-li PEI(),Zhao-yun SUN(),Yu-xi HAN,Wei LI,Yuan-jiao HU   

  1. School of Information Engineering,Chang'an University,Xi'an 710064,China
  • Received:2021-03-15 Online:2022-10-01 Published:2022-11-11
  • Contact: Zhao-yun SUN E-mail:peilili@chd.edu.cn;zhaoyunsun@126.com

摘要:

针对高速公路收费数据中的异常检测和修复问题,分别了提出了基于相似系数和SSC(Sum of similar coefficients)的异常检测算法以及基于XGBoost (eXtreme gradient boosting)的多维数据预测修复方法,并使用这两种算法对实际收费数据进行了异常检测和修复处理。结果表明,基于SSC的异常检测算法能够考虑到数据维度之间的相关性,准确地对多维数据异常检测;同时XGBoost多元预测算法与仅针对单维数据的改进拉格朗日算法相比,R2从0.9166提升至0.9856。本文算法有效而准确,能够为公路管理部门数据分析提供高质量的数据支持。

关键词: 交通信息工程, 异常检测与修复, 相似系数和, XGBoost, K?means

Abstract:

For the detection and repair of anomaly in expressway toll data, an anomaly detection algorithm based on SSC (Sum of Similar Coefficients) suitable for multi-dimensional reading data joint detection and an anomaly repair algorithm based on XGBoost (eXtreme Gradient Boosting) multidimensional data prediction repair method are proposed,and above methods are applied to data test. The results show that the SSC considers the correlation between data dimensions and could accurately detect anomaly in multi-dimensional data. Meanwhile, compared to the improved Lagrange interpolation, the R2 of the proposed method increased from 0.9166 to 0.9856. The algorithms proposed in this paper are effective and could provide high quality data support for data analysis and statistics of expressway management departments.

Key words: traffic information engineering, anomaly detection and repair, sum of similar coefficients, XGBoost, K-means

中图分类号: 

  • U495

表1

原始收费数据部分特征因子"

特征名称含 义单 位样例数据
ID数据序号/50968476
InTime进站口时间/2016/10/26 09:46:50
OutTime出站口时间/6:28
InStation Name进站名称/15
OutStation Name出站名称/16
InLoad进站车辆总重(进站荷载)100 kg40
OutLoad出站车辆总重(出站荷载)100 kg40
Credit消费金额555.75
Last Balance消费后余额2345.5

表2

收费数据各参数特征统计分析"

SortOutLoad/100 kg

OutStation

Name

InLoad/100 kg

InStation

Name

总数866.00879.00869.00879.00
平均值39.49424.761039.334.54
标准差21.56623.903721.633.73
最小值10.00001.00001.000.00
25%25.00002.000025.002.000000
50%35.00003.000035.003.000000
75%46.00007.000046.006.000000
最大值1000.000016.000083.0020.000000
总数866.00879.00869.00879.00

图1

基于欧氏距离的异常检测算法流程"

图2

基于相似系数和的异常检测算法流程"

表3

XGBoost模型参数"

模型参数参数解释
n_estimation最佳迭代次数

max_depth

min_child_weight

最大深度

最小叶子节点样本权重和

subsample随机采样数
colsample bytree每棵随机采样列数占比

图3

原始数据聚类结果"

图4

数据检测清洗结果"

图5

高速公路收费异常修复结果对比"

图6

高精度高速公路收费异常修复结果对比"

图7

高速公路收费数据异常修复指标对比"

1 Byungtae C, Lee S H. A study on intelligent traffic system related with smart city[J]. International Journal of Smart Home, 2015, 9(7): 223-230.
2 Zhou R G, Zhong L D, Zhao N L, et al. The development and practice of china highway capacity research[J]. Transportation Research Procedia, 2016, 15: 14-25.
3 赵怀鑫, 邓然然, 张英杰, 等. 一种用于高速公路通行情况分析的收费数据挖掘方法[J]. 中国公路学报, 2018, 31(8): 155-164.
Zhao Huai-xin, Deng Ran-ran, Zhang Ying-jie, et al. A toll data mining method for expressway traffic situation analysis[J]. China Highway Journal, 2018, 31(8): 155-164.
4 Swapna S, Niranjan P, Srinivas B, et al. Data cleaning for data quality[C]∥2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 2016: 344-348.
5 Yoon K, Bae D. Pattern-based outlier detection method identifying abnormal attributes in software project data[J]. Information & Software Technology, 2010, 52(2): 137-151.
6 Juhola M, Joutsijoki H, Aalto H, et al. On classification in the case of a medical data set with a complicated distribution[J]. Applied Computing & Informatics, 2014, 10(1/2): 52-67.
7 Greenwood N, Shields K. An introduction to data cleaning using internet search data[J]. Australian Economic Review, 2017, 50(3): 363-372.
8 Dilling S, Macvicar B. Cleaning high-frequency velocity profile data with autoregressive moving average (ARMA) models[J]. Flow Measurement & Instrumentation, 2017, 54: 68-81.
9 Titouna C, Naït-abdesselam F, Khokhar A. A novel data cleansing approach for sensitive applications of wireless sensor networks[C]∥2019 International Conference on Smart Applications, Communications and Networking (SmartNets), Sharm El Sheikh, Egypt, 2019: 1-6.
10 肖心园, 江冰, 任其文,等. 基于插值法和皮尔逊相关的光伏数据清洗[J]. 信息技术, 2019, 43(5): 19-22, 28.
Xiao Xin-yuan, Jiang Bing, Ren Qi-wen, et al. Photovoltaic data Cleaning based on interpolation and Pearson correlation [J]. Information Technology, 2019, 43(5):19-22, 28.
11 苗润华. 基于聚类和孤立点检测的数据预处理方法的研究[D]. 北京:北京交通大学计算机与信息技术学院, 2012.
Miao Run-hua. Research on data preprocessing Method based on clustering and outlier detection[D]. Beijing: Beijing Jiaotong University College of Computer and Information Technology, 2012.
12 封富君, 姚俊萍, 李新社, 等. 大数据环境下的数据清洗框架研究[J]. 软件, 2017, 38(12): 193-196.
Feng Fu-jun, Yao Jun-ping, Li Xin-she, et al. Research on data cleaning framework in big data environment[J]. Software, 2017, 38(12): 193-196.
13 Pappas C, Papalexiou S, Koutsoyiannis D. A quick gap filling of missing hydrometeorological data[J]. Journal of Geophysical Research Atmospheres, 2015, 119(15): 9290-9300.
14 Pilsung K. Locally linear reconstruction based missing value imputation for supervised learning[J]. Neurocomputing, 2013, 118: 65-78.
15 Zhao L, Chen Z K, Yang Z N, et al. Local similarity imputation based on fast clustering for incomplete data in cyber-physical systems[J]. IEEE Systems Journal, 2016, 12(2): 1610-1620.
16 邹嵩涵. 面向高速公路收费数据的异常行为分析与应用[D]. 成都: 电子科技大学计算机科学与技术学院, 2020.
Zou Song-han. Analysis and application of abnormal behavior oriented expressway toll data[D]. Chengdu: College of Computer Science and Technology,University of Electronic Science and Technology, 2020.
17 周舟. 高速公路异常数据检测方法研究[D]. 长春: 长春理工大学计算机技术学院, 2018.
Zhou Zhou. Research on highway abnormal data detection method [D]. Changchun: College of Computer Technology, Changchun University of Science and Technology,2018.
18 蒋怡玥. 基于高速公路收费数据的交通分布时空相关性研究[D]. 北京: 北京交通大学交通运输学院, 2019.
Jiang Yi-yue. Research on the spatio-temporal correlation of traffic distribution based on freeway toll data[D]. Beijing: College of Transportation, Beijing Jiaotong University, 2019.
19 Pei Li-li, Sun Zhao-yun, Han Yu-xi, et al. Highway event detection algorithm based on improved fast peak clustering[J]. Mathematical Problems in Engineering, 2021(1): 1-13.
20 李松松. 基于收费数据挖掘的高速公路旅行时间预测和交通状态判别应用研究[D]. 广东: 华南理工大学土木与交通学院, 2017.
Li Song-song. Application research on highway travel time prediction and traffic state discrimination based on toll data mining [D]. Guangdong: College of Civil Engineering and Transportation,South China University of Technology, 2017.
21 Mohamad I, Usman D. Standardization and its effects on K-means clustering algorithm[J]. Research Journal of Applied Sciences, Engineering and Technology, 2013, 6(17): 3299-3303.
22 Pei L L, Sun Z Y, Yu T, et al. Pavement aggregate shape classification based on extreme gradient boosting[J]. Construction and Building Materials, 2020, 256: No. 119356.
[1] 刘兴涛,刘晓剑,武骥,何耀,刘新天. 基于曲线压缩和极限梯度提升算法的锂离子电池健康状态估计[J]. 吉林大学学报(工学版), 2022, 52(6): 1273-1280.
[2] 贾超,徐洪泽,王龙生. 基于多质点模型的列车自动驾驶非线性模型预测控制[J]. 吉林大学学报(工学版), 2020, 50(5): 1913-1922.
[3] 曲大义,贾彦峰,刘冬梅,杨晶茹,王五林. 考虑多特性因素的路网交叉口群动态划分方法[J]. 吉林大学学报(工学版), 2019, 49(5): 1478-1483.
[4] 吴骅跃,段里仁. 基于RGB熵和改进区域生长的非结构化道路识别方法[J]. 吉林大学学报(工学版), 2019, 49(3): 727-735.
[5] 陶涛,徐洪泽. 高速列车浸入与不变自适应容错控制方法[J]. 吉林大学学报(工学版), 2015, 45(2): 554-561.
[6] 陈 强, 李 江, 吴 想, 闫松申. 轮胎印痕识别算法及实例分析[J]. 吉林大学学报(工学版), 2005, 35(01): 39-0043.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!