基于泛化中心聚类的时间序列缺失数据填补方法

吉林大学学报(理学版) ›› 2025, Vol. 63 ›› Issue (4): 1137-1142.

基于泛化中心聚类的时间序列缺失数据填补方法

于艳朋, 惠向晖

河南农业大学信息与管理科学学院（软件学院）, 郑州 450046

收稿日期:2024-06-04 出版日期:2025-07-26 发布日期:2025-07-26
通讯作者: 于艳朋 E-mail:yuyanpeng19870@163.com

Missing Data Filling Method in Time Series Based on Generalized Center Clustering

YU Yanpeng, HUI Xianghui

College of Information and Management Science （College of Software）, Henan Agricultural University, Zhengzhou 450046, China

Received:2024-06-04 Online:2025-07-26 Published:2025-07-26

摘要/Abstract

摘要： 针对填补时间序列中的缺失值通常依赖于已有数据的预测, 由于时间序列的复杂性和不确定性导致预测结果常存在误差的问题, 为保证数据填补效果, 提出一种基于泛化中心聚类的时间序列缺失数据填补方法. 首先, 计算对象与类之间、类与类之间的距离, 量化数据点与聚类中心之间的相对位置关系, 得到数据间的空间关系. 其次, 利用信息瓶颈算法对空间中的泛化中心进行聚类处理, 将含有缺失数据的时间序列数据集划分到同一类中. 最后, 计算簇半径, 对泛化中心聚类后产生的离群点数据再次进行可用、弱可用随机损坏数据划分, 设置波动阈值, 将位于波动阈值内的随机损坏数据与聚类中统一属性值进行字符串对比, 实现时间序列缺失数据填补. 实验结果表明, 该方法在聚类过程中有较高的标准化互信息和命中率, 在缺失数据填补时, 可保证数据补齐率在80%以上, 说明该方法可有效改善时间序列数据的完整性.

关键词: 泛化中心聚类, 时间序列, 缺失数据填补, 信息瓶颈, 随机损坏数据, 补齐率

Abstract: Aiming at the problem that the filling of missing values in time series usually relied on the predictions of existing data, and the complexity and uncertainty of time series often led to errors in the prediction results. In order to ensure the effectiveness of data filling, we proposed a time series missing data filling method based on generalized center clustering. Firstly, we calculated the distance between objects and classes, as well as between classes, quantified the relative positional relationship between data points and cluster centers, and obtained the spatial relationship between data. Secondly, we used information bottleneck algorithms to cluster the generalization centers in space， dividing time series datasets containing missing data into the same class. Finally, we calculated the cluster radius, divided the outlier data generated by the generalized center clustering into usable and weakly usable randomly damaged data, set a fluctuation threshold, and compared the randomly damaged data within the fluctuation threshold with a string of the unified attribute values in the cluster, achieving the missing data filling in the time series. The experimental results show that this method has high standardized mutual information and hit rate in the clustering process, and can ensure a data replenishment rate of over 80% when filling in missing data, indicating that this method can effectively improve the integrity of time series data.

Key words: generalized center clustering, time series, missing data filling, information bottleneck, randomly damaged data, replenishment rate

中图分类号:

TP391

于艳朋, 惠向晖. 基于泛化中心聚类的时间序列缺失数据填补方法[J]. 吉林大学学报(理学版), 2025, 63(4): 1137-1142.

YU Yanpeng, HUI Xianghui. Missing Data Filling Method in Time Series Based on Generalized Center Clustering[J]. Journal of Jilin University Science Edition, 2025, 63(4): 1137-1142.

[1]	刘思博, 杨凯, 董小刚, 徐悦. 基于Poisson分布的Z值Taylor-Schwert GARCH模型[J]. 吉林大学学报(理学版), 2025, 63(4): 1039-1050.
[2]	张洁, 杨志鹏, 董小刚. 一阶广义零一堆积Poisson-Lindley整数值自回归模型的统计推断[J]. 吉林大学学报(理学版), 2025, 63(2): 399-0410.
[3]	魏晓辉, 徐哲文, 王兴旺, 郝介云, 刘长征. 利用地理空间和时间信息GNN-Transformer在MJO预测中的应用[J]. 吉林大学学报(理学版), 2025, 63(1): 67-0075.
[4]	刘瑞, 朱复康, 李琦. 重复观测的Poisson-Lindley INAR(1)模型[J]. 吉林大学学报(理学版), 2025, 63(1): 24-0034.
[5]	徐一宸, Eric Li. 延迟回声状态神经网络用于复杂系统分析和应用[J]. 吉林大学学报(理学版), 2024, 62(5): 1017-1021.
[6]	李晗, 连成, 方引芳, 杨凯. 一阶混合整数值负二项自回归模型[J]. 吉林大学学报(理学版), 2024, 62(3): 547-555.
[7]	张洁, 张玉, 董小刚. 自激励广义二项门限自回归模型的统计推断[J]. 吉林大学学报(理学版), 2023, 61(2): 275-284.
[8]	邱玉祥, 蔡艳, 陈霖, 万明, 周宇. 基于自回归神经网络的多维时间序列分析[J]. 吉林大学学报(理学版), 2022, 60(5): 1143-1152.
[9]	刘子健, 桂尚珂, 陈硕, 杨凯, 金虹桥. 一阶混合整数值二项自回归模型[J]. 吉林大学学报(理学版), 2021, 59(6): 1395-1399.
[10]	魏晓辉, 许国威, 王兴旺, 徐海啸. 基于节点相似性和链接次数组合时间序列的链接预测[J]. 吉林大学学报(理学版), 2019, 57(3): 583-590.
[11]	王宇, 王纯杰, 张海祥. ADCINAR(1)模型的加权条件最小二乘估计[J]. 吉林大学学报(理学版), 2019, 57(3): 553-561.
[12]	毛惠玉, 李琦. 整数值Z上的混合符号稀疏算子INAR(1)模型[J]. 吉林大学学报(理学版), 2019, 57(06): 1379-1384.
[13]	杨杰, 尹向东, 刘小兵. 混沌系统和移位寄存器相融合的图像安全分析算法[J]. 吉林大学学报(理学版), 2017, 55(03): 657-663.
[14]	张海洋, 鞠九滨, 胡亮. 对NWS中资源性能预报算法的一个改进[J]. J4, 2005, 43(02): 157-161.