吉林大学学报(理学版) ›› 2025, Vol. 63 ›› Issue (4): 1137-1142.

• • 上一篇    下一篇

 基于泛化中心聚类的时间序列缺失数据填补方法

于艳朋, 惠向晖   

  1. 河南农业大学 信息与管理科学学院(软件学院), 郑州 450046
  • 收稿日期:2024-06-04 出版日期:2025-07-26 发布日期:2025-07-26
  • 通讯作者: 于艳朋 E-mail:yuyanpeng19870@163.com

Missing Data Filling Method in Time Series Based on Generalized Center Clustering

YU Yanpeng, HUI Xianghui   

  1. College of Information and Management Science (College of Software), Henan Agricultural University, Zhengzhou 450046, China
  • Received:2024-06-04 Online:2025-07-26 Published:2025-07-26

摘要: 针对填补时间序列中的缺失值通常依赖于已有数据的预测, 由于时间序列的复杂性和不确定性导致预测结果常存在误差的问题, 为保证数据填补效果, 提出一种基于泛化中心聚类的时间序列缺失数据填补方法. 首先, 计算对象与类之间、 类与类之间的距离, 量化数据点与聚类中心之间的相对位置关系, 得到数据间的空间关系. 其次, 利用信息瓶颈算法对空间中的泛化中心进行聚类处理, 将含有缺失数据的时间序列数据集划分到同一类中. 最后, 计算簇半径, 对泛化中心聚类后产生的离群点数据再次进行可用、 弱可用随机损坏数据划分, 设置波动阈值, 将位于波动阈值内的随机损坏数据与聚类中统一属性值进行字符串对比, 实现时间序列缺失数据填补. 实验结果表明, 该方法在聚类过程中有较高的标准化互信息和命中率, 在缺失数据填补时, 可保证数据补齐率在80%以上, 说明该方法可有效改善时间序列数据的完整性.

关键词: 泛化中心聚类, 时间序列, 缺失数据填补, 信息瓶颈, 随机损坏数据, 补齐率

Abstract: Aiming at the problem that the filling of missing values in time series usually relied on the predictions of existing data, and  the complexity and uncertainty of time series often led to errors in the prediction results. In order to ensure the effectiveness of data filling, we proposed a time series missing data filling method based on generalized center clustering. Firstly, we calculated the distance between objects and classes, as well as between classes, quantified the relative positional relationship between data points and cluster centers, and obtained the spatial relationship between data. Secondly, we used information bottleneck algorithms to cluster the generalization centers in space, dividing time series datasets containing missing data into the same class. Finally, we calculated the cluster radius, divided the outlier data generated by the generalized center clustering into usable and weakly usable randomly damaged data, set a fluctuation threshold, and compared the randomly damaged data within the fluctuation threshold with a string of the unified attribute values in the cluster, achieving the  missing data filling in the time series. The experimental results show that this method has high standardized mutual information and hit rate in the clustering process,  and  can ensure a data replenishment rate of over 80% when filling in missing data, indicating that this method can effectively improve the integrity of time series data.

Key words: generalized center clustering, time series, missing data filling, information bottleneck, randomly damaged data, replenishment rate

中图分类号: 

  • TP391