吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (7): 2109-2114.doi: 10.13229/j.cnki.jdxbgxb.20220348

• 计算机科学与技术 • 上一篇    

基于时间序列模型的文本数据压缩存储算法

翁渊瀚1,2(),李南1()   

  1. 1.南京航空航天大学 经济与管理学院,南京 211000
    2.南京工业大学 理学院,南京 211800
  • 收稿日期:2022-03-31 出版日期:2023-07-01 发布日期:2023-07-20
  • 通讯作者: 李南 E-mail:wengyuanhan2022@163.com;linan20220202@163.com
  • 作者简介:翁渊瀚(1983-),男,副研究员,博士研究生.研究方向:知识管理和技术创新.E-mail:wengyuanhan2022@163.com
  • 基金资助:
    国家自然科学基金项目(71473119)

Text data compression and storage algorithm based on time series model

Yuan-han WENG1,2(),Nan LI1()   

  1. 1.College of Economics and Management,Nanjing University of Aeronautics and Astronautics,Nanjing 211000,China
    2.School of Economics and Management,Nanjing Technology University,Nanjing 211800,China
  • Received:2022-03-31 Online:2023-07-01 Published:2023-07-20
  • Contact: Nan LI E-mail:wengyuanhan2022@163.com;linan20220202@163.com

摘要:

为了降低文本数据历史数据量,提升文本数据压缩存储效率,提出一种基于时间序列模型的文本数据压缩存储算法。采用小波阈值去噪方法估计并消除文本数据的误差和噪声;从文本数据特征角度,通过细节描述特征,设定特征类型之间的组合和继承关系,组建时间序列模型。将经过预处理的文本数据采用时间序列模型转换为结构近似的二进制编码字节,通过异或操作对结果中的冗余部分进行压缩处理,同时将压缩的数据存储到对应的数据库中,最终完成文本数据压缩存储。仿真实验结果表明,本文算法可以有效提升压缩性能,获取更优的文本数据压缩存储结果。

关键词: 时间序列模型, 文本数据, 压缩存储算法, 小波阈值去噪方法, 非线性函数, 预处理

Abstract:

In order to reduce the amount of historical data of text data and improve the efficiency of text data compression and storage, a text data compression and storage algorithm based on time series model is proposed. The wavelet threshold denoising method is used to estimate and eliminate the error and noise of text data; from the perspective of text data features, the features are described in detail, and the combination and inheritance relationship between feature types are set to build a time series model. Convert the preprocessed text data into binary coded bytes with similar structure using the time series model, perform XOR operation to compress the redundant part in the result, and store the compressed data in the corresponding database, and finally complete the text Data compression storage. The simulation results show that the proposed algorithm can effectively improve the compression performance and obtain more satisfactory compression and storage results of text data.

Key words: time series model, text data, compressed storage algorithm, wavelet threshold denoising method, nonlinear function, pretreatment

中图分类号: 

  • TP393

图1

小波阈值去噪流程图"

图2

基于时间序列模型的文本数据压缩流程"

图3

不同算法的文本数据压缩性能测试结果对比"

表1

不同算法的网络平均能耗测试结果对比"

Running time/min网络平均能耗/J
本文方法文献[3] 方法文献[4]方法
600.250.280.32
1200.320.360.40
1800.360.410.45
2400.400.460.50
3000.440.490.54
3600.470.530.57
4200.530.580.62
4800.580.630.68
5400.620.680.75
6000.670.720.79

图4

不同算法的网络平均延时测试结果对比"

1 陆立华, 杜承烈.复杂式网络用户隐私数据多层分类存储仿真[J].计算机仿真, 2020, 37(3): 405-408, 439.
Lu Li-hua, Du Cheng-lie. Multi-layer classification storage simulation of complex network user privacy data[J]. Computer Simulation, 2020, 37(3): 405-408, 439.
2 徐嘉懿, 邓雪原.面向运维阶段的多源异构BIM数据存储方法研究[J]. 建筑技术, 2020, 51(5): 529-533.
Xu Jia-yi, Deng Xue-yuan. Research on multi-source heterogeneous bim data storage method for operation and maintenance stage[J]. Architecture Technology, 2020, 51(5): 529-533.
3 王鹤, 李石强, 于华楠, 等.基于分布式压缩感知和边缘计算的配电网电能质量数据压缩存储方法[J]. 电工技术学报, 2020, 35(21): 4553-4564.
Wang He, Li Shi-qiang, Yu Hua-nan, et al. Compression acquisition method for power quality data of distribution network based on distributed compressed sensing and edge computing[J]. Transactions of China Electrotechnical Society, 2020, 35(21): 4553-4564.
4 徐敬华, 高铭宇, 苟华伟, 等.基于非规则分块压缩的3D打印稀疏矩阵存储与重构方法[J]. 计算机学报, 2020, 43(11): 2203-2215.
Xu Jing-hua, Gao Ming-yu, Gou Hua-wei, et al. Storage and reconstruction method of sparse matrix for 3D printing based on irregular block compression [J]. Chinese Journal of Computers, 2020, 43(11): 2203-2215.
5 屈松林, 刘林.基于波形字典的铁路空口监测数据压缩算法[J]. 计算机应用研究, 2020, 37(): 266-269, 244.
Qu Song-lin, Liu Lin. Data compression algorithm for railway air interface monitoring based on waveform dictionary [J]. Computer Application Research, 2020, 37(Sup.2): 266-269, 244.
6 李建鑫, 陈鸿, 王晋祺.基于机器视觉轮廓提取的平滑处理算法[J]. 电子技术应用, 2021, 47(4): 116-120, 131.
Li Jian-xin, Chen Hong, Wang Jin-qi. A smoothing algorithm based on machine vision contour extraction [J]. Application of Electronic Technique, 2021, 47(4): 116-120, 131.
7 陈晨, 陶建锋, 郑桂妹.基于MIMO雷达的极化平滑降维酉ESPRIT算法[J]. 信号处理, 2021, 37(4): 616-623.
Chen Chen, Tao Jian-feng, Zheng Gui-mei. Unitary ESPRIT algorithm of polarization smoothing dimension reduction based on MIMO radar[J]. Journal of Signal Processing, 2021, 37(4): 616-623.
8 李志军, 张鸿鹏, 王亚楠.排列熵——CEEMD分解下的新型小波阈值去噪谐波检测方法[J].电机与控制学报, 2020, 24(12): 120-129.
Li Zhi-jun, Zhang Hong-peng, Wang Ya-nan, et al. Wavelet threshold denoising harmonic detection method based on permutation entropy——CEEMD decomposition[J]. Electric Machines and Control, 2020, 24(12): 120-129.
9 宿常鹏, 王雪梅, 许哲, 等.基于新阈值函数的小波阈值去噪方法研究[J].战术导弹技术, 2020(3): 66-72.
Su Chang-peng, Wang Xue-mei, Xu Zhe, et al. Research on wavelet threshold de-noising method based on new threshold function[J]. Tactical Missile Technology, 2020(3): 66-72.
10 康明, 韩森坪, 杨洪杰, 等.基于天然气组分红外光谱图的数据预处理方法研究[J]. 红外技术, 2021, 43(8): 804-808.
Kang Ming, Han Sen-ping, Yang Hong-jie, et al. Data preprocessing method for infrared spectra analysis of natural gas components[J]. Infrared Technology, 2021, 43(8): 804-808.
11 张海涛, 汤儒峰, 李祝莲, 等. 基于阵列探测技术的激光测距数据预处理方法[J]. 红外与激光工程, 2020, 49(8): 89-98.
Zhang Hai-tao, Tang Ru-feng, Li Zhu-lian, et al. Preprocessing method of laser ranging data based on array detection technology[J]. Infrared and Laser Engineering, 2020, 49(8): 89-98.
12 吴翌琳, 南金伶.互联网企业广告收入预测研究——基于低频数据的神经网络和时间序列组合模型[J]. 统计研究, 2020, 37(5): 94-103.
Wu Yi-lin, Jin-ling Nan. Forecasting of advertisement income of internet companies——based on neural network and time series model for low-frequency data[J]. Statistical Research, 2020, 37(5): 94-103.
13 吴晓峰, 林晓言, 靳雅楠. 基于时间序列模型遴选的集成组合预测模型[J]. 统计与决策, 2021, 37(9): 5-8.
Wu Xiao-feng, Lin Xiao-yan, Jin Ya-nan. Integrated combination prediction model based on time series model selection[J]. Statistics and Decision, 2021, 37(9): 5-8.
14 李绕波, 袁希平, 甘淑, 等. 基于特征点和关键点提取的点云数据压缩方法[J]. 激光与红外, 2021, 51(9): 1129-1136.
Li Rao-bo, Yuan Xi-ping, Gan Shu, et al. Point cloud data compression method based on feature point and key point extraction[J]. Laser & Infrared, 2021, 51(9): 1129-1136.
15 赵东保, 孟俊贞, 刘文玉.群组相似轨迹的特征点映射数据压缩方法[J]. 测绘科学,2020, 45(3): 143-149.
Zhao Dong-bao, Meng Jun-zhen, Liu Wen-yu. Feature points mapping data compression method for multiplesimilar trajectories[J]. Science of Surveying and Mapping, 2020, 45(3): 143-149.
[1] 朱小龙,谢忠. 基于海量文本数据的知识图谱自动构建算法[J]. 吉林大学学报(工学版), 2021, 51(4): 1358-1363.
[2] 袁哲明,袁鸿杰,言雨璇,李钎,刘双清,谭泗桥. 基于深度学习的轻量化田间昆虫识别及分类模型[J]. 吉林大学学报(工学版), 2021, 51(3): 1131-1139.
[3] 杨帆,张旭东,赵蒙,折波,邓俊楷. 基于有限元计算的形状记忆合金⁃金属玻璃复合材料变形行为[J]. 吉林大学学报(工学版), 2021, 51(1): 172-180.
[4] 谭旭, 刘建军, 李春来. 月球数据预处理工作流模型的构建及应用[J]. 吉林大学学报(工学版), 2015, 45(6): 2007-2013.
[5] 吴清佳. 基于神经网络集成的旋转人脸快速检测系统[J]. 吉林大学学报(工学版), 2013, 43(增刊1): 424-429.
[6] 王智宏, 刘杰, 王婧茹, 孙玉洋, 于永, 林君. 数据预处理方法对油页岩含油率近红外光谱分析的影响[J]. 吉林大学学报(工学版), 2013, 43(04): 1017-1022.
[7] 曲兴田, 王滨, 张雷, 邵奎伟, 张亮. 焊缝磨抛图像预处理技术[J]. , 2012, (06): 1421-1425.
[8] 张强1,Anders Thygesen2,Anne Belinda Thomsen2. 湿氧预处理玉米秸秆酶解与酒精发酵[J]. 吉林大学学报(工学版), 2011, 41(4): 1189-1192.
[9] 王荣本;李琳辉;郭烈;金立生;张明恒 . 基于立体视觉的越野环境感知技术[J]. 吉林大学学报(工学版), 2008, 38(03): 520-0524.
[10] 于舒春,闫继宏,赵杰,蔡鹤皋 . 立体视觉的四阶段预处理方法[J]. 吉林大学学报(工学版), 2007, 37(03): 651-0654.
[11] 丛玉良, 崔冬, 姜桂艳, 吴志辉, 陈鑫影. 基于预处理的信道预测算法[J]. 吉林大学学报(工学版), 2004, (4): 652-655.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!