吉林大学学报(工学版) ›› 2022, Vol. 52 ›› Issue (4): 874-884.doi: 10.13229/j.cnki.jdxbgxb20200877

• 计算机科学与技术 • 上一篇    

Rhombus sketch:自适应和准确的流数据sketch

魏晓辉1,2(),苗艳微1,王兴旺1,2()   

  1. 1.吉林大学 计算机科学与技术学院,长春 130012
    2.吉林大学 符号计算与知识工程教育部重点实验室,长春 130012
  • 收稿日期:2020-11-18 出版日期:2022-04-01 发布日期:2022-04-20
  • 通讯作者: 王兴旺 E-mail:weixh@jlu.edu.cn;xww@jlu.edu.cn
  • 作者简介:魏晓辉(1972-),男,教授,博士生导师. 研究方向:分布式计算,集群计算和网络安全. E-mail:weixh@jlu.edu.cn
  • 基金资助:
    国家重点研发计划项目(2016YFB0201503);国家自然科学基金项目(61772228);吉林省自然科学基金自由探索项目(2020122208JC);吉林省教育厅科研项目(JJKH20211105KJ)

Rhombus sketch: adaptive and more accurate sketch for streaming data

Xiao-hui WEI1,2(),Yan-wei MIAO1,Xing-wang WANG1,2()   

  1. 1.College of Computer Science and Technology,Jilin University,Changchun 130012,China
    2.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012,China
  • Received:2020-11-18 Online:2022-04-01 Published:2022-04-20
  • Contact: Xing-wang WANG E-mail:weixh@jlu.edu.cn;xww@jlu.edu.cn

摘要:

记录数据出现的次数(频数估计)是流数据管理的一类重要任务。在该类任务的解决方案中,sketch是内存使用效率较高的数据结构之一。目前的多层sketch方案,通常使用相同长度的计数器记录数据能够到达的最高层级。然而,数据频数的高度偏斜会导致部分计数器的高位内存被浪费。另一方面,由于数据分布是不断变化的,静态的sketch数据结构会造成内存分配不合理的问题。为了解决上述问题,本文提出一种弹性层级结构Rhombus sketch,能够按需将计数器分配给数据,并通过估计数据的分布动态调整内存分配。实验结果表明,相比于其他典型sketch方法,特别是在内存紧张的情况下,Rhombus sketch的准确性有明显的提高。

关键词: 计算机应用, 数据流, 近似算法, 弹性, sketch

Abstract:

Estimating the frequency of each data (frequency estimation) is a critical task in stream data management. Sketch is one of the most memory-efficient structure for solving this kind of task. Existing multi-layer sketches often use counters of the same size to record the highest layer that data could be reached. However, many counters represent very small values, which is a waste of memory. On the other hand, multi-layer static sketches cause the problem of unreasonable memory allocation due to the change of data distribution. To address the issues, this paper proposes an elastic hierarchical structure, named Rhombus sketch. Rhombus sketch can assign an appropriate number of counters to each data on demand, and adjust memory allocation by estimating the data distribution. Experimental results show that, compared with other typical sketches, the accuracy of Rhombus sketch is greatly improved, especially for tight memory.

Key words: computer application, data streams, approximate algorithms, elasticity, sketch

中图分类号: 

  • TP399

表1

Rhombus sketch算法变量及含义"

变量名变量代表含义

Mi

Ki

n

m

flag

fi(e)

LF

CF

δ

u

d

rand

计数结构每层初始分配的计数器个数

记录结构每层计数器的个数

R1记录数据个数

L2的数据数量

内存调整标志

不同函数下元素e的标识

溢出标志

标识数据是否出现在内存调整后

计数部分计数器的位数

调整内存判定标准

hash函数个数

0~1之间随机数

图1

Rhombus sketch结构图"

图2

事务数据集ARE的比较"

图3

CAIDA数据集ARE的比较"

图4

事务数据集AAE的比较"

图5

CAIDA数据集AAE的比较"

图6

不同内存ARE的比较"

图7

不同内存AAE的比较"

图8

不同偏度数据集评价标准性能比的比较"

图9

不同偏度数据集插入吞吐量的比较"

图10

不同偏度数据集查询吞吐量的比较"

图11

不同内存准确率的比较"

图12

实时频数分布(50 kB)"

图13

实时频数分布(200 kB)"

1 Rehman M H, Liew C S, Abbas A, et al. Big data reduction methods: a survey[J]. Data Science and Engineering, 2016, 1: 265-284.
2 Li T, Chen S G, Ling Y B. Fast and compact per-flow traffic measurement through randomized counter sharing[C]∥Proceedings IEEE INFOCOM, Shanghai, China, 2012: 1622-1634.
3 Bertino Elisa. Introduction to data security and privacy[J]. Data Science and Engineering, 2016, 1: 125-126.
4 Mirylenka K, Cormode G, Palpanas T, et al. Conditional heavy hitters: detecting interesting correlations in data streams[J]. The VLDB Journal, 2015, 24: 395-414.
5 Roy P, Khan A, Alonso G. Augmented sketch: faster and more accurate stream processing[C]∥ Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 2016: 1449-1463.
6 Pandey P, Bender M A, Johnson R, et al. A general-purpose counting filter: making every bit count[C]∥Proceedings of the ACM International Conference on Management of Data, Chicago, USA: 2017: 775-787.
7 Chen J C, Zhang Q. Bias-aware sketches[J]. Proceedings of the VLDB Endowment, 2017, 10(9): 961-972.
8 Gong J Z, Tian D Y, Yang T, et al. SSS: an accurate and fast algorithm for finding top-k hot items in data streams[C]∥IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai, China, 2018:106-113.
9 Qi J H, Li W J, Yang T, et al. Cuckoo counter: a novel framework for accurate per-flow frequency estimation in network measurement[C]∥ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), Cambridge, United Kingdom, 2019:1-7.
10 Zhou Y, Yang T, Jiang T, et al. Cold filter: a meta-framework for faster and more accurate stream processing[C]∥Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 2018: 741-756.
11 Yang T, Zhou Y, Jin H, et al. Pyramid sketch: a sketch framework for frequency estimation of data streams[J]. Proceedings of the VLDB Endowment, 2017, 10(11): 1442-1453.
12 Yang T, Gao S, Sun Z Y, et al. Diamond sketch: accurate per-flow measurement for big streaming data[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(12): 2650-2662.
13 Li Y L, Miao R, Kim C H, et al. FlowRadar: a better netflow for data centers [C]∥ USENIX Symposium on Networked Systems Design and Implementation, Santa Clara, CA, USA:2016: 311-324.
14 Cormode G, Muthukishnam S. An improved data stream summary: the count-min sketch and its applications[J]. Journal of Algorithms, 2005, 55(1): 58-75.
15 Estan C, Varghese G. New directions in traffic measurement and accounting[J]. ACM SIGCOMM Computer Communication Review, 2002, 32(4): 323-336.
16 Kumar A, Sung M H, Xu J, et al. Data streaming algorithms for efficient and accurate estimation of flow size distribution[C]∥Proceedings of the International Conference on Measurements and Modeling of Computer Systems, New York, NY, USA, 2004: 177-188.
17 Ying S S, Korn F, Saha B, et al. Treescope: finding structural anomalies in semi-structure data[J].Proceedings of the VLDB Endowment, 2015, 8(12): 1904-1907.
18 Regents of the University of California. The CAIDA anonymized internet traces 2018[DB/OL].[2020-10-22].
19 Real-life transactional dataset[DB/OL].[2020-10-22].
20 dataset Webdocs. [DB/OL].[2020-10-22].
[1] 毛琳,任凤至,杨大伟,张汝波. 双向特征金字塔全景分割网络[J]. 吉林大学学报(工学版), 2022, 52(3): 657-665.
[2] 王雪,李占山,吕颖达. 基于多尺度感知和语义适配的医学图像分割算法[J]. 吉林大学学报(工学版), 2022, 52(3): 640-647.
[3] 欧阳继红,郭泽琪,刘思光. 糖尿病视网膜病变分期双分支混合注意力决策网络[J]. 吉林大学学报(工学版), 2022, 52(3): 648-656.
[4] 王学智,李清亮,李文辉. 融合迁移学习的土壤湿度预测时空模型[J]. 吉林大学学报(工学版), 2022, 52(3): 675-683.
[5] 康苏明,张叶娥. 基于Hadoop的跨社交网络局部时序链路预测算法[J]. 吉林大学学报(工学版), 2022, 52(3): 626-632.
[6] 曲优,李文辉. 基于锚框变换的单阶段旋转目标检测方法[J]. 吉林大学学报(工学版), 2022, 52(1): 162-173.
[7] 赵宏伟,霍东升,王洁,李晓宁. 基于显著性检测的害虫图像分类[J]. 吉林大学学报(工学版), 2021, 51(6): 2174-2181.
[8] 刘洲洲,张倩昀,马新华,彭寒. 基于优化离散差分进化算法的压缩感知信号重构[J]. 吉林大学学报(工学版), 2021, 51(6): 2246-2252.
[9] 王生生,陈境宇,卢奕南. 基于联邦学习和区块链的新冠肺炎胸部CT图像分割[J]. 吉林大学学报(工学版), 2021, 51(6): 2164-2173.
[10] 刘福寿,魏琦,徐文婷,谭国金. 基于弹性波传播和谱单元法的桁架结构损伤检测[J]. 吉林大学学报(工学版), 2021, 51(6): 2087-2095.
[11] 孙东明,胡亮,邢永恒,王峰. 基于文本融合的物联网触发动作编程模式服务推荐方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2182-2189.
[12] 林俊聪,雷钧,陈萌,郭诗辉,高星,廖明宏. 基于电影视觉特性的动态多目标实时相机规划[J]. 吉林大学学报(工学版), 2021, 51(6): 2154-2163.
[13] 任丽莉,王志军,闫冬梅. 结合黏菌觅食行为的改进多元宇宙算法[J]. 吉林大学学报(工学版), 2021, 51(6): 2190-2197.
[14] 姚引娣,贺军瑾,李杨莉,谢荡远,李英. 自构建改进型鲸鱼优化BP神经网络的ET0模拟计算[J]. 吉林大学学报(工学版), 2021, 51(5): 1798-1807.
[15] 赵宏伟,张子健,李蛟,张媛,胡黄水,臧雪柏. 基于查询树的双向分段防碰撞算法[J]. 吉林大学学报(工学版), 2021, 51(5): 1830-1837.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!