hash function, original cluster center, nearby local graph, constrained objective function, algebraic signature, hash tree, network channel ,"/> 基于最小哈希的网络单信道重复数据剔除算法

吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (2): 367-373.

• • 上一篇    下一篇

基于最小哈希的网络单信道重复数据剔除算法

邬剑飞1 , 周路明1 , 刘小强   

  1.  (1. 华中科技大学同济医学院 附属肿瘤医院, 武汉 430079; 2. 河南科技大学 应用工程学院, 河南 三门峡 472000)
  • 收稿日期:2022-04-24 出版日期:2023-04-13 发布日期:2023-04-17
  • 通讯作者: 周路明(1975— ), 男, 武汉人, 华中科技大学同济医学 院助理工程师, 主要从事医疗信息化和网络管理研究, (Tel)86-18986027399(E-mail)1286768904@ qq. com
  • 作者简介:邬剑飞(1985— ), 男, 湖北仙桃人, 华中科技大学同济医学院附属肿瘤医院工程师, 主要从事医疗信息化和网络管理研 究; (Tel)86-13971100924(E-mail)wjf9631@ 126. com
  • 基金资助:
    河南省教育厅重点科研基金资助项目(22B413007)

Duplicate Data Elimination of Network Single-Channel Based on Minimum Hash

WU Jianfei 1 , ZHOU Luming 1 , LIU Xiaoqiang 2   

  1. (1. Cancer Hospital Affiliated of Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430079, China; 2. College of Applied Engineering, Henan University of Science and Technology, Sanmenxia 472000, China) 
  • Received:2022-04-24 Online:2023-04-13 Published:2023-04-17

摘要: 剔除重复数据是保证网络高效运行不可缺少的步骤, 但该过程易受信号强度、 网络装置、 路由器性能等 问题的干扰。 为此, 提出基于最小哈希的网络单信道重复数据剔除算法。 首先利用哈希算法中的散列函数对 网络单信道数据实行聚类处理, 然后采用带有监督判别的投影算法对聚类后的数据进行降维处理, 最后采用代 数签名预估数据, 保证数据之间的计算开销最小, 再构造最小哈希树生成校验值, 在更新去重标签的同时, 通过双层剔除机制完全剔除单信道中的重复数据。 实验结果表明, 该算法的执行时间短, 且计算和存储开 销较小。

关键词: 散列函数, 原始聚类中心, 近邻局部图, 约束目标函数, 代数签名, 哈希树, 网络信道

Abstract: Eliminating duplicate data is an indispensable step to ensure efficient network operation. But this process is susceptible to interference from signal strength, network device, router performance and other problems. Therefore, a minimum-hashing algorithm for single channel data elimination is proposed. First the hash function in the hash algorithm network is used for single channel data clustering, and then supervision discriminant projection algorithm is applied for clustering of data dimension reduction after processing, finally the algebraic sign estimate is used to guarantee the data between the computing cost minimum and to construct minimum hash tree generated calibration value, in the update to heavy tags. The repeated data in a single channel is completely eliminated by double-layer culling mechanism. Experimental results show that the algorithm has short execution time and low computation and storage cost.

Key words: hash function')">

hash function, original cluster center, nearby local graph, constrained objective function, algebraic signature, hash tree, network channel

中图分类号: 

  • TP391