吉林大学学报(信息科学版) ›› 2025, Vol. 43 ›› Issue (4): 844-850.

• • 上一篇    下一篇

基于GRNN 算法的数字化信息资源过滤去重方法

张灵运   

  1. 陕西学前师范学院图书馆,西安710100
  • 收稿日期:2023-08-22 出版日期:2025-08-15 发布日期:2025-08-15
  • 作者简介:张灵运(1991— ), 女, 西安人, 陕西学前师范学院图书馆员, 主要从事图书情报信息数据研究, (Tel)86-13619266269 (E-mail)zhanhlingyun082@163. com。
  • 基金资助:
    西安市2023年度社会科学规划基金资助项目(23YZ69)

Digital Information Resource Filtering and Deduplication Method Based on GRNN Algorithm

 ZHANG Lingyun   

  1. Library, Shaanxi Xueqian Normal University, Xi’an 710100, China
  • Received:2023-08-22 Online:2025-08-15 Published:2025-08-15

摘要: 由于资源过滤去重是保证数字化图书馆高效运行中不可缺少的环节,但其过程易受冗余数据、资源类型和客户群体差异等问题的干扰,为此,提出基于GRNN(General Regression Neural Network)算法的数字化信息资 源过滤去重方法。首先采用GRNN算法检测数字化信息资源中的异常值, 并通过PSO-LSSVM(Purticle Swarm Optimization-Least Squares Support Vector Machine)过滤异常值, 避免异常数据对去重过程产生干扰。然后采用局部敏感哈希算法将资源数据转换成二进制哈希码,通过检测哈希码之间的汉明距离相似度完成数字化信息资源的过滤去重。实验结果表明,该方法用时短,并且去重精度和去重率较高。

关键词: 广义回归神经网络, 异常数据的过滤, 二进制哈希码, 哈希函数, 汉明距离

Abstract: Due to the fact that resource filtering and deduplication are essential steps in ensuring the efficient operation of digital libraries, the process is susceptible to interference from redundant data, resource types, and differences in customer groups. Therefore, a digital information resource filtering and deduplication method based on GRNN algorithm is proposed. Firstly, the GRNN(General Regression Neural Network) algorithm is used to detect outliers in digital information resources, and the outliers are filtered through PSO-LSSVM(Purticle Swarm Optimization-Least Squares Support Vector Machine) to avoid interference from outlier data in the deduplication process. Then, a locally sensitive hash algorithm is used to convert the resource data into binary hash codes, and the filtering and deduplication of digital information resources are completed by detecting the Hamming distance similarity between hash codes. The experimental results show that this method takes short time and has high precision and rate of deduplication.

Key words: generalized regression neural network, filtering of abnormal data, binary hash code, hash function, hamming distance

中图分类号: 

  • TP391