吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (4): 1181-1186.doi: 10.13229/j.cnki.jdxbgxb.20220087

• 计算机科学与技术 • 上一篇    

基于谱聚类的不确定数据集中快速离群点挖掘算法

康耀龙1(),冯丽露2,张景安3,曹素娥1   

  1. 1.山西大同大学 计算机与网络工程学院,山西 大同 037009
    2.山西大同大学 教育科学与技术学院,山西 大同 037009
    3.山西大同大学 计算机网络中心,山西 大同 037009
  • 收稿日期:2022-01-22 出版日期:2023-04-01 发布日期:2023-04-20
  • 作者简介:康耀龙(1979-),男,副教授.研究方向:数据挖掘,大数据技术.E-mail:yaolong_king@126.com
  • 基金资助:
    国家自然科学基金项目(71601101);大同市平台基地计划项目(2020196);山西省社会科学院(山西省人民政府发展研究中心)2021年度规划课题一般项目(YWYB202153);山西大同大学基础研究项目(2022K9)

Fast outlier mining algorithm in uncertain data set based on spectral clustering

Yao-long KANG1(),Li-lu FENG2,Jing-an ZHANG3,Su-e CAO1   

  1. 1.School of Computer and Network Engineering,Shanxi Datong University,Datong 037009,China
    2.School of Education Science and Technology,Shanxi Datong University,Datong 037009,China
    3.Computer Network Center,Shanxi Datong University,Datong 037009,China
  • Received:2022-01-22 Online:2023-04-01 Published:2023-04-20

摘要:

针对目前算法对数据进行离群点挖掘时,由于未能在数据挖掘前提取相关数据特征,导致该算法在进行数据挖掘时,存在挖掘时间长、挖掘效果差以及挖掘性能低的问题,提出一种基于谱聚类的不确定数据集中快速离群点挖掘算法。该算法先依据不等长序列计算数据的相似程度,并使用偏最小二乘法完成不确定数据集的特征提取;再基于谱聚类算法对数据特征进行计算,获取数据的离群指数;最后通过离群指数完成不确定数据集的离群点挖掘。实验结果表明,使用该算法挖掘数据离群点时,挖掘时间较短、挖掘效果较好、挖掘性能较高。

关键词: 谱聚类算法, 不确定数据集, 数据离群点, 快速挖掘, 偏最小二乘法

Abstract:

Aiming at the problems of long mining time, poor mining effect and low mining performance in data mining due to the failure to extract relevant data features before data mining, a fast outlier mining algorithm in uncertain data set based on spectral clustering is proposed. The algorithm calculates the similarity of data according to unequal length sequences, and uses partial least square method to extract the features of uncertain data sets; Then, the data features are calculated based on spectral clustering algorithm to obtain the outlier index of the data; Finally, outlier mining of uncertain data sets is completed by outlier index. The experimental results show that the algorithm has the advantages of short mining time, good mining effect and high mining performance.

Key words: spectral clustering algorithm, uncertain data set, data outliers, fast mining, partial least squares method

中图分类号: 

  • TP274

表1

不同挖掘算法的挖掘时间测试结果"

数据总量 /万条挖掘时间/s
本文算法文献[3]算法文献[5]算法
10454050
20455280
304574126
4055100186
5070162238
60100208299
70174285363
80223357395
90298417436
100349489525

图1

不同挖掘算法的挖掘精准度测试结果"

图2

不同挖掘算法的误报率测试结果"

1 李旺彦, 于彤. 基于计算机技术的水利工程管理信息化研究——评≪水利工程管理≫[J]. 人民黄河, 2020, 42(7): 168.
Li Wang-yan, Yu Tong. Research on computer technology-based water conservancy project management information——comment on "water conservancy project management"[J]. Yellow River, 2020, 42(7): 168.
2 吕九亨, 王建岭, 潘丽佳, 等. 基于数据挖掘技术的腹针疗法应用特点研究[J]. 针刺研究, 2020, 45(3): 237-242.
Jiu-heng Lyu, Wang Jian-ling, Pan Li-jia, et al. Application characteristics of abdominal acupuncture based on data mining technique[J]. Acupuncture Research, 2020, 45(3): 237-242.
3 杜旭升, 于炯, 陈嘉颖, 等. 一种基于邻域系统密度差异度量的离群点检测算法[J]. 计算机应用研究, 2020, 37(7): 1969-1973.
Du Xu-sheng, Yu Jiong, Chen Jia-ying, et al. Outlier detection algorithm based on neighborhood system density difference measurement[J]. Application Research of Computers, 2020, 37(7): 1969-1973.
4 杜旭升, 于炯, 叶乐乐, 等. 基于图上随机游走的离群点检测算法[J]. 计算机应用, 2020, 40(5): 1322-1328.
Du Xu-sheng, Yu Jiong, Ye Le-le, et al. Outlier detection algorithm based on graph random walk[J]. Journal of Computer Applications, 2020, 40(5): 1322-1328.
5 赵晓永, 王宁宁, 王磊. 基于主动学习的离群点集成挖掘方法研究[J]. 计算机工程与应用, 2020, 56(12): 112-117.
Zhao Xiao-yong, Wang Ning-ning, Wang Lei. Research of outlier ensemble mining based on active learning[J]. Computer Engineering and Applications, 2020, 56(12): 112-117.
6 吴鑫育, 李心丹, 马超群. 基于期权与高频数据信息的VaR度量研究[J]. 中国管理科学, 2021, 29(8): 13-23.
Wu Xin-yu, Li Xin-dan, Ma Chao-qun. Measuring VaR based on the information content of option and high-frequency data[J]. Chinese Journal of Management Science, 2021, 29(8): 13-23.
7 申晨曦, 杜晨晖, 李震宇, 等. 基于氢核磁共振与偏最小二乘法对酸枣仁及其掺伪品的鉴别[J]. 食品科学, 2020, 41(8): 275-281.
Shen Chen-xi, Du Chen-hui, Li Zhen-yu, et al. Differentiation between authentic and adulterated ziziphi spinosae semen by 1H NMR spectroscopy combined with partial least squares[J]. Food Science, 2020, 41(8): 275-281.
8 徐胜蓝, 司曹明哲, 万灿, 等. 考虑双尺度相似性的负荷曲线集成谱聚类算法[J]. 电力系统自动化, 2020, 44(22): 152-160.
Xu Sheng-lan, Ming-zhe Sicao, Wan Can, et al. Ensemble spectral clustering algorithm for load profiles considering dual-scale similarities[J]. Automation of Electric Power Systems, 2020, 44(22): 152-160.
9 王秋萍, 丁成, 王晓峰. 一种基于改进KH与KHM聚类的混合数据聚类算法[J]. 控制与决策, 2020, 35(10): 2449-2458.
Wang Qiu-ping, Ding Cheng, Wang Xiao-feng. A hybrid data clustering algorithm based on improved krill herd algorithm and KHM clustering[J]. Control and Decision, 2020, 35(10): 2449-2458.
10 王晓辉, 宋学坤, 王晓川. 基于邻域密度的异构数据局部离群点挖掘算法[J]. 计算机仿真, 2021, 38(7): 281-285.
Wang Xiao-hui, Song Xue-kun, Wang Xiao-chuan. Local outlier mining algorithm for heterogeneous data based on neighborhood density[J]. Computer Simulation, 2021, 38(7): 281-285.
[1] 康耀龙,冯丽露,张景安,陈富. 基于谱聚类的高维类别属性数据流离群点挖掘算法[J]. 吉林大学学报(工学版), 2022, 52(6): 1422-1427.
[2] 滕乐生,年综潜,逯家辉,郭伟良,蒋朝军,孟庆繁. 用近红外光谱-PLS法非破坏性分析吡嗪酰胺片[J]. 吉林大学学报(工学版), 2006, 36(03): 443-0446.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!