吉林大学学报(工学版) ›› 2019, Vol. 49 ›› Issue (4): 1293-1300.doi: 10.13229/j.cnki.jdxbgxb20180478

• • 上一篇    

基于已选特征动态变化的非线性特征选择方法

高万夫1(),张平2,胡亮1()   

  1. 1. 吉林大学 计算机科学与技术学院,长春 130012
    2. 吉林大学 软件学院,长春 130012
  • 收稿日期:2018-05-15 出版日期:2019-07-01 发布日期:2019-07-16
  • 通讯作者: 胡亮 E-mail:gaowf16@mails.jlu.edu.cn;hul@jlu.edu.cn
  • 作者简介:高万夫(1990-),男,博士研究生.研究方向:机器学习,数据挖掘.E-mail:gaowf16@mails.jlu.edu.cn
  • 基金资助:
    国家重点研发专项项目(2017YFA0604500);国家科技支撑计划项目(2014BAH02F00);国家自然科学基金项目(61701190);吉林省中青年科技创新领军人才及团队项目(20170519017JH);吉林省省校共建示范项目(SXGJSF2017-4);吉林省重点科技研发项目(20180201103GX)

Nonlinear feature selection method based on dynamic change of selected features

Wan-fu GAO1(),Ping ZHANG2,Liang HU1()   

  1. 1. College of Computer Science and Technology,Jilin University,Changchun 130012, China
    2. College of Software,Jilin University,Changchun 130012,China
  • Received:2018-05-15 Online:2019-07-01 Published:2019-07-16
  • Contact: Liang HU E-mail:gaowf16@mails.jlu.edu.cn;hul@jlu.edu.cn

摘要:

针对目前大多数基于信息论的线性累加特征选择算法的缺点和不足,并且考虑到已选特征的动态变化对特征选择的影响,本文提出了一种非线性的特征选择算法。实验结果证明,本文算法在平均准确率和最高准确率上都取得了很好的效果。为证明本文算法的优势,将本文算法与7个极具竞争性的特征选择算法在3个不同的分类器和8个真实数据集上进行了比较,实验结果表明,本文算法具有较强的分类优势。

关键词: 人工智能, 特征选择, 信息论, 动态变化, 非线性方法, 分类

Abstract:

Information theory is widely used in feature selection methods. Traditional feature selection methods employ cumulative summation to select features. Different from previous feature selection methods, this paper proposes a nonlinear feature selection method that also considers dynamic change of selected features. The experimental results demonstrate that the proposed method achieves the best classification performance in terms of average classification accuracy and highest classification accuracy. To verify the effectiveness of the proposed method, the proposed method is compared with seven very competitive feature selection methods on three different classifiers on eight real-world data sets.The experimental results shows that this algorithm has strong classification superiorities.

Key words: artificial intelligence, feature selection, information theory, dynamic change, nonlinear methods, classification

中图分类号: 

  • TP301

表1

数据集描述"

数据集 样本个数 特征数目 类别数目 类型
Semeion 1593 256 10 离散
Movement_libras 360 90 15 连续
WarpPIE10P 210 2420 10 连续
TOX_171 171 5748 4 连续
SMK_CAN_187 187 19993 2 连续
Isolet 1560 617 26 连续
USPS 9298 256 10 连续
RELATHE 1427 4322 2 离散

表2

8个算法在NB分类器上的平均准确率(%)"

Data sets CIFE CMIM DISR mRMR IWFS MRI JMIM NDCSF
Semeion

44.98±

7.48(+)

51.32±

10.54(=)

44.94±

7.15(+)

46.94±

8.07(+)

44.28±

7.81(+)

48.8±

9.01(+)

46.68±

8.99(+)

51.95±

10.52

Movement_libras

45.9±

8.02(+)

52.9±

11.06(=)

46.31±

7.93(+)

49.9±

9.52(+)

48.67±

8.24(+)

52.71±

10.91(=)

51.79±

10.38(=)

52.23±

11.38

WarpPIE10P

58.23±

7.78(+)

66.53±

13.11(+)

55.15±

10.71(+)

62.85±

10.96(+)

60.89±

9.59(+)

59.85±

10.14(+)

60.81±

11.67(+)

71.63±

14.37

TOX_171

57.42±

1.86(=)

61.87±

2.86(-)

53.76±

3.44(+)

54.97±

1.78(+)

50.99±

1.96(+)

59.37±

3.31(-)

55.11±

1.56(+)

57.76±

3.07

SMK_CAN_187

65.69±

1.79(+)

62.37±

1.72(+)

65.04±

2.35(+)

62.00±

1.45(+)

63.37±

3.81(+)

64.22±

1.7(+)

62.00±

1.68(+)

67.66±

2.40

Isolet

55.53±

13.47(+)

58.26±

14.33(+)

44.46±

10.38(+)

46.87±

10.6(+)

49.37±

10.55(+)

56.55±

13.67(+)

51.57±

11.24(+)

59.43±

17.51

USPS

66.4±

8.8(+)

76.21±

12.55(+)

67.4±

9.95(+)

71.72±

10.55(+)

67.63±

9.02(+)

70.93±

10.41(+)

70.75±

10.22(+)

76.11±

12.43

RELATHE

60.84±

0.98(+)

65.89±

3.76(+)

63.56±

3.01(+)

65.85±

3.67(+)

67.71±

2.82(+)

65.47±

2.64(=)

66.66±

2.5(+)

67.59±

4.6

Average 56.87 61.92 55.08 57.64 56.61 59.74 58.17 63.05

表3

8个算法在SVM分类器上的平均准确率(%)"

Data sets CIFE CMIM DISR mRMR IWFS MRI JMIM NDCSF
Semeion

60.67±

11.8(+)

67.1±

15.89(+)

56.26±

11.36(+)

60.3±

12.32(+)

58.9±

11.59(+)

61.5±

12.93(+)

61.96±

13.59(+)

67.97±

16.84

Movement_libras

66.06±

15.4(+)

70.81±

17.81(+)

60.1±

12.54(+)

66.99±

16.03(+)

67.77±

15.53(+)

70.13±

17.35(+)

70.87±

17.73(+)

72.59±

18.12

WarpPIE10P

81.02±

14.99(+)

84.86±

17.04(+)

76.1±

12.86(+)

81.61±

15.81(+)

83.71±

16.96(+)

80.29±

14.94(+)

81.52±

16.53(+)

87.18±

18.24

TOX_171

59.6±

5.26(+)

65.15±

5.04(-)

58.36±

3.02(+)

59.87±

3.07(+)

58.16±

3.87(+)

57.83±

2.74(+)

56.35±

2.48(+)

63.01±

5.91

SMK_CAN_187

62.51±

2.32(+)

61.48±

1.29(+)

64.46±

1.19(+)

62.69±

1.61(+)

63.56±

2.87(+)

61.77±

1.96(+)

61.89±

1.68(+)

67.39±

1.81

Isolet

59.41±

14.09(+)

68.07±

18.22(-)

60.3±

14.89(+)

60.45±

14.78(+)

58.28±

13.81(+)

65.46±

16.63(=)

60.06±

14.07(+)

65.41±

18.65

USPS

80.12±

13.7(+)

84.12±

15.24(+)

78.74±

13.77(+)

81.35±

13.98(+)

80.29±

13.35(+)

81.55±

14.1(+)

81.92±

14.41(+)

85.84±

15.9

RELATHE

69.66±

1.55(+)

73.25±

2.89(+)

72.74±

3.22(+)

75.14±

3.75(=)

74.67±

2.88(+)

71.03±

1.85(+)

72.19±

2.06(+)

74.65±

3.84

Average 67.38 71.86 65.88 68.55 68.17 68.7 68.35 73.01

表4

8个算法在3NN分类器上的平均准确率(%)"

Data sets CIFE CMIM DISR mRMR IWFS MRI JMIM NDCSF
Semeion

55.99±

13.38(+)

62.06±

18.63(=)

49.39±

14.08(+)

54.81±

14.32(+)

53.76±

12.87(+)

56.11±

14.21(+)

55.88±

15.46(+)

62.56±

19.21

Movement_libras

61.27±

15.1(+)

64.97±

16.88(+)

56.46±

12.54(+)

62.38±

15.56(+)

62.64±

13.92(+)

65.00±

17.22(+)

64.01±

16.32(+)

66.7±

17.23

WarpPIE10P

77.29±

14.62(+)

81.95±

17.77(+)

73.97±

13.91(+)

78.62±

16.94(+)

77.63±

16.12(+)

76.68±

15.42(+)

78.78±

17.68(+)

84.58±

18.81

TOX_171

48.92±

3.01(+)

61.27±

6.79(-)

57.11±

5.6(=)

54.93±

4.89(+)

53.43±

3.97(+)

60.05±

6.37(-)

59.1±

5.66(-)

57.12±

5.86

SMK_CAN_187

60.76±

2.81(+)

59.76±

2.25(+)

62.39±

1.9(+)

62.71±

2.29(+)

60.53±

2.54(+)

59.72±

2.62(+)

59.4±

2.79(+)

67.32±

2.44

Isolet

46.42±

10.66(+)

60.37±

17.53(-)

53.62±

14.75(+)

53.09±

13.97(+)

45.79±

10.54(+)

58.08±

16.23(-)

52.36±

13.5(+)

55.66±

16.86

USPS

75.53±

15.16(+)

80.56±

17.25(+)

74.36±

15.06(+)

76.97±

15.96(+)

75.88±

14.68(+)

77.48±

15.76(+)

77.97±

16.06(+)

82.41±

17.91

RELATHE

62.94±

3.19(+)

64.95±

4.54(=)

62.05±

3.57(+)

65.27±

4.67(=)

57.89±

2.5(+)

63.46±

3.86(+)

63.23±

4.01(+)

65.06±

5.47

Average 61.14 66.99 61.17 63.60 60.94 64.57 63.84 67.68

图1

Lsolet在分类器上的准确率"

图2

Movement_libras在分类器上的准确率"

图3

RELATHE在分类器上的准确率"

图4

Semeion在分类器上的准确率"

图5

SMK_CAN_187在分类器上的准确率"

图6

TOX_171在分类器上的准确率"

图7

USPS在分类器上的准确率"

图8

WarpPIE10P在分类器上的准确率"

表5

8个算法在分类器上的最高准确率比较(%)"

Data sets CIFE CMIM DISR mRMR IWFS MRI JMIM NDCSF
Semeion 64.81 75.75 63.28 65.20 63.47 67.53 69.60 78.01
Movement_libras 66.33 71.89 61.89 69.33 67.81 71.85 71.52 72.89
WarpPIE10P 80.33 89.78 80.06 85.11 85.67 83.17 87.06 93.39
TOX_171 59.93 68.15 60.54 59.69 57.63 62.86 59.64 63.98
SMK_CAN_187 66.04 64.61 65.94 65.11 65.36 65.34 64.61 71.32
Isolet 65.24 75.90 64.53 64.96 60.04 73.82 63.35 76.39
USPS 84.46 90.54 83.63 85.96 84.23 85.79 87.46 91.65
RELATHE 66.08 72.76 70.24 73.58 68.91 69.42 70.22 75.46
Average 69.15 76.17 68.76 71.12 69.14 72.47 71.68 77.89
1 Gao W , Hu L , Zhang P . Class-specific mutual information variation for feature selection[J]. Pattern Recognition, 2018, 79:328-339.
2 Hu L , Gao W , Zhao K , et al . Feature selection considering two types of feature relevancy and feature interdependency[J]. Expert Systems with Applications, 2018, 93:423-434.
3 石峰,莫忠息 .信息论基础[M].3版.武汉:武汉大学出版社,2014:14-52.
4 赵晓群 .信息论基础及应用[M].北京:机械工业出版社,2015:27-53.
5 Lewis D D . Feature selection and feature extraction for text categorization[C]⫽The Workshop on Speech & Natural Language, Association for Computational Linguistics,Harriman, New York, 1992:212-217.
6 Battiti R . Using mutual information for selecting features in supervised neural net learning[J]. IEEE Transactions on Neural Networks, 1994, 5(4):537-550.
7 Peng H , Long F , Ding C . Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2005, 27(8):1226-1238.
8 Lin D , Tang X . Conditional infomax learning: an integrated framework for feature extraction and fusion[C]⫽European Conference on Computer Vision, Graz, 2006: 68-82.
9 Salueña C , Avalos J B . Information-theoretic feature selection in microarray data using variable complementarity[J]. IEEE Journal of Selected Topics in Signal Processing, 2008, 2(3):261-274.
10 Zeng Z , Zhang H , Zhang R , et al . A novel feature selection method considering feature interaction[J]. Pattern Recognition, 2015, 48(8):2656-2666.
11 Fleuret F . Fast Binary Feature selection with conditional mutual information[J]. Journal of Machine Learning Research, 2004, 5(3):1531-1555.
12 Bennasar M , Hicks Y , Setchi R . Feature selection using joint mutual information maximisation[J]. Expert Systems with Applications, 2015, 42(22):8520-8532.
13 Wang J , Wei J M , Yang Z , et al . Feature selection by maximizing independent classification information[J]. IEEE Transactions on Knowledge & Data Engineering, 2017, 29(4):828-841.
14 Che J , Yang Y , Li L , et al . Maximum relevance minimum common redundancy feature selection for nonlinear data[J]. Information Sciences, 2017, 409:68-86.
15 Li J , Cheng K , Wang S , et al . Feature selection: a data perspective[J]. Acm Computing Surveys, 2016, 50(6).
16 Lichman M . UCI machine learning repository[DB/OL].[2018-05-15]..
[1] 欧阳丹彤,肖君,叶育鑫. 基于实体对弱约束的远监督关系抽取[J]. 吉林大学学报(工学版), 2019, 49(3): 912-919.
[2] 董飒, 刘大有, 欧阳若川, 朱允刚, 李丽娜. 引入二阶马尔可夫假设的逻辑回归异质性网络分类方法[J]. 吉林大学学报(工学版), 2018, 48(5): 1571-1577.
[3] 顾海军, 田雅倩, 崔莹. 基于行为语言的智能交互代理[J]. 吉林大学学报(工学版), 2018, 48(5): 1578-1585.
[4] 王旭, 欧阳继红, 陈桂芬. 基于垂直维序列动态时间规整方法的图相似度度量[J]. 吉林大学学报(工学版), 2018, 48(4): 1199-1205.
[5] 张浩, 占萌苹, 郭刘香, 李誌, 刘元宁, 张春鹤, 常浩武, 王志强. 基于高通量数据的人体外源性植物miRNA跨界调控建模[J]. 吉林大学学报(工学版), 2018, 48(4): 1206-1213.
[6] 李雄飞, 冯婷婷, 骆实, 张小利. 基于递归神经网络的自动作曲算法[J]. 吉林大学学报(工学版), 2018, 48(3): 866-873.
[7] 黄岚, 纪林影, 姚刚, 翟睿峰, 白天. 面向误诊提示的疾病-症状语义网构建[J]. 吉林大学学报(工学版), 2018, 48(3): 859-865.
[8] 刘杰, 张平, 高万夫. 基于条件相关的特征选择方法[J]. 吉林大学学报(工学版), 2018, 48(3): 874-881.
[9] 陈涛, 崔岳寒, 郭立民. 适用于单快拍的多重信号分类改进算法[J]. 吉林大学学报(工学版), 2018, 48(3): 952-956.
[10] 陈绵书, 苏越, 桑爱军, 李培鹏. 基于空间矢量模型的图像分类方法[J]. 吉林大学学报(工学版), 2018, 48(3): 943-951.
[11] 刘雪娟, 袁家斌, 许娟, 段博佳. 量子k-means算法[J]. 吉林大学学报(工学版), 2018, 48(2): 539-544.
[12] 王旭, 欧阳继红, 陈桂芬. 基于多重序列所有公共子序列的启发式算法度量多图的相似度[J]. 吉林大学学报(工学版), 2018, 48(2): 526-532.
[13] 杨欣, 夏斯军, 刘冬雪, 费树岷, 胡银记. 跟踪-学习-检测框架下改进加速梯度的目标跟踪[J]. 吉林大学学报(工学版), 2018, 48(2): 533-538.
[14] 杨宏宇, 徐晋. Android恶意软件静态检测模型[J]. 吉林大学学报(工学版), 2018, 48(2): 564-570.
[15] 范敏, 韩琪, 王芬, 宿晓岚, 徐浩, 吴松麟. 基于多层次特征表示的场景图像分类算法[J]. 吉林大学学报(工学版), 2017, 47(6): 1909-1917.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!