吉林大学学报(工学版) ›› 2024, Vol. 54 ›› Issue (10): 2969-2977.doi: 10.13229/j.cnki.jdxbgxb.20221633

• 计算机科学与技术 • 上一篇    

基于质谱数据的生成对抗自编码器整合投票算法

周丰丰1(),于涛1,范雨思2   

  1. 1.吉林大学 计算机科学与技术学院,长春 130012
    2.吉林大学 软件学院,长春 130012
  • 收稿日期:2022-12-27 出版日期:2024-10-01 发布日期:2024-11-22
  • 作者简介:周丰丰(1977-),男,教授,博士.研究方向:健康信息学.E-mail:FengfengZhou@gmail.com
  • 基金资助:
    国家自然科学基金项目(U19A2061);吉林省科技厅基金项目(20210509055RQ)

Generative adversarial autoencoder integrated voting algorithm based on mass spectral data

Feng-feng ZHOU1(),Tao YU1,Yu-si FAN2   

  1. 1.College of Computer Science and Technology,Jilin University,Changchun 130012,China
    2.College of Software,Jilin University,Changchun 130012,China
  • Received:2022-12-27 Online:2024-10-01 Published:2024-11-22

摘要:

针对质谱数据特征数量庞大,导致多疾病诊断任务复杂困难的问题,本文提出了基于质谱数据的生成对抗自编码器整合投票算法msDAGVote。使用基于双自编码器的生成对抗网络作为msDAGVote的特征提取框架,在输入质谱数据训练后,生成器子网络用于特征构造,最后通过整合投票特征选择算法对构造特征进行筛选,将获得的最优特征子集用于多疾病诊断。在10种不同疾病类型的质谱数据集上进行评估,试验数据表明:msDAGVote提取的特征优于比较方法,显著缩减分类所需特征数量的同时具备优秀的疾病分类诊断能力,在6个数据集上分类AUC超过0.98,在其余具有挑战性的数据集上超过0.87。

关键词: 计算机应用, 生物信息学, 质谱, 特征工程, 特征选择, 对抗生成网络, 双自编码器

Abstract:

Mass spectrometry is commonly used for disease prevention and diagnosis, but the large number of mass spectrometry data features and the wide variation of features among different diseases make the task of multi-disease diagnosis complex and difficult. To solve the above problems, this paper proposes the generative adversarial autoencoder integrated voting algorithm msDAGVote based on mass spectrometry data. The msDAGVote feature extraction framework uses a dual autoencoder-based generative adversarial network, and after the network has been trained by mass spectrometry data, the generator sub-network is used for feature construction. Evaluated using mass spectrometry datasets of 10 different disease types, the experimental data show that msDAGVote extracts better features than comparative method, significantly reduces the number of features required for classification while providing excellent diagnostic power for disease classification, with classification AUC over 0.98 on six datasets and 0.87 on the remaining challenging datasets.

Key words: computer application, bioinformatics, mass spectrometry, feature engineering, feature selection, generative adversarial network, dual autoencoder

中图分类号: 

  • TP391

图1

msDAGVote算法流程"

图2

特征构造模块的网络结构"

图3

投票策略"

表1

数据集介绍"

数据集名称疾病类型样本数特征数
ST000284直肠癌1481138464
ST000355乳腺癌121112876135
ST000356乳腺癌213410131103
ST000385肺腺癌16523317590
ST000388肺癌9443 0442965
MTBLS354肺炎23912 04197142
MTBLS352糖尿病前期2102 72498112
ST0003832型糖尿病561061244
Feng冠心病1024 9484359
MTBLS408牛皮癣902 0154545

图4

特征初筛结果"

表2

整合投票策略的AUC对比"

数据集名称GBAdaRFSHAPmsDAGVote(Var+T-test+ LinearSVC+DT)msDAGVote(RF+GB+Ada+SHAP)
ST0002840.841 60.796 40.728 50.860 70.850 70.883 5
ST0003551.000 01.000 01.000 01.000 01.000 01.000 0
ST0003561.000 01.000 01.000 01.000 01.000 01.000 0
ST0003850.850 00.735 20.777 80.842 10.818 50.870 4
ST0003880.861 50.823 10.782 10.846 20.871 80.884 6
MTBLS3540.973 60.969 10.976 40.960 10.979 20.980 1
MTBLS3520.827 30.790 90.866 70.861 40.847 70.884 1
ST0003830.888 91.000 00.814 81.000 01.000 01.000 0
Feng0.819 40.819 40.958 30.941 20.958 31.000 0
MTBLS4080.950 60.839 50.864 21.000 00.963 01.000 0

表3

整合投票策略的特征数量对比"

数据集名称GBAdaRFSHAPmsDAGVote(Var+T-test+LinearSVC+DT)msDAGVote(RF+GB+Ada+SHAP)
ST00028457291819530
ST000355221454
ST000356111111
ST00038568112541456
ST000388535533291239
MTBLS354242323
MTBLS352362179578066
ST00038316956710867
Feng513113
MTBLS40884345325048

图5

整合投票策略对比"

图6

msDAGVote与不进行特征工程的AUC对比"

表4

msDAGVote与不进行特征工程的特征数量对比"

数据集名称不进行特征工程msDAGVote
ST00028411330
ST0003551284
ST0003561011
ST0003852 33156
ST00038843 04439
MTBLS35412 0413
MTBLS3522 72466
ST00038310667
Feng4 9483
MTBLS4082 01548

表5

msDAGVote与其他论文对比"

msDAGVote对比文献
数据集名称PrecisionRecallAccuracyAUCAUC
ST0002840.666 70.615 40.700 00.883 50.820
ST0003551.000 00.964 30.976 71.000 00.978
ST0003560.952 40.952 40.925 91.000 00.990
ST0003850.823 50.777 80.787 90.870 40.660
ST0003880.812 51.000 00.842 10.884 60.570
MTBLS3540.966 71.000 00.979 20.980 10.980
MTBLS3520.739 10.772 70.738 10.884 10.800
ST0003830.900 01.000 00.916 71.000 00.925
Feng0.900 00.750 00.809 51.000 00.980
MTBLS4080.900 01.000 00.944 41.000 00.945

图7

与其他论文对比"

1 陈雪云, 许韬, 黄小巧. 基于条件生成对抗网络的医学细胞图像生成检测方法[J]. 吉林大学学报: 工学版, 2021, 51(4): 1414-1419.
Chen Xue-yun, Xu Tao, Huang Xiao-qiao. Detection method of medical cell image generation based on conditional generative adversarial network[J]. Journal of Jilin University (Engineering and Technology Edition), 2021, 51(4): 1414-1419.
2 欧阳继红, 郭泽琪, 刘思光. 糖尿病视网膜病变分期双分支混合注意力决策网络 [J]. 吉林大学学报:工学版, 2022, 52(3): 648-656.
Ouyang Ji-hong, Guo Ze-qi, Liu Si-guang. Dual⁃branch hybrid attention decision net for diabetic retinopathy classification[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(3): 648-656.
3 周丰丰, 张亦弛. 基于稀疏自编码器的无监督特征工程算法BioSAE[J]. 吉林大学学报: 工学版, 2022, 52(7): 1645-1656.
Zhou Feng-feng, Zhang Yi-chi. Unsupervised feature engineering algorithm BioSAE based on sparse autoencoder[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(7): 1645-1656.
4 周丰丰, 张亚琪. 基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法[J]. 吉林大学学报(理学版), 2023, 61(3): 651-657.
Zhou Feng-feng, Zhang Ya-qi. Binding prediction algorithm of HLA-Ⅰ and polypeptides based on pre-trained model protBert[J]. Journal of Jilin University (Science Edition), 2023, 61(3): 651-657.
5 周丰丰, 张金楷. 具有局部和全局注意力机制的图注意力网络学习单样本组学数据表征[J]. 吉林大学学报(理学版), 2023, 61(6): 1351-1357.
He Zhi-qiao, Han Yan-jun, Jia Jing-yi. Advances in ambient ionization mass spectrometry and its application in food detection[J]. Food Research and Development, 2022, 43(8): 216-224.
6 贺志乔, 韩岩君, 贾婧怡. 常压离子化质谱技术及其在食品检测中的应用研究进展[J]. 食品研究与开发, 2022, 43(8): 216-24.
He Zhi-qiao, Han Yan-jun, Jia Jing-yi. Advances in ambient ionization mass spectrometry and its application in food detection[J]. Food Research and Development, 2022, 43(8): 216-224.
7 Shen X T, Shao W, Wang C C, et al. Deep learning-based pseudo-mass spectrometry imaging analysis for precision medicine[J]. Briefings in Bioinformatics, 2022, 23(5): bbac331.
8 Cadow J, Manica M, Mathis R, et al. On the feasibility of deep learning applications using raw mass spectrometry data [J]. Bioinformatics, 2021, 37(): i245-i253.
9 Mittal P, Condina M R, Klingler-hoffmann M, et al. Cancer tissue classification using supervised machine learning applied to MALDI mass spectrometry imaging[J]. Cancers, 2021, 13(21): 5388.
10 Chen D P, Bryden W A, Wood R. Detection of tuberculosis by the analysis of exhaled breath particles with high-resolution mass spectrometry[J]. Scientific Reports, 2020, 10(1): No.7647.
11 Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536.
12 Goodfellow I, Pouget-abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
13 Lundberg S M, Erion G, Chen H, et al. From local explanations to global understanding with explainable AI for trees[J]. Nature Machine Intelligence, 2020, 2(1): 56-67.
14 Evans E D, Duvallet C, Chu N D, et al. Predicting human health from biofluid-based metabolomics using machine learning[J]. Sci Rep, 2020, 10(1):No. 17635.
[1] 李路,宋均琦,朱明,谭鹤群,周玉凡,孙超奇,周铖钰. 基于RGHS图像增强和改进YOLOv5网络的黄颡鱼目标提取[J]. 吉林大学学报(工学版), 2024, 54(9): 2638-2645.
[2] 赵宏伟,武鸿,马克,李海. 基于知识蒸馏的图像分类框架[J]. 吉林大学学报(工学版), 2024, 54(8): 2307-2312.
[3] 张云佐,郑宇鑫,武存宇,张天. 基于双特征提取网络的复杂环境车道线精准检测[J]. 吉林大学学报(工学版), 2024, 54(7): 1894-1902.
[4] 徐永丽,杨煦兰,周吉森,杨松翰,孙明刚. 温拌沥青的沥青烟成分及温拌剂抑烟性能[J]. 吉林大学学报(工学版), 2024, 54(6): 1701-1707.
[5] 孙铭会,薛浩,金玉波,曲卫东,秦贵和. 联合时空注意力的视频显著性预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1767-1776.
[6] 李延风,刘名扬,胡嘉明,孙华栋,孟婕妤,王奥颖,张涵玥,杨华民,韩开旭. 基于梯度转移和自编码器的红外与可见光图像融合[J]. 吉林大学学报(工学版), 2024, 54(6): 1777-1787.
[7] 张丽平,刘斌毓,李松,郝忠孝. 基于稀疏多头自注意力的轨迹kNN查询方法[J]. 吉林大学学报(工学版), 2024, 54(6): 1756-1766.
[8] 梁礼明,周珑颂,尹江,盛校棋. 融合多尺度Transformer的皮肤病变分割算法[J]. 吉林大学学报(工学版), 2024, 54(4): 1086-1098.
[9] 张云佐,郭威,李文博. 遥感图像密集小目标全方位精准检测算法[J]. 吉林大学学报(工学版), 2024, 54(4): 1105-1113.
[10] 范博松,邵春福. 城市轨道交通突发事件风险等级判别方法[J]. 吉林大学学报(工学版), 2024, 54(2): 427-435.
[11] 朱洪洲,苏春力,唐乃膨,魏俊尧,孙宏军. 胶粉改性沥青排放物采样及定量分析方法[J]. 吉林大学学报(工学版), 2024, 54(10): 2922-2929.
[12] 张云佐,董旭,蔡昭权. 拟合下肢几何特征的多视角步态周期检测[J]. 吉林大学学报(工学版), 2023, 53(9): 2611-2619.
[13] 肖明尧,李雄飞,朱芮. 基于NSST域像素相关分析的医学图像融合[J]. 吉林大学学报(工学版), 2023, 53(9): 2640-2648.
[14] 霍光,林大为,刘元宁,朱晓冬,袁梦,盖迪. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报(工学版), 2023, 53(9): 2591-2600.
[15] 何颖,王卓然,周旭,刘衍珩. 融合社交地理信息加权矩阵分解的兴趣点推荐算法[J]. 吉林大学学报(工学版), 2023, 53(9): 2632-2639.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!