吉林大学学报(工学版) ›› 2025, Vol. 55 ›› Issue (6): 2089-2096.doi: 10.13229/j.cnki.jdxbgxb.20231196

• 计算机科学与技术 • 上一篇    下一篇

面向不平衡多组学癌症数据的特征表征算法

周丰丰1,2(),郭喆1,范雨思2   

  1. 1.吉林大学 人工智能学院,长春 130012
    2.吉林大学 计算机科学与技术学院,长春 130012
  • 收稿日期:2023-11-03 出版日期:2025-06-01 发布日期:2025-07-23
  • 作者简介:周丰丰(1977-),男,教授,博士.研究方向:健康大数据.E-mail:fengfengzhou@gmail.com
  • 基金资助:
    吉林省中青年科技创新创业卓越人才(团队)项目(创新类)(20210509055RQ);中国自然科学基金项目(62072212);中国自然科学基金项目(U19A2061);吉林省大数据智能计算实验室项目(20180622002JC)

Feature representation algorithm for imbalanced classification of multi⁃omics cancer data

Feng-feng ZHOU1,2(),Zhe GUO1,Yu-si FAN2   

  1. 1.School of Artificial Intelligence,Jilin University,Changchun 130012,China
    2.College of Computer Science and Technology,Jilin University,Changchun 130012,China
  • Received:2023-11-03 Online:2025-06-01 Published:2025-07-23

摘要:

针对癌症疾病数据结构复杂、预测困难、数据不平衡和患者隐私保护等一系列的问题,提出了一种特征表征算法ImFeatures,解决了癌症数据的不平衡问题,丰富了样本结构。联合癌症转录组和甲基化2种组学数据作为真实样本,通过逻辑回归(LR)和随机森林(RF)2种特征选择后,得到的负样本被随机划分并结合等量的正样本,输入本文提出的特征表征模型,生成学习到关键特征信息的表征样本,以提高模型预测能力。实验结果表明,在经过特征表征后的11种常见癌症数据集上,本文提出的结合特征筛选和特征表征的算法的准确率(Acc)均超过了80.00%,其中有5种癌症的预测准确率超过了95.00%,可以有效提升癌症疾病的预测准确率。

关键词: 计算机应用技术, 特征表征, 生物信息学, 多组学数据, 特征筛选, 机器学习

Abstract:

Aiming at a series of problems such as complex data structure, difficult prediction, data imbalance, and patient privacy protection in cancer diseases, a feature representation algorithm ImFeatures was proposed to solve the problem of imbalanced cancer data and enrich the sample structure. By combining two types of cancer transcriptome and methylation data as real samples, negative samples obtained after feature selection by logistic regression and random forest were randomly divided and combined with equal numbers of positive samples. The feature representation model proposed was used to generate sample representations that learn key feature information, thereby improving the predictive ability of the model. The experimental results show that on 11 common cancer datasets after feature characterization, the accuracy (Acc) of the algorithm combining feature selection and feature representation proposed in this paper exceeds 80.00% in all cases, and five cancer types even receive accuracies over 95.00%, which can effectively improve the prediction accuracies of cancer diseases.

Key words: computer application technology, feature representation, bioinformatics, multi-omics data, feature selection, machine learning

中图分类号: 

  • TP399

图1

ImFeatures算法流程图"

图2

特征表征模块网络结构"

表1

特征表征模型超参数设置"

参数取值
编码器层数/层3
编码器网络节点数/个512, 256, 128
编码器每层激活函数ReLu, ReLu, ReLu
生成器层数/层4
生成器网络节点数/个128, 256, 512, 1 024
生成器激活函数ReLu, ReLu, ReLu, Tanh
鉴别器层数/层4
鉴别器网络节点数512, 256, 128, 1
鉴别器激活函数Leaky ReLu, Leaky ReLu, Leaky ReLu, Sigmoid

表2

数据集信息"

数据集甲基化特征数转录组特征数样本数正样本数负样本数不平衡率LrN/P
ACC394 01460 483768688.50
BLCA382 02460 4834441662781.67
BRCA363 73660 4831 1122988142.73
CHOL378 73560 483538455.63
COAD374 80560 4834591652941.78
KICH391 29860 483588506.25
KIRC382 72060 4832741725715.12
KIRP383 10560 48312347761.62
LUAD369 18260 48335273454.24
LUSC370 61960 4836391225175.37
MESO381 22360 483465733921.13

表3

特征表征前后模型预测结果"

特征表征前ImFeatures
数据集KNNSVMLRNBKNNSVMLRNB
ACC0.816 70.683 30.816 70.633 30.883 30.883 30.950 00.933 3
BLCA0.705 10.704 90.680 50.714 10.852 60.774 30.773 80.717 2
BRCA0.632 90.718 00.590 40.620 50.819 10.739 30.709 10.636 6
CHOL0.616 70.666 70.800 00.933 31.000 00.933 31.000 01.000 0
COAD0.632 90.718 00.590 40.620 50.819 10.739 30.709 10.636 6
KICH1.000 01.000 01.000 00.933 31.000 01.000 01.000 01.000 0
KIRC0.820 00.820 00.746 70.860 00.926 70.926 70.966 70.933 3
KIRP0.871 90.787 70.882 50.787 70.977 80.893 00.935 70.829 8
LUAD0.708 80.709 40.680 60.774 70.869 00.799 30.803 40.778 8
LUSC0.664 80.664 80.603 20.746 40.828 50.842 30.856 30.808 3
MESO0.750 00.722 50.763 30.854 20.815 80.802 50.815 80.855 0
数据集特征表征前ImFeatures
KNNSVMLRNBKNNSVMLRNB
ACC0.850 00.750 00.850 00.650 00.900 00.900 00.950 00.900 0
BLCA0.705 10.704 60.680 70.714 30.852 40.774 10.773 70.717 3
BRCA0.633 10.718 20.590 40.620 70.819 20.739 50.709 30.636 7
CHOL0.600 00.650 00.850 00.900 01.000 00.900 01.000 01.000 0
COAD0.633 10.718 20.590 40.620 70.819 20.739 50.709 30.636 7
KICH1.000 01.000 01.000 00.950 01.000 01.000 01.000 01.000 0
KIRC0.816 70.816 70.750 00.866 70.933 30.916 70.966 70.933 3
KIRP0.868 90.782 20.878 90.785 60.977 80.891 10.935 60.827 8
LUAD0.708 70.710 70.682 20.776 20.870 00.800 70.804 30.780 3
LUSC0.667 10.665 70.603 80.748 60.828 10.843 80.856 20.810 0
MESO0.755 40.728 60.764 30.858 90.819 60.807 10.816 10.857 1

表4

单组学与多组学数据预测结果"

数据名称准确率敏感性特异性精确率召回率马修斯相关系数ROC曲线下面积
BRCA-Transcriptomics0.604 00.406 30.802 20.667 10.406 30.225 10.604 2
BRCA-Methylation0.596 60.489 30.704 10.623 50.489 30.198 10.596 7
BRCA-Multi-omics0.632 90.464 80.801 30.701 30.464 80.283 10.633 1
LUAD-Transcriptomics0.612 00.464 00.760 00.657 20.464 00.235 00.612 0
LUAD-Methylation0.645 40.456 20.835 20.737 90.456 20.316 50.645 7
LUAD-Multi-omics0.708 80.639 00.778 30.745 30.639 00.423 40.708 7
LUSC-Transcriptomics0.593 00.258 40.927 90.788 90.258 40.251 90.593 2
LUSC-Methylation0.576 10.318 70.836 30.557 00.318 70.147 60.577 5
LUSC-Multi-omics0.664 80.375 20.959 00.920 00.375 20.415 50.667 1

图3

本文模型对比其他特征表征方法结果"

表5

特征表征在联邦学习上的表现结果"

数据集算法准确率敏感性特异性精确率召回率
BRCABase0.632 90.464 80.801 30.701 30.464 8
SMOTE0.666 70.584 70.748 30.695 60.584 7
ImFeatures0.813 80.815 20.812 20.813 50.815 2
SMOTE+ImFeatures0.831 50.818 70.844 20.839 90.818 7
LUADBase0.700 60.623 00.779 30.744 70.623 0
SMOTE0.680 00.623 00.737 00.715 90.623 0
ImFeatures0.848 30.819 00.877 00.867 60.819 0
SMOTE+ImFeatures0.828 20.802 30.853 30.853 60.802 3
LUSCBase0.664 80.383 80.945 70.894 40.383 8
SMOTE0.602 80.316 20.889 50.770 00.316 2
ImFeatures0.807 80.632 40.985 70.977 80.632 4
SMOTE+ImFeatures0.814 70.659 00.971 40.957 80.659 0

表6

与其他同类研究方法对比的结果"

数据集OCF方法本文方法
AccAUCG-meansAccAUCG-means
GSE1224970.994 60.995 60.995 60.997 60.998 70.998 6
GSE1068170.982 70.987 10.987 10.995 70.997 60.997 6
GSE1371400.997 30.998 70.998 70.997 30.997 10.997 1
[1] 安云鹤, 李宝明, 李越, 等. 癌症基因组测序方案制定的研究进展[J]. 中国生物工程杂志, 2014, 34(11): 9-17.
An Yun-he, Li Bao-ming, Li Yue, et al. Progress in cancer genome-sequencing study design[J]. China Biotechnology, 2014, 34(11): 9-17.
[2] 周丰丰, 张亦弛. 基于稀疏自编码器的无监督特征工程算法BioSAE[J].吉林大学学报: 工学版, 2022, 52(7): 1645-1656.
Zhou Feng-feng, Zhang Yi-chi. Unsupervised feature engineering algorithm BioSAE based on sparse autoencoder[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(7): 1645-1656.
[3] Chen X Y, Yu Y Z W, Zheng H Y, et al. Single-cell transcriptome analysis reveals dynamic changes of the preclinical A549 cancer models, and the mechanism of dacomitinib[J]. European Journal of Pharmacology, 2023, 960: No.176046.
[4] 白天, 周春光, 王喆, 等. 代谢组学中机器学习研究进展[J]. 吉林大学学报: 信息科学版, 2008, 26(2): 163-168.
Bai Tian, Zhou Chun-guang, Wang Zhe, et al. Advances of machine learning in metabonomics[J]. Journal of Jilin University (Information Science Edition), 2008, 26(2): 163-168.
[5] 高美虹, 尚学群. 利用人工智能预测癌症的易感性、复发性和生存期[J]. 生物化学与生物物理进展, 2022, 49(9): 1687-1702.
Gao Mei-hong, Shang Xue-qun. Artificial intelligence-based prediction for cancer susceptibility, recurrence and survival[J]. Progress in Biochemistry and Biophysics, 2022, 49(9): 1687-1702.
[6] 刘富, 梁艺馨, 侯涛, 等. 模糊c-harmonic均值算法在不平衡数据上改进[J]. 吉林大学学报: 工学版, 2021, 51(4): 1447-1453.
Liu Fu, Liang Yi-xin, Hou Tao, et al. Improvement of fuzzy c-harmonic mean algorithm on unbalanced data[J]. Journal of Jilin University (Engineering and Technology Edition), 2021, 51(4): 1447-1453.
[7] 章鸯, 潘飞燕, 章卫国, 等. 高通量测序在无创产前遗传学诊断中的应用价值[J]. 中国卫生检验杂志, 2022, 32(10): 1249-1253.
Zhang Yang, Pan Fei-yan, Zhang Wei-guo, et al. Application value of high -throughput sequencing noninvasive prenatal testing in prenatal genetic diagnosis[J]. Chin J Health Lab Tec, 2022, 32(10): 1249-1253.
[8] 方朝剑, 胡新荣. 基于模糊近似度的隐私敏感数据过滤算法[J]. 吉林大学学报: 工学版, 2023, 53(4): 1174-1180.
Fang Chao-jian, Hu Xin-rong. Privacy-sensitive data filtering algorithm based on fuzzy approximation[J]. Journal of Jilin University (Engineering and Technology Edition), 2023, 53(4): 1174-1180.
[9] 张浩, 李海鹏, 彭国琴, 等. 多层次特征融合表征的图像情感识别[J]. 计算机辅助设计与图形学学报, 2023, 35: 1-11.
Zhang Hao, Li Hai-peng, Peng Guo-qin, et al. Image emotion recognition via fusion multi-level representations[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35: 1-11.
[10] Alain L U, Melissa S, Péter O, et al. Transcriptome-based identification of novel endotypes in adult atopic dermatitis[J]. Allergy, 2022, 77(5): 1486-1498.
[11] Sun M, Ding T, Tang X Q, et al. An efficient mixed-model for screening differentially expressed genes of breast cancer based on LR-RF[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019, 16(1): 124-130.
[12] Ball M W, Gorin M A, Drake C G, et al. The landscape of whole-genome alterations and pathologic features in genitourinary malignancies: An analysis of the cancer genome atlas[J]. European Urology Focus, 2017, 3(6): 584-589.
[13] Bourgade R, Rabilloud N, Perennec T, et al. Deep learning for detecting BRCA mutations in high-grade ovarian cancer based on an innovative tumor segmentation method from whole slide images[J]. Modern Pathology, 2023, 36(11): No.100304.
[14] Yildirimtepe C F, Ercan C. RGS10 suppression by DNA methylation is associated with low survival rates in colorectal carcinoma[J]. Pathology - Research and Practice, 2022, 236: No.154007.
[15] Aysegul C, Cenk A A, Yalcin A K. Novel molecular signatures and potential therapeutics in renal cell carcinomas: Insights from a comparative analysis of subtypes[J]. Genomics, 2020, 112(5): 3166-3178.
[16] Turab N A A, Murtaza R S A, Imtaiyaz H M. Pan-cancer analysis of Chromobox (CBX) genes for prognostic significance and cancer classification[J]. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 2023, 1869(1): No.166561.
[17] Goodfellow I J, Jean P A, Mehdi M, et al. Generative adversarial nets[C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2672-2680.
[18] Kingma D P. Welling M. Auto-encoding variational Bayes[J/OL].[2023-10-22].
[19] Precup D, Teh Y W. Wasserstein Generative Adversarial Networks[C]∥Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, 70: 214-223.
[20] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. J. Artif. Int. Res.,2002, 16(1): 321-357.
[21] 周丰丰, 孙燕杰, 范雨思. 基于miRNA组学的数据增强算法[J]. 电子科技大学学报, 2023, 52(2): 182-187.
Zhou Feng-feng, Sun Yan-jie, Fan Yu-si. Data augmentation algorithm for miRNA omics-based classifications[J]. Journal of University of Electronic Science and Technology of China, 2023, 52(2): 182-187.
[1] 王健,贾晨威. 面向智能网联车辆的轨迹预测模型[J]. 吉林大学学报(工学版), 2025, 55(6): 1963-1972.
[2] 车翔玖,孙雨鹏. 基于相似度随机游走聚合的图节点分类算法[J]. 吉林大学学报(工学版), 2025, 55(6): 2069-2075.
[3] 赵秀芝,谢德红. 基于噪声鲁棒性特征提取的普洱茶品种鲁棒判别方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1756-1762.
[4] 金庆良,周鑫森,陈翼,吴承文. 基于群智能增强核极限学习机的创新人才预测模型[J]. 吉林大学学报(工学版), 2025, 55(5): 1763-1771.
[5] 梅生启,刘晓东,王兴举,李旭峰,武腾,程相旭. 基于参数相关性分析和机器学习算法的高强混凝土徐变预测[J]. 吉林大学学报(工学版), 2025, 55(5): 1595-1603.
[6] 孟祥海,王国锐,张明扬,田毕江. 基于选择集成的山区高速事故预测模型[J]. 吉林大学学报(工学版), 2025, 55(4): 1298-1306.
[7] 戴银飞,周秀贞,刘玉宝,刘志远. 基于CAN总线数据的车载网络入侵检测系统[J]. 吉林大学学报(工学版), 2025, 55(3): 857-865.
[8] 车翔玖,武宇宁,刘全乐. 基于因果特征学习的有权同构图分类算法[J]. 吉林大学学报(工学版), 2025, 55(2): 681-686.
[9] 张磊,焦晶,李勃昕,周延杰. 融合机器学习和深度学习的大容量半结构化数据抽取算法[J]. 吉林大学学报(工学版), 2024, 54(9): 2631-2637.
[10] 陈城,史培新,贾鹏蛟,董曼曼. 基于MK-LSTM算法的盾构掘进参数相关性分析及结构变形预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1624-1633.
[11] 梁礼明,周珑颂,尹江,盛校棋. 融合多尺度Transformer的皮肤病变分割算法[J]. 吉林大学学报(工学版), 2024, 54(4): 1086-1098.
[12] 牛世峰,于士杰,刘彦君,马冲. 基于手环数据的愤怒驾驶行为实时检测方法[J]. 吉林大学学报(工学版), 2024, 54(12): 3505-3512.
[13] 拉巴顿珠,扎西多吉,珠杰. 藏语文本标准化方法[J]. 吉林大学学报(工学版), 2024, 54(12): 3577-3588.
[14] 叶育鑫,夏珞珈,孙铭会. 增强现实环境中基于假想键盘的手势输入方法[J]. 吉林大学学报(工学版), 2024, 54(11): 3274-3282.
[15] 戴理朝,王冲,袁平,王磊. 基于可解释机器学习的锈蚀RC构件抗剪承载力预测模型[J]. 吉林大学学报(工学版), 2024, 54(11): 3231-3243.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 范永开,林君,孙天泽,随阳轶 . 基于需求驱动的虚拟仪器软件自动生成构架[J]. 吉林大学学报(工学版), 2007, 37(03): 606 -0610 .
[2] 宣兆志,李国辉,路佳,周放. 小波分析在CO2弧焊控制中的应用[J]. 吉林大学学报(工学版), 2006, 36(04): 480 -483 .
[3] 肖健宇,, 张德运, 陈海诠, 董浩. 模型检测与定理证明相结合开发并验证高可信嵌入式软件[J]. 吉林大学学报(工学版), 2005, 35(05): 531 -0536 .
[4] 赵彦平, 赵晓晖. 用于语音端点检测的鲁棒性特征提取新方法[J]. 吉林大学学报(工学版), 2006, 36(01): 77 -0081 .
[5] 魏跃远,林程,林逸,何洪文,申荣卫. 混合动力汽车系统效率的影响因素[J]. 吉林大学学报(工学版), 2006, 36(01): 20 -0024 .
[6] 何丽桥,高岩,王国光 . CCD激光衍射测径系统的标定方法[J]. 吉林大学学报(工学版), 2008, 38(增刊): 182 -0184 .
[7] 张友安,, 糜玉林, 吕凤琳, 孙富春,. 双连杆柔性臂自适应模糊滑模控制[J]. 吉林大学学报(工学版), 2005, 35(05): 520 -0525 .
[8] 王庆年,张缓缓,靳立强 .

四轮独立驱动电动车转向驱动的转矩协调控制

[J]. 吉林大学学报(工学版), 2007, 37(05): 985 -0989 .
[9] 李伟,康晴晴,张俊雄,荀一 . 基于机器视觉的苹果表面纹理检测方法[J]. 吉林大学学报(工学版), 2008, 38(05): 1110 -1113 .
[10] 梁金广,于秀敏,高跃,王云开,许楠,于洪洋 . 起动电压对电控柴油机起动性能的影响
[J]. 吉林大学学报(工学版), 2009, 39(02): 315 -0320 .