Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (6): 2089-2096.doi: 10.13229/j.cnki.jdxbgxb.20231196

Previous Articles     Next Articles

Feature representation algorithm for imbalanced classification of multi⁃omics cancer data

Feng-feng ZHOU1,2(),Zhe GUO1,Yu-si FAN2   

  1. 1.School of Artificial Intelligence,Jilin University,Changchun 130012,China
    2.College of Computer Science and Technology,Jilin University,Changchun 130012,China
  • Received:2023-11-03 Online:2025-06-01 Published:2025-07-23

Abstract:

Aiming at a series of problems such as complex data structure, difficult prediction, data imbalance, and patient privacy protection in cancer diseases, a feature representation algorithm ImFeatures was proposed to solve the problem of imbalanced cancer data and enrich the sample structure. By combining two types of cancer transcriptome and methylation data as real samples, negative samples obtained after feature selection by logistic regression and random forest were randomly divided and combined with equal numbers of positive samples. The feature representation model proposed was used to generate sample representations that learn key feature information, thereby improving the predictive ability of the model. The experimental results show that on 11 common cancer datasets after feature characterization, the accuracy (Acc) of the algorithm combining feature selection and feature representation proposed in this paper exceeds 80.00% in all cases, and five cancer types even receive accuracies over 95.00%, which can effectively improve the prediction accuracies of cancer diseases.

Key words: computer application technology, feature representation, bioinformatics, multi-omics data, feature selection, machine learning

CLC Number: 

  • TP399

Fig.1

ImFeatures algorithm procedure"

Fig.2

Feature representation module network structure"

Table 1

Feature representation model hyperparameter settings"

参数取值
编码器层数/层3
编码器网络节点数/个512, 256, 128
编码器每层激活函数ReLu, ReLu, ReLu
生成器层数/层4
生成器网络节点数/个128, 256, 512, 1 024
生成器激活函数ReLu, ReLu, ReLu, Tanh
鉴别器层数/层4
鉴别器网络节点数512, 256, 128, 1
鉴别器激活函数Leaky ReLu, Leaky ReLu, Leaky ReLu, Sigmoid

Table 2

Aataset information"

数据集甲基化特征数转录组特征数样本数正样本数负样本数不平衡率LrN/P
ACC394 01460 483768688.50
BLCA382 02460 4834441662781.67
BRCA363 73660 4831 1122988142.73
CHOL378 73560 483538455.63
COAD374 80560 4834591652941.78
KICH391 29860 483588506.25
KIRC382 72060 4832741725715.12
KIRP383 10560 48312347761.62
LUAD369 18260 48335273454.24
LUSC370 61960 4836391225175.37
MESO381 22360 483465733921.13

Table 3

Model prediction results before and after feature representation"

特征表征前ImFeatures
数据集KNNSVMLRNBKNNSVMLRNB
ACC0.816 70.683 30.816 70.633 30.883 30.883 30.950 00.933 3
BLCA0.705 10.704 90.680 50.714 10.852 60.774 30.773 80.717 2
BRCA0.632 90.718 00.590 40.620 50.819 10.739 30.709 10.636 6
CHOL0.616 70.666 70.800 00.933 31.000 00.933 31.000 01.000 0
COAD0.632 90.718 00.590 40.620 50.819 10.739 30.709 10.636 6
KICH1.000 01.000 01.000 00.933 31.000 01.000 01.000 01.000 0
KIRC0.820 00.820 00.746 70.860 00.926 70.926 70.966 70.933 3
KIRP0.871 90.787 70.882 50.787 70.977 80.893 00.935 70.829 8
LUAD0.708 80.709 40.680 60.774 70.869 00.799 30.803 40.778 8
LUSC0.664 80.664 80.603 20.746 40.828 50.842 30.856 30.808 3
MESO0.750 00.722 50.763 30.854 20.815 80.802 50.815 80.855 0
数据集特征表征前ImFeatures
KNNSVMLRNBKNNSVMLRNB
ACC0.850 00.750 00.850 00.650 00.900 00.900 00.950 00.900 0
BLCA0.705 10.704 60.680 70.714 30.852 40.774 10.773 70.717 3
BRCA0.633 10.718 20.590 40.620 70.819 20.739 50.709 30.636 7
CHOL0.600 00.650 00.850 00.900 01.000 00.900 01.000 01.000 0
COAD0.633 10.718 20.590 40.620 70.819 20.739 50.709 30.636 7
KICH1.000 01.000 01.000 00.950 01.000 01.000 01.000 01.000 0
KIRC0.816 70.816 70.750 00.866 70.933 30.916 70.966 70.933 3
KIRP0.868 90.782 20.878 90.785 60.977 80.891 10.935 60.827 8
LUAD0.708 70.710 70.682 20.776 20.870 00.800 70.804 30.780 3
LUSC0.667 10.665 70.603 80.748 60.828 10.843 80.856 20.810 0
MESO0.755 40.728 60.764 30.858 90.819 60.807 10.816 10.857 1

Table 4

Prediction results of single-omics and multi-omics data"

数据名称准确率敏感性特异性精确率召回率马修斯相关系数ROC曲线下面积
BRCA-Transcriptomics0.604 00.406 30.802 20.667 10.406 30.225 10.604 2
BRCA-Methylation0.596 60.489 30.704 10.623 50.489 30.198 10.596 7
BRCA-Multi-omics0.632 90.464 80.801 30.701 30.464 80.283 10.633 1
LUAD-Transcriptomics0.612 00.464 00.760 00.657 20.464 00.235 00.612 0
LUAD-Methylation0.645 40.456 20.835 20.737 90.456 20.316 50.645 7
LUAD-Multi-omics0.708 80.639 00.778 30.745 30.639 00.423 40.708 7
LUSC-Transcriptomics0.593 00.258 40.927 90.788 90.258 40.251 90.593 2
LUSC-Methylation0.576 10.318 70.836 30.557 00.318 70.147 60.577 5
LUSC-Multi-omics0.664 80.375 20.959 00.920 00.375 20.415 50.667 1

Fig.3

The proposed model comparison with other feature representation methods"

Table 5

Performance of feature representation in federated learning"

数据集算法准确率敏感性特异性精确率召回率
BRCABase0.632 90.464 80.801 30.701 30.464 8
SMOTE0.666 70.584 70.748 30.695 60.584 7
ImFeatures0.813 80.815 20.812 20.813 50.815 2
SMOTE+ImFeatures0.831 50.818 70.844 20.839 90.818 7
LUADBase0.700 60.623 00.779 30.744 70.623 0
SMOTE0.680 00.623 00.737 00.715 90.623 0
ImFeatures0.848 30.819 00.877 00.867 60.819 0
SMOTE+ImFeatures0.828 20.802 30.853 30.853 60.802 3
LUSCBase0.664 80.383 80.945 70.894 40.383 8
SMOTE0.602 80.316 20.889 50.770 00.316 2
ImFeatures0.807 80.632 40.985 70.977 80.632 4
SMOTE+ImFeatures0.814 70.659 00.971 40.957 80.659 0

Table 6

Comparison results with other similar research methods"

数据集OCF方法本文方法
AccAUCG-meansAccAUCG-means
GSE1224970.994 60.995 60.995 60.997 60.998 70.998 6
GSE1068170.982 70.987 10.987 10.995 70.997 60.997 6
GSE1371400.997 30.998 70.998 70.997 30.997 10.997 1
[1] 安云鹤, 李宝明, 李越, 等. 癌症基因组测序方案制定的研究进展[J]. 中国生物工程杂志, 2014, 34(11): 9-17.
An Yun-he, Li Bao-ming, Li Yue, et al. Progress in cancer genome-sequencing study design[J]. China Biotechnology, 2014, 34(11): 9-17.
[2] 周丰丰, 张亦弛. 基于稀疏自编码器的无监督特征工程算法BioSAE[J].吉林大学学报: 工学版, 2022, 52(7): 1645-1656.
Zhou Feng-feng, Zhang Yi-chi. Unsupervised feature engineering algorithm BioSAE based on sparse autoencoder[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(7): 1645-1656.
[3] Chen X Y, Yu Y Z W, Zheng H Y, et al. Single-cell transcriptome analysis reveals dynamic changes of the preclinical A549 cancer models, and the mechanism of dacomitinib[J]. European Journal of Pharmacology, 2023, 960: No.176046.
[4] 白天, 周春光, 王喆, 等. 代谢组学中机器学习研究进展[J]. 吉林大学学报: 信息科学版, 2008, 26(2): 163-168.
Bai Tian, Zhou Chun-guang, Wang Zhe, et al. Advances of machine learning in metabonomics[J]. Journal of Jilin University (Information Science Edition), 2008, 26(2): 163-168.
[5] 高美虹, 尚学群. 利用人工智能预测癌症的易感性、复发性和生存期[J]. 生物化学与生物物理进展, 2022, 49(9): 1687-1702.
Gao Mei-hong, Shang Xue-qun. Artificial intelligence-based prediction for cancer susceptibility, recurrence and survival[J]. Progress in Biochemistry and Biophysics, 2022, 49(9): 1687-1702.
[6] 刘富, 梁艺馨, 侯涛, 等. 模糊c-harmonic均值算法在不平衡数据上改进[J]. 吉林大学学报: 工学版, 2021, 51(4): 1447-1453.
Liu Fu, Liang Yi-xin, Hou Tao, et al. Improvement of fuzzy c-harmonic mean algorithm on unbalanced data[J]. Journal of Jilin University (Engineering and Technology Edition), 2021, 51(4): 1447-1453.
[7] 章鸯, 潘飞燕, 章卫国, 等. 高通量测序在无创产前遗传学诊断中的应用价值[J]. 中国卫生检验杂志, 2022, 32(10): 1249-1253.
Zhang Yang, Pan Fei-yan, Zhang Wei-guo, et al. Application value of high -throughput sequencing noninvasive prenatal testing in prenatal genetic diagnosis[J]. Chin J Health Lab Tec, 2022, 32(10): 1249-1253.
[8] 方朝剑, 胡新荣. 基于模糊近似度的隐私敏感数据过滤算法[J]. 吉林大学学报: 工学版, 2023, 53(4): 1174-1180.
Fang Chao-jian, Hu Xin-rong. Privacy-sensitive data filtering algorithm based on fuzzy approximation[J]. Journal of Jilin University (Engineering and Technology Edition), 2023, 53(4): 1174-1180.
[9] 张浩, 李海鹏, 彭国琴, 等. 多层次特征融合表征的图像情感识别[J]. 计算机辅助设计与图形学学报, 2023, 35: 1-11.
Zhang Hao, Li Hai-peng, Peng Guo-qin, et al. Image emotion recognition via fusion multi-level representations[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35: 1-11.
[10] Alain L U, Melissa S, Péter O, et al. Transcriptome-based identification of novel endotypes in adult atopic dermatitis[J]. Allergy, 2022, 77(5): 1486-1498.
[11] Sun M, Ding T, Tang X Q, et al. An efficient mixed-model for screening differentially expressed genes of breast cancer based on LR-RF[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019, 16(1): 124-130.
[12] Ball M W, Gorin M A, Drake C G, et al. The landscape of whole-genome alterations and pathologic features in genitourinary malignancies: An analysis of the cancer genome atlas[J]. European Urology Focus, 2017, 3(6): 584-589.
[13] Bourgade R, Rabilloud N, Perennec T, et al. Deep learning for detecting BRCA mutations in high-grade ovarian cancer based on an innovative tumor segmentation method from whole slide images[J]. Modern Pathology, 2023, 36(11): No.100304.
[14] Yildirimtepe C F, Ercan C. RGS10 suppression by DNA methylation is associated with low survival rates in colorectal carcinoma[J]. Pathology - Research and Practice, 2022, 236: No.154007.
[15] Aysegul C, Cenk A A, Yalcin A K. Novel molecular signatures and potential therapeutics in renal cell carcinomas: Insights from a comparative analysis of subtypes[J]. Genomics, 2020, 112(5): 3166-3178.
[16] Turab N A A, Murtaza R S A, Imtaiyaz H M. Pan-cancer analysis of Chromobox (CBX) genes for prognostic significance and cancer classification[J]. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 2023, 1869(1): No.166561.
[17] Goodfellow I J, Jean P A, Mehdi M, et al. Generative adversarial nets[C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014: 2672-2680.
[18] Kingma D P. Welling M. Auto-encoding variational Bayes[J/OL].[2023-10-22].
[19] Precup D, Teh Y W. Wasserstein Generative Adversarial Networks[C]∥Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, 70: 214-223.
[20] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. J. Artif. Int. Res.,2002, 16(1): 321-357.
[21] 周丰丰, 孙燕杰, 范雨思. 基于miRNA组学的数据增强算法[J]. 电子科技大学学报, 2023, 52(2): 182-187.
Zhou Feng-feng, Sun Yan-jie, Fan Yu-si. Data augmentation algorithm for miRNA omics-based classifications[J]. Journal of University of Electronic Science and Technology of China, 2023, 52(2): 182-187.
[1] Jian WANG,Chen-wei JIA. Trajectory prediction model for intelligent connected vehicle [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 1963-1972.
[2] Xiang-jiu CHE,Yu-peng SUN. Graph node classification algorithm based on similarity random walk aggregation [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2069-2075.
[3] Xiu-zhi ZHAO,De-hong XIE. Discrimination method for Pu-er tea varieties based on noise-robust feature extraction [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1756-1762.
[4] Qing-liang JIN,Xin-sen ZHOU,Yi CHEN,Cheng-wen WU. Predictive model for identifying innovative university talents based on the swarm intelligence evolution enhanced kernel extreme learning machine [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1763-1771.
[5] Sheng-qi MEI,Xiao-dong LIU,Xing-ju WANG,Xu-feng LI,Teng WU,Xiang-xu CHENG. Prediction of high strength concrete creep based on parametric MIC analysis and machine learning algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1595-1603.
[6] Jun WANG,Chang-fu SI,Kai-peng WANG,Qiang FU. Intrusion detection method based on ensemble learning and feature selection by PSO-GA [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1396-1405.
[7] Xiang-hai MENG,Guo-rui WANG,Ming-yang ZHANG,Bi-jiang TIAN. Traffic accident prediction model of mountain highways based on selection integration [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1298-1306.
[8] Yin-fei DAI,Xiu-zhen ZHOU,Yu-bao LIU,Zhi-yuan LIU. In⁃vehicle network intrusion detection system based on CAN bus data [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 857-865.
[9] Xiang-jiu CHE,Yu-ning WU,Quan-le LIU. A weighted isomorphic graph classification algorithm based on causal feature learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(2): 681-686.
[10] Lei ZHANG,Jing JIAO,Bo-xin LI,Yan-jie ZHOU. Large capacity semi structured data extraction algorithm combining machine learning and deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(9): 2631-2637.
[11] Cheng CHEN,Pei-xin SHI,Peng-jiao JIA,Man-man DONG. Correlation analysis of shield driving parameters and structural deformation prediction based on MK-LSTM algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1624-1633.
[12] Li-ming LIANG,Long-song ZHOU,Jiang YIN,Xiao-qi SHENG. Fusion multi-scale Transformer skin lesion segmentation algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(4): 1086-1098.
[13] Bo-song FAN,Chun-fu SHAO. Urban rail transit emergency risk level identification method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(2): 427-435.
[14] Shi-feng NIU,Shi-jie YU,Yan-jun LIU,Chong MA. Real-time detection method of angry driving behavior based on bracelet data [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3505-3512.
[15] Dondrub LHAKPA,Duoji ZHAXI,Jie ZHU. Tibetan text normalization method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3577-3588.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Fan Yong-kai,Lin Jun,Sun Tian-ze,Sui Yang-yi . Requirementdriven virtual instrument software automatic generation framework[J]. 吉林大学学报(工学版), 2007, 37(03): 606 -0610 .
[2] Xuan Zhao-zhi, Li Guo-hui, Lu Jia, Zhou Fang. Application of wavelet analysis in CO2 welding control[J]. 吉林大学学报(工学版), 2006, 36(04): 480 -483 .
[3] XIAO Jian-yu,, ZHANG De-yun, CHEN Hai-quan, DONG Hao. Development and Verification of High Confidence Embedded Software[J]. 吉林大学学报(工学版), 2005, 35(05): 531 -0536 .
[4] Zhao Yanping, Zhao Xiaohui. New Robust Feature Extraction Method for Speech Endpoint Detection[J]. 吉林大学学报(工学版), 2006, 36(01): 77 -0081 .
[5] Wei Yueyuan,Lin Cheng,Lin Yi,He Hongwen,Shen Rongwei. Factors Influencing Hybrid Electric Vehicle System Efficiency[J]. 吉林大学学报(工学版), 2006, 36(01): 20 -0024 .
[6] He Li-qiao,Gao Yan,Wang Guo-guang . Calibration method for CCD laser diffraction filament
diameter measurement system
[J]. 吉林大学学报(工学版), 2008, 38(增刊): 182 -0184 .
[7] ZHANG You-an,, MI Yu-lin, LvFeng-lin, SUN Fu-chun,. Adaptive Fuzzy Sliding Mode Control for Twolink Flexible Manipulator[J]. 吉林大学学报(工学版), 2005, 35(05): 520 -0525 .
[8] Wang Qing-nian,Zhang Huan-huan,Jin Li-qing . Torque coordinated control of fourwheel independent
drive electric vehicles in cornering

[J]. 吉林大学学报(工学版), 2007, 37(05): 985 -0989 .
[9] LI Wei,KANG Qing-qing, ZHANG Jun-xiong, XUN Yi . Detecting technique for surface texture on apples based on machine vision[J]. 吉林大学学报(工学版), 2008, 38(05): 1110 -1113 .
[10] LIANG Jin-guang,YU Xiu-min,GAO Yue,WANG Yun-kai,XU Nan,YU Hong-yang . Influence of battery voltage on start behavior
of electronically controlled diesel engine
[J]. 吉林大学学报(工学版), 2009, 39(02): 315 -0320 .