吉林大学学报(医学版) ›› 2022, Vol. 48 ›› Issue (2): 426-435.doi: 10.13481/j.1671-587X.20220220

• 临床研究 • 上一篇    下一篇

基于随机森林算法的胰腺癌术后预测模型构建

李承圣1,包绮晗1,郝晓燕2,潘庆忠3,王素珍1(),石福艳1   

  1. 1.潍坊医学院公共卫生学院卫生统计学教研室,山东 潍坊 261053
    2.潍坊医学院护理学院内外科 教研室,山东 潍坊 261053
    3.潍坊医学院公共卫生学院数学教研室,山东 潍坊 261053
  • 收稿日期:2021-07-21 出版日期:2022-03-28 发布日期:2022-05-10
  • 通讯作者: 王素珍 E-mail:wangsz@wfmc.edu.cn
  • 作者简介:李承圣(1998-),男,山东省日照市人,在读硕士研究生,主要从事卫生统计学方面的研究。
  • 基金资助:
    国家自然科学基金面上项目(81872719);国家自然科学基金青年科学基金项目(81803337)

Establishment of prediction model for postoperative pancreatic cancer based on random forest algorithm

Chengsheng LI1,Qihan BAO1,Xiaoyan HAO2,Qingzhong PAN3,Suzhen WANG1(),Fuyan SHI1   

  1. 1.Department of Biostatistics,School of Public Health,Weifang Medical University,Weifang 261053,China
    2.Department of Internal and External Sciences,School of Nursing,Weifang Medical University,Weifang 261053,China
    3.Department of Mathematics,School of Public Health,Weifang Medical University,Weifang 261053,China
  • Received:2021-07-21 Online:2022-03-28 Published:2022-05-10
  • Contact: Suzhen WANG E-mail:wangsz@wfmc.edu.cn

摘要: 目的

通过随机森林算法构建胰腺癌患者术后5年生存情况预测模型,为胰腺癌患者术后预后评估提供指导。

方法

利用美国国立癌症研究所监测、流行病学和结果数据库(SEER)收集符合本研究要求的42 618条胰腺癌患者及其预后数据,经过数据筛选最终纳入4 020条患者信息。将研究对象随机分为训练集和测试集,采用χ检验、多因素Logistic回归分析和随机森林变量重要性排名进行特征变量选择;利用合成少数样本过采样技术(SMOTE)将训练集调整为平衡数据集;基于平衡后数据集利用随机森林算法构建预测模型;利用测试集,基于G-mean指数、灵敏度、特异度和受试者工作特征(ROC)曲线下面积(AUC)4个评价指标,分别与Logistic回归分析、支持向量机、决策树和人工神经网络算法进行比较,对预测模型做出评价。

结果

原始训练集中包含2 814个样本,其中生存时间≥5年的患者有196例,占比约1/13,是不平衡数据集,经过SMOTE方法调整后获得平衡数据集,二分类样本数量基本达到平衡。最终纳入模型的变量有年龄、种族、肿瘤原发部位、肿瘤分化程度、是否放疗、T分期、N分期、婚姻状况、肿瘤大小和淋巴结阳性比率。基于随机森林算法构建的胰腺癌患者术后预测模型G-mean指数为0.830,AUC为0.833[P<0.05,95%置信区间(CI)(0.784,0.876)],优于Logistic回归分析、支持向量机、决策树和人工神经网络。

结论

基于随机森林算法构建的胰腺癌术后预测模型对胰腺癌患者术后5年生存情况的预测性能优于其他常见机器学习方法,能够为临床医生改善胰腺癌患者的预后和生存状况提供依据。

关键词: 胰腺肿瘤, 随机森林算法, 预测模型, 预后, 模型比较

Abstract: Objective

To establish a prediction model of 5-year survival after operation in the pancreatic cancer patients by random forest algorithm, and to provide guidance for prognosis evaluation after operation in the pancreatic cancer patients.

Methods

Surveillance, Epidemzology,and End Results Program(SEER) database was used to collect 42,618 data of the pancreatic cancer patients and their prognosis which met the requirements of this study. After data screening, 4,020 data of the patients were finally included. The subjects were randomly divided into training set and test set, and the characteristic variables were selected by means of χtest, multivariate Logistic regression analysis and random forest variable importance ranking; Synthetic Minority Oversampling Technique(SMOTE) was used to adjust the training set to a balanced data set; Based on the balanced data set, the prediction model was constructed by using random forest algorithm; Using the test set, based on the four evaluation indexes of G-mean index, sensitivity, specificity, and area under receiver operating characteristic(ROC) curve(AUC), the prediction model was evaluated by comparing with logistic regression analysis, support vector machine, decision tree and artificial neural network algorithm.

Results

The original training set contained 2,814 samples, of which 196 patients with a survival time ≥5 years, accounting for about 1/13. It was an unbalanced data set. The balanced data set was obtained after adjustment by SMOTE, and the number of two classification samples was basically balanced. The final variables included in the model were age, race, primary tumor location, tumor differentiation degrees, radiotherapy or not, T stage, N stage, marital status, tumor size and lymph node positive rate. The G-mean index of pancreatic cancer prediction model based on random forest algorithm was 0.830, and the AUC was 0.833, which was better than Logistic regression analysis, support vector machine, decision tree and artificial neural network.

Conclusion

The prediction performance of prediction model of pancreatic cancer based on random forest algorithm is better than other common machine learning methods in predicting the 5-year survival rate of patients with pancreatic cancer. It can provide a basis for clinicians to improve the prognosis and survival of patients with pancreatic cancer.

Key words: Pancreatic neoplasms, Random forest algorithm, Prediction model, Prognosis, Model comparison

中图分类号: 

  • R735.9