Journal of Jilin University(Medicine Edition) ›› 2022, Vol. 48 ›› Issue (2): 426-435.doi: 10.13481/j.1671-587X.20220220

• Research in clinical medicine • Previous Articles     Next Articles

Establishment of prediction model for postoperative pancreatic cancer based on random forest algorithm

Chengsheng LI1,Qihan BAO1,Xiaoyan HAO2,Qingzhong PAN3,Suzhen WANG1(),Fuyan SHI1   

  1. 1.Department of Biostatistics,School of Public Health,Weifang Medical University,Weifang 261053,China
    2.Department of Internal and External Sciences,School of Nursing,Weifang Medical University,Weifang 261053,China
    3.Department of Mathematics,School of Public Health,Weifang Medical University,Weifang 261053,China
  • Received:2021-07-21 Online:2022-03-28 Published:2022-05-10
  • Contact: Suzhen WANG E-mail:wangsz@wfmc.edu.cn

Abstract: Objective

To establish a prediction model of 5-year survival after operation in the pancreatic cancer patients by random forest algorithm, and to provide guidance for prognosis evaluation after operation in the pancreatic cancer patients.

Methods

Surveillance, Epidemzology,and End Results Program(SEER) database was used to collect 42,618 data of the pancreatic cancer patients and their prognosis which met the requirements of this study. After data screening, 4,020 data of the patients were finally included. The subjects were randomly divided into training set and test set, and the characteristic variables were selected by means of χtest, multivariate Logistic regression analysis and random forest variable importance ranking; Synthetic Minority Oversampling Technique(SMOTE) was used to adjust the training set to a balanced data set; Based on the balanced data set, the prediction model was constructed by using random forest algorithm; Using the test set, based on the four evaluation indexes of G-mean index, sensitivity, specificity, and area under receiver operating characteristic(ROC) curve(AUC), the prediction model was evaluated by comparing with logistic regression analysis, support vector machine, decision tree and artificial neural network algorithm.

Results

The original training set contained 2,814 samples, of which 196 patients with a survival time ≥5 years, accounting for about 1/13. It was an unbalanced data set. The balanced data set was obtained after adjustment by SMOTE, and the number of two classification samples was basically balanced. The final variables included in the model were age, race, primary tumor location, tumor differentiation degrees, radiotherapy or not, T stage, N stage, marital status, tumor size and lymph node positive rate. The G-mean index of pancreatic cancer prediction model based on random forest algorithm was 0.830, and the AUC was 0.833, which was better than Logistic regression analysis, support vector machine, decision tree and artificial neural network.

Conclusion

The prediction performance of prediction model of pancreatic cancer based on random forest algorithm is better than other common machine learning methods in predicting the 5-year survival rate of patients with pancreatic cancer. It can provide a basis for clinicians to improve the prognosis and survival of patients with pancreatic cancer.

Key words: Pancreatic neoplasms, Random forest algorithm, Prediction model, Prognosis, Model comparison

CLC Number: 

  • R735.9