Journal of Jilin University(Medicine Edition) ›› 2024, Vol. 50 ›› Issue (4): 1044-1054.doi: 10.13481/j.1671-587X.20240419

• Research in clinical medicine • Previous Articles     Next Articles

CatBoost algorithm and Bayesian network model analysis based on risk prediction of cardiovascular and cerebro vascular diseases

Aimin WANG1,Fenglin WANG1,Yiming HUANG1,Yaqi XU1,Wenjing ZHANG1,Xianzhu CONG1,Weiqiang SU1,Suzhen WANG1,Mengyao GAO1,Shuang LI1,Yujia KONG1,Fuyan SHI1(),Enxue TAO2()   

  1. 1.Department of Health Statistics,School of Public Health,Shandong Second Medical University,Weifang 261053,China
    2.School of Basic Medical Sciences,Shandong Second Medical University,Weifang 261053,China
  • Received:2023-09-02 Online:2024-07-28 Published:2024-08-01
  • Contact: Fuyan SHI,Enxue TAO E-mail:shifuyan@126.com;sdwftex@163.com

Abstract:

Objective To screen the main characteristic variables affecting the incidence of cardiovascular and cerebrovascular diseases, and to construct the Bayesian network model of cardiovascular and cerebrovascular disease incidence risk based on the top 10 characteristic variables,and to provide the reference for predicting the risk of cardiovascular and cerebrovascular disease incidence. Methods From the UK Biobank Database, 315 896 participants and related variables were included. The feature selection was performed by categorical boosting (CatBoost) algorithm, and the participants were randomly divided into training set and test set in the ratio of 7∶3. A Bayesian network model was constructed based on the max-min hill-climbing (MMHC) algorithm. Results The prevalence of cardiovascular and cerebrovascular diseases in this study was 28.8%. The top 10 variables selected by the CatBoost algorithm were age, body mass index (BMI), low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), the triglyceride-glucose (TyG) index, family history, apolipoprotein A/B ratio, high-density lipoprotein cholesterol (HDL-C), smoking status, and gender. The area under the receiver operating characteristic (ROC) curve (AUC) for the CatBoost training set model was 0.770, and the model accuracy was 0.764; the AUC of validation set model was 0.759 and the model accuracy was 0.763. The clinical efficacy analysis results showed that the threshold range for the training set was 0.06-0.85 and the threshold range for the validation set was 0.09-0.81. The Bayesian network model analysis results indicated that age, gender, smoking status, family history, BMI, and apolipoprotein A/B ratio were directly related to the incidence of cardiovascular and cerebrovascular diseases and they were the significant risk factors. TyG index, HDL-C, LDL-C, and TC indirectly affect the risk of cardiovascular and cerebrovascular diseases through their impact on BMI and apolipoprotein A/B ratio. Conclusion Controlling BMI, apolipoprotein A/B ratio, and smoking behavior can reduce the incidence risk of cardiovascular and cerebrovascular diseases. The Bayesian network model can be used to predict the risk of cardiovascular and cerebrovascular disease incidence.

Key words: Cardiovascular and cerebrovascular disease, CatBoost algorithm, Bayesian network, Risk inference

CLC Number: 

  • R54