吉林大学学报(医学版) ›› 2024, Vol. 50 ›› Issue (4): 1044-1054.doi: 10.13481/j.1671-587X.20240419

• 临床研究 • 上一篇    下一篇

基于心脑血管疾病发病风险预测的CatBoost算法和贝叶斯网络模型分析

王爱民1,王凤琳1,黄一铭1,徐雅琪1,张文婧1,丛显铸1,苏维强1,王素珍1,高梦瑶1,李爽1,孔雨佳1,石福艳1(),陶恩学2()   

  1. 1.山东第二医科大学公共卫生学院卫生统计学教研室,山东 潍坊 261053
    2.山东第二医科大学基础医学院,山东 潍坊 261053
  • 收稿日期:2023-09-02 出版日期:2024-07-28 发布日期:2024-08-01
  • 通讯作者: 石福艳,陶恩学 E-mail:shifuyan@126.com;sdwftex@163.com
  • 作者简介:王爱民(2000-),男,山东省临沂市人,在读硕士研究生,主要从事卫生统计学方面的研究。
  • 基金资助:
    国家自然科学基金项目(81803337);国家统计局科研项目(2018LY79);山东省科技厅自然科学基金项目(ZR2019MH034);山东省教育厅高等学校青创人才引育计划项目(2019-6-156);潍坊医学院博士启动基金项目(2017BSQD51)

CatBoost algorithm and Bayesian network model analysis based on risk prediction of cardiovascular and cerebro vascular diseases

Aimin WANG1,Fenglin WANG1,Yiming HUANG1,Yaqi XU1,Wenjing ZHANG1,Xianzhu CONG1,Weiqiang SU1,Suzhen WANG1,Mengyao GAO1,Shuang LI1,Yujia KONG1,Fuyan SHI1(),Enxue TAO2()   

  1. 1.Department of Health Statistics,School of Public Health,Shandong Second Medical University,Weifang 261053,China
    2.School of Basic Medical Sciences,Shandong Second Medical University,Weifang 261053,China
  • Received:2023-09-02 Online:2024-07-28 Published:2024-08-01
  • Contact: Fuyan SHI,Enxue TAO E-mail:shifuyan@126.com;sdwftex@163.com

摘要:

目的 筛选影响心脑血管疾病发病的主要特征变量,基于排序前10位的特征变量构建心脑血管疾病发病风险贝叶斯网络模型,为心脑血管疾病发病风险预测提供参考。 方法 从英国生物样本(UK Biobank)数据库中纳入315 896例参与者和相关变量,通过类别型特征提升(CatBoost)算法进行特征选择,将所有参与者按7∶3比例随机分为训练集和测试集,并基于最大最小爬山(MMHC)算法构建贝叶斯网络模型。 结果 本研究中人群心脑血管疾病患病率为28.8%。CatBoost算法筛选的排名前10位变量分别为年龄、体质量指数(BMI)、低密度脂蛋白胆固醇(LDL-C)、总胆固醇(TC)、甘油三酯-葡萄糖(TyG)指数、家族史、载脂蛋白A/B比值、高密度脂蛋白胆固醇(HDL-C)、吸烟状态和性别。CatBoost训练集模型受试者工作特征(ROC)曲线下面积(AUC)为0.770,模型准确性为0.764;验证集模型AUC为0.759,模型准确性为0.763。临床效能分析,训练集阈值范围为0.06~0.85,验证集阈值范围为0.09~0.81。心脑血管疾病发病风险贝叶斯网络模型分析,年龄、性别、吸烟状态、家族史、BMI和载脂蛋白A/B比值与心脑血管疾病直接相关,是心脑血管疾病发生的重要风险因素,TyG指数、HDL-C、LDL-C和TC通过影响BMI和载脂蛋白A/B比值间接影响心脑血管疾病的发生风险。 结论 控制BMI、载脂蛋白A/B比值和吸烟行为,可以降低心脑血管疾病的发病风险。贝叶斯网络模型可用于预测心脑血管疾病发病风险。

关键词: 心脑血管疾病, CatBoost算法, 贝叶斯网络, 风险推理

Abstract:

Objective To screen the main characteristic variables affecting the incidence of cardiovascular and cerebrovascular diseases, and to construct the Bayesian network model of cardiovascular and cerebrovascular disease incidence risk based on the top 10 characteristic variables,and to provide the reference for predicting the risk of cardiovascular and cerebrovascular disease incidence. Methods From the UK Biobank Database, 315 896 participants and related variables were included. The feature selection was performed by categorical boosting (CatBoost) algorithm, and the participants were randomly divided into training set and test set in the ratio of 7∶3. A Bayesian network model was constructed based on the max-min hill-climbing (MMHC) algorithm. Results The prevalence of cardiovascular and cerebrovascular diseases in this study was 28.8%. The top 10 variables selected by the CatBoost algorithm were age, body mass index (BMI), low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), the triglyceride-glucose (TyG) index, family history, apolipoprotein A/B ratio, high-density lipoprotein cholesterol (HDL-C), smoking status, and gender. The area under the receiver operating characteristic (ROC) curve (AUC) for the CatBoost training set model was 0.770, and the model accuracy was 0.764; the AUC of validation set model was 0.759 and the model accuracy was 0.763. The clinical efficacy analysis results showed that the threshold range for the training set was 0.06-0.85 and the threshold range for the validation set was 0.09-0.81. The Bayesian network model analysis results indicated that age, gender, smoking status, family history, BMI, and apolipoprotein A/B ratio were directly related to the incidence of cardiovascular and cerebrovascular diseases and they were the significant risk factors. TyG index, HDL-C, LDL-C, and TC indirectly affect the risk of cardiovascular and cerebrovascular diseases through their impact on BMI and apolipoprotein A/B ratio. Conclusion Controlling BMI, apolipoprotein A/B ratio, and smoking behavior can reduce the incidence risk of cardiovascular and cerebrovascular diseases. The Bayesian network model can be used to predict the risk of cardiovascular and cerebrovascular disease incidence.

Key words: Cardiovascular and cerebrovascular disease, CatBoost algorithm, Bayesian network, Risk inference

中图分类号: 

  • R54