吉林大学学报(医学版) ›› 2025, Vol. 51 ›› Issue (4): 1039-1051.doi: 10.13481/j.1671-587X.20250420

• 临床研究 • 上一篇    

基于生物信息学和机器学习的阿尔兹海默症诊断模型构建及免疫分析

徐林瑞1,张译予1,崔家齐1,丛显铸1,李爽1,葛佳瑜1,孔雨佳1,王素珍1(),石福艳1,王金荣2()   

  1. 1.山东第二医科大学公共卫生学院卫生统计学教研室,山东 潍坊 261053
    2.山东第二医科大学 附属医院中医科,山东 潍坊 261041
  • 收稿日期:2024-10-16 接受日期:2025-05-08 出版日期:2025-07-28 发布日期:2025-08-25
  • 通讯作者: 王素珍,王金荣 E-mail:wangsz@sdsmu.edu.cn;wjrwkl@126.com
  • 作者简介:徐林瑞(2000-),男,山东省泰安市人,在读硕士研究生,主要从事卫生统计学方面的研究。
  • 基金资助:
    国家自然科学基金面上项目(81872719);国家自然科学基金青年科学基金项目(81803337);山东省科技厅自然科学基金面上项目(ZR2023MH034);山东省潍坊市中医药科研立项项目(第二类)(WFZYY2024-2-001)

Construction of diagnostic model for Alzheimer’s disease and immune analysis based on bioinformatics and machine learning

Linrui XU1,Yiyu ZHANG1,Jiaqi CUI1,Xianzhu CONG1,Shuang LI1,Jiayu GE1,Yujia KONG1,Suzhen WANG1(),Fuyan SHI1,Jinrong WANG2()   

  1. 1.Department of Health Statistics,School of Public Health,Shandong Second Medical University,Weifang 261053,China
    2.Department of Traditional Chinese Medicine,Affiliated Hospital,Shandong Second Medical University,Weifang 261041,China
  • Received:2024-10-16 Accepted:2025-05-08 Online:2025-07-28 Published:2025-08-25
  • Contact: Suzhen WANG,Jinrong WANG E-mail:wangsz@sdsmu.edu.cn;wjrwkl@126.com

摘要:

目的 利用生物信息学技术和机器学习(ML)算法筛选阿尔兹海默症(AD)相关基因并构建其诊断模型,探讨AD患者的免疫学特征,为AD诊断提供新的生物标志物。 方法 从基因表达综合(GEO)数据库中下载AD相关的基因表达数据集GSE125583,通过差异分析获得差异表达基因(DEGs),借助基因本体论(GO)功能富集分析和京都基因与基因组百科全书(KEGG)信号通路富集分析探讨DEGs的生物学功能及信号通路,并绘制蛋白-蛋白相互作用(PPI)网络,采用Cytoscape软件和最小绝对收缩和选择算子(LASSO)回归、极限梯度提升(XGBoost)和随机森林(RF)3种ML算法对枢纽(Hub)基因进行筛选,将筛选后的Hub基因通过RF构建AD诊断模型并进行特征重要性排序,以测试集评价AD诊断模型和关键基因的效能。采用单样本基因集富集分析(ssGSEA)对AD组与对照组进行免疫细胞浸润分析。 结果 差异分析共筛选出1 287个DEGs。GO功能富集分析,DEGs主要参与神经信号、突触和囊泡等相关的生物学功能;KEGG信号通路富集分析,DEGs主要在离子转运、神经递质和配体门控等通路上富集。3种ML算法共筛选出9个交集Hub基因。AD诊断模型,对AD诊断性能最高的前4个关键基因分别为腺苷酸环化酶激活多肽1(ADCYAP1)、脑源性神经营养因子(BDNF)、血小板衍生生长因子受体β(PDGFRB)和趋化因子受体4(CXCR4),对应受试者工作特征(ROC)的曲线下面积(AUC)值分别为0.852、0.795、0.820和0.756;模型的AUC值为0.828,准确率为81.25%,灵敏度为84.40%,特异度为71.43%。免疫细胞浸润分析,AD组织中巨噬细胞、单核细胞、各种自然杀伤(NK)细胞和淋巴细胞浸润程度较高,其中,NK细胞/自然杀伤T(NKT)细胞和浆细胞样树突状细胞与4个关键基因显著相关(P<0.05)。 结论 基于生物信息学技术与ML算法筛选出的特征基因对AD具有一定的诊断能力,ADCYAP1等基因可能会成为AD诊断的潜在生物标志物,对AD的早期防治具有重要意义。

关键词: 生物信息学, 机器学习, 阿尔兹海默症, 诊断模型, 腺苷酸环化酶激活多肽1基因

Abstract:

Objective To screen the Alzheimer’s disease(AD)-related genes and construct its diagnostic model using bioinformatics technology and machine learning (ML) algorithms, to discuss the immunological characteristics of AD patients, and to provide novel biomarkers for AD diagnosis. Methods The AD-related gene expression dataset GSE125583 was downloaded from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) were identified through differential analysis. Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathway enrichment analyses were performed to explore the biological functions and signaling pathways of DEGs. A protein-protein interaction (PPI) network was constructed, and hub genes were screened using Cytoscape software combined with three ML algorithms: Least Absolute Shrinkage and Selection Operator (LASSO), eXtreme Gradient Boosting (XGBoost), and Random Forest (RF). The screened hub genes were utilized to build an AD diagnostic model via RF, followed by feature importance ranking. The model’s efficacy and key genes were evaluated using a test set. Single-sample gene set enrichment analysis (ssGSEA) was used for immune cell infiltration analysis between AD group and control group. Results Differential analysis identified 1 287 DEGs. The GO functional enrichment analysis results revealed that DEGs were primarily involved in biological functions related to neural signaling, synapses, and vesicles. KEGG signaling pathway enrichment analysis indicated significant enrichment of DEGs in ion transport, neurotransmitter, and ligand-gated channel pathways. Nine overlapping hub genes were screened by the three ML algorithms. In the AD diagnostic model, the top four key genes with highest diagnostic performance were adenylate cyclase-activating polypeptide 1 (ADCYAP1), brain-derived neurotrophic factor (BDNF), platelet-derived growth factor receptor β (PDGFRB), and C-X-C motif chemokine receptor 4 (CXCR4), with corresponding area under the curve (AUC) values of 0.852, 0.795, 0.820, and 0.756, respectively. The model achieved an AUC of 0.828, accuracy of 81.25%, sensitivity of 84.40%, and specificity of 71.43%. The immune cell infiltration analysis results demonstrated higher infiltration of macrophages, monocytes, natural killer(NK) cells, and lymphocytes in AD tissue. Among these, NK/natural killer T(NKT) cells and plasmacytoid dendritic cells showed significant correlations with the four key genes(P<0.05). Conclusion The feature genes screened based on bioinformatics and ML exhibit diagnostic potential for AD. Genes such as ADCYAP1 may serve as potential biomarkers for AD diagnosis, offering significant implications for early prevention and treatment.

Key words: Bioinformatics, Machine learning, Alzheimer’s disease, Diagnostic model, adenylate cyclase-activating polypeptide 1 gene

中图分类号: 

  • R749.16