基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法

吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (3): 651-657.

基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法

周丰丰^1,2, 张亚琪¹

1. 吉林大学计算机科学与技术学院, 长春 130012；
2. 吉林大学符号计算与知识工程教育部重点实验室, 长春 130012

收稿日期:2022-01-11 出版日期:2023-05-26 发布日期:2023-05-26
通讯作者: 周丰丰 E-mail:ffzhou@jlu.edu.cn

Binding Prediction Algorithm of HLA-Ⅰ and Polypeptides Based on Pre-trained Model ProtBert

ZHOU Fengfeng^1,2, ZHANG Yaqi¹

1. College of Computer Science and Technology, Jilin University, Changchun 130012, China;
2. Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

Received:2022-01-11 Online:2023-05-26 Published:2023-05-26

摘要/Abstract

摘要： 针对现有的第Ⅰ类HLA（HLA-Ⅰ）分子与多肽结合亲和力预测算法在特征构造时依赖传统序列评分函数的问题, 为突破用经典机器学习算法构造氨基酸序列特征的局限性, 提出一种基于蛋白质预训练模型ProtBert的HLA-Ⅰ与多肽的结合预测算法ProHLAⅠ. 该算法利用生命体语言与文本语言在组成上的共性, 将氨基酸序列类比句子, 通过整合ProtBert预训练模型、BiLSTM编码和注意力机制的网络结构优势, 对HLA-Ⅰ序列和多肽序列进行特征提取, 从而实现HLA-Ⅰ独立于位点的多肽结合预测. 实验结果表明, 该模型在两组独立测试集中均取得了最优性能.

关键词: 结合肽预测, 自然语言处理, 注意力机制, BERT模型, 双向长短期记忆模型（BiLSTM）

Abstract: Aiming at the problem that the existing HLA class Ⅰ molecule-polypeptide binding affinity prediction algorithms rely on traditional sequence scoring functions in feature construction. In order to break through the limitations of using classical machine learning algorithms to construct amino acid sequence features, we proposed a binding prediction algorithm ProHLAⅠ of HLA-Ⅰ and polypeptides based on protein pre-trained model ProtBert. The algorithm utilized the commonness of the composition of the living body language and the text language, compared the amino acid sequence with the sentence, and extracted the features of the HLA-Ⅰ sequence and the polypeptied sequence by integrating the network structure advantages of pre-trained model ProtBert, the BiLSTM coding and the attention mechanism, so as to realize the site\|independent polypeptide binding prediction of the HLA-Ⅰ.
The experimental results show that the model achieves the optimal performance on two independent test sets.

Key words: HLA-Ⅰ binding peptide prediction, natural language processing, attention mechanism, BERT model, bi-long short-term memory (BiLSTM) model

中图分类号:

周丰丰, 张亚琪. 基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法[J]. 吉林大学学报(理学版), 2023, 61(3): 651-657.

ZHOU Fengfeng, ZHANG Yaqi. Binding Prediction Algorithm of HLA-Ⅰ and Polypeptides Based on Pre-trained Model ProtBert[J]. Journal of Jilin University Science Edition, 2023, 61(3): 651-657.

[1]	逄晨曦, 李文辉. 基于注意力改进的自适应空间特征融合目标检测算法[J]. 吉林大学学报(理学版), 2023, 61(3): 557-566.
[2]	钱旭淼, 段锦, 刘举, 陈广秋, 刘高天, 梁丽平. 基于注意力特征融合的图像去雾算法[J]. 吉林大学学报(理学版), 2023, 61(3): 567-576.
[3]	霍光, 林大为, 刘元宁, 朱晓冬, 袁梦. 基于轻量级卷积神经网络的小样本虹膜图像分割[J]. 吉林大学学报(理学版), 2023, 61(3): 583-591.
[4]	朱丽, 王新鹏, 付海涛, 冯宇轩, 张竞吉. 基于注意力机制的细粒度图像分类[J]. 吉林大学学报(理学版), 2023, 61(2): 371-376.
[5]	万聪, 王英. 基于加权元学习的节点分类算法[J]. 吉林大学学报(理学版), 2023, 61(2): 331-337.
[6]	李岳泽, 左祥麟, 左万利, 梁世宁, 张一嘉, 朱媛. 基于BERT-GCN的因果关系抽取[J]. 吉林大学学报(理学版), 2023, 61(2): 325-330.
[7]	白诗瑶, 吕佳键, 彭涛, 刘露, 崔海. 基于语义特征提取与层次结构的问题生成方法[J]. 吉林大学学报(理学版), 2023, 61(1): 94-100.
[8]	姚庆安, 张鑫, 刘力鸣, 冯云丛, 金镇君. 融合注意力机制和多尺度特征的图像语义分割[J]. 吉林大学学报(理学版), 2022, 60(6): 1383-1390.
[9]	常洪彬, 李文举, 李文辉. 基于注意力机制的航空图像旋转框目标检测[J]. 吉林大学学报(理学版), 2022, 60(6): 1363-1369.
[10]	徐博文, 卢奕南. 基于改进SOLO网络的城市道路场景实例分割方法[J]. 吉林大学学报(理学版), 2022, 60(6): 1356-1362.
[11]	汪慎文, 周瑶. 基于改进U-Net的肝脏MRI分割方法[J]. 吉林大学学报(理学版), 2022, 60(5): 1153-1160.
[12]	陈继伟, 汪海涛, 朱兴翔, 姜瑛, 陈星. 长期记忆增强的时间感知序列推荐算法[J]. 吉林大学学报(理学版), 2022, 60(4): 919-928.
[13]	张玉波, 王建阳, 韩爽, 王冬梅. 一种循环多尺度的图像盲去模糊网络[J]. 吉林大学学报(理学版), 2022, 60(4): 889-896.
[14]	赵鹏程, 高尚, 于洪梅. 基于多智能体深度强化学习的空间众包任务分配[J]. 吉林大学学报(理学版), 2022, 60(2): 321-331.
[15]	朱海琦, 李宏, 李定文, 李富. 基于生成对抗网络的单图像超分辨率重建[J]. 吉林大学学报(理学版), 2021, 59(6): 1491-1498.