吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (3): 651-657.

• • 上一篇    下一篇

基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法

周丰丰1,2, 张亚琪1   

  1. 1. 吉林大学 计算机科学与技术学院, 长春 130012; 
    2. 吉林大学 符号计算与知识工程教育部重点实验室, 长春 130012
  • 收稿日期:2022-01-11 出版日期:2023-05-26 发布日期:2023-05-26
  • 通讯作者: 周丰丰 E-mail:ffzhou@jlu.edu.cn

Binding Prediction Algorithm of HLA-Ⅰ and Polypeptides Based on  Pre-trained Model ProtBert

ZHOU Fengfeng1,2, ZHANG Yaqi1   

  1. 1. College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    2. Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
  • Received:2022-01-11 Online:2023-05-26 Published:2023-05-26

摘要: 针对现有的第Ⅰ类HLA(HLA-Ⅰ)分子与多肽结合亲和力预测算法在特征构造时依赖传统序列评分函数的问题, 为突破用经典机器学习算法构造氨基酸序列特征的局限性, 提出一种基于蛋白质预训练模型ProtBert的HLA-Ⅰ与多肽的结合预测算法ProHLAⅠ. 该算法利用生命体语言与文本语言在组成上的共性, 将氨基酸序列类比句子, 通过整合ProtBert预训练模型、BiLSTM编码和注意力机制的网络结构优势, 对HLA-Ⅰ序列和多肽序列进行特征提取, 从而实现HLA-Ⅰ独立于位点的多肽结合预测. 实验结果表明, 该模型在两组独立测试集中均取得了最优性能.

关键词: 结合肽预测, 自然语言处理, 注意力机制, BERT模型, 双向长短期记忆模型(BiLSTM)

Abstract: Aiming at the problem that the  existing HLA class Ⅰ  molecule-polypeptide binding affinity prediction algorithms rely on traditional sequence scoring functions in feature construction. In order to break through the limitations of using classical machine learning algorithms to construct  amino acid sequence features, we proposed a binding prediction algorithm ProHLAⅠ of HLA-Ⅰ and polypeptides based on protein pre-trained model ProtBert. The algorithm utilized the commonness of the composition of the living body language and the text language, compared the amino acid sequence with the sentence, and extracted the features of the HLA-Ⅰ sequence and the polypeptied sequence by integrating the network structure advantages of pre-trained model ProtBert, the BiLSTM coding and the attention mechanism, so as to realize the site\|independent polypeptide binding prediction of the HLA-Ⅰ.
The experimental results show that the model  achieves the optimal  performance on two independent test sets.

Key words: HLA-Ⅰ binding peptide prediction, natural language processing, attention mechanism, BERT model, bi-long short-term memory (BiLSTM) model

中图分类号: 

  •