吉林大学学报(工学版) ›› 2017, Vol. 47 ›› Issue (2): 639-646.doi: 10.13229/j.cnki.jdxbgxb201702040

• • 上一篇    下一篇

基于特征选择和支持向量机的HIV-1型蛋白酶剪切位点预测

袁哲明1, 2, 张弘杨1, 2, 陈渊1   

  1. 1.湖南农业大学 湖南省作物种质创新与资源利用重点实验室,长沙410128;
    2.湖南农业大学 植物病虫害生物学与防控湖南省重点实验室,长沙410128
  • 收稿日期:2015-12-15 出版日期:2017-03-20 发布日期:2017-03-20
  • 通讯作者: 陈渊(1987-),男,讲师.研究方向:大数据分析.E-mail:chenyuan0510@126.com
  • 作者简介:袁哲明(1971-),男,教授,博士生导师.研究方向:大数据分析,生物信息学.E-mail:zhmyuan@sina.com
  • 基金资助:
    高等学校博士学科点专项科研基金项目(20124320110002); 长沙市科技计划项目(K1406018-21).

HIV-1 protease cleavage site prediction based on feature selection and support vector machine

YUAN Zhe-ming1, 2, ZHANG Hong-yang1, 2, CHEN Yuan1   

  1. 1.Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha 410128, China;
    2.Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, China
  • Received:2015-12-15 Online:2017-03-20 Published:2017-03-20

摘要: 为了提高HIV-1型蛋白酶剪切位点的预测准确性,提出一种基于特征选择和支持向量机的剪切位点预测模型。首先,通过对5830个样本的HIV-1型蛋白酶剪切位点数据集进行分析,根据最小冗余最大相关理念,采用可自动终止法选择剪切位点的特征向量;然后,将特征向量输入到支持向量机进行学习和训练,建立HIV-1型蛋白酶剪切位点的分类模型;最后,采用Matlab 2014的仿真工具箱进行仿真测试。实验结果表明:本文模型在特征最少的条件下,剪切位点预测精度优于参比模型及文献报道,且所选择的特征向量具有较好的可解释性及生物学意义。

关键词: 生物物理学, 剪切位点预测, 特征选择, 最小冗余最大相关, 支持向量机

Abstract: In order to improve the prediction accuracy of the HIV-1 protease cleavage site, a shear prediction model based on feature selection and support vector machine is proposed. First, by analysis of the cleavage site dataset of 5830 samples, and using absorption minimum redundancy maximum relevance concept, the automatic termination method is employed to select the cleavage site feature vectors. Then, the feature vector is input to a support vector machine for learning and training to build the classification model of splice sites. Finally, simulation is carried out using MATLAB 2004 simulation toolbox. Results show that the proposed model has better prediction accuracy than that of the reference models and literature report. The selected features have good interpretability and biological significance.

Key words: biophysics, cleavage site prediction, feature selection, minimal redundancy maximal relevance(mRMR), support vector machine(SVM)

中图分类号: 

  • Q6
[1] Rodríguez-Barrios F, Gago F. HIV protease inhibition: limited recent progress and advances in understanding current pitfalls[J]. Current Topics in Medicinal Chemistry, 2004, 4(9): 991-1007.
[2] Schechter I, Berger A. On the size of the active site in proteases. I. papain[J]. Biochemical and Biophysical Research Comminications, 1967, 27(2): 157-161.
[3] Nanni L, Lumini A. A new encoding technique for peptide classification[J]. Expert Systems with Applications, 2011, 38(4): 3185-3191.
[4] Kawashima S, Pokarowski P, Pokarowski M, et al. AAindex: amino acid index database, progress report 2008[J]. Nucleic Acids Research, 2008, 36(Sup.1):202-205.
[5] Cao D S, Xu Q S, Liang Y Z. Progy: a tool to generate various modes of Chou's PseAAC[J]. Bioinformatics,2013, 29(7): 960-962.
[6] 韩娜, 袁哲明, 陈渊, 等. 基于高维特征非线性筛选的HLA-A * 0201限制性CTL表位预测[J].物理化学学报, 2013, 29(9): 1945-1953.
Han Na, Yuan Zhe-ming, Chen Yuan, et al. Prediction of HLA-A * 0201 restricted Cytotoxic T Lymphocyte epitopes based on high-dimensional descriptor nonlinear screening[J]. Acta Phys-Chim Sin, 2013, 29(9): 1945-1953.
[7] 李咏, 周玮, 代志军, 等. 基于序列特征筛选与支持向量回归预测蛋白质折叠速率[J].物理化学学报, 2014, 30(6): 1091-1098.
Li Yong, Zhou Wei, Dai Zhi-jun, et al. Predicting the protein folding rate base on sequence feature screeing and support vector regression[J]. Acta Phys-Chim Sin, 2014, 30(6): 1091-1098.
[8] Li B Q, Huang T, Liu L, et al. Identification of colorectal cancer related genes with mRMR and shortest path in protein-protein interaction network[J]. PLoS one, 2012, 7(4): e33393.
[9] Li Y, Wang M, Wang H, et al. Accurate in species-specific acetylation sites by integrating protein sequence-derived and functional features[J]. Scientific Reports, 2014,4:5765.
[10] Ma X, Sun X. Sequence-based predictor of ATP-binding residues using random forest and mRMR-IFS feature selection[J]. Journal of Theoretical Biology, 2014, 360: 59-66.
[11] Chou K C. Prediction of human immunodeficiency virus protease cleavage sites in proteins[J]. Analytical Biochemistry, 1996, 233(1): 1-14.
[12] Jayavardhana R G L, Palaniswami M. Cleavage knowledge extraction in HIV-1 protease using hidden Markov model[C]∥Intelligent Sensing and Information Processing, Chennai, India, 2005: 469-473.
[13] Nanni L, Lumini A. Mpps: an ensemble of support vector machine based on multiple physicochemical properties of amino acids[J]. Neurocomputing, 2006, 69(13): 1688-1690.
[14] Niu B, Lu L, Liu L, et al. HIV-1 protease cleavage site prediction based on amino acid property[J]. Journal of Computational Chemistry, 2009, 30(1):33-39.
[15] Sarda D, Chua G H, Li K B,et al.pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties[J].BMC Bioinformatics, 2005,6(1):1-12.
[16] Chou K C, Cai Y D. Using functional domain composition and support vector machines for prediction of protein subcellular location[J]. Journal of Biological Chemistry, 2002, 277(48): 45765-45769.
[17] Cai Y D, Liu X J, Xu X B, et al. Support vector machines for predicting protein structural class[J]. BMC Bioinformatics, 2001, 2(1):1-5.
[18] Bock J R, Gough D A. Predicting protein-protein interactions from primary structure[J]. Bioinformatics, 2001, 17(5): 455-460.
[19] Cai C, Han L Y, Ji Z L, et al. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence[J]. Nucleic Acids Research, 2003, 31(13): 3692-3697.
[20] Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data[J]. Journal of Bioinformatics and Computational Biology, 2005, 3(2): 185-205.
[21] Székely G J, Rizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distance[J]. The Annals of Statistics, 2007, 35(6): 2769-2794.
[22] Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2007,2(3):389-396.
[23] You L, Garwica D, Rögnvaldsson T. Comprehensive bioinformatics analysis of the specificity of human immunodeficiency virus type 1 protease[J]. Journal of Virology, 2005, 79(19): 12477-12486.
[24] Kontijevskis A, Wikberg J E, Komorowski J. Computational proteomics analysis of HIV-1 protease interactome[J]. Proteins: Structure, Function, and Bioinformatics, 2007, 68(1): 305-312.
[25] Rögnvaldsson T, Etchells T A, You L, et al. How to find simple and accurate rules for viral protease cleavage speciticities[J]. BMC Bioinformatics, 2009, 10(1):1-17.
[26] Jaeger S, Chen S S. Information fusion for biological prediction[J]. Journal of Data Science, 2010, 8(2): 269-288.
[27] Impens F, Timmerman E, Staes A, et al. A catalogue of putative HIV-1 protease host cell substrates[J]. Biological Chemistry, 2012, 393(9): 915-931.
[28] Fawcett T. ROC graphs: notes and practical considerations for researchers[J]. Machine Learning, 2004, 31: 1-38.
[29] Gök M, Özcerit A T. A new feature encoding scheme for HIV-1 protease cleavage site prediction[J]. Neural Computing and Applications,2013,22(7):1757-1761.
[30] Öztürk O, Aksac A, Elsheikh A, et al. A consistency-based feature selection method allied with linear SVMs for HIV-1 protease cleavage site prediction[J]. PLoS One, 2013, 8(8): e63145.
[31] Tomii K, Kanehisa M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins[J]. Protein Engineering, 1996, 9(1): 27-36.
[32] Poorman B R A, Tomasselli A G, Heinrikson R L, et al. A cumulative specificity model for proteases from human immunodeficiency virus types 1 and 2, inferred from statistical analysis of an extended substrate data base[J].Journal of Biological Chemistry,2010,266(22):14554-14561.
[33] Ezziane Z. Application of artificial intelligence in bioinformatics:a review[J]. Expert Systems with Applications, 2006, 30(1): 2-10.
[1] 隗海林, 包翠竹, 李洪雪, 李明达. 基于最小二乘支持向量机的怠速时间预测[J]. 吉林大学学报(工学版), 2018, 48(5): 1360-1365.
[2] 刘杰, 张平, 高万夫. 基于条件相关的特征选择方法[J]. 吉林大学学报(工学版), 2018, 48(3): 874-881.
[3] 耿庆田, 于繁华, 王宇婷, 高琦坤. 基于特征融合的车型检测新算法[J]. 吉林大学学报(工学版), 2018, 48(3): 929-935.
[4] 蔡振闹, 吕信恩, 陈慧灵. 基于反向细菌优化支持向量机的躯体化障碍预测模型[J]. 吉林大学学报(工学版), 2018, 48(3): 936-942.
[5] 梁士利, 魏莹, 潘迪, 张玲, 许廷发, 王双维. 基于语谱图行投影的特定人二字汉语词汇识别[J]. 吉林大学学报(工学版), 2017, 47(1): 294-300.
[6] 商强, 杨兆升, 张伟, 邴其春, 周熙阳. 基于奇异谱分析和CKF-LSSVM的短时交通流量预测[J]. 吉林大学学报(工学版), 2016, 46(6): 1792-1798.
[7] 赵云鹏, 于天来, 焦峪波, 宫亚峰, 宋刚. 异形桥梁损伤识别方法及参数影响分析[J]. 吉林大学学报(工学版), 2016, 46(6): 1858-1866.
[8] 周炳海, 徐佳惠. 基于支持向量机的多载量小车实时调度[J]. 吉林大学学报(工学版), 2016, 46(6): 2027-2033.
[9] 卢英, 王慧琴, 秦立科. 高大空间建筑火灾精确定位方法[J]. 吉林大学学报(工学版), 2016, 46(6): 2067-2073.
[10] 马知行, 赵琦, 张浩. 基于傅立叶分析的持家基因预测模型[J]. 吉林大学学报(工学版), 2016, 46(5): 1639-1643.
[11] 王品, 何璇, 吕洋, 李勇明, 邱明国, 刘书君. 基于多特征支持向量机和弹性区域生长的膝软骨自动分割[J]. 吉林大学学报(工学版), 2016, 46(5): 1688-1696.
[12] 张静, 刘向东. 混沌粒子群算法优化最小二乘支持向量机的混凝土强度预测[J]. 吉林大学学报(工学版), 2016, 46(4): 1097-1102.
[13] 申铉京, 翟玉杰, 卢禹彤, 王玉, 陈海鹏. 基于信道补偿的说话人识别算法[J]. 吉林大学学报(工学版), 2016, 46(3): 870-875.
[14] 宗芳, 王占中, 贾洪飞, 焦玉玲, 吴杨. 基于支持向量机的通勤日活动-出行持续时间预测[J]. 吉林大学学报(工学版), 2016, 46(2): 406-411.
[15] 张浩, 刘海明, 吴春国, 张艳梅, 赵天明, 李寿涛. 基于多特征融合的绿色通道车辆检测判定[J]. 吉林大学学报(工学版), 2016, 46(1): 271-276.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!