[1]马昕,郭静,孙啸.蛋白质中RNA-结合残基预测的随机森林模型[J].东南大学学报(自然科学版),2012,42(1):50-54.[doi:10.3969/j.issn.1001-0505.2012.01.010]
 Ma Xin,Guo Jing,Sun Xiao.Prediction of RNA-binding residues in proteins using random forest[J].Journal of Southeast University (Natural Science Edition),2012,42(1):50-54.[doi:10.3969/j.issn.1001-0505.2012.01.010]
点击复制

蛋白质中RNA-结合残基预测的随机森林模型()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
42
期数:
2012年第1期
页码:
50-54
栏目:
自动化
出版日期:
2012-01-18

文章信息/Info

Title:
Prediction of RNA-binding residues in proteins using random forest
作者:
马昕12郭静1孙啸1
(1东南大学生物电子学国家重点实验室, 南京 210096)
(2南京审计学院金审学院,南京 210029)
Author(s):
Ma Xin12Guo Jing1Sun Xiao1
(1State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China)
(2College of Golden Audit, Nanjing Audit University, Nanjing 210029, China)
关键词:
随机森林位置特异性矩阵嵌套式交叉验证RNA-结合残基
Keywords:
random forest position specific scoring matrix (PSSM) nested cross-validation RNA-binding residue
分类号:
TP181
DOI:
10.3969/j.issn.1001-0505.2012.01.010
摘要:
构建了用于预测蛋白质序列中RNA-结合残基的分类模型.在模型的特征提取方面,除了与功能相关的结构特征和序列正交编码信息以外,还提出了一个新颖的特征PSSM-PP.该特征不仅包含蛋白质序列的进化保守特征,还包含与蛋白质和RNA结合有关的氨基酸理化特征.在设计模型时,考虑到样本数据量大的问题,选用了快速的随机森林算法.该预测模型总体预测准确率达到87.02%,特异性达到95.62%,敏感性达51.16%,Matthew相关系数为0.5336.此外,还构建了RNA结合残基的预测平台.
Abstract:
A prediction method is proposed for predicting RNA-binding residues in protein sequences using a variety of features from amino acid sequence information with random forest (RF) algorithm. A novel feature, named position specific scoring matrix combing with physicochemical properties (PSSM-PP), is proposed to represent the conservation information and physicochemical properties of residues. Then the novel feature, the secondary structure information and orthogonal binary vectors are used to establish the RF model for prediction of RNA-binding residues in protein and the prediction classifier achieves 0. 5336 Matthew’s correlation coefficient (MCC) and 87. 02% overall accuracy (ACC) with 51. 16% sensitivity (SE) and 95. 62% specificity(SP). The web server implementation is freely available.

参考文献/References:

[1] Jeong E,Chung I F,Miyano S.A neural network method for identification of RNA-interacting residues in protein [J].Genome Inform,2004,15(1):105-116.
[2] Tong J,Jiang P,Lu Z H.RISP:a web-based server for prediction of RNA-binding sites in proteins [J].Comput Methods Programs Biomed,2008,90(2):148-153.
[3] Kumar M,Gromiha M M,Raghava G P.Prediction of RNA binding sites in a protein using SVM and PSSM profile [J].Proteins,2008,71(1):189-194.
[4] Berman H M,Westbrook J,Feng Z,et al.The protein data bank [J].Nucleic Acids Res,2008,28(1):235-242.
[5] Ma X,Guo J,Wu J S,et al.Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature [J].Proteins,2011,79(4):1230-1239.
[6] Cheng C W,Su E C,Hwang J K,et al.Predicting RNA-binding sites of proteins using support vector machines and evolutionary information[J].BMC Bioinformatics,2008,9(supp 12):S6
[7] Frishman D,Argos P.Seventy-five percent accuracy in protein secondary structure prediction[J].Proteins,1997,27(3):329-335.
[8] Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32.
[9] Kubat M,Matwin S.Addressing the curse of imbalanced training sets:one-sided selection[C]//Proceedings of the Fourteenth International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers,1997 :179-186.
[10] Swets J A.Measuring the accuracy of diagnostic systems[J].Science,1988,240(4857):1285-1293.
[11] Liaw A,Wiener M.Classification and regression by random Forest [J].R News,2002,2(3):18-22.
[12] Scheffer T.Error estimation and model selection[M].Berlin:Technischen University,1999:74-82.
[13] Wang L,Brown S J.BindN:a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences [J].Nucleic Acids Res,2006,34(supp 2):243-248.
[14] Ma X,Guo J.RNAPre-RF[EB/OL].(2011-06)[2011-11].http://www.cbi.seu.edu.cn/RNAPre-RF/.

备注/Memo

备注/Memo:
作者简介:马昕(1982—),女,博士生,讲师;孙啸(联系人),男,博士,教授,博士生导师,xsun@seu.edu.cn.
基金项目:国家自然科学基金资助项目 (61073141,60971099).
引文格式: 马昕,郭静,孙啸.蛋白质中RNA-结合残基预测的随机森林模型[J].东南大学学报:自然科学版,2012,42(1):50-54.[doi:10.3969/j.issn.1001-0505.2012.01.010]
更新日期/Last Update: 2012-01-20