|
grDNA-Prot:基于氨基酸物理化学特性和支持向量机的DNA结合蛋白预测
|
Abstract:
DNA结合蛋白在细胞内外的各种活动中起着重要作用。本文提出一种新的DNA结合蛋白预测方法(grDNA-Prot),使用20个氨基酸组成频率和基于AAindex数据库531个氨基酸物理化学性质的图形表示法描述蛋白质序列信息。此外,还采用三种特征选择方法来选择最优特征,并通过5折交叉验证,建立了基于支持向量机的DNA结合蛋白识别预测模型。为验证该方法的有效性,本文在独立测试数据集上与其他方法进行了比较。这些结果表明,Hydrophobicity (H)、Physicochemical properties (P)和Alpha and turn properties (A)是有效区分DNA结合蛋白和非DNA结合蛋白的主要氨基酸物理化学性质。
DNA-binding proteins played an important role in various intra- and extra-cellular activities. In this paper, a novel grDNA-Prot method of DNA-binding predictor is proposed, the protein sequence in-formation is described with the probabilities of 20 amino acids and the 531 physicochemical prop-erties indices of 20 amino acids in AAindex database based on the Cylindrical graphical representa-tion. Furthermore, we employ three feature selection methods to select the optimal feature, which is used to establish the model for identify DNA-binding proteins basing on support machine vector with 5-fold cross-validation. In order to test the effectiveness of our method, we compare the accu-racy performance with the other methods in independent test dataset. These results demonstrated that the physicochemical properties of hydrophobicity (H), Physicochemical properties (P) and the alpha and turn properties (A) are primarily responsible for distinguishing between DNA-binding proteins and non DNA-binding proteins.
[1] | Lilley, D.M.J (1995) DNA Protein Structural Interactions. Oxford University Press, Oxford. |
[2] | Zimmer, C. and W?hnert, U. (1986) Nonintercalating DNA-Binding Ligands: Specificity of the Interaction and Their Use as Tools in Bi-ophysical, Biochemical and Biological Investigations of the Genetic Material. Progress in Biophysics and Molecular Bi-ology, 47, 31-112. https://doi.org/10.1016/0079-6107(86)90005-2 |
[3] | Boute, E., Lieberherr, D., Tognolli, M., Schneider, M. and Bairoch, A. (2007) UniProtKB/Swiss-Prot. In: Edwards, D., Ed., Plant Bioinformatics, Vol. 406, Humana Press, Totowa, 89-112. https://doi.org/10.1007/978-1-59745-535-0_4 |
[4] | Helwa, R. and Hoheisel, J.D. (2010) Analysis of DNA-Protein Interactions: From Nitrocellulose Filter Binding Assays to Microarray Studies. Analyt-ical and Bioanalytical Chemistry, 398, 2551-2561.
https://doi.org/10.1007/s00216-010-4096-7 |
[5] | Freeman, K., Gwadz, M. and Shore, D. (1995) Molecular and Genetic Analysis of the Toxic Effect of Rap1 Overexpression in Yeast. Genetic, 141, 1253-1262. https://doi.org/10.1093/genetics/141.4.1253 |
[6] | Jaiswal, R., Singh, S.K., Bastia, D. and Escalante, C.R. (2015) Crystallization and Preliminary X-Ray Characterization of the Eukaryotic Replication Terminator Reb1-Ter DNA Com-plex. Acta Crystallographica Section F: Structural Biology Communications, 71, 414-418. https://doi.org/10.1107/S2053230X15004112 |
[7] | Buck, M.J. and Lieb, J.D. (2004) Chip-Chip: Considerations for the Design, Analysis, and Application of Genome-Wide Chromatin Immunoprecipitation Experiments. Genomics, 83, 349-360.
https://doi.org/10.1016/j.ygeno.2003.11.004 |
[8] | Langlois, R.E. and Lu, H. (2010) Boosting the Prediction and Understanding of DNA-Binding Domains from Sequence. Nucleic Acids Research, 38, 3149-3158. https://doi.org/10.1093/nar/gkq061 |
[9] | Shanahan, H.P., Garcia, M.A., Jones, S. and Thornton, J.M. (2004) Iden-tifying DNA-Proteins Using Structural Motifs and Electrostatic Potential. Nucleic Acids Research, 32, 4732-4741. https://doi.org/10.1093/nar/gkh803 |
[10] | Ahmad, S. and Sarai, A. (2004) Moment-Based Prediction of DNA-Binding Proteins. Journal of Molecular Biology, 341, 65-71. https://doi.org/10.1016/j.jmb.2004.05.058 |
[11] | Lin, W.Z., Fang, J.A., Xiao, X.K. and Chou, K.C. (2011) iD-NA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE, 6, e24756. https://doi.org/10.1371/journal.pone.0024756 |
[12] | Kumar, K.K., Pugalenthi, G. and Suganthan, P.N. (2009) DNA-Prot: Identification of DNA Binding Proteins from Protein Sequence Information Using Random Forest. Journal of Biomolecular Structure and Dynamics, 26, 679-686.
https://doi.org/10.1080/07391102.2009.10507281 |
[13] | Kumar, M., Gromiha, M.M. and Raghava, G.P. (2007) Identification of DNA-Binding Proteins Using Support Vector Machines and Evolutionary Profiles. BMC Bioinformatics, 8, Article No. 463.
https://doi.org/10.1186/1471-2105-8-463 |
[14] | Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X. and Chou, K.C. (2014) iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS ONE, 9, e106691. https://doi.org/10.1371/journal.pone.0106691 |
[15] | Zhang, J. and Liu, B. (2017) PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Interna-tional Journal of Molecular Sciences, 18, Article No. 1856.
https://doi.org/10.3390/ijms18091856 |
[16] | Zhang, J., Chen, Q.C. and Liu, B. (2019) DeepDRBP-2L: A New Ge-nome Annotation Predictor for Identifying DNA Binding Proteins and RNA Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1.
https://doi.org/10.1109/TCBB.2019.2952338 |
[17] | Lou, W.C., Wang, X.Q., Chen, F., Chen, Y.X., Jiang, B. and Zhang, H. (2014) Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Ran-dom Forest and Gaussian Naive Bayes. PLoS ONE, 9, e86703. https://doi.org/10.1371/journal.pone.0086703 |
[18] | Wei, L.Y., Tang, J.J. and Zou, Q. (2017) Local-DPP: an Im-proved DNA-Binding Protein Prediction Method by Exploring Local Evolutionary Information. Information Sciences, 384, 135-144.
https://doi.org/10.1016/j.ins.2016.06.026 |
[19] | Huang, T., Chen, L., Cai, Y.D. and Chou, K.C. (2011) Classifica-tion and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Func-tional Property. PLoS ONE, 6, e25297.
https://doi.org/10.1371/journal.pone.0025297 |
[20] | Zou, C., Gong, J. and Li, H. (2013) An Improved Sequence Based Prediction Protocol for DNA-Binding Proteins Using SVM and Comprehensive Feature Analysis. BMC Bioin-formatics, 14, Article No. 90.
https://doi.org/10.1186/1471-2105-14-90 |
[21] | Li, S., Li, D.P., Zeng, X.X., Wu, Y.F., Guo, L. and Zou, Q. (2014) nDNA-Prot: Identification of DNA-Binding Proteins Based on Unbalanced Classification. BMC Bioinformatics, 15, Ar-ticle No. 298.
https://doi.org/10.1186/1471-2105-15-298 |
[22] | Kumar, R., Srivastava, A., Kumari, B. and Kumar M. (2015) Pre-diction of Beta-Lactamase and Its Class by Chou’s Pseudo-Amino Acid Composition and Support Vector Machine. Journal of Theoretical Biology, 365, 96-103.
https://doi.org/10.1016/j.jtbi.2014.10.008 |
[23] | Shahana, Y.C., Swakkhar, S. and Abdollah, D. (2017) iDNAP-rot-ES: Identification of DNA-Binding Proteins Using Evolutionary and Structural Features. Scientific Reports, 7, Article No. 14938.
https://doi.org/10.1038/s41598-017-14945-1 |
[24] | Hu, J., Zhou, X.G., Zhu, Y.H., Yu, D.J. and Zhang, G.J. (2020) TargetDBP: Accurate DNA-Binding Protein Prediction via Sequence-Based Multi-View Feature Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17, 1419-1429. |
[25] | Wang, Y.B., Ding, Y.J., Guo, F., Wei, L.Y. and Tang, J.J. (2017) Improved Detection of DNA-Binding Proteins via Compression Technology on PSSM Information. PLoS ONE, 12, e0185587.
https://doi.org/10.1371/journal.pone.0185587 |
[26] | Liu, X.J., Gong, X.J., Yu, H. and Xu, J.H. (2018) A Model Stacking Framework for Identifying DNA Binding Proteins by Orchestrating Multi-View Features and Classifiers. Genes, 9, Article No. 394.
https://doi.org/10.3390/genes9080394 |
[27] | Ahmad, S., Gromiha, M.M. and Sarai, A. (2004) Analysis and Predic-tion of DNA-Binding Proteins and Their Binding Residues Based on Composition, Sequence and Structural Information. Bioinformatics, 20, 477-486.
https://doi.org/10.1093/bioinformatics/btg432 |
[28] | Liu, B., Fang, L.Y., Wang, S.Y., Wang, X.L., Li, H.T. and Chou K.C. (2015) Identification of MicroRNA Precursor with the Degenerate K-Tuple or Kmer Strategy. Journal of Theoretical Biology, 385, 153-159.
https://doi.org/10.1016/j.jtbi.2015.08.025 |
[29] | Kawashima, S., Pokarowski, P., Pokarowska, M., Mkolinski, A., Katayama, T. and Kanehisa, M. (2008) AAindex: Amino Acid Index Database, Progress Report 2008. Nucleic Acids Re-search, 36, D202-D205.
https://doi.org/10.1093/nar/gkm998 |
[30] | Huang, H.L., Lin, I.C., Liou, Y.F., Tsai, C.T., Hsu, K.T., Huang, W.L., Ho, J. and Ho, S.Y. (2011) Predicting and Analyzing DNA-Binding Domains Using a Systematic Approach to Identify-ing a Set of Informative Physicochemical and Biochemical Properties. BMC Bioinformatics, 12, Article No. S47. https://doi.org/10.1186/1471-2105-12-S1-S47 |
[31] | Tung, C.W. and Ho, S.Y. (2008) Computational Identification of Ubiquitylation Sites from Protein Sequences. BMC Bioinformatics, 9, Article No. 310. https://doi.org/10.1186/1471-2105-9-310 |
[32] | Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 273-282. https://doi.org/10.1111/j.1467-9868.2011.00771.x |
[33] | Fang, Y., Guo, Y., Feng, Y. and Li, M. (2008) Predicting DNA-Binding Proteins: Approached from Chou’s Pseudo Amino Acid Composition and Other Specific Sequence Fea-tures. Amino Acids, 24, 103-109.
https://doi.org/10.1007/s00726-007-0568-2 |
[34] | Huang, Y., Niu, B.F., Gao, Y., Fu, L. and Li, W.Z. (2010) CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences. Bioinformatics, 26, 680-682. https://doi.org/10.1093/bioinformatics/btq003 |
[35] | Randic, M., Zupan, J., Balaban, A.T., Vikic-Topic, D. and Plav?i?, D. (2011) Graphical Representation of Proteins. Chemical Reviews, 111, 790-862. https://doi.org/10.1021/cr800198j |
[36] | Yu, J.F., Dou, X.H., Wang, H.B., Sun, X., Zhao, H.Y. and Wang, J.H. (2015) A Novel Cylindrical Representation for Characterizing Intrinsic Properties of Protein Sequences. Journal of Chemical Information and Modeling, 55, 1261-1270.
https://doi.org/10.1021/ci500577m |
[37] | Zhang, Y.N., Yu, D.J., Li, S.S., Fan, Y.X., Huang, Y. and Shen, H.B. (2012) Prediction Protein-ATP Binding Sites from Primary Sequence through Fusing Bi-Profile Sampling of Multi-View Features. BMC Bioinformatics, 13, Article No. 118. https://doi.org/10.1186/1471-2105-13-118 |
[38] | Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A. and Nielsen, H. (2000) Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinformatics, 16, 412-424. https://doi.org/10.1093/bioinformatics/16.5.412 |
[39] | Sonego, P., Kocsor, A. and Pongor, S. (2008) ROC Analysis: Applications to the Classification of Biological Sequences and 3D Structures. Briefings in Bioinformatics, 9, 198-209. https://doi.org/10.1093/bib/bbm064 |
[40] | Deng, L., Pan, J., Xu, X., Yang, W., Liu, C. and Liu, H. (2018) PDRLGB: Precise DNA-Binding Residue Prediction Using a Light Gradient Boosting Machine. BMC Bioinformatics, 19, Article No. 522.
https://doi.org/10.1186/s12859-018-2527-1 |
[41] | Peng, H., Long, F.H. and Ding, C. (2015) Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27, 1226-1238.
https://doi.org/10.1109/TPAMI.2005.159 |