OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

PLOS ONE 2012

Prediction of Protein Domain with mRMR Feature Selection and Analysis

DOI: 10.1371/journal.pone.0039308

Bi-Qing Li, Le-Le Hu, Lei Chen, Kai-Yan Feng, Yu-Dong Cai, Kuo-Chen Chou

Full-Text Cite this paper Add to My Lib

Abstract:

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

References

[1]	Chivian D, Kim DE, Malmstrom L, Bradley P, Robertson T (2003) Automated prediction of CASP-5 structures using the Robetta server. Proteins 53: Suppl 6524.533
[2]	Ingolfsson H, Yona G (2008) Protein domain prediction. Methods Mol Biol 426: 117.143
[3]	Holland TA, Veretnik S, Shindyalov IN, Bourne PE (2006) Partitioning protein structures into domains: why is it so difficult? J Mol Biol 361: 562.590
[4]	Campbell ID, Downing AK (1994) Building protein structure and function from modular units. Trends Biotechnol 12: 168.172
[5]	Guerois R, Serrano L (2001) Protein design based on folding models. Curr Opin Struct Biol 11: 101.106
[6]	Nielsen PK, Yamada Y (2001) Identification of cell-binding sites on the Laminin alpha 5 N-terminal domain by site-directed mutagenesis. J Biol Chem 276: 10906.10912
[7]	Chou KC (2004) Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry 11: 2105.2134
[8]	Schnell JR, Chou JJ (2008) Structure and mechanism of the M2 proton channel of influenza A virus. Nature 451: 591.595
[9]	Wang J, Pielak RM, McClintock MA, Chou JJ (2009) Solution structure and functional analysis of the influenza B proton channel. Nature Structural and Molecular Biology 16: 1267.1271
[10]	Chou JJ, Li S, Klee CB, Bax A (2001) Solution structure of Ca2+-calmodulin reveals flexible hand-like properties of its domains. Nature Structural Biology 8: 990.997
[11]	Berardi MJ, Shih WM, Harrison SC, Chou JJ (2011) Mitochondrial uncoupling protein 2 structure determined by NMR molecular fragment searching. Nature 476: 109.113
[12]	Chou KC (2004) Insights from modelling the 3D structure of the extracellular domain of alpha7 nicotinic acetylcholine receptor. Biochemical and Biophysical Research Communication 319: 433.438
[13]	Chou KC (1995) The convergence-divergence duality in lectin domains of the selectin family and its implications. FEBS Letters 363: 123.126
[14]	Chou KC (2004) Modelling extracellular domains of GABA-A receptors: subtypes 1, 2, 3, and 5. Biochemical and Biophysical Research Communications 316: 636.642
[15]	Chou KC (2005) Modeling the tertiary structure of human cathepsin-E. Biochem Biophys Res Commun 331: 56.60
[16]	Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol44, 60) 43: 246.255
[17]	Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273: 236.247
[18]	Cai YD, Zhou GP, Chou KC (2003) Support vector machines for predicting membrane protein types by using functional domain composition. Biophysical Journal 84: 3257.3263
[19]	Xiao X, Wang P, Chou KC (2009) GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. Journal of Computational Chemistry 30: 1414.1423
[20]	Xiao X, Wang P, Chou KC (2011) GPCR-2L: Predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Molecular Biosystems 7: 911.919
[21]	Chou KC, Shen HB (2008) ProtIdent: A web server for identifying proteases and their types by fusing functional domain and sequential evolution information. Biochem Biophys Res Comm 376: 321.325
[22]	Xiao X, Wang P, Chou KC (2011) Quat-2L: a web-server for predicting protein quaternary structural attributes. Molecular Diversity 15: 149.155
[23]	Xiao X, Wang P, Chou KC (2009) Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition. Journal of Applied Crystallography 42: 169.173
[24]	Chou KC, Cai YD (2004) Predicting protein structural class by functional domain composition. Biochemical and Biophysical Research Communications (Corrigendum: ibid, 2005, Vol329, 1362) 321: 1007.1009
[25]	Wang K, Hu LL, Shi XH, Dong YS, Li HP (2012) PSCL: Predicting Protein Subcellular Localization Based on Optimal Functional Domains. Protein & Peptide Letters 19: 15.22
[26]	Chou KC, Shen HB (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS ONE 5: e9931.
[27]	Chou KC, Shen HB (2010) Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization. PLoS ONE 5: e11335.
[28]	Zhou GP (2011) The Structural Determinations of the Leucine Zipper Coiled-Coil Domains of the cGMP-Dependent Protein Kinase I alpha and its Interaction with the Myosin Binding Subunit of the Myosin Light Chains Phosphase. Proteins & Peptide Letters 18: 966.978
[29]	Zhou GP (2011) The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism. Journal of Theoretical Biology 284: 142.148
[30]	Chen L, Feng KY, Cai YD, Chou KC, Li HP (2010) Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition. BMC Bioinformatics 11: 293.
[31]	Gewehr JE, Zimmer R (2006) SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles. Bioinformatics 22: 181.187
[32]	von Ohsen N, Sommer I, Zimmer R, Lengauer T (2004) Arby: automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics 20: 2228.2235
[33]	Zhang Y, Liu B, Dong Q, Jin VX (2011) An improved profile-level domain linker propensity index for protein domain boundary prediction. Protein & Peptide Letters 18: 7.16
[34]	George RA, Heringa J (2002) SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 316: 839.851
[35]	Liu J, Rost B (2004) Sequence-based prediction of protein domains. Nucleic Acids Res 32: 3522.3530
[36]	Kim DE, Chivian D, Malmstrom L, Baker D (2005) Automated prediction of domain boundaries in CASP6 targets using Ginzu and RosettaDOM. Proteins 61: Suppl 7193.200
[37]	Cheng J (2007) DOMAC: an accurate, hybrid protein domain prediction server. Nucleic Acids Res 35: W354.356
[38]	Shameer K, Pugalenthi G, Kandaswamy KK, Sowdhamini R (2011) 3dswap-pred: Prediction of 3D Domain Swapping from Protein Sequence Using Random Forest Approach. Protein & Peptide Letters 18: 1010.1020
[39]	Nagarajan N, Yona G (2004) Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics 20: 1335.1360
[40]	Cheng J, Sweredoski M, Baldi P (2006) DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery 13: 1.10
[41]	Eickholt J, Deng X, Cheng J (2011) DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning. BMC Bioinformatics 12: 43.
[42]	Ebina T, Toh H, Kuroda Y (2011) DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27: 487.494
[43]	Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB (1997) CATH–a hierarchic classification of protein domain structures. Structure 5: 1093.1108
[44]	Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536.540
[45]	Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33: 88.96
[46]	Walsh I, Martin AJ, Mooney C, Rubagotti E, Vullo A (2009) Ab initio and homology based prediction of protein domains by recursive neural networks. BMC Bioinformatics 10: 195.
[47]	Bondugula R, Lee MS, Wallqvist A (2009) FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator. Nucleic Acids Res 37: 452.462
[48]	Suyama M, Ohara O (2003) DomCut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics 19: 673.674
[49]	Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res 31: 3701.3708
[50]	Wu ZC, Xiao X, Chou KC (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular BioSystems 7: 3287.3297
[51]	Xiao X, Wu ZC, Chou KC (2011) A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS ONE 6: e20592.
[52]	Lin WZ, Fang JA, Xiao X, Chou KC (2011) iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE 6: e24756.
[53]	Wang P, Xiao X, Chou KC (2011) NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features. PLoS ONE 6: e23505.
[54]	Chou KC, Wu ZC, Xiao X (2012) iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems 8: 629.641
[55]	Xiao X, Wu ZC, Chou KC (2011) iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology 284: 42.51
[56]	Wu ZC, Xiao X, Chou KC (2012) iLoc-Gpos: A Multi-Layer Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Gram-Positive Bacterial Proteins. Protein & Peptide Letters 19: 4.14
[57]	Apweiler R, Martin MJ, O'Donovan C, Magrane M, Alam-Faruque Y (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 38: D142.D148
[58]	Chou KC, Shen HB (2007) Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry 370: 1.16
[59]	Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658.1659
[60]	Chou KC (2002) Review: Prediction of protein signal sequences. Current Protein and Peptide Science 3: 615.622
[61]	Chou KC, Shen HB (2007) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Comm 357: 633.640
[62]	Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29: 2994.3005
[63]	Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389.3402
[64]	Chou KC, Shen HB (2007) MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Comm 360: 339.345
[65]	Hayat M, Khan A (2012) MemHyb: Predicting membrane protein types by hybridizing SAAC and PSSM. J ournal of Theoretical Biology 292: 93.102
[66]	Li D, Jiang Z, Yu W, Du L (2010) Predicting Caspase Substrate Cleavage Sites Based on a Hybrid SVM-PSSM Method. Protein and Peptide Letters 17: 1566.1571
[67]	Mundra P, Kumar M, Kumar KK, Jayaraman VK, Kulkarni BD (2007) Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognition Letters 28: 1610.1615
[68]	Shen HB, Chou KC (2007) Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Engineering, Design & Selection 20: 561.567
[69]	Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One 6: e18258.
[70]	Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28: 374.
[71]	Atchley WR, Zhao J, Fernandes AD, Druke T (2005) Solving the protein sequence metric problem. Proc Natl Acad Sci U S A 102: 6395.6400
[72]	Wright PE, Dyson HJ (1999) Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. Journal of Molecular Biology 293: 321.331
[73]	Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z (2002) Intrinsic disorder and protein function. Biochemistry 41: 6573.6582
[74]	Yoon MK, Venkatachalam V, Huang A, Choi BS, Stultz CM (2009) Residual structure within the disordered C-terminal segment of p21(Waf1/Cip1/Sdi1) and its implications for molecular recognition. Protein Sci 18: 337.347
[75]	Liu J, Tan H, Rost B (2002) Loopy proteins appear conserved in evolution. Journal of Molecular Biology 322: 53.64
[76]	Tompa P (2002) Intrinsically unstructured proteins. Trends in Biochemical Sciences 27: 527.533
[77]	Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z (2006) Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7: 208.
[78]	Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Research 33: W72.W76
[79]	Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27: 1226.1238
[80]	Kandaswamy KK, Chou KC, Martinetz T, Moller S, Suganthan PN (2011) AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology 270: 56.62
[81]	Pugalenthi G, Kandaswamy KK, Chou KC, Vivekanandan S, Kolatkar P (2012) RSARF: Prediction of Residue Solvent Accessibility from Protein Sequence Using Random Forest Method. Protein & Peptide Letters 19: 50.56
[82]	Jia SC, Hu XZ (2011) Using Random Forest Algorithm to Predict beta-Hairpin Motifs. Protein and Peptide Letters 18: 609.617
[83]	Qiu Z, Wang X (2011) Improved Prediction of Protein Ligand-Binding Sites Using Random Forests. Protein & Peptide Letters 18: 1212.1218
[84]	Breiman L (2001) Random forests. Machine learning 45: 5.32
[85]	Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann Pub.
[86]	Chou KC, Zhang CT (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 30: 275.349
[87]	Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology 263: 203.209
[88]	Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology 281: 18.23
[89]	Lin J, Wang Y (2011) Using a novel AdaBoost algorithm and Chou's pseudo amino acid composition for predicting protein subcellular localization. Protein & Peptide Letters 18: 1219.1225
[90]	Gu Q, Ding YS, Zhang TL (2010) Prediction of G-Protein-Coupled Receptor Classes in Low Homology Using Chou's Pseudo Amino Acid Composition with Approximate Entropy and Hydrophobicity Patterns. Protein & Peptide Letters 17: 559.567
[91]	Xiao X, Wang P, Chou KC (2012) iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix. PLoS ONE 7: e30869.
[92]	Li YX, Shao YH, Jing L, Deng NY (2011) An efficient support vector machine approach for identifying protein s-nitrosylation sites. Protein and Peptide Letters 18: 573.587
[93]	Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein & Peptide Letters 17: 715.722
[94]	Zhao W, Wang X, Deng R, Wang J, Zhou H (2011) Discrimination of Thermostable and Thermophilic Lipases using Support Vector Machines. Protein & Peptide Letters 18: 707.717
[95]	Huang T, Chen L, Cai YD, Chou KC (2011) Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property. PLoS ONE 6: e25297.
[96]	He Z, Zhang J, Shi XH, Hu LL, Kong X (2010) Predicting drug-target interaction networks based on functional groups and biological features. PLoS ONE 5: e9603.
[97]	Li BQ, Hu LL, Niu S, Cai YD, Chou KC (2012) Predict and analyze S-nitrosylation modification sites with the mRMR and IFS approaches. J Proteomics 75: 1654.1665
[98]	Roy S, Martinez D, Platero H, Lane T, Werner-Washburne M (2009) Exploiting amino acid composition for predicting protein-protein interactions. PLoS One 4: e7813.
[99]	Moses AM, Durbin R (2009) Inferring selection on amino acid preference in protein domains. Mol Biol Evol 26: 527.536
[100]	Angov E, Hillier CJ, Kincaid RL, Lyon JA (2008) Heterologous protein expression is enhanced by harmonizing the codon usage frequencies of the target gene with those of the expression host. PLoS One 3: e2189.
[101]	Goldenberg NM, Steinberg BE (2010) Surface charge: a key determinant of protein localization and function. Cancer Res 70: 1277.1280
[102]	Mbamala EC, Ben-Shaul A, May S (2005) Domain formation induced by the adsorption of charged proteins on mixed lipid membranes. Biophys J 88: 1702.1714
[103]	Gong S, Park C, Choi H, Ko J, Jang I (2005) A protein domain interaction interface database: InterPare. BMC Bioinformatics 6: 207.
[104]	Li YD, Zhou Z, Lv LX, Hou XP, Li YQ (2009) New approach to achieve high-level secretory expression of heterologous proteins by using Tat signal peptide. Protein & Peptide Letters 16: 706.710
[105]	Reynolds SM, Kall L, Riffle ME, Bilmes JA, Noble WS (2008) Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Comput Biol 4: e1000213.
[106]	Saffari B, Mohabatkar H, Mohsenzadeh S (2008) T and B-cell Epitopes Prediction of Iranian Saffron (Crocus sativus) Profilin by Bioinformatics Tools. Protein Pept Lett 15: 280.285
[107]	Chen J, Liu H, Yang J, Chou KC (2007) Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33: 423.428
[108]	Chou KC (1993) A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. Journal of Biological Chemistry 268: 16938.16948
[109]	Poorman RA, Tomasselli AG, Heinrikson RL, Kezdy FJ (1991) A cumulative specificity model for proteases from human immunodeficiency virus types 1 and 2, inferred from statistical analysis of an extended substrate data base. Journal of Biological Chemistry 266: 14554.14561
[110]	Chou KC (1996) Review: Prediction of HIV protease cleavage sites in proteins. Analytical Biochemistry 233: 1.14
[111]	Shen HB, Chou KC (2008) HIVcleave: a web-server for predicting HIV protease cleavage sites in proteins. Analytical Biochemistry 375: 388.390
[112]	Chou KC (1995) A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Science 4: 1365.1383
[113]	Elhammer AP, Poorman RA, Brown E, Maggiora LL, Hoogerheide JG (1993) The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. Journal of Biological Chemistry 268: 10029.10038

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133