The ability to catalytically cleave protein substrates after synthesis is fundamental for all forms of life. Accordingly, site-specific proteolysis is one of the most important post-translational modifications. The key to understanding the physiological role of a protease is to identify its natural substrate(s). Knowledge of the substrate specificity of a protease can dramatically improve our ability to predict its target protein substrates, but this information must be utilized in an effective manner in order to efficiently identify protein substrates by in silico approaches. To address this problem, we present PROSPER, an integrated feature-based server for in silico identification of protease substrates and their cleavage sites for twenty-four different proteases. PROSPER utilizes established specificity information for these proteases (derived from the MEROPS database) with a machine learning approach to predict protease cleavage sites by using different, but complementary sequence and structure characteristics. Features used by PROSPER include local amino acid sequence profile, predicted secondary structure, solvent accessibility and predicted native disorder. Thus, for proteases with known amino acid specificity, PROSPER provides a convenient, pre-prepared tool for use in identifying protein substrates for the enzymes. Systematic prediction analysis for the twenty-four proteases thus far included in the database revealed that the features we have included in the tool strongly improve performance in terms of cleavage site prediction, as evidenced by their contribution to performance improvement in terms of identifying known cleavage sites in substrates for these enzymes. In comparison with two state-of-the-art prediction tools, PoPS and SitePrediction, PROSPER achieves greater accuracy and coverage. To our knowledge, PROSPER is the first comprehensive server capable of predicting cleavage sites of multiple proteases within a single substrate sequence using machine learning techniques. It is freely available at http://lightning.med.monash.edu.au/PROSP?ER/.
References
[1]
López-Otín C, Overall CM (2002) Protease degradomics: a new challenge for proteomics,. Nat Rev Mol Cell Biol 3: 509–519.
[2]
Turk B (2006) Targeting proteases: successes, failures and future prospects. Nat Rev Drug Discov 5: 785–799.
[3]
López-Otín C, Matrisian LM (2007) Emerging roles of proteases in tumour suppression. Nat Rev Cancer 7: 800–808.
[4]
Igarashi Y, Eroshkin A, Gramatikova S, Gramatikoff K, Zhang Y, et al. (2007) CutDB: a proteolytic event database. Nucleic Acids Res 35: D546–D549.
[5]
Igarashi Y, Heureux E, Doctor KS, Talwar P, Gramatikova S, et al. (2009) PMAP: databases for analyzing proteolytic events and pathways. Nucleic Acids Res 37: D611–D618.
[6]
Quesada V, Ordó?ez GR, Sánchez LM, Puente XS, López-Otín C (2009) The Degradome database: mammalian proteases and diseases of proteolysis. Nucleic Acids Res 37: D239–D243.
[7]
Timmer JC, Zhu W, Pop C, Regan T, Snipas SJ, et al. (2009) Structural and kinetic determinants of protease substrates. Nat Struct Mol Biol 16: 1101–1108.
[8]
Song J, Tan H, Boyd SE, Shen H, Mahmood K, et al. (2011) Bioinformatic approaches for predicting substrates of proteases. J Bioinform Comput Biol 9: 149–178.
[9]
Hauske P, Ottmann C, Meltzer M, Ehrmann M, Kaiser M (2008) Allosteric regulation of proteases. Chembiochem 9: 2920–2928.
[10]
Rana S, Pozzi N, Pelc LA, Di Cera E (2011) Redesigning allosteric activation in an enzyme. Proc Natl Acad Sci USA 108: 5221–5225.
[11]
Ju W, Valencia CA, Pang H, Ke Y, Gao W, et al. (2007) Proteome-wide identification of family member-specific natural substrate repertoire of caspases. Proc Natl Acad Sci USA 104: 14294–14299.
[12]
Enoksson M, Li J, Ivancic MM, Timmer JC, Wildfang E, et al. (2007) Identification of proteolytic cleavage sites by quantitative proteomics. J Proteome Res 6: 2850–2858.
[13]
Dix MM, Simon GM, Cravatt BF (2008) Global mapping of the topography and magnitude of proteolytic events in apoptosis. Cell 134: 679–691.
[14]
Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, et al. (2008) Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini. Cell 134: 866–876.
[15]
Schilling O, Overall CM (2008) Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites. Nat Biotechnol 26: 685–694.
[16]
Demon D, Van Damme P, Vanden Berghe T, Deceuninck A, Van Durme J, et al. (2009) Proteome-wide substrate analysis indicates substrate exclusion as a mechanism to generate caspase-7 versus caspase-3 specificity. Mol Cell Proteomics 8: 2700–2714.
[17]
Van Damme P, Maurer-Stroh S, Plasman K, Van Durme J, Colaert N, et al. (2009) Analysis of protein processing by N-terminal proteomics reveals novel species-specific substrate determinants of granzyme B orthologs. Mol Cell Proteomics 8: 258–272.
[18]
Van Damme P, Staes A, Bronsoms S, Helsens K, Colaert N, et al. (2010) Complementary positional proteomics for screening substrates of endo- and exoproteases. Nat Methods 7: 512–515.
[19]
Schilling O, Barré O, Huesgen PF, Overall CM (2010) Proteome-wide analysis of protein carboxy termini: C terminomics. Nat Methods 7: 508–511.
[20]
Kleifeld O, Doucet A, auf dem Keller U, Prudova A, Schilling O, et al. (2010) Isotopic labeling of terminal amines in complex samples identifies protein N-termini and protease cleavage products. Nat Biotechnol 28: 281–288.
[21]
Yang ZR (2005) Prediction of caspase cleavage sites using Bayesian bio-basis function neural networks. Bioinformatics 21: 1831–1837.
[22]
Wee LJ, Tan TW, Ranganathan S (2006) SVM-based prediction of caspase substrate cleavage sites. BMC Bioinformatics 7: S14.
[23]
Wee LJ, Tan TW, Ranganathan S (2007) CASVM: web server for SVM-based prediction of caspase substrates cleavage sites. Bioinformatics 23: 3241–3243.
[24]
Chen CT, Yang EW, Hsu HJ, Sun YK, Hsu WL, et al. (2008) Protease substrate site predictors derived from machine learning on multilevel substrate phage display data. Bioinformatics 24: 2691–2697.
[25]
Wee LJ, Tong JC, Tan TW, Ranganathan S (2009) A multi-factor model for caspase degradome prediction. BMC Genomics 10: S6.
[26]
Piippo M, Lietzén N, Nevalainen OS, Salmi J, Nyman TA (2010) Pripper: prediction of caspase cleavage sites from whole proteomes. BMC Bioinformatics 11: 320.
[27]
Barkan DT, Hostetter DR, Mahrus S, Pieper U, Wells JA, et al. (2010) Prediction of protease substrates using sequence and structure features. Bioinformatics 26: 1714–1722.
[28]
Song J, Tan H, Shen H, Mahmood K, Boyd SE, et al. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics 26: 752–760.
[29]
Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, et al.. (2005) Protein Identification and Analysis Tools on the ExPASy Server. In The Proteomics Protocols Handbook Edited by: Walker JM. Humana Press; 571–607.
[30]
Garay-Malpartida HM, Occhiucci JM, Alves J, Belizário JE (2005) CaSPredictor: a new computer-based tool for caspase substrate prediction. Bioinformatics 21: i169–i176.
[31]
Backes C, Kuentzer J, Lenhof HP, Comtesse N, Meese E (2005) GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Res 33: W208–W213.
[32]
Boyd SE, Pike RN, Rudy GB, Whisstock JC, Garcia de la Banda M (2005) PoPS: a computational tool for modeling and predicting protease specificity. J Bioinform Comput Biol 3: 551–585.
[33]
Verspurten J, Gevaert K, Declercq W, Vandenabeele P (2009) SitePredicting the cleavage of proteinase substrates. Trends Biochem Sci 34: 319–323.
[34]
Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ (2008) MEROPS: the peptidase database. Nucleic Acids Res 36: D320–D325.
[35]
Rawlings ND, Barrett AJ, Bateman A (2010) MEROPS: the peptidase database. Nucleic Acids Res 38: D227–D233.
[36]
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659.
[37]
Qian J, Lin J, Luscombe NM, Yu H, Gerstein M (2003) Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics 19: 1917–1926.
[38]
Song J, Burrage K, Yuan Z, Huber T (2006) Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC Bioinformatics 7: 124.
[39]
Shao J, Xu D, Tsai SN, Wang Y, Ngai SM (2009) Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 4: e4920.
[40]
Hubbard SJ, Campbell SF, Thornton JM (1991) Molecular recognition. Conformational analysis of limited proteolytic sites and serine proteinase protein inhibitors. J Mol Biol 220: 507–530.
[41]
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292: 195–202.
[42]
Chen K, Kurgan L (2007) PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics 23: 2843–2850.
[43]
Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, et al. (2010) Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics 26: i489–496.
[44]
Ofran Y, Mysore V, Rost B (2007) Prediction of DNA-binding residues from sequence. Bioinformatics 23: i347–353.
[45]
Ofran Y, Rost B (2007) Protein-protein interaction hotspots carved into sequences. PLoS Comput Biol 3: e119.
[46]
Song J, Burrage K (2006) Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 7: 425.
[47]
Song J, Yuan Z, Tan H, Huber T, Burrage K (2007) Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics 23: 3147–3154.
[48]
Song J, Tan H, Takemoto K, Akutsu T (2008) HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 24: 1489–1497.
[49]
Zhang H, Zhang T, Chen K, Shen S, Ruan J, et al. (2008) Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinformatics 9: 388.
[50]
Nicholson DW (1999) Caspase structure, proteolytic substrates, and function during apoptotic cell death. Cell Death Differ 6: 1028–1042.
[51]
Cheng J, Randall AZ, Sweredoski MJ, Baldi P (2005) SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 33: W72–W76.
[52]
Schlessinger A, Punta M, Rost B (2007) Natively unstructured regions in proteins identified from contact predictions. Bioinformatics 23: 2376–2384.
[53]
Schlessinger A, Liu J, Rost B (2007) Natively unstructured loops differ from other loops. PLoS Comput Biol 3: e140.
[54]
Song J, Tan H, Mahmood K, Law RH, Buckle AM, et al. (2009) Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS ONE 4: e7072.
[55]
Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337: 635–645.
[56]
Joachims T (1999) Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning. Edited by: Sch?lkopf, B., Burges, C. and Smola, A., Cambridge, MA: MIT Press.
[57]
Vapnik V (2000) The nature of statistical learning theory. Springer, New York.
[58]
Agius P, Arvey A, Chang W, Noble WS, Leslie C (2010) High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput Biol 6: e1000916.
[59]
Bock JR, Gough DA (2002) A new method to estimate ligand-receptor energetics. Mol Cell Proteomics 1: 904–910.
[60]
Chen L, Xuan J, Riggins RB, Wang Y, Hoffman EP, et al. (2010) Multilevel support vector regression analysis to identify condition-specific regulatory networks. Bioinformatics 26: 1416–1422.
[61]
Liaw A, Wiener M (2002) Classification and regression by randomForest. R news 2: 18–22.
[62]
Ebina T, Toh H, Kuroda Y (2011) DROP: an SVM domain linker predictor trained with optimal features selected by random forest. Bioinformatics 27: 487–494.
[63]
Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405: 442–451.
[64]
Schechter I, Berger A (1967) On the size of the active site in proteases. I. Papain. Biochem Biophys Res Commun 27: 157–162.
[65]
Cohen GM (1997) Caspases: the executioners of apoptosis. Biochem J 326: 1–16.
[66]
Pop C, Salvesen GS (2009) Human caspases: Activation, specificity and regulation. J Biol Chem 284: 21777–21781.
[67]
Thornberry NA (1997) The caspase family of cysteine proteases. Br Med Bull 53: 478–490.
[68]
Hubbard SJ (1998) The structural aspects of limited proteolysis of native proteins. Biochim Biophys Acta 1382: 191–206.
[69]
Impens F, Vandekerckhove J, Gevaert K (2010) Who gets cut during cell death? Curr Opin Cell Biol 22: 859–864.
[70]
Lobley A, Swindells MB, Orengo CA, Jones DT (2007) Inferring function using patterns of native disorder in proteins. PLoS Comput Biol 3 e162.
[71]
Lobley AE, Nugent T, Orengo CA, Jones DT (2008) FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res 36: W297–W302.
[72]
Dunker AK, Obradovic Z (2001) The protein trinity-linking function and disorder. Nat Biotechnol 19: 805–806.
[73]
Dunker AK (2004) The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 32: 1037–1049.
[74]
Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 6: 197–208.
Gsponer J, Futschik ME, Teichmann SA, Babu MM (2008) Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science 322: 1365–1368.
[77]
Edwards YJ, Lobley AE, Pentony MM, Jones DT (2009) Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data. Genome Biol 10: R50.
[78]
Tompa P, Prilusky J, Silman I, Sussman JL (2008) Structural disorder serves as a weak signal for intracellular protein degradation. Proteins 71: 903–909.
[79]
Vavouri T, Semple JI, Garcia-Verdugo R, Lehner B (2009) Intrinsic protein disorder and interaction promiscuity are widely associated with dosage sensitivity. Cell 138: 198–208.
[80]
Gao J, Thelen JJ, Dunker AK, Xu D (2010) Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol Cell Proteomics 9: 2586–2600.
[81]
Zhu L, Yang J, Song JN, Chou KC, Shen HB (2010) Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem 31: 1478–1485.
[82]
Wang XF, Chen Z, Wang C, Yan RX, Zhang Z, et al. (2011) Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PLoS ONE 6: e26767.
[83]
Wang M, Zhao XM, Takemoto K, Xu H, Li Y, et al. (2012) FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model. PLoS ONE 7: e43847.
[84]
Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, et al. (2004) The International Protein Index: An integrated database for proteomics experiments. Proteomics 4: 1985–1988.
[85]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29.
[86]
Chen J, Bardes EE, Aronow BJ, Jegga AG (2009) ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res 37: W305–W311.
[87]
Baumgartner R, Meder G, Briand C, Decock A, D'arcy A, et al. (2009) The crystal structure of caspase-6, a selective effector of axonal degeneration. Biochem J 423: 429–439.
[88]
Palidwor GA, Shcherbinin S, Huska MR, Rasko T, Stelzl U, et al. (2009) Detection of alpha-rod protein repeats using a neural network and application to huntingtin. PLoS Comput Biol 5: e1000304.
[89]
Kim YJ, Yi Y, Sapp E, Wang YM, Cuiffo B, et al. (2001) Caspase 3-cleaved N-terminal fragments of wild-type and mutant huntingtin are present in normal and Huntington's disease brains, associate with membranes, and undergo calpain-dependent proteolysis. Proc Natl Acad Sci U S A 98: 12784–12789.
[90]
Warby SC, Doty CN, Graham RK, Carroll JB, Yang YZ, et al. (2008) Activated caspase-6 and caspase-6-cleaved fragments of huntingtin specifically colocalize in the nucleus. Hum Mol Genet 17: 2390–2404.
[91]
Vindigni A, Dang QD, Cera ED (1997) Site-specific dissection of substrate recognition by thrombin. Nat Biotech 15: 891–895.
[92]
Ng NM, Pike RN, Boyd SE (2009) Subsite cooperativity in protease specificity. Biol Chem 390: 401–407.
[93]
Asur S, Ucar D, Parthasarathy S (2007) An ensemble framework for clustering protein-protein interaction networks. Bioinformatics 23: i29–40.
[94]
Ishida T, Kinoshita K (2008) Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 24: 1344–1348.
[95]
Yanover C, Singh M, Zaslavsky E (2009) M are better than one: an ensemble-based motif finder and its application to regulatory element prediction. Bioinformatics 25: 868–874.
[96]
Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18: 6097–6100.