|
BMC Bioinformatics 2007
Predicting active site residue annotations in the Pfam databaseAbstract: We have created a large database of predicted active site residues. On comparing our active site predictions to those found in UniProtKB, Catalytic Site Atlas, PROSITE and MEROPS we find that we make many novel predictions. On investigating the small subset of predictions made by these databases that are not predicted by us, we found these sequences did not meet our strict criteria for prediction. We assessed the sensitivity and specificity of our methodology and estimate that only 3% of our predicted sequences are false positives.We have predicted 606110 active site residues, of which 94% are not found in UniProtKB, and have increased the active site annotations in Pfam by more than 200 fold. Although implemented for Pfam, the tool we have developed for transferring the data can be applied to any alignment with associated experimental active site data and is available for download. Our active site predictions are re-calculated at each Pfam release to ensure they are comprehensive and up to date. They provide one of the largest available databases of active site annotation.Enzymes play a considerable role in controlling the flow of metabolites within a cell; they catalyze virtually all of the reactions that make and modify the molecules required in biological pathways. Only a small number of residues within an enzyme are directly involved in catalysis and the structure and chemical properties of these residues (termed the active site) determine the chemistry of the enzyme. For this reason active site residues are highly conserved.Pfam [1] is a database of 8296 protein families (as of Pfam release 20.0). Only ~0.4% of the sequences contained within the enzymatic Pfam families (i.e. those families that contain at least one characterized catalytic site) have the active site residues experimentally determined. There are families within Pfam which we know are catalytic, yet the residues that perform catalysis have not been characterized for any of the sequences within th
|