|
BMC Bioinformatics 2005
Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machinesAbstract: We have developed a system for predicting the subcellular localization of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines. The recall of the extracellular site and overall recall of our predictor reach 86.0% and 89.8%, respectively, in 5-fold cross-validation. To the best of our knowledge, these are the most accurate results for predicting subcellular localization in Gram-negative bacteria.Clustering 20 amino acids into a few groups by the proposed greedy algorithm provides a new way to extract features from protein sequences to cover more adjacent amino acids and hence reduce the dimensionality of the input vector of protein features. It was observed that a good amino acid grouping leads to an increase in prediction performance. Furthermore, a proper choice of a subset of complementary support vector machines constructed by different features of proteins maximizes the prediction accuracy.Subcellular localization is a key functional attribute of a protein. Since cellular functions are often localized in specific compartments, predicting the subcellular localization of unknown proteins may be used to obtain useful information about their functions and to select proteins for further study. Moreover, studying the subcellular localization of proteins is also helpful in understanding disease mechanisms and for developing novel drugs.As a result of large-scale genome sequencing efforts in recent years, protein data has accumulated in public data banks at an increasing rate. Analyzing protein data to extract useful knowledge is thus essential for projects like automatic annotation. It is desirable to have an automated and reliable system for predicting subcellular localization of proteins from amino acid sequences.A number of efforts [1-21] have been made to predict protein subcellular localization. Most of these prediction methods can be classified into two categories: one based on the recognition
|