|
BMC Bioinformatics 2008
A genetic approach for building different alphabets for peptide and protein classificationAbstract: The new approach has been tested in three peptide classification problems: HIV-protease, recognition of T-cell epitopes and prediction of peptides that bind human leukocyte antigens. The tests demonstrate that the idea of training a pool classifiers by reduced alphabets, created using a Genetic Algorithm, allows an improvement over other state-of-the-art feature extraction methods.The validity of the novel strategy for creating reduced alphabets is demonstrated by the performance improvement obtained by the proposed approach with respect to other reduced alphabets-based methods in the tested problems.In the literature several feature extraction approaches [1] have been proposed for the representation of peptides (e.g orthonormal encoding, n-grams, ...); some of them have been used for building ensembles of classifiers based on the perturbation of features (i.e. each classifier is trained using a different feature set). Nanni and Lumini in [2] proposed to build an ensemble of classifiers where each classifier is trained using a different physicochemical property of the amino acids, the selection of the best physicochemical properties to be combined is performed by Sequential Forward Floating Selection [3]; the same feature extraction is also used in [4] to train a machine learning approach for protein subcellular localization. A system for the recognition of T-cell epitopes is presented in [5] based on the combination of two Support Vector Machines (SVM). The first SVM is trained using the information on amino acid positions, while the second SVM is trained using information extracted from the sparse indicator vector and the BLOSUM50 matrix.In particular, in [6] it is proposed an ensemble of SVM classifiers where each classifier is trained using a different N-peptide composition with reduced amino acid alphabets for larger values of N. The authors report that the ensemble of SVMs outperforms a stand-alone SVM trained using the well-known 2-peptide composition with th
|