%0 Journal Article
%T Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction
%A Noel M O'Boyle
%A David S Palmer
%A Florian Nigsch
%A John BO Mitchell
%J Chemistry Central Journal
%D 2008
%I BioMed Central
%R 10.1186/1752-153x-2-21
%X Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6～C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, 汍 of 0.21) and an RMSE of 45.1～C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3～C, R2 of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5～C, R2 of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest.With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.Quantitative Structure-Activity and Structure-Property Relationship (QSAR and QSPR) models are based upon the idea, first proposed by Hansch [1], that a molecular property can be related to physicochemical descriptors of the molecule. A QSAR model for prediction must be able to generalise well to give accurate predictions on unseen test data. Although it is true in general that the more descriptors used to build a model, the better the model predicts the training set data, such a model typically has very poor predictive ability when presented with unseen test data, a phenomenon known as overfitting [2]. Feature selection refers to the problem of selecting a subset of the descriptors which can be used to build a model with optimal pred
%U http://journal.chemistrycentral.com/content/2/1/21