|
BMC Bioinformatics 2010
Class prediction for high-dimensional class-imbalanced dataAbstract: Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers.Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.High-throughput technologies measure simultaneously tens of thousands of variables for each of the observations included in the study; data produced by these technologies are often called high-dimensional, because the number of variables greatly exceeds the number of observations. Microarrays are high-dimensional tools commonly used in the biomedical field; they measure the expression of genes [1] or miRNAs [2], the presence of DNA copy number alterations [3] or of variation at a single site in DNA [4], across the entire genome of a subject.Microarrays are frequently used for class prediction (classification). In these studies the goal is to develop a rule based on the measurements (variables) obtained from the microarrays from samples (observations) that belong to distinct and well-defined groups (classes); these rules can be used to predict
|