|
BMC Bioinformatics 2007
Supervised group Lasso with applications to microarray data analysisAbstract: We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data.We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods.Development in microarray techniques makes it possible to profile gene expression on a whole genome scale and study associations between gene expression and occurrence or progression of common diseases such as cancer or heart disease. A large amount of efforts have been devoted to identifying genes that have influential effects on diseases. Such studies can lead to better understanding of the genetic causation of diseases and better predictive models. Analysis of microarray data is challenging because of the large number of genes surveyed and small sample sizes, and presence of cluster structure. Here the clusters are composed of co-regulated genes with coordinated functions. Without causing confusion, we use the phrases "clusters" and "
|