%0 Journal Article %T Identification of Robust Pathway Markers for Cancer through Rank-Based Pathway Activity Inference %A Navadon Khunlertgit %A Byung-Jun Yoon %J Advances in Bioinformatics %D 2013 %I Hindawi Publishing Corporation %R 10.1155/2013/618461 %X One important problem in translational genomics is the identification of reliable and reproducible markers that can be used to discriminate between different classes of a complex disease, such as cancer. The typical small sample setting makes the prediction of such markers very challenging, and various approaches have been proposed to address this problem. For example, it has been shown that pathway markers, which aggregate the gene activities in the same pathway, tend to be more robust than gene markers. Furthermore, the use of gene expression ranking has been demonstrated to be robust to batch effects and that it can lead to more interpretable results. In this paper, we propose an enhanced pathway activity inference method that uses gene ranking to predict the pathway activity in a probabilistic manner. The main focus of this work is on identifying robust pathway markers that can ultimately lead to robust classifiers with reproducible performance across datasets. Simulation results based on multiple breast cancer datasets show that the proposed inference method identifies better pathway markers that can predict breast cancer metastasis with higher accuracy. Moreover, the identified pathway markers can lead to better classifiers with more consistent classification performance across independent datasets. 1. Introduction Advances in microarray and sequencing technologies have enabled the measurement of genome-wide expression profiles, which have spawned a large number of studies aiming to make accurate diagnosis and prognosis based on gene expression profiles [1¨C4]. For example, there has been significant amount of work on identifying markers and building classifiers that can be used to predict breast cancer metastasis [2, 4]. Many existing methods have directly employed gene expression data without any knowledge of the interrelations between genes. As a result, the predicted gene markers often lack interpretability and many of them are not reproducible in other independent datasets. To overcome this problem, several different approaches have been proposed so far. For example, a recent work by Geman et al. [3] proposed an approach that utilizes the relative expression between genes, rather than their absolute expression values. It was shown that the resulting markers are easier to interpret, robust to chip-to-chip variations, and more reproducible across datasets. Another possible way to address the aforementioned problem is to interpret the gene expression data at a ¡°modular¡± level through data integration [5¨C11]. These methods utilize additional data %U http://www.hindawi.com/journals/abi/2013/618461/