全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
PLOS ONE  2012 

Algebraic Comparison of Partial Lists in Bioinformatics

DOI: 10.1371/journal.pone.0036540

Full-Text   Cite this paper   Add to My Lib

Abstract:

The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists (“list stability”) in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.

References

[1]  Boulesteix AL, Slawski M (2009) Stability and aggregation of ranked gene lists. Brief Bioinform 10: 556–568.
[2]  Ein-Dor L, Zuk O, Domany E (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS 103: 5923–5928.
[3]  Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, et al. (2009) Prognostic gene signatures for non-small-cell lung cancer. PNAS 106: 2824–2828.
[4]  Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, et al. (2007) Three-Gene Prognostic Classifier for Early-Stage Non Small-Cell Lung Cancer. J Clin Oncol 25: 5562–5569.
[5]  Shi W, Tsyganova M, Dosymbekov D, Dezso Z, Nikolskaya T, et al. (2010) The Tale of Underlying biology: Functional Analysis of MAQC-II Signatures. Pharmacogenomics J 10: 310–323.
[6]  Haury AC, Gestraud P, Vert JP (2011) The inuence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 6: e28210.
[7]  Ioannidis J, Allison D, Ball C, Coulibaly I, Cui X, et al. (2009) Repeatability of published microarray gene expression analyses. Nat Genet 41: 499–505.
[8]  Jurman G, Merler S, Barla A, Paoli S, Galea A, et al. (2008) Algebraic stability indicators for ranked lists in molecular profiling. Bioinformatics 24: 258–264.
[9]  Slawski M, Boulesteix AL (2012) GeneSelector: Stability and Aggregation of ranked gene lists. Bioconductor 2.9 package version 2.4.0:
[10]  Critchlow D (1985) Metric methods for analyzing partially ranked data. LNS 34. Heidelberg: Springer. 242 p.
[11]  Diaconis P (1988) Group representations in probability and statistics. Institute of Mathematical Statistics Lecture Notes – Monograph Series Vol. 11. Beachwood, OH: IMS. 198 p.
[12]  Lance G, Williams W (1966) Computer programs for hierarchical polythetic classification (“similarity analysis”). Comput J 9: 60–64.
[13]  Lance G, Williams W (1967) Mixed-Data Classificatory Programs I - Agglomerative Systems. Aust Comput J 1: 15–20.
[14]  Jurman G, Riccadonna S, Visintainer R, Furlanello C (2009) Canberra Distance on Ranked Lists. Agrawal S, Burges C, Crammer K, editors, Proc. Advances in Ranking - NIPS 09 Workshop. pp. 22–27.
[15]  Gobbi A (2008) Algebraic and combinatorial techniques for stability algorithms on ranked data. Master’s thesis, University of Trento.
[16]  Fagin R, Kumar R, Sivakumar D (2003) Comparing top-k lists. SIAM J Discrete Math 17: 134–160.
[17]  Hall P, Schimek M (2008) Inference for the Top-k Rank List Problem. Brito P, editor, Proc. COMPSTAT 08. pp. 433–444.
[18]  Schimek M, Budinska E, Kugler K, Lin S (2011) Package “TopKLists” for rank-based genomic data integration. In: Proc IASTED CompBio 2011. ACTA Press, 434–440:
[19]  Lin S (2010) Space oriented rank-based data integration. Stat Appl Genet Mol 9: Article 20:
[20]  Lin S, Ding J (2009) Integration of ranked lists via Cross Entropy Monte Carlo with applications to mRNA and microRNA studies. Biometrics 65: 9–18.
[21]  Bar-Ilan J, Mat-Hassan M, Levene M (2006) Methods for comparing rankings of search engine results. Comput Netw 50: 1448–1463.
[22]  Fury W, Batliwalla F, Gregersen P, Li W (2006) Overlapping Probabilities of Top Ranking Gene Lists, Hypergeometric Distribution, and Stringency of Gene Selection Criterion. In: Proc. 28th IEEE-EMBS. IEEE, 5531–5534:
[23]  Pearson R (2007) Reciprocal rank-based comparison of ordered gene lists. In: Proc. GENSIP 07. IEEE, 1–3:
[24]  Yang X, Sun X (2007) Meta-analysis of several gene lists for distinct types of cancer: A simple way to reveal common prognostic markers. BMC Bioinformatics 8: 118.
[25]  Schimek M, My?i?ková A, Budinská E (2012) An Inference and Integration Approach for the Consolidation of Ranked Lists. Commun Stat Simulat 41: 1152–1166.
[26]  Hall P, Schimek M (2012) Moderate deviation-based inference for random degeneration in paired rank lists. J Amer Statist Assoc. In press.
[27]  Guzzetta G, Jurman G, Furlanello C (2010) A machine learning pipeline for quantitative phenotype prediction from genotype data. BMC Bioinformatics 11: S3.
[28]  Schowe B, Morik K (2011) Fast-Ensembles of Minimum Redundancy Feature Selection. In: Okun O, Valentini G, Re M, editors. Ensembles in Machine Learning Applications. Volume 373 of Studies in Computational Intelligence. Heidelberg: Springer. pp. 75–95.
[29]  Yu L, Han Y, Berens M (2012) Stable Gene Selection from Microarray Data via Sample Weighting. IEEE ACM T Comput Bi 9: 262–272.
[30]  Kossenkov A, Vachani A, Chang C, Nichols C, Billouin S, et al. (2011) Resection of Non-Small Cell Lung Cancers Reverses Tumor-Induced Gene Expression Changes in the Peripheral Immune System. Clin Cancer Res 17: 5867–5877.
[31]  Desarkar M, Joshi R, Sarkar S (2011) Displacement Based Unsupervised Metric for Evaluating Rank Aggregation. In: Kuznetsov S, Mandal D, Kundu M, Pal S, editors. Pattern Recognition and Machine Intelligence, Volume 6744 of Lecture Notes in Computer Science. Heidelberg: Springer. pp. 268–273.
[32]  Soneson C, Fontes M (2012) A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities. Biostatistics 13: 129–141.
[33]  He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34: 215–225.
[34]  Corrada D, Viti F, Merelli I, Battaglia C, Milanesi L (2011) myMIR: a genome-wide microRNA targets identification and annotation tool. Brief Bioinform 12(6): 588–600.
[35]  The MicroArray Quality Control (MAQC) Consortium (2010) The MAQC-II Project: A comprehensive study of common practices for the development and validation of microarray-based predictive models. Nature Biotech 28: 827–838.
[36]  Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, et al. (2012) Effect of size and het-erogeneity of samples on biomarker discovery: synthetic and real data assessment. Plos ONE 7: e32200.
[37]  Albanese D, Visintainer R, Merler S, Riccadonna S, Jurman G, et al. (2012) mlpy: Machine Learning Python. arXiv. 1202.6548 p.
[38]  Kendall M (1962) Rank correlation methods. Griffin Books on Statistics. Duxbury, MA: Griffin Publishing Company.
[39]  Diaconis P, Graham R (1977) Spearman’s Footrule as a Measure of Disarray. J Roy Stat Soc B 39: 262–268.
[40]  Graham R, Knuth D, Patashnik O (1989) Concrete Mathematics: A Foundation for Computer Science. Boston, MA: Addison Wesley.
[41]  Cheon GS, El-Mikkawy MEA (2007) Generalized Harmonic Number Identities And Related Matrix Representation. J Korean Math Soc 44: 487–498.
[42]  Simi?c S (1998) Best possible bounds and monotonicity of segments of harmonic series (II). Mat Vesnik 50: 5–10.
[43]  Villarino M (2004) Ramanujan’s Approximation to the n-th Partial Sum of the Harmonic Series. arXiv:math.CA/0402354 v5:
[44]  Villarino M (2006) Sharp Bounds for the Harmonic Numbers. arXiv:math.CA/0510585 v3:
[45]  Kauers M, Schneider C (2006) Indefinite Summation with Unspecified Summands. Discrete Math 306: 2021–2140.
[46]  Kauers M, Schneider C (2006) Application of Unspecified Sequences in Symbolic Summation. In: Proc. ISSAC 06. ACM, 177–183:
[47]  Schneider C (2004) Symbolic Summation with Single-Nested Sum Extension. In: Proc. ISSAC 04. ACM, 282–289:
[48]  Abramov S, Carette J, Geddes K, Lee H (2004) Telescoping in the context of symbolic summation in Maple. J Symb Comput 38: 1303–1326.
[49]  Schneider CSimplifying Sums in -Extensions J Algebra. Appl 6: 415–441.
[50]  Hoeffding W (1951) A Combinatorial Central Limit Theorem. Ann Math Stat 22: 558–566.
[51]  Borda J (1781) Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences.
[52]  Saari D (2001) Chaotic Elections! A Mathematician Looks at Voting. Providence, RI: American Mathematical Society. 159 p.
[53]  Setlur S, Mertz K, Hoshida Y, Demichelis F, Lupien M, et al. (2008) Estrogen-dependent signaling in a molecularly distinct subclass of aggressive prostate cancer. J Natl Cancer Inst 100: 815–825.
[54]  Sboner A, Demichelis F, Calza S, Pawitan Y, Setlur S, et al. (2010) Molecular sampling of prostate cancer: a dilemma for predicting disease progression. BMC Med Genomics 3: 8.
[55]  Dudoit S, Fridlyand J, Speed T (2002) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J Am Stat Assoc 97: 77–87.
[56]  Pique-Regi R, Ortega A (2006) Block diagonal linear discriminant analysis with sequential embedded feature selection. In: Proc. ICASSP 06. IEEE, volume 5, pp. V– V:
[57]  Pique-Regi R, Ortega A, Asgharzadeh S (2005) Sequential Diagonal Linear Discriminant Analysis (SeqDLDA) for Microarray Classification and Gene Identification. In: Proc. CSB 05. IEEE, 112–116:
[58]  B? T, Jonassen I (2002) New feature subset selection procedures for classification of expression profiles. Genome Biol 3: research0017.1–research0017.11.
[59]  Cortes C, Vapnik V (1995) Support-Vector Networks. Mach Learn 20:
[60]  Cai D, Xiaofei H, Han J (2008) SRDA: An efficient algorithm for large-scale discriminant analysis. IEEE T Knowl Data En 20: 1–12.
[61]  Visintainer , R (2008) Feature ranking and classification of molecular data based on discriminant analysis methods. Master’s thesis, University of Trento.
[62]  Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 46: 389–422.
[63]  Furlanello C, Serafini M, Merler S, Jurman G (2003) Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data. BMC Bioinformatics 4: 54.
[64]  Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16: 412–424.
[65]  Cortes C, Mobri M (2003) AUC optimization vs. error rate minimization. In: Thrun S, Saul L, Sch?lkopf B, editors, Proc. NIPS 03. volume 16, 169–176:
[66]  Calders T, Jaroszewicz S (2007) Efficient AUC Optimization for Classification. Proc. PKDD 07. Heidelberg: Springer. pp. 42–53.
[67]  Vanderlooy S, Hüllermeier E (2008) A critical analysis of variants of the AUC. Mach Learn 72: 247–262.
[68]  Wang X, Simon R (2011) Microarray-based cancer prediction using single genes. BMC Bioinformatics 12: 391.
[69]  Tusher V, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98: 5116–5121.
[70]  L?nnstedt I, Speed T (2001) Replicated microarray data. Stat Sinica 12: 31–46.
[71]  Neter J, Kutner M, Nachtsheim C, Wasserman W (1996) Applied Linear Statistical Models. Columbus, OH: McGraw-Hill/Irwin. 1408 p.
[72]  Jeffery I, Higgins D, Culhane A (2006) Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics 7: 359.
[73]  Smyth G (2003) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3: Article 3:
[74]  Xiao Y, Yang YH (2008) Bioconductor’s DEDS package. 27: Available: http://www.bioconductor.org/packages/rel?ease/bioc/html/DEDS.html. Accessed 2012 Apr.
[75]  Gentleman R, Carey V, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5(10): R80.
[76]  R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 27: Available: http://www.R-project.org. Accessed 2012 Apr.
[77]  Yao C, Zhang M, Zou J, Gong X, Zhang L, et al. (2008) Disease prediction power and stability of differential expressed genes. In: Proc. BMEI 2008. IEEE, 265–268:
[78]  Chen J, Hsueh HM, Delongchamp R, Lin CJ, Tsai CA (2007) Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinformatics 8: 412.
[79]  Simon R (2008) Microarray-based expression profiling and informatics. Curr Opin Biotech 16: 26–29.
[80]  Storey J (2002) A direct approach to false discovery rates. J Roy Stat Soc B 64: 479–498.
[81]  Efron B, Tibshirani R, Storey J, Tusher V (2001) Empirical Bayes Analysis of a Microarray Experiment. J Am Stat Assoc 96: 1151–1160.
[82]  Efron B, Tibshirani R (2002) Empirical Bayes Methods, and False Discovery Rates. Genet Epidemiol 23: 70–86.
[83]  Efron B, Tibshirani R, Taylor J (2005) The “Miss rate” for the analysis of gene expression data. Biostat 6: 111–117.
[84]  Witten D, Tibshirani R (2007) A comparison of fold-change and the t-statistic for microarray data analysis. Technical report, Department of Statistics, Stanford University. 27: Available: http://www-stat.stanford.edu/~tibs/ftp/F?CTComparison.pdf. Accessed 2012 Apr.
[85]  Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2: 499–526.
[86]  Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25: 161–193.
[87]  Kalousis A, Prados J, Hilario M (2005) Stability of feature selecion algorithms. In: Proc. ICNC 2007. IEEE, 218–225:
[88]  Kuncheva L (2007) A stability index for feature selecion. Proc. IASTED 07. Phuket, Thailand: ACTA Press. pp. 390–395.
[89]  Zhang L (2007) A Method for Improving the Stability of Feature Selection Algorithm. In: Proc. ICNC 07. IEEE, 715–717:
[90]  Krízek P, Kittler J, Hlavá? V (2007) Improving Stability of Feature Selection Methods. In: Kropatsc , Kampel M, Hanbury A, editors. Proc. CAIP 2007. pp. 929–936.
[91]  Xiao Y, Hua J, Dougherty ER (2007) Quantification of the impact of Feature Selection on the Variance of Cross-Validation Error Estimation. EURASIP J Bioinform Syst Biol 2007.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133