Transcriptomics meta-analysis aims at re-using existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple independent phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons. We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By using a rank product method, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modelled via a term-frequency inverse-document frequency (TF-IDF) model, which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than similarity-based approaches in retrieving related studies, using a Medical Subject Headings gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections. Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.
References
[1]
Nordmann AJ (2012) Meta-analyses: what they can and cannot do. Swiss Med Wkly 142: w13518. doi: 10.4414/smw.2012.13518
[2]
Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, et al. (2009) Arrayexpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37: D868–D872. doi: 10.1093/nar/gkn889
[3]
Segal E, Friedman N, Koller D, Regev A (2004) A module map showing conditional activity of expression modules in cancer. Nat Genet 36: 1090–1098. doi: 10.1038/ng1434
[4]
Sirota M, Dudley JT, Kim J, Chiang AP, Morgan AA, et al. (2011) Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med 3: 96ra77. doi: 10.1126/scitranslmed.3001318
[5]
Huang H, Liu CC, Zhou XJ (2010) Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proc Natl Acad Sci USA 107: 6823–6828. doi: 10.1073/pnas.0912043107
[6]
Lukk M, Kapushesky M, Nikkil? J, Parkinson H, Gon?alves ?ngela, et al. (2010) A global map of human gene expression. Nat Biotechnol 28: 322–324. doi: 10.1038/nbt0410-322
[7]
Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, et al. (2007) Oncomine 3.0: Genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9: 166–180. doi: 10.1593/neo.07112
[8]
Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, et al. (2010) Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One 5: e13066. doi: 10.1371/journal.pone.0013066
[9]
Caldas J, Gehlenborg N, Kettunen E, Faisal A, R?nty M, et al. (2012) Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. Bioinformatics 28: 246–253. doi: 10.1093/bioinformatics/btr634
[10]
Rung J, Brazma A (2012) Reuse of public genome-wide gene expression data. Nat Rev Genet 14: 89–99. doi: 10.1038/nrg3394
[11]
Tseng GC, Ghosh D, Feingold E (2012) Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res 40: 3785–3799. doi: 10.1093/nar/gkr1265
[12]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102: 15545–15550. doi: 10.1073/pnas.0506580102
[13]
Breitling R, Armengaud P, Amtmann A, Herzyk P (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 573: 83–92. doi: 10.1016/j.febslet.2004.07.055
[14]
Hong F, Breitling R (2008) A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics 24: 374–382. doi: 10.1093/bioinformatics/btm620
[15]
Lipscomb CE (2000) Medical subject headings (mesh). Bull Med Libr Assoc 88: 265–266.
[16]
Hollander M, Wolfe DA (2000) Nonparametric Statistical Methods. New York, NY: Wiley-Interscience.
Duncan EM, Muratore-Schroeder TL, Cook RG, Garcia BA, Shabanowitz J, et al. (2008) Cathepsin l proteolytically processes histone h3 during mouse embryonic stem cell differentiation. Cell 135: 284–294. doi: 10.1016/j.cell.2008.09.055
[19]
Heo JS, Lee YJ, Han HJ (2006) EGF stimulates proliferation of mouse embryonic stem cells: involvement of ca2+ inux and p44/42 MAPKs. Am J Physiol Cell Physiol 290: C123–C133. doi: 10.1152/ajpcell.00142.2005
[20]
Potapova IA, Cohen IS, Doronin SV (2010) Von willebrand factor increases endothelial cell adhesiveness for human mesenchymal stem cells by activating p38 mitogen-activated protein kinase. Stem Cell Res Ther 1: 35. doi: 10.1186/scrt35
[21]
Shi W, Wang H, Pan G, Geng Y, Guo Y, et al. (2006) Regulation of the pluripotency marker Rex-1 by Nanog and Sox2. J Biol Chem 281: 23319–23325. doi: 10.1074/jbc.m601811200
[22]
Jiang K, Liu Y, Ma M, Tang YB, Zhou JG, et al. (2013) Mitochondria dependent pathway is involved in the protective effect of bestrophin-3 on hydrogen peroxide-induced apoptosis in basilar artery smooth muscle cells. Apoptosis 18: 556–565. doi: 10.1007/s10495-013-0828-4
[23]
Chan JY, Takeda M, Briggs LE, Graham ML, Lu JT, et al. (2008) Identification of cardiac-specific myosin light chain kinase. Circ Res 102: 571–580. doi: 10.1161/circresaha.107.161687
[24]
Maglich JM, Lobe DC, Moore JT (2008) The nuclear receptor CAR (NR1I3) regulates serum triglyceride levels under conditions of metabolic stress. J Lipid Res 50: 439–445. doi: 10.1194/jlr.m800226-jlr200
[25]
Huang H, Wei X, Su X, Qiao F, Xu Z, et al. (2011) Clinical significance of expression of hint1 and potential epigenetic mechanism in gastric cancer. Int J Oncol 38: 1557–1564. doi: 10.3892/ijo.2011.994
[26]
Baum AE, Cabanero M, Cardona I, Corona W, Klemens B, et al. (2008) A genome-wide association study implicates diacylglycerol kinase eta (dgkh) and several other genes in the etiology of bipolar disorder. Mol Psychiatry 13: 197–207. doi: 10.1038/sj.mp.4002012
[27]
Squire JA (2009) Tmprss2-erg and pten loss in prostate cancer. Nat Genet 41: 509–510. doi: 10.1038/ng0509-509
[28]
Jacobsen FE, Kazmierczak KM, Lisher JP, Winkler ME, Giedroc DP (2011) Interplay between manganese and zinc homeostasis in the human pathogen Streptococcus pneumoniae. Metallomics 3: 38–41. doi: 10.1039/c0mt00050g
[29]
Shafeeq S, Kloosterman TG, Kuipers OP (2011) Transcriptional response of Streptococcus pneu-moniae to zn(2+) limitation and the repressor/activator function of adcr. Metallomics 3: 609–618. doi: 10.1039/c1mt00030f
[30]
Reyes-Caballero H, Guerra AJ, Jacobsen FE, Kazmierczak KM, Cowart D, et al. (2010) The metalloregulatory zinc site in Streptococcus pneumoniae adcr, a zinc-activate marr family repressor. J Mol Biol 403: 197–216. doi: 10.1016/j.jmb.2010.08.030
[31]
Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J, et al. (2010) Modeling sample variables with an experimental factor ontology. Bioinformatics 26: 1112–1118. doi: 10.1093/bioinformatics/btq099
[32]
Network CGAR (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061–1068. doi: 10.1038/nature11903
[33]
Mardis ER (2008) Next-generation dna sequencing methods. Annu Rev Genomics Hum Genet 9: 387–402. doi: 10.1146/annurev.genom.9.081307.164359
[34]
Pruitt KD, Tatusova T, Maglott DR (2007) Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–D65. doi: 10.1093/nar/gkl842
[35]
Altschul SF, Madden TL, Sch?ffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. doi: 10.1093/nar/25.17.3389
[36]
Koziol JA (2010) Comments on the rank product method for analyzing replicated experiments. FEBS Lett 584: 941–944. doi: 10.1016/j.febslet.2010.01.031
[37]
Eisinga R, Breitling R, Heskes T (2013) The exact probability distribution of the rank product statistics for replicated experiments. FEBS Lett 587: 677–682. doi: 10.1016/j.febslet.2013.01.037
[38]
Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. P Natl Acad Sci USA 100: 9440–9445. doi: 10.1073/pnas.1530509100
[39]
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18: 613–620. doi: 10.1145/361219.361220
[40]
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press.