We present an analytic framework based on Self-Organizing Map (SOM) machine learning to study large scale patient data sets. The potency of the approach is demonstrated in a case study using gene expression data of more than 200 mature aggressive B-cell lymphoma patients. The method portrays each sample with individual resolution, characterizes the subtypes, disentangles the expression patterns into distinct modules, extracts their functional context using enrichment techniques and enables investigation of the similarity relations between the samples. The method also allows to detect and to correct outliers caused by contaminations. Based on our analysis, we propose a refined classification of B-cell Lymphoma into four molecular subtypes which are characterized by differential functional and clinical characteristics.
References
[1]
Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455, 1061–1068, doi:10.1038/nature07385.
[2]
Cancer Genome Atlas Research Networ. Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012, 487, 330–337.
[3]
Barretina, J.; Caponigro, G.; Stransky, N.; Venkatesan, K.; Margolin, A.A.; Kim, S.; Wilson, C.J.; Lehár, J.; Kryukov, G.V.; Sonkin, D.; et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 2012, 483, 603–607, doi:10.1038/nature11003.
[4]
Hudson, T.J.; Anderson, W.; Artez, A.; Barker, A.D.; Bell, C.; Bernabé, R.R.; Bhan, M.K.; Calvo, F.; Eerola, I.; Gerhard, D.S.; et al. International network of cancer genome projects. Nature 2010, 464, 993–998, doi:10.1038/nature08987.
Pop, M.; Salzberg, S.L. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008, 24, 142–149, doi:10.1016/j.tig.2007.12.006.
[7]
Sboner, A.; Mu, X.J.; Greenbaum, D.; Auerbach, R.K.; Gerstein, M.B. The real cost of sequencing: Higher than you think! Genome Biol. 2011, 12, 125, doi:10.1186/gb-2011-12-8-125.
[8]
Mardis, E.R. The $1,000 genome, the $100,000 analysis? Genome Med. 2010, 2, 84, doi:10.1186/gm205.
[9]
Kohonen, T. Self Organizing Maps; Springer: Berlin, Heidelberg, Germany, New York, NY, USA, 1995.
[10]
Hummel, M.; Bentink, S.; Berger, H.; Klapper, W.; Wessendorf, S.; Barth, T.F.E.; Bernd, H.-W.; Cogliatti, S.B.; Dierlamm, J.; Feller, A.C.; et al. A biologic definition of Burkitt’s lymphoma from transcriptional and genomic profiling. N. Engl. J. Med. 2006, 354, 2419–2430, doi:10.1056/NEJMoa055351.
[11]
Wirth, H.; Loffler, M.; von Bergen, M.; Binder, H. Expression cartography of human tissues using self organizing maps. BMC Bioinform. 2011, 12, 306, doi:10.1186/1471-2105-12-306.
[12]
Wirth, H.; von Bergen, M.; Binder, H. Mining SOM expression portraits: Feature selection and integrating concepts of molecular function. BioData Min. 2012, 5, 18, doi:10.1186/1756-0381-5-18.
[13]
Binder, H.; Preibisch, S. “Hook”-calibration of GeneChip-microarrays: Theory and algorithm. Algorithms Mol. Biol. 2008, 3, 12, doi:10.1186/1748-7188-3-12.
[14]
Binder, H.; Krohn, K.; Preibisch, S. “Hook”-calibration of GeneChip-microarrays: Chip characteristics and expression measures. Algorithms Mol. Biol. 2008, 3, 11, doi:10.1186/1748-7188-3-11.
[15]
Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19, 185–193, doi:10.1093/bioinformatics/19.2.185.
[16]
Wirth, H. Analysis of large-scale molecular biological data using self-organizing maps. Available online: http://www.qucosa.de/fileadmin/data/qucosa/documents/10129/Dissertation%20Henry%20Wirth.pdf (accessed on 14 November 2013).
[17]
Binder, H.; Hopp, L.; Cakir, V.; Fasold, M.; von Bergen, M.; Wirth, H. Molecular phenotypic portraits—Exploring the ‘OMES’ with individual resolution. In Proceedings of the 6th International Symposium Health Informatics and Bioinformatics (HIBIT), Izmir, Turkey, 2–5 May 2011; pp. 99–107.
[18]
Vesanto, J.; Himberg, J.; Alhoniemi, E.; Parhankangas, J. Self-organizing map in Matlab: The SOM toolbox. In Proceedings of the Matlab DSP Conference, Espoo, Finland, 16–17 November; 1999; pp. 35–40.
[19]
Yan, J. Som: Self-Organizing Map 2010. Available online: http://cran.r-project.org/web/packages/som/ (accessed on 14 November 2013).
[20]
Wirth, H.; von Bergen, M.; Murugaiyan, J.; R?sler, U.; Stokowy, T.; Binder, H. MALDI-typing of infectious algae of the genus Prototheca using SOM portraits. J. Microbial. Methods 2012, 88, 83–97, doi:10.1016/j.mimet.2011.10.013.
[21]
Hopp, L.; Wirth, H.; Fasold, M.; Binder, H. Portraying the expression landscapes of cancer subtypes: A glioblastoma multiforme and prostate cancer case study. Syst. Biomed. 2013, 1. in press.
[22]
Wirth, H.; Cakir, V.; Hopp, L.; Binder, H. Analysis of miRNA expression using machine learning. Methods Mol. Biol. 2014, 1107. in press.
[23]
Cakir, V.; Wirth, H.; Hopp, L.; Binder, H. miRNA expression landscapes in stem cells, tissues and cancer. Methods Mol. Biol. 2014, 1107. in press.
[24]
Quackenbush, J. Genomics. Microarrays—Guilt by association. Science 2003, 302, 240–241, doi:10.1126/science.1090887.
[25]
Goeman, J.J.; Bühlmann, P. Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics 2007, 23, 980–987, doi:10.1093/bioinformatics/btm051.
[26]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000, 25, 25–29, doi:10.1038/75556.
[27]
Haider, S.; Ballester, B.; Smedley, D.; Zhang, J.; Rice, P.; Kasprzyk, A. BioMart Central Portal—Unified access to biological data. Nucleic Acids Res. 2009, 37, W23–W27, doi:10.1093/nar/gkp265.
[28]
Ackermann, M.; Strimmer, K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009, 10, 47.
[29]
Zhang, B.; Kirov, S.; Snoddy, J. WebGestalt: An integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005, 33, W741–W748, doi:10.1093/nar/gki475.
[30]
Vêncio, R.Z.N.; Shmulevich, I. ProbCD: Enrichment analysis accounting for categorization uncertainty. BMC Bioinformatics 2007, 8, 383, doi:10.1186/1471-2105-8-383.
[31]
Noble, W.S. How does multiple testing correction work? Nat. Biotechnol. 2009, 27, 1135–1137, doi:10.1038/nbt1209-1135.
[32]
T?r?nen, P.; Ojala, P.J.; Marttinen, P.; Holm, L. Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function. BMC Bioinformatics 2009, 10, 307, doi:10.1186/1471-2105-10-307.
[33]
Guo, Y.; Eichler, G.S.; Feng, Y.; Ingber, D.E.; Huang, S. Towards a holistic, yet gene-centered analysis of gene expression profiles: A case study of human lung cancers. J. Biomed. Biotechnol. 2006, 2006, 69141.
[34]
Liebermeister, W. Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002, 18, 51–60, doi:10.1093/bioinformatics/18.1.51.
[35]
Hyv?rinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430, doi:10.1016/S0893-6080(00)00026-5.
[36]
Paradis, E.; Claude, J.; Strimmer, K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 2004, 20, 289–290, doi:10.1093/bioinformatics/btg412.
[37]
Saitou, N.; Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987, 4, 406–425.
[38]
Lloyd, S. Least squares quantization in PCM. Inf. Theory IEEE Trans. 1982, 28, 129–137, doi:10.1109/TIT.1982.1056489.
[39]
Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007.
[40]
Monti, S.; Tamayo, P. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003, 52, 91–118, doi:10.1023/A:1023949509487.
[41]
Wilkerson, M.D.; Hayes, D.N. ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. Bioinformatics 2010, 26, 1572–1573, doi:10.1093/bioinformatics/btq170.
[42]
Rosolowski, M.; L?uter, J.; Abramov, D.; Drexler, H.; Hummel, M.; Klapper, W.; MacLeod, R.; Pellissery, S.; Horn, F.; Siebert, R.; et al. Diffuse large B-cell lymphomas exhibit different functional and metabolic activation patterns independent of the cell of origin signature. PLoS One 2013. in press.
[43]
Klapper, W.; Kreuz, M.; Kohler, C.W.; Burkhardt, B.; Szczepanowski, M.; Salaverria, I.; Hummel, M.; Loeffler, M.; Pellissery, S.; Woessmann, W.; et al. Patient age at diagnosis is associated with the molecular characteristics of diffuse large B-cell lymphoma. Blood 2012, 119, 1882–1887, doi:10.1182/blood-2011-10-388470.
[44]
Guengerich, F. Cytochrome p450 and chemical toxicology. Chem. Res. Toxicol. 2007, 21, 70–83, doi:10.1021/tx700079z.
[45]
Wright, G.; Tan, B.; Rosenwald, A.; Hurt, E.H.; Wiestner, A.; Staudt, L.M. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc. Natl. Acad. Sci. USA 2003, 100, 9991–9996.
[46]
Stein, H.; Hummel, M. Burkitt’s and Burkitt-like lymphoma. Molecular definition and value of the World Health Organisation’s diagnostic criteria. Pathology 2007, 28, 41–45, doi:10.1007/s00292-006-0884-4.
[47]
Lu, Y.; Yi, Y.; Liu, P.; Wen, W.; James, M.; Wang, D.; You, M. Common human cancer genes discovered by integrated gene-expression analysis. PLoS One 2007, 2, e1149, doi:10.1371/journal.pone.0001149.
[48]
Wolfer, A.; Wittner, B.S.; Irimia, D.; Flavin, R.J.; Lupien, M.; Gunawardane, R.N.; Meyer, C.A.; Lightcap, E.S.; Tamayo, P.; Mesirov, J.P.; et al. MYC regulation of a “poor-prognosis” metastatic cancer cell state. Proc. Natl. Acad. Sci. USA 2010, 107, 3698–3703, doi:10.1073/pnas.0914203107.
[49]
Kaplan, E.; Meier, P. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 1958, 53, 457–481, doi:10.1080/01621459.1958.10501452.