Current microarray data mining methods such as clustering, classification, and association analysis heavily rely on statistical and machine learning algorithms for analysis of large sets of gene expression data. In recent years, there has been a growing interest in methods that attempt to discover patterns based on multiple but related data sources. Gene expression data and the corresponding literature data are one such example. This paper suggests a new approach to microarray data mining as a combination of text mining (TM) and information extraction (IE). TM is concerned with identifying patterns in natural language text and IE is concerned with locating specific entities, relations, and facts in text. The present paper surveys the state of the art of data mining methods for microarray data analysis. We show the limitations of current microarray data mining methods and outline how text mining could address these limitations. 1. Introduction DNA microarrays facilitate the simultaneous measurement of the expression levels of thousands of genes [1, 2]. As a result, this high-throughput technology has led to increased amount of gene expression data. Microarrays have been used for a variety of studies, including gene coregulation studies, gene function identification studies, identification of pathway and gene regulatory networks, predictive toxicology, clinical diagnosis, and sequence variance studies. For a complete description about microarrays and its analytical tasks, refer to the books [3–5]. Current microarray data mining methods such as clustering, classification, and association analysis are based on statistical and machine learning algorithms. Most of these techniques are purely data driven and do not incorporate significant amounts of biological knowledge. Considering the statistically ill-defined nature of microarray data (many more variables than observations) and the massive body of existing biological knowledge, it is imperative that we exploit that knowledge for analysis and interpretation of microarray data. Text mining techniques constitute a promising technology for automating the incorporation of scientific knowledge in the microarray data mining process. Applying domain knowledge is fundamental in any scientific discovery process. In biology, domain knowledge is available in vast collections of the literature in natural language form such as abstracts [6] and full-text journal articles [7, 8] and also as textual annotations in databases such as SwissProt [9] and GenBank [10] For example, the biological abstract database PubMed
References
[1]
M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science, vol. 270, no. 5235, pp. 467–470, 1995.
[2]
J. L. DeRisi, V. R. Iyer, and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, no. 5338, pp. 680–686, 1997.
[3]
“The chipping forecast I,” Supplement to Nature Genetics, vol. 21, no. 1, 1999.
[4]
“The chipping forecast II,” Supplement to Nature Genetics, vol. 32, 2002.
[5]
A Practical Approach to Microarray Data Analysis, Kluwer Academic, Boston, Mass, USA, 2002, edited by D. Berrar, W. Dubitzky and M. Granzow.
[6]
National Library of Medicine, “PubMed literature abstract database,” http://www.ncbi.nlm.nih.gov/pubmed.
[7]
National Library of Medicine, “PubMed Central full text repository,” http://www.pubmedcentral.gov.
[8]
BioMed Central, “The open access publisher,” http://www.biomedcentral.com.
[9]
Swiss institutute of Bioinformatics, “SwissProt protein sequence database,” http://us.expasy.org/sprot.
[10]
National Library of Medicine, “GenBank seqeunce database,” 2012, http://www.ncbi.nlm.nih.gov/genbank.
[11]
M. A. Hearst, “Untangling text data mining,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL '99), University of Maryland, 1999.
[12]
J. Cowie and W. Lehnert, “Information extraction,” Communications of the ACM, vol. 39, no. 1, pp. 80–91, 1996.
[13]
P. Jackson and I. Moulinier, Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization, John Benjamins, Amsterdam, The Netherlands, 2002.
[14]
C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Mass, USA, 1999.
[15]
J. Allen, Natural Language Understanding, The Benjamin/Cummings, Menlo Park, Calif, USA, 1995.
[16]
J. Natarajan, D. Berrar, C. J. Hack, and W. Dubitzky, “Knowledge discovery in biology and biotechnology texts: a review of techniques, evaluation strategies, and applications,” Critical Reviews in Biotechnology, vol. 25, no. 1-2, pp. 31–52, 2005.
[17]
D. K. Slonim, P. Tamayo, J. P. Mesirov, T. R. Golub, and E. S. Lander, “Class prediction and discovery using gene expression data,” in Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB '00), pp. 263–272, Universal Academy Press, Tokyo, Japan, April 2000.
[18]
E. P. Xing, M. Jordan, and R. M. Karp, “Feature selection for high-dimensional genomic microarray data,” in Proceedings of the 18th International Conference on Machine Learning, pp. 601–608, 2001.
[19]
E.-J. Yeoh, M. E. Ross, S. A. Shurtleff et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer Cell, vol. 1, no. 2, pp. 133–143, 2002.
[20]
H. C. King and A. A. Sinha, “Gene expression profile analysis by DNA microarrays: promise and pitfalls,” Journal of the American Medical Association, vol. 286, no. 18, pp. 2280–2288, 2001.
[21]
G. M. O'Neill, D. R. Catchpoole, and E. A. Golemis, “From correlation to causality: microarrays, cancer, and cancer treatment,” BioTechniques, vol. 34, no. 3, pp. S64–S71, 2003.
[22]
M. West, C. Blanchette, H. Dressman et al., “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 20, pp. 11462–11467, 2001.
[23]
B. Vogelstein and K. W. Kinzler, “The multistep nature of cancer,” Trends in Genetics, vol. 9, no. 4, pp. 138–141, 1993.
[24]
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, no. 25, pp. 14863–14868, 1998.
[25]
P. Tamayo, D. Slonim, J. Mesirov et al., “Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 6, pp. 2907–2912, 1999.
[26]
M. Granzow, D. Berrar, W. Dubitzky, A. Schuster, F. J. Azuaje, and R. Eils, “Tumor classification by gene expression profiling: comparison and validation of five clustering methods,” ACM SIGBIO Newsletter, vol. 21, no. 1, pp. 16–22, 2001.
[27]
C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in Proceedings of the 8th International Conference on Database Theory (ICDT '01), pp. 420–434, 2001.
[28]
C. Tilstone, “Vital statistics,” Nature, vol. 424, no. 6949, pp. 610–612, 2003.
[29]
F. Azuaje, “Clustering-based approaches to discovering and visualising microarray data patterns,” Briefings in Bioinformatics, vol. 4, no. 1, pp. 31–42, 2003.
[30]
S. Raychaudhuri, J. T. Chang, P. D. Sutphin, and R. B. Altman, “Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature,” Genome Research, vol. 12, no. 1, pp. 203–214, 2002.
[31]
T.-K. Jenssen, A. L?greid, J. Komorowski, and E. Hovig, “A literature network of human genes for high-throughput analysis of gene expression,” Nature Genetics, vol. 28, no. 1, pp. 21–28, 2001.
[32]
C. Sabatti, “Statistical issues in microarray analysis,” Current Genomics, vol. 3, no. 1, pp. 7–12, 2002.
[33]
D. Berrar, C. S. Downes, and W. Dubitzky, “Multiclass cancer classification using gene expression profiling and probabilistic neural networks,” in Proceedings of the Pacific Symposium on Biocomputing, vol. 8, pp. 5–16, 2003.
[34]
A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000.
[35]
“KDD Cup 2002 task 2 of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,” http://www.biostat.wisc.edu/~craven/kddcup/index.html.
[36]
J. Natarajan, D. Berrar, W. Dubitzky et al., “Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line,” BMC Bioinformatics, vol. 7, no. 1, p. 373, 2006.
[37]
P. G. Febbo, M. G. Mulligan, D. A. Slonina et al., “Literature Lab: a method of automated literature interrogation to infer biology from microarray analysis,” BMC Genomics, vol. 461, pp. 8–18, 2007.
[38]
J. Khan, J. S. Wei, M. Ringnér et al., “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.
[39]
S. Ramaswamy, P. Tamayo, R. Rifkin et al., “Multiclass cancer diagnosis using tumor gene expression signatures,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 26, pp. 15149–15154, 2001.
[40]
D. Berrar, M. Granzow, W. Dubitzky et al., “New insights in clinical impact of molecular genetic data by knowledge-driven data mining,” in Proceedings of the 2nd International Conference on Systems Biology, pp. 275–281, Omni press, 2001.
[41]
J. T. Chang and R. B. Altman, “Extracting and characterizing gene-drug relationships from the literature,” Pharmacogenetics, vol. 14, no. 9, pp. 577–586, 2004.