|
BMC Bioinformatics 2006
A combined approach to data mining of textual and structured data to identify cancer-related targetsAbstract: The data mining method identified previously undetected targets. Our combined strategy applied to each cancer type identified a minimum of 375 proteins expressed within the extracellular space and/or attached to the plasma membrane. The method led to the recognition of human cancer-related hydrolases (on average, ~35 per cancer type), among which were prostatic acid phosphatase, prostate-specific antigen, and sulfatase 1.The combined data mining of several databases overcame many of the limitations of querying a single database and enabled the facile identification of gene products. In the case of cancer-related targets, it produced a list of putative extracellular, hydrolytic enzymes that merit additional study as candidates for cancer radioimaging and radiotherapy. The proposed data mining strategy is of a general nature and can be applied to other biological databases for understanding biological functions and diseases.Recent advances in genomics and associated high throughput technologies have resulted in the exponential growth of biological databases. These consist of annotated genomic databases such as those at NCBI Genomic Biology [1], Ensembl [2] and UCSC Genome Bioinformatics [3]; specialized primary databases of proteins including UniProt (the universal protein resource) [4] and the RCSB Protein Data Bank (PDB, the database of protein structures) [5]; and derived databases such as EMBL-EBI InterPro (database of protein families, domains and functional sites) [6]. In parallel with structured data, the corpus of scientific literature (textual data) has been expanding rapidly. Structured and textual data are fertile grounds in the bioinformatics community for the development of data mining tools to identify key entities (genes/proteins) involved in biological processes and provide important biological insights. The combination of these two resources has resulted in knowledge bases that represent derived information on interactions among entities (see referenc
|