Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks. 1. Background Biomedical literature provides extensive information that is not covered in other knowledge resources and the amount of information produced and published in articles and patents is growing at a fast pace, thus the manual analysis and annotation of the literature is a tedious, time-consuming, and costly process. Fortunately, this process has been addressed by text-mining systems that have already shown to be helpful in speeding up some steps of this process [1]. Normally, the first step of text-mining systems is the identification of named entities in text. This is a crucial step and includes the tasks of named entity recognition and entity resolution. Named entity recognition comprises the identification of the text boundaries that limits a string referring to a target category, such as chemicals [2]. Entity resolution takes as input the strings identified in the previous task, in order to find exactly which chemical each string corresponds to, by mapping each of them to a reference database entry. Most efforts in entity recognition and resolution have been made in the identification of protein and gene named entities in the literature. The performance of systems tackling such tasks has been measured in competitions such as the BioCreative challenge [3, 4], TREC Genomics Track [5] and the NLPBA challenge [6]. However, few efforts have been made on the recognition and resolution of other terminologies, partly due to the lack of annotated corpora and the high costs associated to its generation. One of such cases is chemical terminologies, a field that suffers from the lack of available corpora but can benefit immensely from text mining. For example, chemical
References
[1]
M. Krauthammer and G. Nenadic, “Term identification in the biomedical literature,” Journal of Biomedical Informatics, vol. 37, no. 6, pp. 512–526, 2004.
[2]
P. Zweigenbaum, D. Demner-Fushman, H. Yu, and K. B. Cohen, “Frontiers of biomedical text mining: current progress,” Briefings in Bioinformatics, vol. 8, no. 5, pp. 358–375, 2007.
[3]
L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, “Overview of BioCreAtIvE: critical assessment of information extraction for biology,” BMC Bioinformatics, vol. 6, supplement 1, article S1, 2005.
[4]
A. A. Morgan, Z. Lu, X. Wang, et al., “Open access overview of bioCreative II gene normalization,” Genome Biology, vol. 9, supplement 2, p. S3, 2008.
[5]
W. Hersh and E. Voorhees, “TREC genomics special issue overview,” Information Retrieval, vol. 12, no. 1, pp. 1–15, 2008.
[6]
J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, “Introduction to the bio-entity recognition task at JNLPBA,” in Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 70–75, 2004.
[7]
R. A.-A. Erhardt, R. Schneider, and C. Blaschke, “Status of text-mining techniques applied to biomedical text,” Drug Discovery Today, vol. 11, no. 7-8, pp. 315–325, 2006.
[8]
M. Krallinger, A. Valencia, and L. Hirschman, “Linking genes to literature: text mining, information extraction, and retrieval applications for biology,” Genome Biology, vol. 9, supplement 2, article S8, 2008.
[9]
D. L. Banville, “Mining chemical structural information from the drug literature,” Drug Discovery Today, vol. 11, no. 1-2, pp. 35–42, 2006.
[10]
I. Spasic, S. Ananiadou, J. McNaught, and A. Kumar, “Text mining and ontologies in biomedicine: making sense of raw text,” Briefings in Bioinformatics, vol. 6, no. 3, pp. 239–251, 2005.
[11]
W. JohnWilburt, G. F. Hazard, G. Divita, et al., “Analysis of biomedical text for chemical names : a comparison of three methods james,” in Proceedings of the AMIA Symposium, pp. 176–180, 1999.
[12]
M. Narayanaswamy, K. E. Ravikumar, and K. Vijay-Shanker, “A biological named entity recognizer,” in Proceedings of the Pacific Symposium on Biocomputing, vol. 438, pp. 427–438, 2003.
[13]
J. D. Wren, “A scalable machine-learning approach to recognize chemical names within large text databases,” BMC Bioinformatics, vol. 7, supplement 2, article S3, 2006.
[14]
P. Tomasulo, “ChemIDplus-super source for chemical and drug information,” Medical Reference Services Quarterly, vol. 21, no. 1, pp. 53–59, 2002.
[15]
R. Klinger, C. Kolá?ik, J. Fluck, M. Hofmann-Apitius, and C. M. Friedrich, “Detection of IUPAC and IUPAC-like chemical names,” Bioinformatics, vol. 24, no. 13, pp. i268–i276, 2008.
[16]
P. Corbett and A. Copestake, “Cascaded classifiers for confidence-based chemical named entity recognition,” BMC Bioinformatics, vol. 9, supplement 11, article S4, 2008.
[17]
D. Rebholz-Schuhmann, M. Arregui, S. Gaudan, H. Kirsch, and A. Jimeno, “Text processing through web services: calling Whatizit,” Bioinformatics, vol. 24, no. 2, pp. 296–298, 2008.
[18]
K. Degtyarenko, P. de matos, M. Ennis et al., “ChEBI: a database and ontology for chemical entities of biological interest,” Nucleic Acids Research, vol. 36, no. 1, pp. D344–D350, 2008.
[19]
R. T.-H. Tsai, S.-H. Wu, W.-C. Chou et al., “Various criteria in the evaluation of biomedical named entity recognition,” BMC Bioinformatics, vol. 7, article 92, 2006.
[20]
J. D. Ferreira and F. M. Couto, “Semantic similarity for automatic classification of chemical compounds,” Plos Computational Biology, vol. 6, no. 9, Article ID e1000937, 2010.
[21]
J. Lafferty, A. McCallum, and F. Pereira, “Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the 18th International Conference on Machine Learning, pp. 282–289, 2001.
[22]
L. Smith, L. K. Tanabe, R. Ando et al., “Overview of BioCreative II gene mention recognition,” Genome Biology, vol. 9, supplement 2, pp. 1–19, 2008.
[23]
A. K. McCallum, MALLET: A Machine Learning for Language Toolkit, 2002.
[24]
P. Corbett, C. Batchelor, and S. Teufel, “Annotation of chemical named entities,” in Proceedings of the BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 57–64, 2007.
[25]
C. Pesquita, C. Stroe, I. Cruz, and F. M. Couto, “BLOOMS on agreementMaker: results for OAEI 2010,” in Proceedings of the ISWC Workshop on Ontology Matching, pp. 134–141, 2010.
[26]
F. M. Couto, M. J. Silva, and P. M. Coutinho, “Finding genomic ontology terms in text using evidence content,” BMC Bioinformatics, vol. 6, supplement 1, article S21, 2005.