%0 Journal Article
%T Chemical Entity Recognition and Resolution to ChEBI
%A Tiago Grego
%A Catia Pesquita
%A Hugo P. Bastos
%A Francisco M. Couto
%J ISRN Bioinformatics
%D 2012
%R 10.5402/2012/619427
%X Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2每5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks. 1. Background Biomedical literature provides extensive information that is not covered in other knowledge resources and the amount of information produced and published in articles and patents is growing at a fast pace, thus the manual analysis and annotation of the literature is a tedious, time-consuming, and costly process. Fortunately, this process has been addressed by text-mining systems that have already shown to be helpful in speeding up some steps of this process [1]. Normally, the first step of text-mining systems is the identification of named entities in text. This is a crucial step and includes the tasks of named entity recognition and entity resolution. Named entity recognition comprises the identification of the text boundaries that limits a string referring to a target category, such as chemicals [2]. Entity resolution takes as input the strings identified in the previous task, in order to find exactly which chemical each string corresponds to, by mapping each of them to a reference database entry. Most efforts in entity recognition and resolution have been made in the identification of protein and gene named entities in the literature. The performance of systems tackling such tasks has been measured in competitions such as the BioCreative challenge [3, 4], TREC Genomics Track [5] and the NLPBA challenge [6]. However, few efforts have been made on the recognition and resolution of other terminologies, partly due to the lack of annotated corpora and the high costs associated to its generation. One of such cases is chemical terminologies, a field that suffers from the lack of available corpora but can benefit immensely from text mining. For example, chemical
%U http://www.hindawi.com/journals/isrn.bioinformatics/2012/619427/