|
BMC Bioinformatics 2008
Corpus annotation for mining biomedical events from literatureAbstract: We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation.The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.Due to the ever-increasing amount of scientific articles in the bio-medical domain, Text Mining (TM) has been recognized as one of the key technologies for future bio-medical research [1-8]. In particular, since the limit of simple TM techniques which treat text as a bag of words has become apparent, there has been increased interest in more sophisticated, Natural Language Processing (NLP)-based TM. NLP as a field has been engaged in computer processing of structure of a sentence or text. Recently, advanced NLP software which uses grammatical knowledge and/or machine learning techniques has been increasingly applied to TM for the bio-medical domain [9-21].For NLP techniques to be successfully applied to text in the bio-medical domain, we first have to construct resources specifically designed for NLP in this domain. Since vocabularies are highly depend
|