|
BMC Bioinformatics 2008
Multi-label literature classification based on the Gene Ontology graphAbstract: In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.A thrust in bioinformatics is to acquire and transform contemporary knowledge from biomedical literature into computable forms, so that computers can be used to efficiently organize, retrieve and discover the knowledge. The Gene Ontology (GO) [1] is a controlled vocabulary used to represent molecular biology concepts, which is the de facto standard for annotating genes/proteins. The concepts in GO, referred to as GO terms, are organized in directed acyclic graphs (DAGs) to reflect hierarchical relationships among concepts. Currently, the process of extracting biological concepts from biomedical literature to annotate genes/proteins i
|