%0 Journal Article
%T Thesaurus-based disambiguation of gene symbols
%A Bob JA Schijvenaars
%A Barend Mons
%A Marc Weeber
%A Martijn J Schuemie
%A Erik M van Mulligen
%A Hester M Wain
%A Jan A Kors
%J BMC Bioinformatics
%D 2005
%I BioMed Central
%R 10.1186/1471-2105-6-149
%X We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.The amount of information in the life sciences is staggering and growing exponentially. One of the largest biomedical resources of textual scientific information, the Medline database, currently contains over 14 million abstracts, with an estimated increase in size of more than one article per minute. Scientists are faced with an overload of information, which is particularly pressing in the biological field where high-throughput experiments in genomics and proteomics generate new data at an unprecedented rate. More often than not, interpretation of these data requires the digestion and integration of information contained in many thousands of articles and other information sources, a daunting task clearly beyond the capacity of human reading and comprehension.Recently, a number of information retrieval systems have been proposed to extract and relate pertinent b
%U http://www.biomedcentral.com/1471-2105/6/149