%0 Journal Article
%T What the papers say: Text mining for genomics and systems biology
%A Nathan Harmston
%A Wendy Filsell
%A Michael PH Stumpf
%J Human Genomics
%D 2010
%I BioMed Central
%R 10.1186/1479-7364-5-1-17
%X The scientific literature provides an important source of knowledge generated by the research community; it does not become defunct five years after publication and it is not just something to promote the authors' careers. While large amounts of data relating to biological systems are stored in public repositories, an even larger amount can be found in a semi-structured form in the literature (see Figure 1). This knowledge is potentially very useful in a variety of genomics and systems biology contexts [1]. For example, manually curated and literature-derived protein-protein interaction data-sets are typically used as gold standards by the systems biology community and it is standard practice to extract parameters for mechanistic models from the literature.Manual curation lacks the scalability to deal with the ever-increasing numbers of papers being published [2,3] and suffers from inter-annotator disagreement: different curators may interpret a piece of text in different ways. This means that a single paper needs to be annotated at least twice if the reliability of the proposed annotations is in any way to be calculated. The increase in the numbers of papers being published also means that it is becoming harder for researchers to stay up to date with the relevant literature in their field. This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process [4].These issues have motivated a sustained interest in the application of text mining (TM) techniques by both the industrial [5] and academic [6] communities to address some of these problems. TM refers to the process of extracting information encoded in text by authors through the use of techniques from a variety of fields such as information retrieval (IR), machine learning (ML), natural language processing (NLP), statistics and computational linguistics (CL) [7]. The use of these techniques leads t
%K data mining
%K systems medicine
%K literature processing
%K hypothesis generation
%U http://www.humgenomics.com/content/5/1/17