Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science. 1. Introduction Biologists are expected to answer large-scale questions that address processes occurring across broad spatial and temporal scales, such as the effects of climate change on species [1, 2]. This motivates the development of a new type of data-driven discovery focusing on scientific insights and hypothesis generation through the novel management and analysis of preexisting data [3, 4]. Data-driven discovery presumes that a large, virtual pool of data will emerge across a wide spectrum of the life sciences, matching that already in place for the molecular sciences. It is argued that the availability of such a pool will allow biodiversity science to join the other “Big” (i.e., data-centric) sciences such as astronomy and high-energy particle physics . Managing large amounts of heterogeneous data for this Big New Biology will require a cyberinfrastructure that organizes an open pool of biological data . To assess the resources needed to establish the cyberinfrastructure for biology, it is necessary to understand the nature of biological data . To become a part of the cyberinfrastructure, data must be ready to enter a digital data pool. This means data must be digital, normalized, and standardized . Biological data sets are heterogeneous in format, size, degree of digitization, and openness [4, 7, 8]. The distribution of
W. E. Bradshaw and C. M. Holzapfel, “Genetic shift in photoperiodic response correlated with global warming,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 25, pp. 14509–14511, 2001.
A. Hey, The Fourth Paradigm: Data-Intensive Scientific Discovery, 2009, http://iw.fh-potsdam.de/fileadmin/FB5/Dokumente/forschung/tagungen/i-science/TonyHey_-__eScience_Potsdam__Mar2010____complete_.pdf.
Key Perspectives Ltd, “Data dimensions: disciplinary differences in research data sharing, reuse and long term viability,” Digital Curation Centre, 2010, http://scholar.google.com/scholar?hl=en&q=Data+Dimensions:+disciplinary+differences+in+research+data-sharing,+reuse+and+long+term+viability.++&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0.
M. Kalfatovic, “Building a global library of taxonomic literature,” in 28th Congresso Brasileiro de Zoologia Biodiversidade e Sustentabilidade, 2010, http://www.slideshare.net/Kalfatovic/building-a-global-library-of-taxonomic-literature.
X. Tang and P. Heidorn, “Using automatically extracted information in species page retrieval,” 2007, http://scholar.google.com/scholar?hl=en&q=Tang+Heidorn+2007+using+automatically+extracted&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0.
H. Cui, P. Selden, and D. Boufford, “Semantic annotation of biosystematics literature without training examples,” Journal of the American Society for Information Science and Technology, vol. 61, pp. 522–542, 2010.
Y. Miyao, K. Sagae, R. S？tre, T. Matsuzaki, and J. Tsujii, “Evaluating contributions of natural language parsers to protein-protein interaction extraction,” Bioinformatics, vol. 25, no. 3, pp. 394–400, 2009.
K. Humphreys, G. Demetriou, and R. Gaizauskas, “Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '00), vol. 513, pp. 505–513, 2000.
X. Zhou, X. Zhang, and X. Hu, “Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining,” in Proceedings of the19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '07), pp. 197–201, October 2007.
D. Rebholz-Schuhmann, H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr, “EBIMed—text crunching to gather facts for proteins from Medline,” Bioinformatics, vol. 23, no. 2, pp. e237–e244, 2007.
S. Pyysalo and T. Salakoski, “Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches,” BMC Bioinformatics, vol. 7, supplement 3, article S2, 2006.
R. Abascal and J. A. Sánchez, “X-tract: structure extraction from botanical textual descriptions,” in Proceeding of the String Processing & Information Retrieval Symposium & International Workshop on Groupware, pp. 2–7, IEEE Computer Society, Cancun , Mexico, September 1999.
H. Cui, “CharaParser for fine-grained semantic annotation of organism morphological descriptions,” Journal of the American Society for Information Science and Technology, vol. 63, no. 4, pp. 738–754, 2012.
R. Leaman and G. Gonzalez, “BANNER: an executable survey of advances in biomedical named entity recognition,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '08), pp. 652–663, Kona, Hawaii, USA, January 2008.
M. Schr？der, “Knowledge-based processing of medical language: a language engineering approach,” in Proceedings of the16th German Conference on Artificial Intelligence (GWAI '92), vol. 671, pp. 221–234, Bonn, Germany, August-September 1992.
A. Kornai, K. Mohiuddin, and S. D. Connell, “Recognition of cursive writing on personal checks,” in Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition, pp. 373–378, Citeseer, Essex, UK, 1996.
C. Freeland, “Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing.,” in BioSystematics Berlin, 2011, http://www.slideshare.net/chrisfreeland/digitization-and-enhancement-of-biodiversity-literature-through-ocr-scientific-names-mapping-and-crowdsourcing.
A. Willis, D. King, D. Morse, A. Dil, C. Lyal, and D. Roberts, “From XML to XML: the why and how of making the biodiversity literature accessible to researchers,” in Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC '10), pp. 1237–1244, European Language Resources Association (ELRA), Valletta, Malta, May 2010.
T. Rees, “TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases,” in Proceedings of TDWG, 2008, pp. 35, http://www.tdwg.org/fileadmin/2008conference/documents/Proceedings2008.pdf#page=35.
G. Sautter, K. B？hm, and D. Agosti, “Semi-automated xml markup of biosystematic legacy literature with the goldengate editor,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '07), pp. 391–402, World Scientific, 2007.
G. A. Pavlopoulos, E. Pafilis, M. Kuhn, S. D. Hooper, and R. Schneider, “OnTheFly: a tool for automated document-based text annotation, data linking and network generation,” Bioinformatics, vol. 25, no. 7, pp. 977–978, 2009.
W. M. Dahdul, J. P. Balhoff, J. Engeman et al., “Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature,” Plos ONE, vol. 5, no. 5, Article ID e10708, 2010.
M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Populating a database from parallel texts using ontology-based information extraction,” in Natural Language Processing and Information Systems, vol. 3136, pp. 357–365, 2004.
H. Yu, W. Kim, V. Hatzivassiloglou, and W. J. Wilbur, “Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles,” Journal of Biomedical Informatics, vol. 40, no. 2, pp. 150–159, 2007.
J. D. Wren and H. R. Garner, “Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries,” Methods of Information in Medicine, vol. 41, no. 5, pp. 426–434, 2002.
M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham, “Using parallel texts to improve recall in IE,” in Proceedings of Recent Advances in Natural Language Processing (RANLP '03), pp. 505–512, Borovetz, Bulgaria, 2003.
H. Cui and P. B. Heidorn, “The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions,” Journal of the American Society for Information Science and Technology, vol. 58, no. 1, pp. 133–149, 2007.
H. Cui, S. Singaram, and A. Janning, “Combine unsupervised learning and heuristic rules to annotate morphological characters,” Proceedings of the American Society for Information Science and Technology, vol. 48, no. 1, pp. 1–9, 2011.