A Survey on Models and Query Languages for Temporally Annotated RDF
Anastasia Analyti,Ioannis Pachoulakis
International Journal of Advanced Computer Sciences and Applications , 2012,
Abstract: In this paper, we provide a survey on the models and query languages for temporally annotated RDF. In most of the works, a temporally annotated RDF ontology is essentially a set of RDF triples associated with temporal constraints, where, in the simplest case, a temporal constraint is a validity temporal interval. However, a temporally annotated RDF ontology may also be a set of triples connecting resources with a specific lifespan, where each of these triples is also associated with a validity temporal interval. Further, a temporal RDF ontology may be a set of triples connecting resources as they stand at specific time points. Several query languages for temporally annotated RDF have been proposed, where most of which extend SPARQL or translate to SPARQL. Some of the works provide experimental results while the rest are purely theoretical.
Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora  [PDF]
John Hughes,Clive Souter,Eric Atwell
Computer Science , 1995,
Abstract: This paper describes some of the recent work of project AMALGAM (automatic mapping among lexico-grammatical annotation models). We are investigating ways to map between the leading corpus annotation schemes in order to improve their resuability. Collation of all the included corpora into a single large annotated corpus will provide a more detailed language model to be developed for tasks such as speech and handwriting recognition. In particular, we focus here on a method of extracting mappings from corpora that have been annotated according to more than one annotation scheme.
Extracting a bilingual semantic grammar from FrameNet-annotated corpora  [PDF]
Dana Dannélls,Normunds Grūzītis
Computer Science , 2014,
Abstract: We present the creation of an English-Swedish FrameNet-based grammar in Grammatical Framework. The aim of this research is to make existing framenets computationally accessible for multilingual natural language applications via a common semantic grammar API, and to facilitate the porting of such grammar to other languages. In this paper, we describe the abstract syntax of the semantic grammar while focusing on its automatic extraction possibilities. We have extracted a shared abstract syntax from ~58,500 annotated sentences in Berkeley FrameNet (BFN) and ~3,500 annotated sentences in Swedish FrameNet (SweFN). The abstract syntax defines 769 frame-specific valence patterns that cover 77.8% examples in BFN and 74.9% in SweFN belonging to the shared set of 471 frames. As a side result, we provide a unified method for comparing semantic and syntactic valence patterns across framenets.
LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual  [PDF]
Mona Diab,Nizar Habash,Owen Rambow,Ryan Roth
Computer Science , 2013,
Abstract: The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and final evaluation. Therefore it becomes necessary to divide the corpora used into the required data sets (divisions). This document details a set of rules that have been defined to enable consistent divisions for old and new Arabic treebanks (ATB) and related corpora.
Data formats for phonological corpora  [PDF]
Laurent Romary,Andreas Witt
Computer Science , 2011,
Abstract: The goal of the present chapter is to explore the possibility of providing the research (but also the industrial) community that commonly uses spoken corpora with a stable portfolio of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources.
Querying Databases of Annotated Speech  [PDF]
Steve Cassidy,Steven Bird
Computer Science , 2002,
Abstract: Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent efforts to model annotated speech databases, one at Macquarie University and one at the University of Pennsylvania. Various query languages are described, along with illustrative applications to a variety of analytical problems. The research reported here forms a part of several ongoing projects to develop platform-independent open-source tools for creating, browsing, searching, querying and transforming linguistic databases, and to disseminate large linguistic databases over the internet.
John Newman
Taiwan Journal of Linguistics , 2008,
Abstract: Despite the abundance of electronic corpora now available to researchers, corpora of natural speech are still relatively rare and relatively costly. This paper suggests reasons why spoken corpora are needed, despite the formidable problems of construction. The multiple purposes of such corpora and the involvement of very different kinds of language communities in such projects mean that there is no one single blueprint for the design, markup, and distribution of spoken corpora. A number of different spoken corpora are reviewed to illustrate a range of possibilities for the construction of spoken corpora.
Metaphor Identification in Large Texts Corpora  [PDF]
Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder
PLOS ONE , 2013, DOI: 10.1371/journal.pone.0062343
Abstract: Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.
Corpora and concordancers on the nl.ijs.si server  [PDF]
Toma? Erjavec
Sloven??ina 2.0 : Empiri?ne, Aplikativne in Interdisciplinarne Raziskave , 2013,
Abstract: The paper presents the monolingual and parallel corpora which can be accessed through two concordancers on the server nl.ijs.si. Twelve monolingual corpora contain Slovene language texts, one contains Japanese and one English texts, and comprise reference corpora, such as Gigafida for written contemporary Slovene, IMP for historical Slovene, and GOS for spoken Slovene and specialised corpora, such as the corpus of texts from the informatics domain and the corpus of Slovene tweets. The five parallel corpora contain Slovene texts sentence aligned with, variously, English, Japanese, French, German, and Italian from domains such as EU law, literature and journalism. Although most of the corpora have been produced in the past, they have now been newly annotated, some have been extended with additional texts, and a few are completely new. The texts in the corpora are supplied with meta-data, while their word tokens are either manually or automatically annotated with at least lemmas and morphosyntactic descriptions. Most of the corpora are freely available through two web concordancers, the noSketch Engine and CUWI. These two corpus analysis tools support searching large annotated corpora, various types of search result display, the possibility to filter the searches according to meta-data, and saving the search results locally. In addition to the corpora and concordancers the paper also discusses some issues pertaining to such a corpus-linguistic infrastructure, and concludes with directions for further work.
Using the Annotated Bibliography as a Resource for Indicative Summarization  [PDF]
Min-Yen Kan,Judith L. Klavans,Kathleen R. McKeown
Computer Science , 2002,
Abstract: We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important form to study for information retrieval. We detail our methodology for collecting the corpus, and overview our document feature markup that we introduced to facilitate summary analysis. We present the characteristics of the corpus, methods of collection, and show its use in finding the distribution of types of information included in indicative summaries and their relative ordering within the summaries.
