oalib
Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
A PDTB-Styled End-to-End Discourse Parser  [PDF]
Ziheng Lin,Hwee Tou Ng,Min-Yen Kan
Computer Science , 2010,
Abstract: We have developed a full discourse parser in the Penn Discourse Treebank (PDTB) style. Our trained parser first identifies all discourse and non-discourse relations, locates and labels their arguments, and then classifies their relation types. When appropriate, the attribution spans to these relations are also determined. We present a comprehensive evaluation from both component-wise and error-cascading perspectives.
Compacting the Penn Treebank Grammar  [PDF]
Alexander Krotov,Mark Hepple,Robert Gaizauskas,Yorick Wilks
Computer Science , 1999,
Abstract: Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more treebanked text would be required to obtain a complete grammar, if one exists at some limit. However, we offer an alternative explanation in terms of the underspecification of structures within the treebank. This hypothesis is explored by applying an algorithm to compact the derived grammar by eliminating redundant rules -- rules whose right hand sides can be parsed by other rules. The size of the resulting compacted grammar, which is significantly less than that of the full treebank grammar, is shown to approach a limit. However, such a compacted grammar does not yield very good performance figures. A version of the compaction algorithm taking rule probabilities into account is proposed, which is argued to be more linguistically motivated. Combined with simple thresholding, this method can be used to give a 58% reduction in grammar size without significant change in parsing performance, and can produce a 69% reduction with some gain in recall, but a loss in precision.
The biomedical discourse relation bank
Rashmi Prasad, Susan McRoy, Nadya Frid, Aravind Joshi, Hong Yu
BMC Bioinformatics , 2011, DOI: 10.1186/1471-2105-12-188
Abstract: We have developed the Biomedical Discourse Relation Bank (BioDRB), in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB), which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89). These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28), mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57).Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more annotated data. The poor performance of a classifier trained in the open domain and tested in the biomedical domain suggests significant differences in the semantic usage of connectives across these domains, and provides robust evidence for a biomedical sublanguage for discourse and the need to develop a specialized biomedical discourse annotated corpus. The results of our cross-domain experiments are consistent with related work on identifying connectives
The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon  [PDF]
Barbara McGillivray,Marco Passarotti,Paolo Ruffolo
Traitement Automatique des Langues , 2010,
Abstract: We present an overview of the Index Thomisticus Treebank project (IT-TB). The IT-TB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11-million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures.
Rule-based Automatic Annotating for the Discourse of English Complicated Sentences
基于规则的英语复句关联词自动标注技术

Shen Chunyan,Wang Huilin,
申春艳
,王惠临

现代图书情报技术 , 2008,
Abstract: This paper introduces the technology of Finite State Transducer,and references to the thinking of development of Penn Treebank,through the analysis of rules and the results of comprehensive utilization of POS tagging,recognition of discourse connectives,punctuations,vocabulary mapping,and chunk to simplify the complicated sentences.Final results are expressed in the form of proposition.
Two Languages - One Annotation Scenario? Experience from the Prague Dependency Treebank
Silvie Cinková, Eva Haji ová, Jarmila Panevová, Petr Sgall
The Prague Bulletin of Mathematical Linguistics , 2008, DOI: 10.2478/v10108-009-0001-y
Abstract: This paper compares the two FGD-based annotation scenarios for Czech and for English, with the Czech as the basis. We discuss the secondary predication expressed by infinitive and its functions in Czech and English, respectively. We give a few examples of English constructions that do not have direct counterparts in Czech (e.g., tough movement and causative constructions with make, get, and have), as well as some phenomena central in English but much less employed in Czech (object raising or control in adjectives as nominal predicates), and, last, structures more or less parallel both in their function and distribution, whose respective annotation differs due to significant differences in the respective linguistic traditions (verbs of perception).
Entity-Augmented Distributional Semantics for Discourse Relations  [PDF]
Yangfeng Ji,Jacob Eisenstein
Computer Science , 2014,
Abstract: Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.
One Vector is Not Enough: Entity-Augmented Distributional Semantics for Discourse Relations  [PDF]
Yangfeng Ji,Jacob Eisenstein
Computer Science , 2014,
Abstract: Discourse relations bind smaller linguistic units into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked arguments. A more subtle challenge is that it is not enough to represent the meaning of each argument of a discourse relation, because the relation may depend on links between lower-level components, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted from the distributional representations of the arguments, and also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.
Bagging and Boosting a Treebank Parser  [PDF]
John C. Henderson,Eric Brill
Computer Science , 2000,
Abstract: Bagging and boosting, two effective machine learning techniques, are applied to natural language parsing. Experiments using these techniques with a trainable statistical parser are described. The best resulting system provides roughly as large of a gain in F-measure as doubling the corpus size. Error analysis of the result of the boosting technique reveals some inconsistent annotations in the Penn Treebank, suggesting a semi-automatic method for finding inconsistent treebank annotations.
Annotation graphs as a framework for multidimensional linguistic data analysis  [PDF]
Steven Bird,Mark Liberman
Computer Science , 1999,
Abstract: In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain specialists, we have constructed a hybrid multi-level annotation for a fragment of the Boston University Radio Speech Corpus which includes the following levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named entity. We show how annotation graphs can represent hybrid multi-level structures which derive from a diverse set of file formats. We also show how the approach facilitates substantive comparison of multiple annotations of a single signal based on different theoretical models. The discussion shows how annotation graphs open the door to wide-ranging integration of tools, formats and corpora.
Page 1 /100
Display every page Item


Home
Copyright © 2008-2017 Open Access Library. All rights reserved.