oalib
Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
Metaphor Identification in Large Texts Corpora  [PDF]
Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder
PLOS ONE , 2013, DOI: 10.1371/journal.pone.0062343
Abstract: Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.
Building of Networks of Natural Hierarchies of Terms Based on Analysis of Texts Corpora  [PDF]
Dmitry Lande
Computer Science , 2014,
Abstract: The technique of building of networks of hierarchies of terms based on the analysis of chosen text corpora is offered. The technique is based on the methodology of horizontal visibility graphs. Constructed and investigated language network, formed on the basis of electronic preprints arXiv on topics of information retrieval.
SPOKEN CORPORA: RATIONALE AND APPLICATION
John Newman
Taiwan Journal of Linguistics , 2008,
Abstract: Despite the abundance of electronic corpora now available to researchers, corpora of natural speech are still relatively rare and relatively costly. This paper suggests reasons why spoken corpora are needed, despite the formidable problems of construction. The multiple purposes of such corpora and the involvement of very different kinds of language communities in such projects mean that there is no one single blueprint for the design, markup, and distribution of spoken corpora. A number of different spoken corpora are reviewed to illustrate a range of possibilities for the construction of spoken corpora.
Segmenting DNA sequence into `words'  [PDF]
Wang Liang
Computer Science , 2012,
Abstract: This paper presents a novel method to segment/decode DNA sequences based on n-grams statistical language model. Firstly, we find the length of most DNA 'words' is 12 to 15 bps by analyzing the genomes of 12 model species. Then we design an unsupervised probability based approach to segment the DNA sequences. The benchmark of segmenting method is also proposed.
Corpora and concordancers on the nl.ijs.si server  [PDF]
Toma? Erjavec
Sloven??ina 2.0 : Empiri?ne, Aplikativne in Interdisciplinarne Raziskave , 2013,
Abstract: The paper presents the monolingual and parallel corpora which can be accessed through two concordancers on the server nl.ijs.si. Twelve monolingual corpora contain Slovene language texts, one contains Japanese and one English texts, and comprise reference corpora, such as Gigafida for written contemporary Slovene, IMP for historical Slovene, and GOS for spoken Slovene and specialised corpora, such as the corpus of texts from the informatics domain and the corpus of Slovene tweets. The five parallel corpora contain Slovene texts sentence aligned with, variously, English, Japanese, French, German, and Italian from domains such as EU law, literature and journalism. Although most of the corpora have been produced in the past, they have now been newly annotated, some have been extended with additional texts, and a few are completely new. The texts in the corpora are supplied with meta-data, while their word tokens are either manually or automatically annotated with at least lemmas and morphosyntactic descriptions. Most of the corpora are freely available through two web concordancers, the noSketch Engine and CUWI. These two corpus analysis tools support searching large annotated corpora, various types of search result display, the possibility to filter the searches according to meta-data, and saving the search results locally. In addition to the corpora and concordancers the paper also discusses some issues pertaining to such a corpus-linguistic infrastructure, and concludes with directions for further work.
Corpora and cognitive linguistics
Newman, John;
Revista Brasileira de Linguística Aplicada , 2011, DOI: 10.1590/S1984-63982011000200010
Abstract: corpora are a natural source of data for cognitive linguists, since corpora, more than any other source of data, reflect "usage" - a notion which is often claimed to be of critical importance to the field of cognitive linguistics. corpora are relevant to all the main topics of interest in cognitive linguistics: metaphor, polysemy, synonymy, prototypes, and constructional analysis. i consider each of these topics in turn and offer suggestions about which methods of analysis can be profitably used with available corpora to explore these topics further. in addition, i consider how the design and content of currently used corpora need to be rethought if corpora are to provide all the types of usage data that cognitive linguists require.
Automatically Segmenting Oral History Transcripts  [PDF]
Ryan Shaw
Computer Science , 2015,
Abstract: Dividing oral histories into topically coherent segments can make them more accessible online. People regularly make judgments about where coherent segments can be extracted from oral histories. But making these judgments can be taxing, so automated assistance is potentially attractive to speed the task of extracting segments from open-ended interviews. When different people are asked to extract coherent segments from the same oral histories, they often do not agree about precisely where such segments begin and end. This low agreement makes the evaluation of algorithmic segmenters challenging, but there is reason to believe that for segmenting oral history transcripts, some approaches are more promising than others. The BayesSeg algorithm performs slightly better than TextTiling, while TextTiling does not perform significantly better than a uniform segmentation. BayesSeg might be used to suggest boundaries to someone segmenting oral histories, but this segmentation task needs to be better defined.
Corpora for computational linguistics Corpora for computational linguistics
Constantin Orasan,Le An Ha,Richard Evans,Laura Hasler
Ilha do Desterro , 2008,
Abstract: Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed. Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed.
TexComp - A Text Complexity Analyzer for Student Texts  [PDF]
T. Kakkonen
Computer Science , 2012,
Abstract: This paper describes a method for providing feedback about the degree of complexity that is present in particular texts. Both the method and the software tool called TexComp are designed for use during the assessment of student compositions (such as essays and theses). The method is based on a cautious approach to the application of readability and lexical diversity formulas for reasons that are analyzed in detail in this paper. We evaluated the tool by using USE and BAWE, two corpora of texts that originate from students who use English as a medium of instruction.
About the creation of a parallel bilingual corpora of web-publications  [PDF]
D. V. Lande,V. V. Zhygalo
Computer Science , 2008,
Abstract: The algorithm of the creation texts parallel corpora was presented. The algorithm is based on the use of "key words" in text documents, and on the means of their automated translation. Key words were singled out by means of using Russian and Ukrainian morphological dictionaries, as well as dictionaries of the translation of nouns for the Russian and Ukrainianlanguages. Besides, to calculate the weights of the terms in the documents, empiric-statistic rules were used. The algorithm under consideration was realized in the form of a program complex, integrated into the content-monitoring InfoStream system. As a result, a parallel bilingual corpora of web-publications containing about 30 thousand documents, was created
Page 1 /100
Display every page Item


Home
Copyright © 2008-2017 Open Access Library. All rights reserved.