|
Best friends or just faking it? Corpus-based extraction of Slovene-Croatian translation equivalents and false friendsKeywords: automatic bilingual lexicon extraction , distributional semantics , closely related languages , cognates , false friends Abstract: In this paper we present a corpus-based approach to automatic extraction of translation equivalents and false friends for Slovene and Croatian, a pair of closely related languages. While taking advantage of the orthographic similarities between the two languages, the approach relies on a straightforward but powerful assumption of distributional semantics, which stipulates that words with a similar meaning tend to be used in similar contexts in both languages. On the one hand, this phenomenon enables us to quickly generate a Slovene-Croatian bilingual lexicon from minimal knowledge sources, the weakly comparable web corpora. On the other, it can also be used to identify the cognates that only seem similar on the surface but are in fact used to express different concepts in the two languages. The presented approach is language-independent and therefore attractive for natural language processing tasks that often lack the lexical resources and cannot afford to build them by hand, but is also useful in lexicography and language pedagogy where it can be used to highlight the lexical characteristics specific for a given language pair or domain.
|