全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Gigafida and slWaC: topic comparison

Keywords: Slovenian language , reference corpus , Web corpus , topic modeling

Full-Text   Cite this paper   Add to My Lib

Abstract:

In the article, the following two issues are analyzed: (a) incorporation of texts from the Internet into existing reference corpora and comparison with the existence of web corpora, and (b) the latest two corpora of Slovenian language texts: the Gigafida corpus consisting mainly of printed texts and to a lesser extent also web texts, and the slWaC corpus which is entirely compiled from web texts. First, similarities and differences between the two corpora are identified using the topic modelling method, and then the same method is applied to the individual taxonomic categories of the Gigafida corpus. The first part of the analysis showed that the work of reference corpus compilers is currently still incoherent with regard to the incorporation of Internet texts into corpora which should reveal the overall picture of a certain language. In case compilers decide to incorporate web texts, the range of included genres is generally broad. The second part of the analysis showed a significant thematic variation between the Gigafida and slWaC corpora, and pointed out the most typical themes covered by each of the six Gigafida corpus parts.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133