%0 Journal Article
%T Gigafida and slWaC: topic comparison
%A Nata？a Logar Berginc
%A Nikola Ljube？i？
%J Sloven？？ina 2.0 : Empiri？ne, Aplikativne in Interdisciplinarne Raziskave
%D 2013
%I Trojina, Institute for Applied Slovene Studies
%X In the article, the following two issues are analyzed: (a) incorporation of texts from the Internet into existing reference corpora and comparison with the existence of web corpora, and (b) the latest two corpora of Slovenian language texts: the Gigafida corpus consisting mainly of printed texts and to a lesser extent also web texts, and the slWaC corpus which is entirely compiled from web texts. First, similarities and differences between the two corpora are identified using the topic modelling method, and then the same method is applied to the individual taxonomic categories of the Gigafida corpus. The first part of the analysis showed that the work of reference corpus compilers is currently still incoherent with regard to the incorporation of Internet texts into corpora which should reveal the overall picture of a certain language. In case compilers decide to incorporate web texts, the range of included genres is generally broad. The second part of the analysis showed a significant thematic variation between the Gigafida and slWaC corpora, and pointed out the most typical themes covered by each of the six Gigafida corpus parts.
%K Slovenian language
%K reference corpus
%K Web corpus
%K topic modeling
%U http://www.trojina.org/slovenscina2.0/arhiv/2013/1/Slo2.0_2013_1_05.pdf