全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
-  2019 

25 years of Ha?ek

Keywords: Ha?ek, spellchecking, learning, Google, n-gram systems

Full-Text   Cite this paper   Add to My Lib

Abstract:

Sa?etak Ha?ek is a Croatian on-line spellchecker that continuously operates since March 21, 1994, nowadays at the address https://ispravi.me/. In 25 years of functioning Ha?ek processed nearly 30 million texts, which build a corpus of more than 7 billion tokens. By comparison, all books ever published in Croatian form a corpus with less than 20 billion tokens. As a WWW-embedded tool, Ha?ek took advantage of many web-based services including learning. Thanks to Ha?ek’s learning capability, its dictionary increased from initial 100 thousand to more than 2 million word-types. Another aspect of learning was the creating and regular updating of the Croatian n-gram system. Unlike Google, whose n-gram systems are based on the WaC (Web as Corpus) approach and cut-off criteria, Croatian n-grams were extracted from processed texts by a lexical criterion: each n-gram constituent must be proven by the spellchecker as valid in Croatian spelling. The difference in approaches made Croatian n-gram system comparable in size to the largest Google n-gram systems. Unfortunately, the advantages of on-line spellchecking for rapid breakthroughs into much more sophisticated language technology areas were not recognized by Croatian decision makers, with some consequences mentioned in the paper

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133