全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

A MODEL FOR OVERLAPPING TRIGRAM TECHNIQUE FOR TELUGU SCRIPT

Keywords: canonical structure , Text categorization , trigram , bigram , conjuncts

Full-Text   Cite this paper   Add to My Lib

Abstract:

N-grams are consecutive overlapping N-character sequences formed from an input stream. N-grams are used as alternatives to word-based retrieval in a number of systems. In this paper we propose a model applicable to categorization of Telugu document. Telugu is an official language derived from ancient Brahmi script and also the official language of the state of Andhra Pradesh. Brahmi based script is noted for complex conjunct formations. The canonical structure is described as ((C) C) CV. The structure evolves any character from a set of basic syllables known as vowels and consonants where consonant, vowel (CV) core is the basic unit optionally preceded by one or two consonants. A huge set of characters that resemble the phonetic nature with an equivalent character shape are derived from the canonical structure. Words formed from this set evolved into a large corpus. Stringent grammar rules in word formation are part of this corpus. Certain word combinations result in the formation of single word is to be addressed where the last character of the first word and first character of the successive word are combined. Keeping in view of these complexities we propose a trigram based system that provides a reasonable alternative to a word based system in achieving document categorization for the language Telugu.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133