oalib

Publish in OALib Journal

ISSN: 2333-9721

APC: Only $99

Submit

Search Results: 1 - 10 of 12 matches for " Tokenization "
All listed articles are free for downloading (OA Articles)
Page 1 /12
Display every page Item
Advantages of Using a Spell Checker in Text Mining Pre-Processes  [PDF]
Jhonathan Quillo-Espino, Rosa María Romero-González, Alberto Lara-Guevara
Journal of Computer and Communications (JCC) , 2018, DOI: 10.4236/jcc.2018.611004
Abstract: The aim of this work was the behavior analysis when a spell checker was integrated as an extra pre-process during the first stage of the test mining. Different models were analyzed, choosing the most complete one considering the pre-processes as the initial part of the text mining process. Algorithms for the Spanish language were developed and adapted, as well as for the methodology testing through the analysis of 2363 words. A capable notation for removing special and unwanted characters was created. Execution times of each algorithm were analyzed to test the efficiency of the text mining pre-process with and without orthographic revision. The total time was shorter with the spell-checker than without it. The key difference of this work among the existing related studies is the first time that the spell checker is used in the text mining preprocesses.
Preprocessing and Morphological Analysis in Text Mining
Krishna Kumar Mohbey Sachin Tiwari
International Journal of Electronics Communication and Computer Engineering , 2011,
Abstract: This paper is based on the preprocessing activities which is performed by the software or language translators before applying mining algorithms on the huge data. Text mining is an important area of Data mining and it plays a vital role for extracting useful information from the huge database or data ware house. But before applying the text mining or information extraction process, preprocessing is must because the given data or dataset have the noisy, incomplete, inconsistent, dirty and unformatted data. In this paper we try to collect the necessary requirements for preprocessing. When we complete the preprocess task then we can easily extract the knowledgful information using mining strategy. This paper also provides the information about the analysis of data like tokenization, stemming and semantic analysis like phrase recognition and parsing. This paper also collect the procedures for preprocessing data i.e. it describe that how the stemming, tokenization or parsing are applied.
IRS for Computer Character Sequences Filtration: a new software tool and algorithm to support the IRS at tokenization process
Ahmad Al Badawi,Qasem Abu Al-Haija
International Journal of Advanced Computer Sciences and Applications , 2013,
Abstract: Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of token a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. New software tool and algorithm to support the IRS at tokenization process are presented. Our proposed tool will filter out the three computer character Sequences: IP-Addresses, Web URLs, Date, and Email Addresses. Our tool will use the pattern matching algorithms and filtration methods. After this process, the IRS can start a new tokenization process on the new retrieved text which will be free of these sequences.
Plagiarism in solutions of programming tasks in distance learning
Krzysztof Barteczko
EduAction : Electronic Education Magazine , 2012,
Abstract: Source code plagiarism in students solutions of programming tasks is a serious problem, especially important in distance learning. Naturally, it should be prevented, but publicly available code plagiarism detection tools are not fully adjusted to this purpose. This paper proposes the specific approach to detecting code duplicates. This approach is based on adapting of detection process to characteristics of programming tasks and comprise of freshly developed detecting tools, which could be configured and tuned to fit individual features of the programming task. Particular attention is paid to the possibility of an automatic elimination of duplicate codes from the set of all solutions. As a minimum, this requires the rejection of false-positive duplicates, even for simple, schematic tasks. The case in the use of tools is presented in this context. The discussion is illustrated by applying of proposed tools to duplicates detection in the set of actual, real-life, codes written in Java programming language.
Part of speech Tagging in Manipuri with Hidden Markov Model
Kh Raju Singha,Bipul Syam Purkayastha,Kh Dhiren Singha
International Journal of Computer Science Issues , 2012,
Abstract: Part of Speech tagging in Manipuri is a very complex task as Manipuri is highly agglutinating in nature. There is no enough tagged corpus for Manipuri which can be used in any statistical analysis of the language. In this tagging model we are using tagged output of the Manipuri rule-based tagger as tagged corpus. The present paper expounds the Part of Speech Tagging in Manipuri by applying a stochastic model called Hidden Markov Model.
An Approach for Extracting the Keyword Using Frequency and Distance of the Word Calculations
Ashwini Madane,Devendra Thakore
International Journal of Soft Computing & Engineering , 2012,
Abstract: A significant word used in indexing or cataloguing is regarded as a Keyword. Keywords provide a concise and precise high-level summarization of a document. They therefore constitute an important feature for document retrieval, classification, topic search and other tasks even if full text search is available. Keywords are useful tools as they give the shortest summary of the document. A keyword is identified by finding the relevance of the word with or without prior vocabulary of the document or the web page. Extracting keywords manually is an extremely difficult and time consuming process, therefore it is almost impossible to extract keywords manually even for the articles published in a single conference. Therefore there is a need for automated process that extracts keywords from documents. This paper concentrates on the extracting the keywords by understanding the linguistic, non-linguistic and various other approaches but applying the simple statistics approach.
S XPipe 2 : an architecture for surface preprocessing of raw corpora S XPipe 2 : architecture pour le traitement présyntaxique de corpus bruts
Beno?t Sagot,Pierre Boullier
Traitement Automatique des Langues , 2009,
Abstract: This article introduces S XPipe 2, a modular and customizable chain aimed to apply to raw corpora a cascade of surface processing steps. Necessary preliminary step before parsing, they can be also used to prepare other tasks. Developed for French and for other languages, S XPipe 2 includes, among others, various named entities recognition modules in raw text, a sentence segmenter and tokenizer, a spelling corrector and compound words recognizer, and an original context-free patterns recognizer, used by several specialized grammars (numbers, impersonal constructions...). We describe the theoretical foundations of these modules, their implementation on French and a quantitative evaluation for some of them
Arabic-Chinese and Chinese-Arabic Phrase-Based Statistical Machine Translation Systems
Mossa Ghurab,Yueting Zhuang,Jiangqin Wu,Maan Younis Abdullah
Information Technology Journal , 2010,
Abstract: Designs for Arabic-to-Chinese and Chinese-to-Arabic translation systems are presented. The core of the system implements standard Phrase-Based Statistical Machine Translation architecture, where Corpus data used for the systems was collected from the United-Nations website and various news engine websites. Here, we focus on its acquisition as it is the training data of Arabic-Chinese and Chinese-Arabic Statistical Machine Translation systems. We trained Statistical Machine Translation systems for two language pairs, which revealed interesting clues into the challenges ahead. Models are then softly integrated into Statistical Machine Translation architecture so they can interact with other models without modifying the basic architecture. As a result, phrase translation probabilities learn directly rather than deriving them heuristically.
Myanmar Language Search Engine
Pann Yu Mon,Yoshiki Mikami
International Journal of Computer Science Issues , 2011,
Abstract: With the enormous growth of the World Wide Web, search engines play a critical role in retrieving information from the borderless Web. Although many search engines are available for the major languages, but they are not much proficient for the less computerized languages including Myanmar. The main reason is that those search engines are not considering the specific features of those languages. A search engine which capable of searching the Web documents written in those languages is highly needed, especially when more and more Web sites are coming up with localized content in multiple languages. In this study, the design and the architecture of language specific search engine for Myanmar language is proposed. The main feature of the system are, (1) it can search the multiple encodings of the Myanmar Web page, (2) the system is designed to comply with the specific features of the Myanmar language. Finally the experiment has been done to prove whether it meets the design requirements.
Text Categorization Using Activation Based Term Set
M. Pushpa,K. Nirmala
International Journal of Computer Science Issues , 2012,
Abstract: Text classification is a challenging field in the current scenario and has great importance in text categorization application. Documents may be classified or categorized according to their subjects or according to their attributes. There is need to categorize a collection of text document into mutually exclusive categories by extracting the concept or features using supervised learning paradigm and different classification algorithm. In this paper we present a nave based approach for the classification using semi-supervised text classification methodology with the help of Activation term sets. Such frequent term set can be discovered based on David Merrills First principles of instruction (FPI) techniques. The system uses a pre-defined category group by providing them with the proper training set based on the activation of FPI We made an attempt to classify the document using FPI methodology, the algorithm involves the text tokenization, text categorization and text analysis
Page 1 /12
Display every page Item


Home
Copyright © 2008-2017 Open Access Library. All rights reserved.