全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Automatic Arabic Document Classification Based on the HRWiTD Algorithm

DOI: 10.4236/jsea.2018.114011, PP. 167-179

Keywords: Automatic Text Classification, Confusion Matrix, SPA, Machine Learning Algorithms

Full-Text   Cite this paper   Add to My Lib

Abstract:

The documents contain a large amount of valuable knowledge on various subjects and, more recently, documents on the Internet are available from various sources. Therefore, automatic, rapid and accurate classification of these documents with less human interaction has become necessary. In this paper, we introduce a new algorithm called the highest repetition of words in a text document (HRWiTD) to classify the automatic Arabic text. The corpus is divided into a train set and a test set to be applied to proposed classification technique. The train set is analyzed for learning and the learning data is stored in the Learning Dataset file. The category that contains the highest repetition for each word is assigned as a category for the word in Learning Dataset file. This file includes non-duplicate words with the value of higher repetition and categories and they get from all texts in the train set. For each text in the test set, the category of words is assigned to a specific category by using Learning Dataset file. The category that contains the largest number of words is assigned as the predicted category of the text. To evaluate the classification accuracy of the HRWiTD algorithm, the confusion matrix method is used. The HRWiTD algorithm has been applied to convergent samples from six categories of Arabic news at SPA (Saudi Press Agency). As a result, the accuracy of the HRWiTD algorithm is 86.84%. In addition, we used the same corpus with the most popular machine learning algorithms which are C5.0, KNN, SVM, NB and C4.5, and their results of classification accuracy are 52.86%, 52.38%, 51.90%, 51.90% and 30%, respectively. Thus, the HRWiTD algorithm gives better classification accuracy compared to the most popular machine learning algorithms on the selected domain.

References

[1]  Al-Diabat, M. (2012) Arabic Text Categorization Using Classification Rule Mining. Applied Mathematical Sciences, 6, 4033-4046.
[2]  Kourdi, M.E., Bensaid, A. and Rachidi, T.-E. (2004) Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Geneva, 28 August 2004, 51-58.
https://doi.org/10.3115/1621804.1621819
[3]  Al-Thubaity, A., Almuhareb, A., Al-Harbi, S., Al-Rajeh, A. and Khorsheed, M. (2008) KACST Arabic Text Classification Project: Overview and Preliminary Results. Proceedings of The 9th IBIMA Conference on Information Management in Modern Organizations, Morocco, 1 January 2008, 1239-1244.
[4]  Mohammad, A.H., Alwada’n, T. and Al-Moman, O. (2016) Arabic Text Categorization Using Support Vector Machine, Naïve Bayes and Neural Network. GSTF Journal on Computing, 5, 108-115, 2016.
[5]  Khorsheed, M.S. and Al-Thubaity, A.O. (2013) Comparative Evaluation of Text Classification Techniques Using a Large Diverse Arabic Dataset. Language Resources and Evaluation, 47, 513-538.
https://doi.org/10.1007/s10579-013-9221-8
[6]  Menon, A.K. (2009) Large-Scale Support Vector Machines: Algorithms and Theory. UCSD, San Diego, 1-17.
[7]  Aliwy, A.H. and Ameer, E.H.A. (2017) Comparative Study of Five Text Classification Algorithms with Their Improvements. International Journal of Applied Engineering Research, 12, 4309-4319.
[8]  Sharef, B., Omar, N. and Sharef, Z. (2014) An Automated Arabic Text Categorization Based on the Frequency Ratio Accumulation. The International Arab Journal of Information Technology, 11, 213-221.
[9]  Saad, M.K. (2010) The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Master Thesis, Islamic University, Gaza.
[10]  Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochutet, K.A (2017) A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques.
https://arxiv.org/pdf/1707.02919.pdf
[11]  Sawaf, H., Zaplo, J. and Ney, H. (2001) Statistical Classification Methods for Arabic News Articles. Arabic Natural Language Processing, Workshop on the ACL’2001, Toulouse, 6 July 2001.
[12]  Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. and Al-Rajeh, A.A. (2008) Automatic Arabic Text Classification, The 9th International Conference on the Statistical Analysis of Textual Data, Lyon, 12-14 March 2008, 77-83.
[13]  Al-Kabi, M.N. and Sinjilawi, S. (2007) A Comparative Study of the Efficiency of Different Measures to Classify Arabic Text. The University of Sharjah Journal of Pure and Applied Sciences, 4, 13-26.
[14]  Khreisat, L. (2006) Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study. International Conference on Data Mining, Las Vegas, 26-29 June 2006, 78-82.
[15]  Kanaan, G., Al-Shalabi, R. and Al-Azzam, O. (2005) Automatic Text Classification Using Naïve Bayesian Algorithm on Arabic language. IBIMA 2005 Conference on the Internet & Information Technology in Modern Organization, Cairo, 13-15 December 2005.
[16]  Galathiya, A.S., Ganatra, A.P. and Bhensdadia, C.K. (2012) Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning. International Journal of Computer Science and Information Technologies, 3, 3427-3431.
[17]  Wu, X.D., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D. and Steinberg, D. (2008) Top 10 Algorithms in Data Mining. Knowledge and Information Systems, 14, 1-37.
https://doi.org/10.1007/s10115-007-0114-2
[18]  Veeraswamy, A., Alias, S. and Kannan, E. (2013) An Implementation of Efficient Datamining Classification Algorithm using Nbtree. International Journal of Computer Applications, 67, 26-29.
https://doi.org/10.5120/11448-7043
[19]  Kohavi, R. and Provost, F. (1998) Glossary of terms. Machine Learning—Special Issue on Applications of Machine Learning and the Knowledge Discovery Process. Machine Learning, 30, 271-274.
https://doi.org/10.1023/A:1017181826899

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133