Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
References
[1]
Schivinski Bruno, Dabrowski Dariusz (2013) The Effect of Social-Media Communication on Consumer Perceptions of Brands. Working Paper Series A, Gdansk University of Technology, Faculty of Management and Economics 12 (12) 2–19.
[2]
Christopher J, Burges C (2009) Dimension Reduction: A Guided Tour. Foundations and Trends R in. Machine Learning 2 (4): 275–365.
[3]
Kim Hyunsoo, Howland Peg, Park Haesun (2005) Dimension Reduction in Text Classification with Support Vector Machines. Journal of Machine Learning Research 6: 37–53.
[4]
Blei DM, Ng Andrew Y, Jordan Michael I (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 4 (5) 993–1022.
[5]
Yang Y (1997) An evaluation of statistical approaches to text categorization. Technical Report CMU-CS-97-127, Carnegie Mellon University.
[6]
Heinrich G (2009) Parameter Estimation for Text Analysis (version 2.9). Technical report.
[7]
Minh H. Q, Niyogi P, Yao Y (2006) Mercers Theorem, Feature Maps, and Smoothing. Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh, pp. 154–168, PA.
[8]
Republished: Richard Ernest Bellman (2003) Dynamic Programming. Courier Dover Publications. ISBN 978-0-486-42809-3.
[9]
Debole F, Sebastiani F (2004) An Analysis of the Relative Hardness of Reuters-21578 Subsets. Journal of the American Society for Information Science and Technology 56 (6) 584–596.
[10]
Lewis DD, Yang Yiming, Rose Tony G, Li Fan (2004) RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (3) 361–397.
[11]
Forman G, Cohen I (2004) Learning from Little: Comparison of Classifiers Given Little Training. Jean FB, Floriana E, Fosca G, Dino P, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04), Pisa: Springer-Verlag, pp. 161–172.
[12]
Cai J., Lee W., Teh Y. Improving WSD Using Topic Features. Proc. EMNLP-CoNLL, 2007.
[13]
Phan X. H, Nguyen L. M, Horiguchi S (2008) Learning to Classify Short and Sparse Text and Web with Hidden Topics form Large-scale Data Collections. WWW 2008/Refereed Track: Data Mining-Learning, Beijing, pp. 91–100.
[14]
Tang Jie, Zhang Jing, Jin Ruoming, Yang Zi, Cai Keke, et al. (2011) Topic Level Expertise Search over Heterogeneous Networks. Machine Learning Journal 82 (2) 211–237.
[15]
Jie Tang, Bo Wang, Yang Yang, Po Hu, Ynagting Zhao, et al.. (2012) PatentMiner: Topic-driven Patent Analysis and Mining. In Proceedings of the Eighteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'12). pp. 1366–1374.
[16]
Nunzio G. M. D. (2004) A Bidimensional View of Documents for Text Categorisation. McDonald S, Tait J, eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04), Sunderland: Springer-Verlag, pp. 112–126.
[17]
Forman G (2003) An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3 (1) 1533–7928.
[18]
Chen W, Chang X, Wang H, Zhu J, Yao T (2004) utomatic Word Clustering for Text Categorization Using Global Information. Myaeng SH, Zhou M, Wong KF, Zhang H, eds. Proc. of the Information Retrieval Technology, Asia Information Retrieval Symp (AIRS 2004), Beijing, Springer-Verlag, pp. 1–11.
[19]
Kim Hyunsoo, Howland Peg, Park Haesun (2005) Dimension Reduction in Text Classification with Support Vector Machines. Journal of Machine Learning Research 6 (1) 37–53.
[20]
Kazama J, Tsujii J (2005) Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization. Machine Learning 60 (1–3) 159–194.
[21]
Hao PY, Chiang JH, Tu YK (2007) Hierarchically SVM Classification Based on Support Vector Clustering Method and Its Application to Document Categorization. Expert Systems with Applications 33 (3) 627–635.