|
基于距离相关系数的局部实例加权朴素贝叶斯文本分类算法
|
Abstract:
朴素贝叶斯算法具有简单高效的特点,被广泛应用于文本分类。方法要求属性之间满足条件独立性假设,然而该假设在现实中很难满足。同时,随着大数据时代到来,文本数据呈现非线性结构的特点,经典朴素贝叶斯算法拟合效果不高。为解决以上问题,本文提出了一种基于距离相关系数的局部实例加权朴素贝叶斯分类算法。首先,计算属性和类别的距离相关系数,并将其作为属性权重嵌入到文档距离测度中,构建一种新的距离度量方法;其次,测算训练样本和测试样本的距离,进行实例选择和实例加权,构建局部实例加权贝叶斯文本分类器;最后,利用WEKA平台上的15个文本数据集对算法性能进行实验比较。结果表明新提出的算法在分类精度上均优于三种经典的朴素贝叶斯文本分类器。
Naive Bayes algorithm has the characteristics of simplicity and efficiency, and is widely used in text classification. The method requires the assumption of conditional independence between attributes, which is difficult to satisfy in reality. Meanwhile, with the advent of the big data era, text data exhibits non-linear structures, and the fitting effect of classical naive Bayesian algorithms is limited. To address these issues, a locally instance-weighted Naive Bayes classification algorithm based on distance correlation coefficient is proposed. Firstly, it calculates the distance correlation coefficient between attributes and classes, and embeds it as attribute weights into the document distance measure to construct a new distance measurement method. Secondly, it measures the distances between training samples and test samples, conducts instance selection and instance weighting, and constructs a locally instance-weighted Bayesian text classifier. Finally, the algorithm’s performance is experimentally compared with 15 text datasets from the WEKA platform. The results indicate that the proposed algorithm outperforms three classical Naive Bayes text classifiers in terms of classification accuracy.
[1] | McCallum, A. and Nigam, K. (1998) A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of the 15th AAAI Workshop on Learning for Text Categorization (AAAI’98). AAAI Press/The MIT Press, Madison, Wisconsin, 41-48. |
[2] | Hall, M. (2007) A Decision Tree-Based Attribute Weighting Filter for Naive Bayes. Knowledge-Based Systems, 20, 120-126. https://doi.org/10.1016/j.knosys.2006.11.008 |
[3] | Joachims, T. (1998) Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C. and Rouveirol, C., Eds., Machine Learning: ECML-98, Springer, 137-142. https://doi.org/10.1007/BFb0026683 |
[4] | Sebastiani, F. (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1-47. https://doi.org/10.1145/505282.505283 |
[5] | Jiang, L., Wang, D. and Cai, Z. (2012) Discriminatively Weighted Naive Bayes and Its Application in Text Classification. International Journal on Artificial Intelligence Tools, 21, Article ID: 1250007. https://doi.org/10.1142/s0218213011004770 |
[6] | Xu, W., Jiang, L. and Yu, L. (2018) An Attribute Value Frequency-Based Instance Weighting Filter for Naive Bayes. Journal of Experimental & Theoretical Artificial Intelligence, 31, 225-236. https://doi.org/10.1080/0952813x.2018.1544284 |
[7] | Frank, E., Hall, M. and Pfahringer, B. (2003) Locally Weighted Naive Bayes. arXiv: 1212.2487. |
[8] | Jiang, L., Cai, Z., Zhang, H. and Wang, D. (2013) Naive Bayes Text Classifiers: A Locally Weighted Learning Approach. Journal of Experimental & Theoretical Artificial Intelligence, 25, 273-286. https://doi.org/10.1080/0952813x.2012.721010 |
[9] | Salton, G., Wong, A. and Yang, C.S. (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620. https://doi.org/10.1145/361219.361220 |
[10] | Ababneh, A.H., Lu, J. and Xu, Q. (2019) An Efficient Framework of Utilizing the Latent Semantic Analysis in Text Extraction. International Journal of Speech Technology, 22, 785-815. https://doi.org/10.1007/s10772-019-09623-8 |
[11] | Su, J., Zeng, J., Xiong, D., Liu, Y., Wang, M. and Xie, J. (2018) A Hierarchy-To-Sequence Attentional Neural Machine Translation Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 623-632. https://doi.org/10.1109/taslp.2018.2789721 |
[12] | Székely, G.J., Rizzo, M.L. and Bakirov, N.K. (2007) Measuring and Testing Dependence by Correlation of Distances. The Annals of Statistics, 35, 2769-2794. https://doi.org/10.1214/009053607000000505 |
[13] | Miao, C. (2021) Clustering of Different Dimensional Variables Based on Distance Correlation Coefficient. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-020-02817-y |
[14] | 孙宇豪, 李国通, 张鸽. 距离相关系数融合GPR模型的卫星异常检测方法[J]. 北京航空大学学报, 2021, 47(4): 844-852. |
[15] | Bhattacharjee, A. (2014) Distance Correlation Coefficient: An Application with Bayesian Approach in Clinical Data Analysis. Journal of Modern Applied Statistical Methods, 13, 354-366. https://doi.org/10.22237/jmasm/1398918120 |
[16] | Sheng, W. and Yin, X. (2016) Sufficient Dimension Reduction via Distance Covariance. Journal of Computational and Graphical Statistics, 25, 91-104. https://doi.org/10.1080/10618600.2015.1026601 |
[17] | Li, R., Zhong, W. and Zhu, L. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association, 107, 1129-1139. https://doi.org/10.1080/01621459.2012.695654 |
[18] | Ruan, S., Chen, B., Song, K. and Li, H. (2021) Weighted Na?ve Bayes Text Classification Algorithm Based on Improved Distance Correlation Coefficient. Neural Computing and Applications, 34, 2729-2738. https://doi.org/10.1007/s00521-021-05989-6 |
[19] | Rennie, J., Shih, L., Teevan, J. and Karger, D. (20003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 21-24 August 2003, 616-623. |