|
基于特征选择和加权的改进条件概率分布距离度量
|
Abstract:
为提高名词性属性实例间差异的识别精度,优化分类算法的准确率,在充分考虑属性间依赖关系下提出了一种基于特征选择和加权的改进条件概率分布距离度量方法。该方法首先利用对称不确定性构建了一个特征选择机制;其次,在此基础上计算属性与类的信息增益率,获得每个属性的权重,并计算加权距离;最后基于K-近邻算法对19个数据集进行仿真实验。结果表明:论文提出的距离度量有效提高了分类算法的性能。
To enhance the recognition accuracy of differences between instances of nominal attributes and to optimize the accuracy of classification algorithms, an improved conditional probability distribution distance measurement based on feature selection and weighting has been proposed, taking into full consideration the dependencies among attributes. Firstly, a feature selection mechanism is constructed by using symmetric uncertainty. Secondly, on this basis, the information gain ratio of attributes and classes is calculated, and the weight of each attribute is obtained. Subsequently, the weighted distance is computed. Finally, simulation experiments are conducted on 19 datasets based on the K-Nearest Neighbors algorithm. The results indicate that the distance measurement proposed in this paper effectively improves the performance of classification algorithms.
[1] | Ayats, H.A., Cellier, P. and Ferré, S. (2024) Concepts of Neighbors and Their Application to Instance-Based Learning on Relational Data. International Journal of Approximate Reasoning, 164, Article ID: 109059. https://doi.org/10.1016/j.ijar.2023.109059 |
[2] | Aha, D.W., Kibler, D. and Albert, M.K. (1991) Instance-Based Learning Algorithms. Machine Learning, 6, 37-66. https://doi.org/10.1007/bf00153759 |
[3] | El Hindi, K. (2013) Specific-Class Distance Measures for Nominal Attributes. AI Communications, 26, 261-279. https://doi.org/10.3233/aic-130565 |
[4] | Short, R. and Fukunaga, K. (1981) The Optimal Distance Measure for Nearest Neighbor Classification. IEEE Transactions on Information Theory, 27, 622-627. https://doi.org/10.1109/tit.1981.1056403 |
[5] | Quang, L.S. and Bao, H.T. (2004) A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data. In: Dai, H., Srikant, R. and Zhang, C., Eds., Advances in Knowledge Discovery and Data Mining, Springer, 580-589. https://doi.org/10.1007/978-3-540-24775-3_69 |
[6] | Ienco, D., Pensa, R.G. and Meo, R. (2012) From Context to Distance: Learning Dissimilarity for Categorical Data Clustering. ACM Transactions on Knowledge Discovery from Data, 6, 1-25. https://doi.org/10.1145/2133360.2133361 |
[7] | Myles, J.P. and Hand, D.J. (1990) The Multi-Class Metric Problem in Nearest Neighbour Discrimination Rules. Pattern Recognition, 23, 1291-1297. https://doi.org/10.1016/0031-3203(90)90123-3 |
[8] | Guyon, I. and Elisseeff, A. (2003) An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182. |
[9] | 龚芳. 反转类指定距离度量的改进及应用研究[D]: [博士学位论文]. 武汉: 中国地质大学, 2021. |
[10] | 李超群. 名词性属性距离度量问题及其应用研究[D]: [博士学位论文]. 武汉: 中国地质大学, 2012. |
[11] | Qiu, C., Jiang, L. and Li, C. (2015) Not Always Simple Classification: Learning SuperParent for Class Probability Estimation. Expert Systems with Applications, 42, 5433-5440. https://doi.org/10.1016/j.eswa.2015.02.049 |
[12] | Gong, F., Jiang, L., Zhang, H., Wang, D. and Guo, X. (2020) Gain Ratio Weighted Inverted Specific-Class Distance Measure for Nominal Attributes. International Journal of Machine Learning and Cybernetics, 11, 2237-2246. https://doi.org/10.1007/s13042-020-01112-8 |