|
基于KPCA的不平衡数据欠抽样算法
|
Abstract:
在现实世界的分类任务中,不平衡数据通常呈现非线性分布的特点,而传统的抽样方法难以有效处理这些非线性,导致分类效果不佳。为了解决这个问题,本文提出了一种基于核主成分分析(KPCA)的欠抽样方法。该方法通过使用非线性核函数将原始数据映射到适当的高维空间使其线性化,然后根据每个样本在核主成分上的得分来选择性地删除多数类样本,从而实现欠抽样。在9组具有不同平衡率的数据集上,采用本文提出的方法进行了欠抽样预处理,并使用逻辑回归(Logistic Regression)分类器进行分类。实验结果表明,在Accuracy、F1-measure和AUC值三个指标中,本文方法分别在7组、8组和9组数据集上取得了最高评分。这表明该方法在不平衡数据集上具有良好的分类性能。
The unbalanced data in the real classification task are mostly characterized by nonlinear distribution, and the traditional sampling method is not good at dealing with this kind of nonlinearity resulting in unsatisfactory sample classification effect. Aiming at this problem, an under-sampling method based on KPCA is proposed. The method maps the original data to a suitable high-dimensional space to make it linearly divisible by nonlinearly transforming the kernel function, and de-redundantly removes the majority class by calculating the scores of individual samples on the kernel principal components in order to achieve the purpose of under-sampling. After the under-sampling preprocessing of nine datasets with different balance rates, the classification is performed using Logistic Regression classifier model. The experimental results show that the algorithm of this paper obtains the highest evaluation metrics under Accuracy, F1-measure and AUC value scores under 7, 8 and 9 groups of datasets, respectively, which shows that the method has a good classification performance on unbalanced datasets.
[1] | Ileberi, E., Sun, Y. and Wang, Z. (2022) A Machine Learning Based Credit Card Fraud Detection Using the GA Algorithm for Feature Selection. Journal of Big Data, 9, Article No. 24. https://doi.org/10.1186/s40537-022-00573-8 |
[2] | Shilaskar, S., Ghatol, A. and Chatur, P. (2017) Medical Decision Support System for Extremely Imbalanced Datasets. Information Sciences, 384, 205-219. https://doi.org/10.1016/j.ins.2016.08.077 |
[3] | Zakaryazad, A. and Duman, E. (2016) A Profit-Driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing. Neurocomputing, 175, 121-131. https://doi.org/10.1016/j.neucom.2015.10.042 |
[4] | 李昂, 韩萌, 穆栋梁, 等. 多类不平衡数据分类方法综述[J]. 计算机应用研究, 2022, 39(12): 3534-3545. |
[5] | Kubat, M., Hotle, R. and Matwin, S. (1997) Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Fisher, D.H., Ed., International Conference on Machine Learning, Morgan Kaufmann Publishers, 179-186. |
[6] | Sowah, R.A., Agebure, M.A., Mills, G.A., Koumadi, K.M. and Fiawoo, S.Y. (2016) New Cluster Undersampling Technique for Class Imbalance Learning. International Journal of Machine Learning and Computing, 6, 205-214. https://doi.org/10.18178/ijmlc.2016.6.3.599 |
[7] | Lin, W., Tsai, C., Hu, Y. and Jhang, J. (2017) Clustering-Based Undersampling in Class-Imbalanced Data. Information Sciences, 409, 17-26. https://doi.org/10.1016/j.ins.2017.05.008 |
[8] | Song, A. and Xu, Q. (2018) Imbalanced Data Classification Based on MBCDK-Means Undersampling and GA-ANN. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L. and Maglogiannis, I., Eds., Artificial Neural Networks and Machine Learning—ICANN 2018, 349-358. https://doi.org/10.1007/978-3-030-01421-6_34 |
[9] | Twining, C.J. and Taylor, C.J. (2003) The Use of Kernel Principal Component Analysis to Model Data Distributions. Pattern Recognition, 36, 217-227. https://doi.org/10.1016/s0031-3203(02)00051-1 |
[10] | 陈祥涛, 张前进. 基于核主成分分析的步态识别方法[J]. 计算机应用, 2011, 31(5): 1239. |
[11] | Rosipal, R. and Girolami, M. (2001) An Expectation-Maximization Approach to Nonlinear Component Analysis. Neural Computation, 13, 505-510. https://doi.org/10.1162/089976601300014439 |
[12] | 赵丽红, 孙宇舸, 蔡玉, 等. 基于核主成分分析的人脸识别[J]. 东北大学学报: 自然科学版, 2006, 27(8): 847-850. |
[13] | Dachapak, C., Kanae, S., Yang, Z. and Wada, K. (2003) Kernel Principal Component Regression in Reproducing Kernel Hilbert Space. Proceedings of the ISCIE International Symposium on Stochastic Systems Theory and Its Applications, 2003, 213-218. https://doi.org/10.5687/sss.2003.213 |
[14] | Zelias, A.J. (19921) Multicollinearity of Variables an Embarrassing Problem of Econometrics. Krakow Academy of Economics. |
[15] | 雷银香, 熊科云. 中医药领域不平衡数据的特征选择和分类方法研究[J]. 信息与电脑, 2023, 35(24): 55-57. |
[16] | 潘继斌. 核函数的概念、性质及其应用[J]. 湖北师范学院学报(自然科学版), 2007, 27(1): 10-12. |
[17] | 吴今培. 基于核函数的主成分分析及应用[J]. 系统工程, 2005, 23(2): 117-120. |
[18] | 陈将宏, 张渊渊. 核主成分分析中核参数选择的遗传算法[J]. 计算机与现代化, 2011(11): 1-2, 14. |
[19] | Rivera, W.A. (2017) Noise Reduction a Priori Synthetic Over-Sampling for Class Imbalanced Data Sets. Information Sciences, 408, 146-161. https://doi.org/10.1016/j.ins.2017.04.046 |
[20] | Ling, C.X., Huang, J. and Zhang, H. (2003) AUC: A Statistically Consistent and More Discriminating Measure than Accuracy. International Joint Conference on Artificial Intelligence 2003, Acapulco, 9-15 August 2003, 519-524. |