|
面向集值型数据的无监督聚类方法及其应用
|
Abstract:
分类问题是数据挖掘、机器学习等领域的基础性问题之一,然而多数分类方法仅关注向量值样本的分类问题,而对于实际中广泛存在的集值型数据样本的分类关注较少。本文提出了一种基于Wasserstein距离的无监督聚类算法(Wk-means),利用熵正则最优传输模型度量集值型数据点之间的距离,并结合聚类的思想设计了一个可用于集值型数据的Wk-means聚类方法。为验证方法的有效性,本文首先在几个公开数据集上进行了实验,结果证实了Wk-means在多样本、多类别、多特征的集值型数据中表现优异,并且通过统计检验表明本文算法与其他算法存在显著差异。随后将本文方法实际应用于滏阳河水质数据集,结果同样表明相比传统的数据聚类算法,Wk-means能够更准确地划分水质类别,且运行效率更高。本文提出的Wk-means算法在集值型水质数据的分类任务中表现出色,能够为环境监测和管理提供有价值的决策支持。
Classification is one of the basic problems in data mining, machine learning and other fields. However, most classification methods only focus on the vector-valued samples, while paying less attention to the classification of set-valued data samples that are widely existed in practice. This paper proposes an unsupervised clustering algorithm (Wk-means) based on Wasserstein distance. Combined with the idea of clustering, Wk-means can be used for set-valued samples, in which the entropy-regularized optimal transport model is used to measure the distance between set-valued samples. In order to verify the effectiveness of Wk-means, experiments are conducted firstly on several public data sets. The results confirm the excellent performance of Wk-means in set-valued data with multi-sample, multi-category, and multi-feature. Moreover, the statistical test show that Wk-means is significantly different from other algorithms. Wk-means is then applied to the Fuyang River water quality data set. The results also show that Wk-means can classify water quality categories more accurately and effectively than the traditional data clustering algorithm. The Wk-means algorithm proposed in this paper performs well in the classification task of set-valued water quality data and can provide valuable decision support for environmental monitoring and management.
[1] | 李久生, 盛姣, 纪鉴航, 等. 基于KNN算法研究遥感图像地块分割与提取[C]//国家新闻出版广电总局中国新闻文化促进会学术期刊专业委员会. 2021年创新人才培养与可持续发展国际学术会议论文集(中文). 2021: 69-72. |
[2] | 张炎亮, 张超, 李静. 基于动态用户画像标签的KNN分类推荐算法研究[J]. 情报科学, 2020, 38(8): 11-15. |
[3] | 陈婷, 谢志龙. 基于改进决策树的不平衡数据集分类算法研究[J]. 计算机仿真, 2024, 41(8): 497-501. |
[4] | 韩彩娟. 基于决策树的制冷设备电子电路故障智能检测方法[J]. 电工技术, 2024(15): 140-142. |
[5] | 刘生富, 张鹏程, 周广宇, 等. 基于支持向量机与改进分水岭的红细胞识别算法研究[J]. 测试技术学报, 2022, 36(1): 48-53. |
[6] | 陶佳慧, 别雨轩, 顾约翰, 等. 基于多特征融合的SVM图像分类算法研究[J]. 上海航天(中英文), 2021, 38(S1): 98-102. |
[7] | DeSanto, J.B. and Sandwell, D.T. (2019) Meter-Scale Seafloor Geodetic Measurements Obtained from Repeated Multibeam Sidescan Surveys. Marine Geodesy, 42, 491-506. https://doi.org/10.1080/01490419.2019.1661887 |
[8] | 郝勇敢, 尚圆圆. 总磷总氮水质在线分析仪不确定度评定[J]. 广东化工, 2022, 49(1): 173-176. |
[9] | 陶莉. 水质监测中影响水质采样质量的因素及控制对策[J]. 清洗世界, 2022, 38(10): 109-111. |
[10] | Ashino, K., Kamiya, N., Zhou, X., Kato, H., Hara, T. and Fujita, H. (2024) Joint Segmentation of Sternocleidomastoid and Skeletal Muscles in Computed Tomography Images Using a Multiclass Learning Approach. Radiological Physics and Technology, 17, 854-861. https://doi.org/10.1007/s12194-024-00839-1 |
[11] | Yoneyama, J. (2012) Robust Sampled-Data Stabilization of Uncertain Fuzzy Systems via Input Delay Approach. Information Sciences, 198, 169-176. https://doi.org/10.1016/j.ins.2012.02.007 |
[12] | Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. https://doi.org/10.1007/bf00994018 |
[13] | Li, Y.F., Tsang, I.W., Kwok, J.T., et al. (2013) Convex and Scalable Weakly Labeled SVMs. Machine Learning, 14, 2151-2188. |
[14] | Quinlan, J.R. (1986) Induction of Decision Trees. Machine Learning, 1, 81-106. https://doi.org/10.1007/bf00116251 |
[15] | Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R. and Darrell, T. (2005) Face Recognition with Image Sets Using Manifold Density Divergence. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, 20-25 June 2005, 581-588. https://doi.org/10.1109/cvpr.2005.151 |
[16] | Villani, C. (2009) Optimal Transport: Old and New. Springer. |
[17] | 王森, 刘琛, 邢帅杰. K-Means聚类算法研究综述[J]. 华东交通大学学报, 2022, 39(5): 119-126. |
[18] | Monge, G. (1781) Mémoire sur la théorie des déblais et des remblais. Académie Royale des Sciences (France), 666-704. |
[19] | Kantorovitch, L. (1958) On the Translocation of Masses. Management Science, 5, 1-4. https://doi.org/10.1287/mnsc.5.1.1 |
[20] | 范启哲. 基于最优传输与领域自适应的语义分割研究[D]: [硕士学位论文]. 西安: 西安理工大学, 2023. |
[21] | 张浩. 基于深度学习和最优传输的地震数据重构与全波形反演[D]: [博士学位论文]. 哈尔滨: 哈尔滨工业大学, 2021. |
[22] | 张琪. 基于最优运输理论的环境智适应无线网络研究[D]: [硕士学位论文]. 武汉: 华中科技大学, 2022. |
[23] | 张沙沙, 刘小弟, 张世涛. 基于Wasserstein测度的概率犹豫模糊聚类方法[J]. 模糊系统与数学, 2023, 37(6): 41-54. |
[24] | 晏远翔, 曹国, 张友强. 基于Wasserstein距离与生成对抗网络的高光谱图像分类[J]. 计算机系统应用, 2024, 33(2): 13-22. |
[25] | 苏连成, 朱娇娇, 郭高鑫, 等. 基于XGBoost和Wasserstein距离的风电机组塔架振动监测研究[J]. 太阳能学报, 2023, 44(1): 306-312. |
[26] | Altschuler, J., Niles-Weed, J. and Rigollet, P. (2017) Near-Linear Time Approximation Algorithms for Optimal Transport via Sinkhorn Iteration. Neural Information Processing Systems, 2017, 1964-1974. |
[27] | Schmitzer, B. (2019) Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems. SIAM Journal on Scientific Computing, 41, A1443-A1481. https://doi.org/10.1137/16m1106018 |
[28] | Lin, T., Ho, N. and Jordan, M. (2019) On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms. 2019 International Conference on Machine Learning, Long Beach, 9-15 June 2019, 3982-3991. |
[29] | Shivaswamy, P., Bhattacharyya, C. and Smola, A. (2006) Second Order Cone Programming Approaches for Handling Missing and Uncertain Data. Machine Learning, 7, 1283-1314. |
[30] | Zhu, P., Zuo, W., Zhang, L., Shiu, S.C. and Zhang, D. (2014) Image Set-Based Collaborative Representation for Face Recognition. IEEE Transactions on Information Forensics and Security, 9, 1120-1132. https://doi.org/10.1109/tifs.2014.2324277 |
[31] | Hu, Y., Mian, A.S. and Owens, R. (2011) Sparse Approximated Nearest Points for Image Set Classification. 2011 Conference on Computer Vision and Pattern Recognition, Colorado Springs, 20-25 June 2011, 121-128. https://doi.org/10.1109/cvpr.2011.5995500 |
[32] | Wang, Z. and Qiao, X. (2023) Set-Valued Classification with Out-of-Distribution Detection for Many Classes. Journal of Machine Learning Research, 24, 1-39. |